Using Consensus-Based Reasoning and Large Language Models to Extract Structured Data From Surgical Pathology Reports.
Surgical pathology reports provide essential diagnostic information critical for cancer staging, treatment planning, and cancer registry documentation.
- p-value P < .001
APA
Tripathi A, Waqas A, et al. (2026). Using Consensus-Based Reasoning and Large Language Models to Extract Structured Data From Surgical Pathology Reports.. Laboratory investigation; a journal of technical methods and pathology, 106(2), 104272. https://doi.org/10.1016/j.labinv.2025.104272
MLA
Tripathi A, et al.. "Using Consensus-Based Reasoning and Large Language Models to Extract Structured Data From Surgical Pathology Reports.." Laboratory investigation; a journal of technical methods and pathology, vol. 106, no. 2, 2026, pp. 104272.
PMID
41412350
Abstract
Surgical pathology reports provide essential diagnostic information critical for cancer staging, treatment planning, and cancer registry documentation. However, their writing styles and formats vary widely, reflecting each pathologist's stylistic choices, institutional norms, and inherited practices from residency training. When performing large-scale data analysis, this unstructured nature and variability across tumor types and institutions pose significant hurdles for automated data extraction. To overcome these challenges, we present a consensus-driven, reasoning-based framework that adapts multiple locally deployed large language models (LLMs) to extract both standard diagnostic variables (such as site, laterality, histology, stage, grade, and behavior) and organ-specific biomarkers. Each LLM generates structured outputs, accompanied by justifications, which are subsequently evaluated for accuracy and coherence by 3 separate reasoning models (DeepSeek-R1-large, Qwen3-32B, and QWQ-32B). Final consensus values are determined through aggregation. Board-certified pathologists conducted expert validation. This framework was applied to >6100 pathology reports from The Cancer Genome Atlas (TCGA), spanning 10 organ systems, and 510 reports from Moffitt Cancer Center. For TCGA data set, automated evaluation demonstrated a mean accuracy of 84.9% ± 7.3%, with histology (88%), site (87%), stage (84%), and behavior (84%) showing the highest extraction accuracy averaged across all models. Expert review of randomly selected 138 reports confirmed high agreement for behavior (100.0%), histology (99%), grade (96%), and site (95%) in TCGA data set, with slightly lower performance for stage (89%) and laterality (88%). In Moffitt Cancer Center reports (brain, breast, and lung), accuracy remained high (88.2% ± 7.2%), with behavior (99%), histology (97%), laterality (96%), grade (94%), and site (93%) achieving strong agreement. Biomarker extraction achieved 70.6% ± 7.9% overall accuracy, with TP53 (84%) on brain tumor, Ki-67 (68%) on breast cancer, and ROS1 (82%) on lung cancer showing the highest accuracy. Interevaluator agreement analysis revealed high concordance (correlation values ≥0.93) across the 3 evaluation models. Statistical analyses revealed significant main effects of model type (F = 1716.82, P < .001), variable (F = 3236.68, P < .001), and organ system (F = 1946.43, P < .001), as well as model × variable × organ interactions (F = 24.74, P < .001), emphasizing the role of clinical context in model performance. These results highlight the potential of stratified, multiorgan evaluation frameworks with multievaluator consensus in LLM benchmarking for clinical applications. Overall, this consensus-based approach demonstrated that locally deployed LLMs can provide a transparent, accurate, and auditable solution for integration into real-world pathology workflows, such as synoptic reporting and cancer registry abstraction.
MeSH Terms
Humans; Consensus; Data Mining; Large Language Models; Neoplasms; Pathology, Surgical