Using Consensus-Based Reasoning and Large Language Models to Extract Structured Data From Surgical Pathology Reports.

Tripathi A; Waqas A; Venkatesan K; Ullah E; Khan A; Khalil F; Chen WS; Ozturk ZG; Saeed-Vafa D; Bui MM; Schabath MB; Rasool G

doi:10.1016/j.labinv.2025.104272

← 뒤로

Using Consensus-Based Reasoning and Large Language Models to Extract Structured Data From Surgical Pathology Reports.

Laboratory investigation; a journal of technical methods and pathology 2026 Vol.106(2) p. 104272

Tripathi A, Waqas A, Venkatesan K, Ullah E, Khan A, Khalil F, Chen WS, Ozturk ZG, Saeed-Vafa D, Bui MM, Schabath MB, Rasool G

원문 ↗ DOI ↗ BibTeX ↓ RIS ↓

📝 환자 설명용 한 줄

Surgical pathology reports provide essential diagnostic information critical for cancer staging, treatment planning, and cancer registry documentation.

🔬 핵심 임상 통계 (초록에서 자동 추출 — 원문 검증 권장)

p-value P < .001

이 논문을 인용하기

BibTeX ↓ RIS ↓

APA Tripathi A, Waqas A, et al. (2026). Using Consensus-Based Reasoning and Large Language Models to Extract Structured Data From Surgical Pathology Reports.. Laboratory investigation; a journal of technical methods and pathology, 106(2), 104272. https://doi.org/10.1016/j.labinv.2025.104272

MLA Tripathi A, et al.. "Using Consensus-Based Reasoning and Large Language Models to Extract Structured Data From Surgical Pathology Reports.." Laboratory investigation; a journal of technical methods and pathology, vol. 106, no. 2, 2026, pp. 104272.

PMID 41412350

DOI 10.1016/j.labinv.2025.104272

Abstract

Surgical pathology reports provide essential diagnostic information critical for cancer staging, treatment planning, and cancer registry documentation. However, their writing styles and formats vary widely, reflecting each pathologist's stylistic choices, institutional norms, and inherited practices from residency training. When performing large-scale data analysis, this unstructured nature and variability across tumor types and institutions pose significant hurdles for automated data extraction. To overcome these challenges, we present a consensus-driven, reasoning-based framework that adapts multiple locally deployed large language models (LLMs) to extract both standard diagnostic variables (such as site, laterality, histology, stage, grade, and behavior) and organ-specific biomarkers. Each LLM generates structured outputs, accompanied by justifications, which are subsequently evaluated for accuracy and coherence by 3 separate reasoning models (DeepSeek-R1-large, Qwen3-32B, and QWQ-32B). Final consensus values are determined through aggregation. Board-certified pathologists conducted expert validation. This framework was applied to >6100 pathology reports from The Cancer Genome Atlas (TCGA), spanning 10 organ systems, and 510 reports from Moffitt Cancer Center. For TCGA data set, automated evaluation demonstrated a mean accuracy of 84.9% ± 7.3%, with histology (88%), site (87%), stage (84%), and behavior (84%) showing the highest extraction accuracy averaged across all models. Expert review of randomly selected 138 reports confirmed high agreement for behavior (100.0%), histology (99%), grade (96%), and site (95%) in TCGA data set, with slightly lower performance for stage (89%) and laterality (88%). In Moffitt Cancer Center reports (brain, breast, and lung), accuracy remained high (88.2% ± 7.2%), with behavior (99%), histology (97%), laterality (96%), grade (94%), and site (93%) achieving strong agreement. Biomarker extraction achieved 70.6% ± 7.9% overall accuracy, with TP53 (84%) on brain tumor, Ki-67 (68%) on breast cancer, and ROS1 (82%) on lung cancer showing the highest accuracy. Interevaluator agreement analysis revealed high concordance (correlation values ≥0.93) across the 3 evaluation models. Statistical analyses revealed significant main effects of model type (F = 1716.82, P < .001), variable (F = 3236.68, P < .001), and organ system (F = 1946.43, P < .001), as well as model × variable × organ interactions (F = 24.74, P < .001), emphasizing the role of clinical context in model performance. These results highlight the potential of stratified, multiorgan evaluation frameworks with multievaluator consensus in LLM benchmarking for clinical applications. Overall, this consensus-based approach demonstrated that locally deployed LLMs can provide a transparent, accurate, and auditable solution for integration into real-world pathology workflows, such as synoptic reporting and cancer registry abstraction.

MeSH Terms

Humans; Consensus; Data Mining; Large Language Models; Neoplasms; Pathology, Surgical

같은 제1저자의 인용 많은 논문 (3)

.
Gut microbes 2026
Ten-year survival rates by PSA nadir in patients with metastatic hormone-sensitive prostate cancer: long-term survival analysis from the ECOG-ACRIN 3805 (CHAARTED) trial.
Annals of oncology : official journal of the European Society for Medical Oncology 2025
Gene expression abnormalities in histologically normal breast epithelium of breast cancer patients.
International journal of cancer 2008