Agent-based large language model system for extracting structured data from breast cancer synoptic reports: a dual-validation study.
1/5 보강
[OBJECTIVES] To develop and validate an agent-based Large Language Model (LLM) system for extracting structured data from breast cancer synoptic pathology reports and assess the performance gap betwee
APA
Hart SN, Bergamaschi TS (2026). Agent-based large language model system for extracting structured data from breast cancer synoptic reports: a dual-validation study.. JAMIA open, 9(1), ooag016. https://doi.org/10.1093/jamiaopen/ooag016
MLA
Hart SN, et al.. "Agent-based large language model system for extracting structured data from breast cancer synoptic reports: a dual-validation study.." JAMIA open, vol. 9, no. 1, 2026, pp. ooag016.
PMID
41756715 ↗
Abstract 한글 요약
[OBJECTIVES] To develop and validate an agent-based Large Language Model (LLM) system for extracting structured data from breast cancer synoptic pathology reports and assess the performance gap between synthetic and real-world validation.
[MATERIALS AND METHODS] We developed a modular artificial intelligence (AI) agent-based framework employing sequential specialized LLMs for parsing pathology reports and extracting structured data. We normalized College of American Pathologists (CAP) cancer protocols into 8 sections, 86 subsections, and 229 discrete fields. Seven leading LLMs (gemini-2.5-pro, llama3.3-70b, phi4-14b, deepseek-r1 14B/70B, gemma3-27b, gemini-2.0-flash-lite) were validated using dual evaluation: synthetic validation (864 controlled test cases) and real-world ground truth (6651 annotated fields from 90 pathology reports).
[RESULTS] Synthetic validation demonstrated strong performance (accuracy: 93.8%-99.0%). Real-world evaluation revealed field extraction recall ranging from 61.8% to 87.7%, demonstrating a substantial "reality gap" with performance drops of 11-32 percentage points. The gemini-2.5-pro model achieved the highest real-world recall (87.7%). Model size did not predict performance: the 14B-parameter deepseek-r1 (77.6%) outperformed its 70B-parameter counterpart (70.4%).
[DISCUSSION] The substantial performance degradation from synthetic to real-world data underscores the complexity of authentic clinical documentation. Smaller models can achieve competitive or superior recall, reducing computational costs. With even the best models missing 12%-38% of annotated fields, mandatory human verification is essential for clinical deployment.
[CONCLUSION] While LLM-based extraction systems show promise for pathology data extraction, synthetic validation alone provides false confidence. Rigorous real-world ground truth evaluation with expert annotation is essential before clinical deployment. These systems are best positioned as screening tools with mandatory human oversight rather than autonomous decision-making systems.
[MATERIALS AND METHODS] We developed a modular artificial intelligence (AI) agent-based framework employing sequential specialized LLMs for parsing pathology reports and extracting structured data. We normalized College of American Pathologists (CAP) cancer protocols into 8 sections, 86 subsections, and 229 discrete fields. Seven leading LLMs (gemini-2.5-pro, llama3.3-70b, phi4-14b, deepseek-r1 14B/70B, gemma3-27b, gemini-2.0-flash-lite) were validated using dual evaluation: synthetic validation (864 controlled test cases) and real-world ground truth (6651 annotated fields from 90 pathology reports).
[RESULTS] Synthetic validation demonstrated strong performance (accuracy: 93.8%-99.0%). Real-world evaluation revealed field extraction recall ranging from 61.8% to 87.7%, demonstrating a substantial "reality gap" with performance drops of 11-32 percentage points. The gemini-2.5-pro model achieved the highest real-world recall (87.7%). Model size did not predict performance: the 14B-parameter deepseek-r1 (77.6%) outperformed its 70B-parameter counterpart (70.4%).
[DISCUSSION] The substantial performance degradation from synthetic to real-world data underscores the complexity of authentic clinical documentation. Smaller models can achieve competitive or superior recall, reducing computational costs. With even the best models missing 12%-38% of annotated fields, mandatory human verification is essential for clinical deployment.
[CONCLUSION] While LLM-based extraction systems show promise for pathology data extraction, synthetic validation alone provides false confidence. Rigorous real-world ground truth evaluation with expert annotation is essential before clinical deployment. These systems are best positioned as screening tools with mandatory human oversight rather than autonomous decision-making systems.
🏷️ 키워드 / MeSH 📖 같은 키워드 OA만
📖 전문 본문 읽기 PMC JATS · ~38 KB · 영문
Background and significance
Background and significance
Breast cancer remains one of the most common malignancies worldwide, with pathology reports serving as the cornerstone of diagnosis, treatment planning, and prognosis.1 Synoptic pathology reports, structured according to College of American Pathologists (CAP) guidelines, provide comprehensive tumor characterization, including histologic type, grade, biomarker status, and staging information.2 While these reports follow standardized templates, critical information frequently resides in free-text fields rather than discrete, computable data elements.
This unstructured data problem creates a significant barrier to population-scale analytics, clinical research, and quality improvement initiatives.3 Manual extraction of data from free-text is labor-intensive, error-prone, and not scalable to the thousands of reports generated annually at large medical centers. Traditional natural language processing (NLP) approaches using rule-based or classical machine learning methods have shown limited success due to the complexity and variability of clinical language.4
Recent advances in large language models (LLMs) have demonstrated remarkable capabilities in understanding and processing complex natural language.5,6 However, the application of LLMs to clinical data extraction faces significant challenges, including domain-specific terminology, the need for high accuracy in medical contexts, and the gap between performance on synthetic test cases vs real-world clinical data.7
In this study, we developed and validated a modular AI agent-based system that leverages specialized LLMs to extract, normalize, and structure unstructured data from breast cancer synoptic reports. We employed a dual-validation strategy, comparing performance on controlled synthetic test cases against manually curated real-world ground truth data, to provide a comprehensive assessment of both model capabilities and clinical readiness. Our findings reveal critical insights into the performance characteristics, limitations, and appropriate deployment strategies for LLM-based clinical data extraction systems.
Breast cancer remains one of the most common malignancies worldwide, with pathology reports serving as the cornerstone of diagnosis, treatment planning, and prognosis.1 Synoptic pathology reports, structured according to College of American Pathologists (CAP) guidelines, provide comprehensive tumor characterization, including histologic type, grade, biomarker status, and staging information.2 While these reports follow standardized templates, critical information frequently resides in free-text fields rather than discrete, computable data elements.
This unstructured data problem creates a significant barrier to population-scale analytics, clinical research, and quality improvement initiatives.3 Manual extraction of data from free-text is labor-intensive, error-prone, and not scalable to the thousands of reports generated annually at large medical centers. Traditional natural language processing (NLP) approaches using rule-based or classical machine learning methods have shown limited success due to the complexity and variability of clinical language.4
Recent advances in large language models (LLMs) have demonstrated remarkable capabilities in understanding and processing complex natural language.5,6 However, the application of LLMs to clinical data extraction faces significant challenges, including domain-specific terminology, the need for high accuracy in medical contexts, and the gap between performance on synthetic test cases vs real-world clinical data.7
In this study, we developed and validated a modular AI agent-based system that leverages specialized LLMs to extract, normalize, and structure unstructured data from breast cancer synoptic reports. We employed a dual-validation strategy, comparing performance on controlled synthetic test cases against manually curated real-world ground truth data, to provide a comprehensive assessment of both model capabilities and clinical readiness. Our findings reveal critical insights into the performance characteristics, limitations, and appropriate deployment strategies for LLM-based clinical data extraction systems.
Methods
Methods
Data schema development
We created a comprehensive and normalized data schema from the CAP cancer protocols for all 5 major breast cancer report types, including Ductal carcinoma in situ, invasive carcinoma (biopsy and resection), and Phyllodes tumors. To address the challenge of inconsistent data types and formatting within the original CAP checklists, we organized the reports into a hierarchical structure comprising 8 main sections (specimen, tumor, margins, regional lymph nodes, distant metastasis, pTNM classification, additional findings, comments), 86 subsections (eg, procedure, specimen laterality, tumor site), and 229 discrete fields. This standardized ontology established a clean and computable foundation for all downstream analysis.
Agent-based workflow architecture
We designed and implemented a flexible, AI-driven, modular workflow capable of parsing unstructured free-text from synoptic reports and mapping it to the standardized ontology (Figure 1). The system leverages a series of specialized agents to isolate, analyze, and structure complex clinical data that are not captured in discrete fields.
The automated extraction process is orchestrated by the AgentRunner.py module, which executes a series of modular tasks as defined by individual agents. The workflow consists of 2 parallel initialization steps followed by sequential processing:
Ontology definition: The get_master_dict.py tool reads the schema from CSV files, and get_ontology.py retrieves valid options for specific fields (eg, “procedure”).
Text segmentation: The regExpMatchAndTrim.py tool isolates relevant free-text blocks by searching for specific section patterns and extracting the corresponding text segments.
Prompt construction: The buildPrompt.py tool dynamically constructs detailed prompts combining the extracted text snippet with predefined ontology options and instructions guiding the LLM on matching text to ontology and formatting output as JSON.
LLM processing: The llmSelector.py tool sends the prompt to an LLM provider (such as Ollama or Gemini). The LLM analyzes the prompt and returns a structured JSON response, converting unstructured text into computable, normalized data points.
The callAgent.py tool enables modularity by allowing the main workflow to execute individual tasks as self-contained subagents.
Model selection and evaluation
We evaluated 7 leading LLM architectures with varying parameter sizes spanning multiple vendors and architectural approaches. From Google, we tested gemini-2.5-pro and gemini-2.0-flash-lite. From Meta, we evaluated llama3.3-70b and gemma3-27b. From Microsoft, we tested phi4-14b. From DeepSeek, we evaluated both deepseek-r1-14b and deepseek-r1-70b. This diverse selection enabled comprehensive assessment across different model scales (14B-70B parameters) and training methodologies.
Synthetic validation dataset
We built a synthetic data generation pipeline that produces a diverse and challenging validation dataset with known ground-truth values. This approach allows for controlled, reproducible testing and objective performance comparisons across LLM architectures. The total simulation comprised 864 examples: 445 positive examples (cases where a correct ontology value was present in the text) and 419 negative examples (cases where the text did not contain a correct value for the given field). Synthetic cases were designed to test edge cases, ambiguous terminology, and complex clinical scenarios.
Real-world ground truth dataset
To assess real-world efficacy and generalizability, we manually curated ground truth annotations from 90 de-identified breast cancer synoptic pathology reports, resulting in 6651 individual field-level records. Each record was independently reviewed and annotated by clinical experts to establish ground-truth values. This painstaking curation process provided definitive metrics on the system’s accuracy and readiness for application in clinical research and population health studies.
Performance metrics
Due to the structure of our ground-truth dataset, which contains only positive annotations (fields explicitly present in reports) without negative annotations (fields explicitly marked as absent), we restricted our evaluation to field-sample combinations with definitive ground truth. This approach avoids making unfounded assumptions about true negatives or false positives for unannotated fields.
For each model prediction on the 6651 annotated ground-truth fields, we classified the outcome as either a true positive (TP) when the model’s extraction exactly matched the ground-truth value, or a false negative (FN) when the model failed to extract the correct value or extracted an incorrect value. Our primary performance metric was field-level accuracy, calculated as
This metric represents the percentage of annotated ground-truth fields that each model correctly extracted. We did not calculate precision, specificity, F1 score, or Matthews correlation coefficient (MCC) for the real-world evaluation, as these metrics require knowledge of true negatives, which are not available in our partially annotated ground-truth dataset. However, these metrics were calculated for the synthetic validation dataset, where complete ground truth (both positive and negative cases) was available by design.
Data schema development
We created a comprehensive and normalized data schema from the CAP cancer protocols for all 5 major breast cancer report types, including Ductal carcinoma in situ, invasive carcinoma (biopsy and resection), and Phyllodes tumors. To address the challenge of inconsistent data types and formatting within the original CAP checklists, we organized the reports into a hierarchical structure comprising 8 main sections (specimen, tumor, margins, regional lymph nodes, distant metastasis, pTNM classification, additional findings, comments), 86 subsections (eg, procedure, specimen laterality, tumor site), and 229 discrete fields. This standardized ontology established a clean and computable foundation for all downstream analysis.
Agent-based workflow architecture
We designed and implemented a flexible, AI-driven, modular workflow capable of parsing unstructured free-text from synoptic reports and mapping it to the standardized ontology (Figure 1). The system leverages a series of specialized agents to isolate, analyze, and structure complex clinical data that are not captured in discrete fields.
The automated extraction process is orchestrated by the AgentRunner.py module, which executes a series of modular tasks as defined by individual agents. The workflow consists of 2 parallel initialization steps followed by sequential processing:
Ontology definition: The get_master_dict.py tool reads the schema from CSV files, and get_ontology.py retrieves valid options for specific fields (eg, “procedure”).
Text segmentation: The regExpMatchAndTrim.py tool isolates relevant free-text blocks by searching for specific section patterns and extracting the corresponding text segments.
Prompt construction: The buildPrompt.py tool dynamically constructs detailed prompts combining the extracted text snippet with predefined ontology options and instructions guiding the LLM on matching text to ontology and formatting output as JSON.
LLM processing: The llmSelector.py tool sends the prompt to an LLM provider (such as Ollama or Gemini). The LLM analyzes the prompt and returns a structured JSON response, converting unstructured text into computable, normalized data points.
The callAgent.py tool enables modularity by allowing the main workflow to execute individual tasks as self-contained subagents.
Model selection and evaluation
We evaluated 7 leading LLM architectures with varying parameter sizes spanning multiple vendors and architectural approaches. From Google, we tested gemini-2.5-pro and gemini-2.0-flash-lite. From Meta, we evaluated llama3.3-70b and gemma3-27b. From Microsoft, we tested phi4-14b. From DeepSeek, we evaluated both deepseek-r1-14b and deepseek-r1-70b. This diverse selection enabled comprehensive assessment across different model scales (14B-70B parameters) and training methodologies.
Synthetic validation dataset
We built a synthetic data generation pipeline that produces a diverse and challenging validation dataset with known ground-truth values. This approach allows for controlled, reproducible testing and objective performance comparisons across LLM architectures. The total simulation comprised 864 examples: 445 positive examples (cases where a correct ontology value was present in the text) and 419 negative examples (cases where the text did not contain a correct value for the given field). Synthetic cases were designed to test edge cases, ambiguous terminology, and complex clinical scenarios.
Real-world ground truth dataset
To assess real-world efficacy and generalizability, we manually curated ground truth annotations from 90 de-identified breast cancer synoptic pathology reports, resulting in 6651 individual field-level records. Each record was independently reviewed and annotated by clinical experts to establish ground-truth values. This painstaking curation process provided definitive metrics on the system’s accuracy and readiness for application in clinical research and population health studies.
Performance metrics
Due to the structure of our ground-truth dataset, which contains only positive annotations (fields explicitly present in reports) without negative annotations (fields explicitly marked as absent), we restricted our evaluation to field-sample combinations with definitive ground truth. This approach avoids making unfounded assumptions about true negatives or false positives for unannotated fields.
For each model prediction on the 6651 annotated ground-truth fields, we classified the outcome as either a true positive (TP) when the model’s extraction exactly matched the ground-truth value, or a false negative (FN) when the model failed to extract the correct value or extracted an incorrect value. Our primary performance metric was field-level accuracy, calculated as
This metric represents the percentage of annotated ground-truth fields that each model correctly extracted. We did not calculate precision, specificity, F1 score, or Matthews correlation coefficient (MCC) for the real-world evaluation, as these metrics require knowledge of true negatives, which are not available in our partially annotated ground-truth dataset. However, these metrics were calculated for the synthetic validation dataset, where complete ground truth (both positive and negative cases) was available by design.
Results
Results
Synthetic validation performance
We first evaluated all 7 models on the synthetic validation dataset (864 test cases with complete ground truth, including both positive and negative examples). Table 1 presents the comprehensive performance metrics, demonstrating uniformly excellent performance across all models and metrics.
All models achieved accuracy above 93.8%, with sensitivity ranging from 90.2% to 97.8%, specificity from 96.0% to 99.5%, precision from 92.5% to 98.9%, F1 scores from 91.3% to 98.3%, and MCC values from 0.886 to 0.976. The top-performing model, gemini-2.5-pro, achieved 99.0% accuracy with near-perfect specificity (99.5%) and precision (98.9%). These results suggested that all models were highly capable of the extraction task when tested on controlled, synthetic data.
Real-world ground-truth performance
We then evaluated the same 7 models on real-world clinical synoptic reports using our manually curated ground-truth dataset. Table 2 presents the performance on the 6651 annotated positive fields from 90 breast cancer synoptic reports.
Models demonstrated substantial variability in their ability to accurately extract ground-truth fields, with recall ranging from 61.8% to 87.7%. The top-performing model, gemini-2.5-pro, correctly extracted 5831 of 6651 annotated fields (87.7% recall), while the lowest performing model, llama3.3-70b, correctly extracted 4111 fields (61.8% recall). The false negative counts ranged from 820 (gemini-2.5-pro) to 2540 (llama3.3-70b), indicating that even the best-performing models miss a significant portion of the ground-truth fields.
The reality gap: synthetic vs real-world performance
Our dual-validation approach revealed a substantial performance gap between controlled synthetic testing and real-world clinical data. Table 3 directly compares synthetic accuracy and real-world recall for each model, clearly illustrating this “reality gap.”
The performance drop ranged from 11.3 to 32.0 percentage points across all models. The best-performing model on synthetic data (gemini-2.5-pro, 99.0% accuracy) maintained the highest real-world recall (87.7%), but still showed an 11.3 percentage point drop. The largest gap was observed in llama3.3-70b, which dropped from 93.8% synthetic accuracy to 61.8% real-world recall (32.0 percentage point decrease). This substantial performance degradation underscores the complexity and variability inherent in authentic clinical documentation that cannot be fully captured by synthetic test cases.
Qualitative analysis: examples of correct and incorrect extractions
To illustrate the types of successes and failures encountered by the LLM-based extraction system, Table 4 presents representative examples from the synthetic validation dataset using the best-performing model (gemini-2.5-pro). True positive cases demonstrate the system’s ability to correctly parse clinical language and map it to the appropriate ontology fields, including extracting numeric values and handling structured field specifications. False negative cases reveal common failure modes, including selecting incorrect field types when ambiguous language is present (eg, confusing “Other” text fields with exact numeric distance fields when the text lacks clear numeric indicators) and misinterpreting context when multiple valid options exist. True negative cases show the system’s capability to correctly identify when specific fields are not mentioned in the report text. False positive cases, though rare (21/864 cases for gemini-2.5-pro), typically involve predicting field presence when text contains vague or generic terminology that does not clearly specify the required values. These examples underscore the importance of human review, particularly for ambiguous cases where clinical context is essential for correct interpretation.
Model size does not predict performance
Contrary to the common assumption that larger models perform better, model size did not directly correlate with extraction recall. The gemini-2.5-pro model achieved the highest recall (87.7%), correctly extracting 5831 of 6651 fields. Notably, the 14B-parameter deepseek-r1-14b (77.6% recall) outperformed its larger 70B-parameter counterpart deepseek-r1-70b (70.4% recall). Similarly, the 70B-parameter llama3.3-70b achieved the lowest recall (61.8%), underperforming several smaller models. Since all models were evaluated off-the-shelf without domain-specific fine-tuning, these performance differences likely stem from variations in base model architecture, pretraining data composition, and training methodologies rather than parameter count alone. This finding challenges the assumption that larger models necessarily deliver superior performance for specialized clinical NLP tasks.
Architecture-agnostic framework
Our modular framework successfully operated across diverse LLM architectures from multiple vendors (Google Gemini, Meta Llama, Microsoft Phi, DeepSeek) with varying parameter sizes (14B-70B). This architecture-agnostic design enables flexible model selection based on performance-cost tradeoffs and allows for easy integration of future model improvements.
Synthetic validation performance
We first evaluated all 7 models on the synthetic validation dataset (864 test cases with complete ground truth, including both positive and negative examples). Table 1 presents the comprehensive performance metrics, demonstrating uniformly excellent performance across all models and metrics.
All models achieved accuracy above 93.8%, with sensitivity ranging from 90.2% to 97.8%, specificity from 96.0% to 99.5%, precision from 92.5% to 98.9%, F1 scores from 91.3% to 98.3%, and MCC values from 0.886 to 0.976. The top-performing model, gemini-2.5-pro, achieved 99.0% accuracy with near-perfect specificity (99.5%) and precision (98.9%). These results suggested that all models were highly capable of the extraction task when tested on controlled, synthetic data.
Real-world ground-truth performance
We then evaluated the same 7 models on real-world clinical synoptic reports using our manually curated ground-truth dataset. Table 2 presents the performance on the 6651 annotated positive fields from 90 breast cancer synoptic reports.
Models demonstrated substantial variability in their ability to accurately extract ground-truth fields, with recall ranging from 61.8% to 87.7%. The top-performing model, gemini-2.5-pro, correctly extracted 5831 of 6651 annotated fields (87.7% recall), while the lowest performing model, llama3.3-70b, correctly extracted 4111 fields (61.8% recall). The false negative counts ranged from 820 (gemini-2.5-pro) to 2540 (llama3.3-70b), indicating that even the best-performing models miss a significant portion of the ground-truth fields.
The reality gap: synthetic vs real-world performance
Our dual-validation approach revealed a substantial performance gap between controlled synthetic testing and real-world clinical data. Table 3 directly compares synthetic accuracy and real-world recall for each model, clearly illustrating this “reality gap.”
The performance drop ranged from 11.3 to 32.0 percentage points across all models. The best-performing model on synthetic data (gemini-2.5-pro, 99.0% accuracy) maintained the highest real-world recall (87.7%), but still showed an 11.3 percentage point drop. The largest gap was observed in llama3.3-70b, which dropped from 93.8% synthetic accuracy to 61.8% real-world recall (32.0 percentage point decrease). This substantial performance degradation underscores the complexity and variability inherent in authentic clinical documentation that cannot be fully captured by synthetic test cases.
Qualitative analysis: examples of correct and incorrect extractions
To illustrate the types of successes and failures encountered by the LLM-based extraction system, Table 4 presents representative examples from the synthetic validation dataset using the best-performing model (gemini-2.5-pro). True positive cases demonstrate the system’s ability to correctly parse clinical language and map it to the appropriate ontology fields, including extracting numeric values and handling structured field specifications. False negative cases reveal common failure modes, including selecting incorrect field types when ambiguous language is present (eg, confusing “Other” text fields with exact numeric distance fields when the text lacks clear numeric indicators) and misinterpreting context when multiple valid options exist. True negative cases show the system’s capability to correctly identify when specific fields are not mentioned in the report text. False positive cases, though rare (21/864 cases for gemini-2.5-pro), typically involve predicting field presence when text contains vague or generic terminology that does not clearly specify the required values. These examples underscore the importance of human review, particularly for ambiguous cases where clinical context is essential for correct interpretation.
Model size does not predict performance
Contrary to the common assumption that larger models perform better, model size did not directly correlate with extraction recall. The gemini-2.5-pro model achieved the highest recall (87.7%), correctly extracting 5831 of 6651 fields. Notably, the 14B-parameter deepseek-r1-14b (77.6% recall) outperformed its larger 70B-parameter counterpart deepseek-r1-70b (70.4% recall). Similarly, the 70B-parameter llama3.3-70b achieved the lowest recall (61.8%), underperforming several smaller models. Since all models were evaluated off-the-shelf without domain-specific fine-tuning, these performance differences likely stem from variations in base model architecture, pretraining data composition, and training methodologies rather than parameter count alone. This finding challenges the assumption that larger models necessarily deliver superior performance for specialized clinical NLP tasks.
Architecture-agnostic framework
Our modular framework successfully operated across diverse LLM architectures from multiple vendors (Google Gemini, Meta Llama, Microsoft Phi, DeepSeek) with varying parameter sizes (14B-70B). This architecture-agnostic design enables flexible model selection based on performance-cost tradeoffs and allows for easy integration of future model improvements.
Discussion
Discussion
The reality gap in clinical NLP
Our study demonstrates a critical finding that has significant implications for clinical AI deployment: synthetic validation alone provides false confidence in model readiness. As shown in Tables 1-3, all models demonstrated excellent performance on synthetic data (93.8%-99.0% accuracy) but showed substantial degradation on real-world data (61.8%-87.7% recall), with performance drops ranging from 11 to 32 percentage points. This “reality gap” underscores the complexity and variability inherent in authentic clinical documentation. While synthetic test cases allow for controlled evaluation of model capabilities, they cannot fully capture the nuances, abbreviations, errors, and contextual complexity present in actual clinical practice.8
Model size and selection for clinical tasks
The lack of correlation between model size and performance challenges the “bigger is better” paradigm in clinical NLP. The 14B-parameter deepseek-r1-14b model outperformed its 70B-parameter counterpart, and the 70B-parameter llama3.3-70b showed the poorest overall performance. Since all models were evaluated off-the-shelf without domain-specific fine-tuning, these differences reflect variations in base model architecture, pretraining data composition, and training methodologies rather than specialized clinical adaptation. For resource-constrained health-care settings, this finding is particularly encouraging, as it demonstrates that smaller, more efficient models can achieve competitive or superior performance compared to their larger counterparts, reducing computational requirements and deployment costs without sacrificing accuracy.
Appropriate deployment strategy
The observed performance characteristics define appropriate use cases for these systems. With recall ranging from 61.8% to 87.7% on ground-truth fields, even the best-performing models miss 12%-38% of annotated fields. This indicates that LLM-based extraction systems are well suited for screening and triage workflows where missing information is costly, but should not be relied upon for autonomous clinical decision-making. Instead, these systems should be deployed with mandatory human verification, particularly for fields that directly impact treatment decisions.9
Potential appropriate applications include prepopulating structured databases for manual review and correction, flagging reports with potentially missing critical information, identifying candidates for clinical trial enrollment, and supporting quality assurance and audit workflows. In each of these use cases, the system serves as a first-pass screening tool that increases efficiency while maintaining human oversight for final decision-making. The error rate of 12%-38% emphasizes that human review is not optional but essential for clinical deployment.
The essential role of real-world ground truth
Our findings emphasize that expert-annotated clinical data validation is not optional but essential for assessing true clinical utility and safety before any clinical deployment. The labor-intensive process of manually curating 90 reports to generate 6651 ground-truth records was critical for revealing the actual performance characteristics of these systems. Regulatory frameworks and institutional review processes should require such real-world validation as a prerequisite for clinical AI system deployment.10
Limitations
Several limitations should be considered when interpreting our findings. First, our ground-truth dataset contains only positive annotations (fields explicitly present in reports) without negative annotations (fields explicitly marked as absent). This restricts our real-world evaluation to recall metrics and prevents calculation of precision, specificity, F1 score, and MCC, which require knowledge of true negatives. While this approach ensures we only evaluate on field-sample combinations with definitive ground truth, it limits our ability to assess false positive rates. Second, our ground-truth dataset, while substantial, was derived from a single institution and may not fully represent the diversity of reporting practices across different health-care systems. Third, the manual annotation process, while rigorous, is subject to interrater variability. Fourth, our evaluation focused on breast cancer pathology reports, and generalizability to other cancer types or clinical domains remains to be established. Finally, the rapid pace of LLM development means that newer models may demonstrate improved performance characteristics.
Additionally, while this LLM-based approach demonstrates promising performance, it introduces significant overhead compared to simpler methods. Traditional rule-based approaches using regular expressions or deterministic NLP methods require substantially less computational resources and engineering complexity. However, these simpler methods proved inadequate for our task due to several factors: the extreme variability in how pathologists express the same clinical findings (eg, “invasive ductal carcinoma” vs “IDC” vs “infiltrating ductal carcinoma”), the need to disambiguate context-dependent terminology, the requirement to handle negation and uncertainty statements, and the complexity of mapping free-text to a 229-field normalized ontology spanning 8 report types. The LLM-based system requires considerable upfront costs including: (1) token costs for processing each report through multiple agent calls (estimated at $0.05-0.50 per report depending on model choice), (2) engineering expertise to design and maintain the modular agent workflow, and (3) domain expert time to manually curate ground-truth data for validation. These resource requirements must be carefully weighed against the value of extracting structured data at scale. For institutions with limited computational resources or budgets, smaller models (eg, deepseek-r1-14b at 77.6% recall) may offer an acceptable performance-cost tradeoff compared to larger models.
Future directions
Future work should focus on several key areas to advance this technology toward clinical deployment. First, development of domain-specific fine-tuning strategies could improve real-world performance by training models on pathology-specific corpora. Second, investigation of ensemble approaches combining multiple models may leverage the complementary strengths of different architectures to achieve superior overall performance. Third, exploration of active learning strategies could efficiently improve model performance with minimal additional annotation by intelligently selecting the most informative examples for human review. Fourth, expansion to other cancer types and clinical domains would establish the generalizability of this approach beyond breast pathology. Fifth, development of uncertainty quantification methods would enable the system to identify low-confidence extractions requiring human review, improving safety and user trust. Finally, integration with clinical workflows and assessment of impact on efficiency and data quality through prospective clinical trials would provide the evidence base necessary for widespread adoption.
The reality gap in clinical NLP
Our study demonstrates a critical finding that has significant implications for clinical AI deployment: synthetic validation alone provides false confidence in model readiness. As shown in Tables 1-3, all models demonstrated excellent performance on synthetic data (93.8%-99.0% accuracy) but showed substantial degradation on real-world data (61.8%-87.7% recall), with performance drops ranging from 11 to 32 percentage points. This “reality gap” underscores the complexity and variability inherent in authentic clinical documentation. While synthetic test cases allow for controlled evaluation of model capabilities, they cannot fully capture the nuances, abbreviations, errors, and contextual complexity present in actual clinical practice.8
Model size and selection for clinical tasks
The lack of correlation between model size and performance challenges the “bigger is better” paradigm in clinical NLP. The 14B-parameter deepseek-r1-14b model outperformed its 70B-parameter counterpart, and the 70B-parameter llama3.3-70b showed the poorest overall performance. Since all models were evaluated off-the-shelf without domain-specific fine-tuning, these differences reflect variations in base model architecture, pretraining data composition, and training methodologies rather than specialized clinical adaptation. For resource-constrained health-care settings, this finding is particularly encouraging, as it demonstrates that smaller, more efficient models can achieve competitive or superior performance compared to their larger counterparts, reducing computational requirements and deployment costs without sacrificing accuracy.
Appropriate deployment strategy
The observed performance characteristics define appropriate use cases for these systems. With recall ranging from 61.8% to 87.7% on ground-truth fields, even the best-performing models miss 12%-38% of annotated fields. This indicates that LLM-based extraction systems are well suited for screening and triage workflows where missing information is costly, but should not be relied upon for autonomous clinical decision-making. Instead, these systems should be deployed with mandatory human verification, particularly for fields that directly impact treatment decisions.9
Potential appropriate applications include prepopulating structured databases for manual review and correction, flagging reports with potentially missing critical information, identifying candidates for clinical trial enrollment, and supporting quality assurance and audit workflows. In each of these use cases, the system serves as a first-pass screening tool that increases efficiency while maintaining human oversight for final decision-making. The error rate of 12%-38% emphasizes that human review is not optional but essential for clinical deployment.
The essential role of real-world ground truth
Our findings emphasize that expert-annotated clinical data validation is not optional but essential for assessing true clinical utility and safety before any clinical deployment. The labor-intensive process of manually curating 90 reports to generate 6651 ground-truth records was critical for revealing the actual performance characteristics of these systems. Regulatory frameworks and institutional review processes should require such real-world validation as a prerequisite for clinical AI system deployment.10
Limitations
Several limitations should be considered when interpreting our findings. First, our ground-truth dataset contains only positive annotations (fields explicitly present in reports) without negative annotations (fields explicitly marked as absent). This restricts our real-world evaluation to recall metrics and prevents calculation of precision, specificity, F1 score, and MCC, which require knowledge of true negatives. While this approach ensures we only evaluate on field-sample combinations with definitive ground truth, it limits our ability to assess false positive rates. Second, our ground-truth dataset, while substantial, was derived from a single institution and may not fully represent the diversity of reporting practices across different health-care systems. Third, the manual annotation process, while rigorous, is subject to interrater variability. Fourth, our evaluation focused on breast cancer pathology reports, and generalizability to other cancer types or clinical domains remains to be established. Finally, the rapid pace of LLM development means that newer models may demonstrate improved performance characteristics.
Additionally, while this LLM-based approach demonstrates promising performance, it introduces significant overhead compared to simpler methods. Traditional rule-based approaches using regular expressions or deterministic NLP methods require substantially less computational resources and engineering complexity. However, these simpler methods proved inadequate for our task due to several factors: the extreme variability in how pathologists express the same clinical findings (eg, “invasive ductal carcinoma” vs “IDC” vs “infiltrating ductal carcinoma”), the need to disambiguate context-dependent terminology, the requirement to handle negation and uncertainty statements, and the complexity of mapping free-text to a 229-field normalized ontology spanning 8 report types. The LLM-based system requires considerable upfront costs including: (1) token costs for processing each report through multiple agent calls (estimated at $0.05-0.50 per report depending on model choice), (2) engineering expertise to design and maintain the modular agent workflow, and (3) domain expert time to manually curate ground-truth data for validation. These resource requirements must be carefully weighed against the value of extracting structured data at scale. For institutions with limited computational resources or budgets, smaller models (eg, deepseek-r1-14b at 77.6% recall) may offer an acceptable performance-cost tradeoff compared to larger models.
Future directions
Future work should focus on several key areas to advance this technology toward clinical deployment. First, development of domain-specific fine-tuning strategies could improve real-world performance by training models on pathology-specific corpora. Second, investigation of ensemble approaches combining multiple models may leverage the complementary strengths of different architectures to achieve superior overall performance. Third, exploration of active learning strategies could efficiently improve model performance with minimal additional annotation by intelligently selecting the most informative examples for human review. Fourth, expansion to other cancer types and clinical domains would establish the generalizability of this approach beyond breast pathology. Fifth, development of uncertainty quantification methods would enable the system to identify low-confidence extractions requiring human review, improving safety and user trust. Finally, integration with clinical workflows and assessment of impact on efficiency and data quality through prospective clinical trials would provide the evidence base necessary for widespread adoption.
Conclusion
Conclusion
This study presents a comprehensive evaluation of an agent-based LLM system for extracting structured data from breast cancer synoptic reports. We successfully developed a modular AI agent system that extracts, normalizes, and structures unstructured breast cancer data from synoptic report free-text into computable formats for population-scale analytics. Our architecture-agnostic framework works across diverse LLM architectures with varying parameter sizes, enabling flexible model selection based on performance-cost tradeoffs.
Through dual-validation methodology, we identified a critical “reality gap” with all models showing significant performance degradation from synthetic validation (93.8%-99.0% accuracy) to real-world ground truth (61.8%-87.7% recall), demonstrating that synthetic testing alone provides false confidence in clinical readiness. Real-world evaluation on 6651 annotated fields revealed that gemini-2.5-pro achieved the best performance (87.7% recall). Notably, we found that model size does not directly correlate with recall: the 14B-parameter deepseek-r1 (77.6%) outperformed its 70B-parameter counterpart (70.4%), and the 70B-parameter llama3.3-70b achieved the lowest recall (61.8%). Since all models were evaluated off-the-shelf without domain-specific fine-tuning, these differences likely reflect variations in base model architecture, pretraining data composition, and training methodologies rather than parameter count alone.
The observed performance characteristics define an appropriate deployment strategy. With even the best models missing 12%-38% of annotated fields, these systems are positioned for screening and triage workflows where they can increase efficiency, but they require mandatory human verification before clinical decisions. Our findings confirm that real-world ground-truth evaluation with expert annotation is essential for assessing true clinical utility and safety being a necessary step before any clinical deployment.
While LLM-based systems show considerable promise for unlocking the wealth of information in clinical free-text, our findings emphasize the need for rigorous real-world validation, appropriate deployment strategies with human oversight, and continued refinement of domain-specific training approaches. With these considerations in mind, such systems can serve as valuable tools to enhance the accessibility and utility of clinical data for research, quality improvement, and ultimately, better patient care.
This study presents a comprehensive evaluation of an agent-based LLM system for extracting structured data from breast cancer synoptic reports. We successfully developed a modular AI agent system that extracts, normalizes, and structures unstructured breast cancer data from synoptic report free-text into computable formats for population-scale analytics. Our architecture-agnostic framework works across diverse LLM architectures with varying parameter sizes, enabling flexible model selection based on performance-cost tradeoffs.
Through dual-validation methodology, we identified a critical “reality gap” with all models showing significant performance degradation from synthetic validation (93.8%-99.0% accuracy) to real-world ground truth (61.8%-87.7% recall), demonstrating that synthetic testing alone provides false confidence in clinical readiness. Real-world evaluation on 6651 annotated fields revealed that gemini-2.5-pro achieved the best performance (87.7% recall). Notably, we found that model size does not directly correlate with recall: the 14B-parameter deepseek-r1 (77.6%) outperformed its 70B-parameter counterpart (70.4%), and the 70B-parameter llama3.3-70b achieved the lowest recall (61.8%). Since all models were evaluated off-the-shelf without domain-specific fine-tuning, these differences likely reflect variations in base model architecture, pretraining data composition, and training methodologies rather than parameter count alone.
The observed performance characteristics define an appropriate deployment strategy. With even the best models missing 12%-38% of annotated fields, these systems are positioned for screening and triage workflows where they can increase efficiency, but they require mandatory human verification before clinical decisions. Our findings confirm that real-world ground-truth evaluation with expert annotation is essential for assessing true clinical utility and safety being a necessary step before any clinical deployment.
While LLM-based systems show considerable promise for unlocking the wealth of information in clinical free-text, our findings emphasize the need for rigorous real-world validation, appropriate deployment strategies with human oversight, and continued refinement of domain-specific training approaches. With these considerations in mind, such systems can serve as valuable tools to enhance the accessibility and utility of clinical data for research, quality improvement, and ultimately, better patient care.
출처: PubMed Central (JATS). 라이선스는 원 publisher 정책을 따릅니다 — 인용 시 원문을 표기해 주세요.
🏷️ 같은 키워드 · 무료전문 — 이 논문 MeSH/keyword 기반
- Nanotechnology-Assisted Molecular Profiling: Emerging Advances in Circulating Tumor DNA Detection.
- Artificial intelligence and breast cancer screening in Serbia: a dual-perspective qualitative study among radiologists and screening-aged women.
- Aesthetically ideal noses created using a single artificial intelligence model: Validating literature and exploring ethnic differences.
- Integrative Computational Approaches to Prostate Cancer with Conditional Reprogramming and AI-Driven Precision Medicine.
- Exploring the Role of Extracellular Vesicles in Pancreatic and Hepatobiliary Cancers: Advances Through Artificial Intelligence.
- From Time-Limited Therapy to Treatment-Free Observation: The Evolving Role of MRD in CLL Management.