Prompting large language models and evaluating inter- and intra-rater agreement for cancer progression assessment from radiology reports.
1/5 보강
PICO 자동 추출 (휴리스틱, conf 2/4)
유사 논문P · Population 대상 환자/모집단
184 patients with metastatic breast cancer from Danish electronic health records.
I · Intervention 중재 / 시술
추출되지 않음
C · Comparison 대조 / 비교
추출되지 않음
O · Outcome 결과 / 결론
[CONCLUSIONS] The proposed LLM framework was non-inferior to human annotators in classifying cancer progression from free-text radiology reports. This offers significant potential for using LLMs as a tool for identifying tumor progression events in clinical assessment and research.
[BACKGROUND] Manual annotation of free-text radiology reports is time-consuming and costly, delaying real-world evidence (RWE) studies in oncology.
- 95% CI 0.74-0.89
- Specificity 90%
APA
Kristjánsson TÓ, Henriksen AF, et al. (2026). Prompting large language models and evaluating inter- and intra-rater agreement for cancer progression assessment from radiology reports.. ESMO real world data and digital oncology, 11, 100689. https://doi.org/10.1016/j.esmorw.2026.100689
MLA
Kristjánsson TÓ, et al.. "Prompting large language models and evaluating inter- and intra-rater agreement for cancer progression assessment from radiology reports.." ESMO real world data and digital oncology, vol. 11, 2026, pp. 100689.
PMID
41853219
Abstract
[BACKGROUND] Manual annotation of free-text radiology reports is time-consuming and costly, delaying real-world evidence (RWE) studies in oncology. This study aimed to evaluate the performance of large language models (LLMs) in annotating cancer progression from Danish free-text radiology reports. The objectives were to determine whether human-to-LLM inter-rater agreement was non-inferior to human-to-human agreement, establish human intra-rater agreement, and develop a framework for tuning LLM performance to RWE needs.
[MATERIALS AND METHODS] We identified 376 radiology reports from 184 patients with metastatic breast cancer from Danish electronic health records. Six human annotators, including two experts, classified radiology reports as progressive disease (PD) or non-PD. A 'reverse questioning' strategy was used to evaluate five LLM model series (Mistral, Gemma, Gemma 2, Llama 3, and Llama 3.1). Bootstrapping estimated confidence intervals (CIs) and assessed non-inferiority of the best-performing LLM ensemble compared with human agreement, using a non-inferiority margin of 0.1.
[RESULTS] The LLM framework was non-inferior to human annotators with a mean Cohen's kappa of 0.82 (95% CI 0.74-0.89) for human-to-LLM versus 0.79 (95% CI 0.71-0.86) for human-to-human agreement ( < 0.001). The best-performing ensemble model, Llama 3.1:70B, achieved 100% sensitivity, a specificity of 90%, and an F1 score of 84% on the test set. The mean human intra-rater variability was 0.87.
[CONCLUSIONS] The proposed LLM framework was non-inferior to human annotators in classifying cancer progression from free-text radiology reports. This offers significant potential for using LLMs as a tool for identifying tumor progression events in clinical assessment and research.
[MATERIALS AND METHODS] We identified 376 radiology reports from 184 patients with metastatic breast cancer from Danish electronic health records. Six human annotators, including two experts, classified radiology reports as progressive disease (PD) or non-PD. A 'reverse questioning' strategy was used to evaluate five LLM model series (Mistral, Gemma, Gemma 2, Llama 3, and Llama 3.1). Bootstrapping estimated confidence intervals (CIs) and assessed non-inferiority of the best-performing LLM ensemble compared with human agreement, using a non-inferiority margin of 0.1.
[RESULTS] The LLM framework was non-inferior to human annotators with a mean Cohen's kappa of 0.82 (95% CI 0.74-0.89) for human-to-LLM versus 0.79 (95% CI 0.71-0.86) for human-to-human agreement ( < 0.001). The best-performing ensemble model, Llama 3.1:70B, achieved 100% sensitivity, a specificity of 90%, and an F1 score of 84% on the test set. The mean human intra-rater variability was 0.87.
[CONCLUSIONS] The proposed LLM framework was non-inferior to human annotators in classifying cancer progression from free-text radiology reports. This offers significant potential for using LLMs as a tool for identifying tumor progression events in clinical assessment and research.
🏷️ 키워드 / MeSH
📖 전문 본문 읽기 PMC JATS · ~36 KB · 영문
Introduction
Introduction
Structured fields within electronic health record (EHR) systems offer potential for readily extractable data capture, yet clinically important variables are routinely documented in unstructured formats. Tumor response to antineoplastic treatment is a key clinical parameter in oncology, and is typically documented in reports as unstructured free text by radiologists.
Manual annotation of clinical parameters from free text to structured data is time-consuming and costly, slowing real-world evidence (RWE) studies, while impeding the roll-out of common data models1 and the progress of international RWE oncology consortia like ONCOVALUE,2,3 EHDEN,4 and EUCAIM.5
A promising solution to addressing this problem is the rapid increase in quality of large language models (LLMs) and our ability to optimize their performance through advanced prompting techniques, such as few-shot,6,7 chain-of-thought,7, 8, 9 self-consistency,10,11 and multi-agent collaboration.12,13 This has allowed LLMs to show effectiveness in health care applications, such as achieving expert-level performance on medical licensing examinations11,14 and supporting tasks across medical specialties,15 including radiology.16 These types of studies generally rely on clearly defined ‘ground truths’. In contrast, routine medical radiology reports in oncology involve a high degree of complexity and subjective interpretation, making it difficult to establish a single, definitive ground truth for comparing annotation performance. To address this using a clinically viable approach, this study utilizes local installations of open-weight models, prioritizing implementation feasibility, data privacy, and minimized regulatory burden. While LLM-supported tools for identifying tumor progression events could be valuable both in clinical assessment and research, their outputs can be inaccurate and biased,17 and benchmark scores often overestimate real-world performance.18 Therefore, to accurately assess the value of LLM tools, a robust method of evaluation is required.
In the context of free-text radiology reports, where there is a high degree of subjective interpretation, a fair comparison between humans and LLMs must measure inter- and intra-rater agreement in comparison to human-to-LLM agreement. This evaluation is essential to both contextualize model performance and assess its reliability.
Study objectives
The primary objective was to evaluate LLM performance in classifying progression events (yes/no) from Danish free-text radiology reports for metastatic breast cancer patients, and to test whether human-to-LLM agreement was non-inferior to human inter-rater agreement.
The secondary objectives were as follows:•To quantify human intra-rater agreement for annotation of progression events.
•To develop and evaluate a secure and sustainable LLM-based framework that reduces the manual workload of interpreting free-text radiology reports for medical annotation tasks.
Structured fields within electronic health record (EHR) systems offer potential for readily extractable data capture, yet clinically important variables are routinely documented in unstructured formats. Tumor response to antineoplastic treatment is a key clinical parameter in oncology, and is typically documented in reports as unstructured free text by radiologists.
Manual annotation of clinical parameters from free text to structured data is time-consuming and costly, slowing real-world evidence (RWE) studies, while impeding the roll-out of common data models1 and the progress of international RWE oncology consortia like ONCOVALUE,2,3 EHDEN,4 and EUCAIM.5
A promising solution to addressing this problem is the rapid increase in quality of large language models (LLMs) and our ability to optimize their performance through advanced prompting techniques, such as few-shot,6,7 chain-of-thought,7, 8, 9 self-consistency,10,11 and multi-agent collaboration.12,13 This has allowed LLMs to show effectiveness in health care applications, such as achieving expert-level performance on medical licensing examinations11,14 and supporting tasks across medical specialties,15 including radiology.16 These types of studies generally rely on clearly defined ‘ground truths’. In contrast, routine medical radiology reports in oncology involve a high degree of complexity and subjective interpretation, making it difficult to establish a single, definitive ground truth for comparing annotation performance. To address this using a clinically viable approach, this study utilizes local installations of open-weight models, prioritizing implementation feasibility, data privacy, and minimized regulatory burden. While LLM-supported tools for identifying tumor progression events could be valuable both in clinical assessment and research, their outputs can be inaccurate and biased,17 and benchmark scores often overestimate real-world performance.18 Therefore, to accurately assess the value of LLM tools, a robust method of evaluation is required.
In the context of free-text radiology reports, where there is a high degree of subjective interpretation, a fair comparison between humans and LLMs must measure inter- and intra-rater agreement in comparison to human-to-LLM agreement. This evaluation is essential to both contextualize model performance and assess its reliability.
Study objectives
The primary objective was to evaluate LLM performance in classifying progression events (yes/no) from Danish free-text radiology reports for metastatic breast cancer patients, and to test whether human-to-LLM agreement was non-inferior to human inter-rater agreement.
The secondary objectives were as follows:•To quantify human intra-rater agreement for annotation of progression events.
•To develop and evaluate a secure and sustainable LLM-based framework that reduces the manual workload of interpreting free-text radiology reports for medical annotation tasks.
Materials and methods
Materials and methods
Data source
We retrospectively identified the study population from the EHR system, Sundhedsplatformen (the Danish distribution of Epic Systems). The repository covers every public hospital in the Capital Region and Region Zealand representing ∼2.6 million citizens, for whom systemic anticancer therapy is delivered exclusively within oncology departments (Danish National Guidelines19). The study population was limited to 376 reports from 184 patients who started a cyclin-dependent kinase 4/6 inhibitor for metastatic breast cancer between 2020 and 2022 (Figure 1). Although additional eligible reports were available, this subset was selected to balance feasibility with sufficient data for model development and evaluation. Multiple reports per patient were allowed and included all types of scans, ranging from baseline to follow-up studies, regardless of the level of detail in each report. The radiology reports were generated by multiple radiologists across several sites as part of routine clinical practice, and the corresponding image assessments were not standardized or required to conform to RECIST criteria.
The study was approved by The Research Ethics Committee Secretary in the Capital Region, and the study was filed with the Danish Data Protection Agency. As we worked with sensitive medical data, we used Ollama, an open-source framework, to run LLMs locally.
Annotation of dataset
A one-page annotation guide in Danish and English was provided to the annotators, ensuring consistent labeling of each report as a progressive disease (PD) or non-PD event. The annotators reviewed the reports as independent observations and in a randomized sequence. To ensure a strict text-only assessment, they were blinded to external patient histories and prior imaging, relying solely on information explicitly stated within the specific report.
Annotation of all free-text reports was carried out by one expert (reference annotator). This provided the basis for an ∼50% training (191 scans), ∼10% validation (44 scans), and ∼40% test split (141 scans), with a balanced distribution of PD and non-PD in each set. The test dataset was independently annotated by a second expert annotator and four medical students, all with training in retrospective and prospective observational trial data gathering. The annotators were oncology clinicians, not radiologists, and they evaluated only the free-text reports, not the corresponding medical images.
Intra-rater agreement was evaluated by re-annotating the test set after a minimum 90-day washout period by the same annotators. Inter-rater variability and non-inferiority were evaluated on both annotation rounds.
Disagreements between the two expert annotators on the test dataset were identified and reevaluated to achieve mutual agreement. This expert consensus was used to evaluate the LLM’s final performance on the independent test set.
LLM methods
The English annotation guide was condensed into binary (yes/no) checklist items for prompting, e.g. ‘Is there at least one new metastasis?’ (see Appendix 1).
Our prompting strategy built on established few-shot, chain-of-thought, and self-consistency methods. We extend these with a novel technique, reverse questioning, which systematically prompts the model to generate arguments for both the PD and non-PD categories before being prompted to determine which argument is more likely correct (Figure 2). This method was specifically developed to prevent the model from being hypersensitive to negative clinical keywords and correct its bias toward classification of PD, which were the empirically experienced issues during model validation. In this setup, the LLMs had an option of classifying the reports as inconclusive. To prioritize minimizing false negatives, these inconclusive assignments were converted to PD events before reporting evaluation metrics, reflecting the critical importance of identifying all PD events.
To increase the robustness and reliability of our methodology, we utilized a ‘wisdom of crowds’ ensemble approach, which resulted in 25 distinct outputs per report for each LLM model series, varying by quantization and temperature. We selected state-of-the-art open-weight models ranging from 7B to 70B parameters (Mistral,20 Gemma,21 Gemma 2,22 Llama 3,23 and Llama 3.123). This allowed us to investigate the performance variation between model series, while ensuring data privacy compliance using self-hosted local LLMs. For each report, we aggregated the 25 outputs by calculating the proportion classified as PD. We converted this proportion to a binary prediction (PD versus non-PD) using 10 thresholds from 10% to 100%, allowing performance to be evaluated across operating points.
We used the training set for prompt tuning the reverse questioning approach, the validation set for model selection, and the test set for final performance evaluation. We also compared individual model performance to show the superiority of the ensemble method.
Evaluation metrics and statistics
The LLM models were compared using the F1 score, accuracy, sensitivity, specificity, and Cohen’s kappa scores. The best-performing model for the final analysis was the one that achieved the highest F1 score while maintaining 100% sensitivity on the validation set. The selected model was then used on the test set.
Inter- and intra-rater agreements were assessed with Cohen’s kappa for pairs and Fleiss kappa for groups on the test set. Additionally, we report the percentage agreement when assessing the inter-rater agreement.
We assessed the non-inferiority of LLM versus human agreement on the test set using bootstrapping24 with a non-inferiority margin of 0.1. This margin was set below the observed human inter-rater variability across the two annotation rounds, keeping the threshold within the range of human agreement. We repeated this analysis without including the student annotations to specifically evaluate the model’s performance against the two expert oncologists. A significance level of 0.05 was used in all statistical tests.
To assess the utility of the methodology, we calculated the reduction in manual workload, defined as the percentage of reports classified as non-PD. These reports represent the share of cases that could be automatically labeled without manual review, given 100% sensitivity.
See Supplementary Material A, available at https://doi.org/10.1016/j.esmorw.2026.100689, for further details on methods.
Data source
We retrospectively identified the study population from the EHR system, Sundhedsplatformen (the Danish distribution of Epic Systems). The repository covers every public hospital in the Capital Region and Region Zealand representing ∼2.6 million citizens, for whom systemic anticancer therapy is delivered exclusively within oncology departments (Danish National Guidelines19). The study population was limited to 376 reports from 184 patients who started a cyclin-dependent kinase 4/6 inhibitor for metastatic breast cancer between 2020 and 2022 (Figure 1). Although additional eligible reports were available, this subset was selected to balance feasibility with sufficient data for model development and evaluation. Multiple reports per patient were allowed and included all types of scans, ranging from baseline to follow-up studies, regardless of the level of detail in each report. The radiology reports were generated by multiple radiologists across several sites as part of routine clinical practice, and the corresponding image assessments were not standardized or required to conform to RECIST criteria.
The study was approved by The Research Ethics Committee Secretary in the Capital Region, and the study was filed with the Danish Data Protection Agency. As we worked with sensitive medical data, we used Ollama, an open-source framework, to run LLMs locally.
Annotation of dataset
A one-page annotation guide in Danish and English was provided to the annotators, ensuring consistent labeling of each report as a progressive disease (PD) or non-PD event. The annotators reviewed the reports as independent observations and in a randomized sequence. To ensure a strict text-only assessment, they were blinded to external patient histories and prior imaging, relying solely on information explicitly stated within the specific report.
Annotation of all free-text reports was carried out by one expert (reference annotator). This provided the basis for an ∼50% training (191 scans), ∼10% validation (44 scans), and ∼40% test split (141 scans), with a balanced distribution of PD and non-PD in each set. The test dataset was independently annotated by a second expert annotator and four medical students, all with training in retrospective and prospective observational trial data gathering. The annotators were oncology clinicians, not radiologists, and they evaluated only the free-text reports, not the corresponding medical images.
Intra-rater agreement was evaluated by re-annotating the test set after a minimum 90-day washout period by the same annotators. Inter-rater variability and non-inferiority were evaluated on both annotation rounds.
Disagreements between the two expert annotators on the test dataset were identified and reevaluated to achieve mutual agreement. This expert consensus was used to evaluate the LLM’s final performance on the independent test set.
LLM methods
The English annotation guide was condensed into binary (yes/no) checklist items for prompting, e.g. ‘Is there at least one new metastasis?’ (see Appendix 1).
Our prompting strategy built on established few-shot, chain-of-thought, and self-consistency methods. We extend these with a novel technique, reverse questioning, which systematically prompts the model to generate arguments for both the PD and non-PD categories before being prompted to determine which argument is more likely correct (Figure 2). This method was specifically developed to prevent the model from being hypersensitive to negative clinical keywords and correct its bias toward classification of PD, which were the empirically experienced issues during model validation. In this setup, the LLMs had an option of classifying the reports as inconclusive. To prioritize minimizing false negatives, these inconclusive assignments were converted to PD events before reporting evaluation metrics, reflecting the critical importance of identifying all PD events.
To increase the robustness and reliability of our methodology, we utilized a ‘wisdom of crowds’ ensemble approach, which resulted in 25 distinct outputs per report for each LLM model series, varying by quantization and temperature. We selected state-of-the-art open-weight models ranging from 7B to 70B parameters (Mistral,20 Gemma,21 Gemma 2,22 Llama 3,23 and Llama 3.123). This allowed us to investigate the performance variation between model series, while ensuring data privacy compliance using self-hosted local LLMs. For each report, we aggregated the 25 outputs by calculating the proportion classified as PD. We converted this proportion to a binary prediction (PD versus non-PD) using 10 thresholds from 10% to 100%, allowing performance to be evaluated across operating points.
We used the training set for prompt tuning the reverse questioning approach, the validation set for model selection, and the test set for final performance evaluation. We also compared individual model performance to show the superiority of the ensemble method.
Evaluation metrics and statistics
The LLM models were compared using the F1 score, accuracy, sensitivity, specificity, and Cohen’s kappa scores. The best-performing model for the final analysis was the one that achieved the highest F1 score while maintaining 100% sensitivity on the validation set. The selected model was then used on the test set.
Inter- and intra-rater agreements were assessed with Cohen’s kappa for pairs and Fleiss kappa for groups on the test set. Additionally, we report the percentage agreement when assessing the inter-rater agreement.
We assessed the non-inferiority of LLM versus human agreement on the test set using bootstrapping24 with a non-inferiority margin of 0.1. This margin was set below the observed human inter-rater variability across the two annotation rounds, keeping the threshold within the range of human agreement. We repeated this analysis without including the student annotations to specifically evaluate the model’s performance against the two expert oncologists. A significance level of 0.05 was used in all statistical tests.
To assess the utility of the methodology, we calculated the reduction in manual workload, defined as the percentage of reports classified as non-PD. These reports represent the share of cases that could be automatically labeled without manual review, given 100% sensitivity.
See Supplementary Material A, available at https://doi.org/10.1016/j.esmorw.2026.100689, for further details on methods.
Results
Results
LLM selection
The Llama 3.1:70b achieved the best performance on the validation set, with an F1 score of 0.83 and a Cohen’s kappa of 0.77 at 70% model agreement (Table 1), where it identified 14 reports with PD compared with the 10 from the reference annotator out of a total of 44 cases. Between the 30% and 70% agreement thresholds, the number of PD cases identified remains stable, with 15 and 14 PD cases, respectively, while maintaining 100% sensitivity (Figure 3). Performance metrics for all ensemble models at their highest F1 scores are shown in Supplementary Table S1, available at https://doi.org/10.1016/j.esmorw.2026.100689.
The subsequent analysis uses the Llama 3.1:70b ensemble model at a 70% agreement threshold.
LLM performance on the test set
When evaluated on the test set against the expert consensus, the ensemble model achieved an F1 score of 0.84, an accuracy of 0.92, and a specificity of 0.90, while maintaining 100% sensitivity. Of the 141 test cases, the model correctly identified all 28 expert-consensus PD cases and predicted an additional 11 false positives, potentially reducing the manual review workload by 72%. An overview of the performance on the test data for Llama 3.1:70b and the other ensemble models at all thresholds are visualized in Supplementary Figures S1, S2, and S3, as well as Tables S2 and S3, available at https://doi.org/10.1016/j.esmorw.2026.100689.
Individual model analysis showed Llama 3.1:70b-instruct-q4_K_M with a temperature of 0.75 achieved the highest F1 score on the validation set (0.87) but dropped to the 13th place on the test set (F1 of 0.74) compared with expert consensus (Supplementary Tables S4 and S5 and Figures S4 and S5, available at https://doi.org/10.1016/j.esmorw.2026.100689).
LLM versus human agreement
Inter- and intra-rater agreement
The inter-rater agreement across both annotation rounds achieved substantial agreement. In round 1 (Figure 4A), the annotators achieved a Fleiss kappa of 0.73, and the pairwise Cohen’s kappa ranged from 0.56 to 0.84, with the highest agreement between two students, agreeing on 132 of 141 cases. The two expert annotators agreed on 126 cases, resulting in a Cohen’s kappa of 0.65.
In round 2 (Figure 4B), the Fleiss kappa rose to 0.79 and Cohen’s kappa ranged from 0.70 to 0.93, with agreement on 126 and 137 cases, respectively.
All annotators achieved almost perfect intra-rater agreement (Table 2), with a mean Cohen’s kappa of 0.87 (range 0.80-0.92) and a mean self-agreement of 95% (range 94%-97%).
Non-inferiority assessment
For the first round, the mean Cohen’s kappa for human-to-LLM was 0.78 [95% confidence interval (CI) 0.69-0.85], compared with a mean human-to-human agreement of 0.73 (95% CI 0.64-0.80). The bootstrapped non-inferiority test with a margin of 0.1 showed that ensemble model agreement was non-inferior to human agreement (95% one-sided CI 0.003 to ∞, P < 0.001).
For the second round of annotations, the mean agreement was 0.82 (95% CI 0.74-0.89) and 0.79 (95% CI 0.71-0.86) for human-to-LLM and human-to-human, respectively. Again, the human-to-LLM agreement was non-inferior (95% CI −0.026 to ∞, P < 0.001).
Repeating the non-inferiority test using only the two expert annotators yielded similar results. The expert-to-expert agreement was 0.65 (95% CI 0.48-0.81) and 0.76 (95% CI 0.61-0.88) in the first and second rounds, respectively. The mean expert-to-LLM agreement was 0.69 (95% CI 0.58-0.80) and 0.77 (95% CI 0.67-0.87) for the corresponding rounds. In both rounds, the expert-to-LLM agreement was significantly non-inferior to the expert-to-expert agreement, with P values of 0.020 (95% CI −0.070 to ∞) and 0.026 (95% CI −0.078 to ∞) for the first and second rounds, respectively.
This indicates that the LLM achieved non-inferiority by reaching a level of inter-rater reliability which is not worse than the variability found between trained annotators.
Of the 141 radiology reports, 17 were labeled as PD by all six annotators in the first round, increasing to 21 in the second round. Reports labeled PD by at least one annotator decreased from 51 to 46 between rounds (Supplementary Table S6, available at https://doi.org/10.1016/j.esmorw.2026.100689).
LLM selection
The Llama 3.1:70b achieved the best performance on the validation set, with an F1 score of 0.83 and a Cohen’s kappa of 0.77 at 70% model agreement (Table 1), where it identified 14 reports with PD compared with the 10 from the reference annotator out of a total of 44 cases. Between the 30% and 70% agreement thresholds, the number of PD cases identified remains stable, with 15 and 14 PD cases, respectively, while maintaining 100% sensitivity (Figure 3). Performance metrics for all ensemble models at their highest F1 scores are shown in Supplementary Table S1, available at https://doi.org/10.1016/j.esmorw.2026.100689.
The subsequent analysis uses the Llama 3.1:70b ensemble model at a 70% agreement threshold.
LLM performance on the test set
When evaluated on the test set against the expert consensus, the ensemble model achieved an F1 score of 0.84, an accuracy of 0.92, and a specificity of 0.90, while maintaining 100% sensitivity. Of the 141 test cases, the model correctly identified all 28 expert-consensus PD cases and predicted an additional 11 false positives, potentially reducing the manual review workload by 72%. An overview of the performance on the test data for Llama 3.1:70b and the other ensemble models at all thresholds are visualized in Supplementary Figures S1, S2, and S3, as well as Tables S2 and S3, available at https://doi.org/10.1016/j.esmorw.2026.100689.
Individual model analysis showed Llama 3.1:70b-instruct-q4_K_M with a temperature of 0.75 achieved the highest F1 score on the validation set (0.87) but dropped to the 13th place on the test set (F1 of 0.74) compared with expert consensus (Supplementary Tables S4 and S5 and Figures S4 and S5, available at https://doi.org/10.1016/j.esmorw.2026.100689).
LLM versus human agreement
Inter- and intra-rater agreement
The inter-rater agreement across both annotation rounds achieved substantial agreement. In round 1 (Figure 4A), the annotators achieved a Fleiss kappa of 0.73, and the pairwise Cohen’s kappa ranged from 0.56 to 0.84, with the highest agreement between two students, agreeing on 132 of 141 cases. The two expert annotators agreed on 126 cases, resulting in a Cohen’s kappa of 0.65.
In round 2 (Figure 4B), the Fleiss kappa rose to 0.79 and Cohen’s kappa ranged from 0.70 to 0.93, with agreement on 126 and 137 cases, respectively.
All annotators achieved almost perfect intra-rater agreement (Table 2), with a mean Cohen’s kappa of 0.87 (range 0.80-0.92) and a mean self-agreement of 95% (range 94%-97%).
Non-inferiority assessment
For the first round, the mean Cohen’s kappa for human-to-LLM was 0.78 [95% confidence interval (CI) 0.69-0.85], compared with a mean human-to-human agreement of 0.73 (95% CI 0.64-0.80). The bootstrapped non-inferiority test with a margin of 0.1 showed that ensemble model agreement was non-inferior to human agreement (95% one-sided CI 0.003 to ∞, P < 0.001).
For the second round of annotations, the mean agreement was 0.82 (95% CI 0.74-0.89) and 0.79 (95% CI 0.71-0.86) for human-to-LLM and human-to-human, respectively. Again, the human-to-LLM agreement was non-inferior (95% CI −0.026 to ∞, P < 0.001).
Repeating the non-inferiority test using only the two expert annotators yielded similar results. The expert-to-expert agreement was 0.65 (95% CI 0.48-0.81) and 0.76 (95% CI 0.61-0.88) in the first and second rounds, respectively. The mean expert-to-LLM agreement was 0.69 (95% CI 0.58-0.80) and 0.77 (95% CI 0.67-0.87) for the corresponding rounds. In both rounds, the expert-to-LLM agreement was significantly non-inferior to the expert-to-expert agreement, with P values of 0.020 (95% CI −0.070 to ∞) and 0.026 (95% CI −0.078 to ∞) for the first and second rounds, respectively.
This indicates that the LLM achieved non-inferiority by reaching a level of inter-rater reliability which is not worse than the variability found between trained annotators.
Of the 141 radiology reports, 17 were labeled as PD by all six annotators in the first round, increasing to 21 in the second round. Reports labeled PD by at least one annotator decreased from 51 to 46 between rounds (Supplementary Table S6, available at https://doi.org/10.1016/j.esmorw.2026.100689).
Discussion
Discussion
This study aimed to evaluate the performance of LLMs in identifying cancer progression events using real-world radiology reports. Our results demonstrate that our proposed reverse questioning framework is non-inferior to human annotators in classifying radiology reports as PD or non-PD (mean Cohen’s kappa of 0.79 for human-to-human versus 0.82 for human-to-LLM, P < 0.001). Importantly, our best-performing model could potentially reduce the manual review workload by 72%. Additionally, we quantified human inter- and intra-rater agreement and developed a practical framework for using LLMs to annotate progression events within Danish-language radiology reports from patients with metastatic breast cancer. This provides evidence for the LLM’s ability to perform at the level of human annotators.
Inter- and intra-rater agreement
Manual review of free-text radiology reports poses significant challenges due to the complex and nuanced nature of cancer progression. These challenges are particularly relevant in RWE, where radiologists might not provide structured labels, in contrast to prospective research settings where progression is systematically assessed using standardized RECIST criteria.25 This complexity is reflected in our human annotator agreement data, where the Fleiss kappa increased from 0.73 (95% CI 0.69-0.77) to 0.79 (95% CI 0.75-0.84) between annotation rounds. Across both rounds, the expert annotators demonstrated lower mean pairwise agreement compared with the students. This is likely because the expert annotators apply tacit rules not spelled out in the guide or because they are less adherent to the guide altogether.
To our knowledge, there are no inter-rater agreement benchmarks described specifically for free-text radiology reports in the real-world data (RWD) setting. Our findings show a mean intra-rater Cohen’s kappa of 0.87, and a 5% self-disagreement rate. These results are consistent with other studies, such as a study on asthma reports,26 which reported an inter-rater agreement of 0.75 and an intra-rater agreement of 0.81. Similarly, another study27 found an inter-rater agreement ranging from 0.62 to 0.76 and intra-rater agreement of ranging from 0.48 to 0.77. Although both studies likely involved less interpretive complexity than the oncology progression assessment, it highlights that medical text interpretation tasks are inherently complex, increasing the need for consistent methods such as LLM-assisted annotation.
LLM performance
The reliability of our LLM framework is strengthened by its evaluation on unseen RWD. This approach was particularly important as the performance of LLMs have been seen to drop significantly when given simple perturbations such as changing names and numbers, suggesting that their high benchmark scores may be due to data memorization rather than genuine reasoning.18 Given these limitations with standard benchmarks, we used an independent test set to provide a fair evaluation of the LLMs in a clinical context, thereby ensuring a true and unbiased assessment of their reliability for high-stakes applications.
Our ‘wisdom of crowds’ approach significantly enhanced the robustness and reliability of predictions, especially given that our top-performing single LLM instance on the validation set dropped to the 13th place when evaluated on the independent test set. The ensemble approach allows us to compute a certainty score (% of models agreeing), which has significant value in real-world scenarios, as the score can be used to configure models to a pre-defined sensitivity–specificity balance.
For instance, our implementation allows for further increasing the likelihood of identifying all PD cases. By decreasing the agreement threshold from 70% to 30%, the number of manual evaluations only increases from 39 to 47. This slight tradeoff dramatically boosts the confidence of maintaining 100% sensitivity in large real-world datasets. This ability to adjust the threshold introduces a key safety mechanism, enhancing the framework’s robustness and adaptability for future clinical implementation.
A post hoc correlation analysis (Supplementary Material C, available at https://doi.org/10.1016/j.esmorw.2026.100689) confirmed that report characteristics were heterogeneous. Importantly, these factors did not meaningfully confound model–expert agreement, supporting the applicability of our method in real-world clinical reports.
Implementation perspectives
The automated pipeline provides significant opportunities for the generation of robust datasets for large-scale RWE studies, such as investigating real-world progression-free survival and generating data to train image-to-category deep learning models. From a clinical perspective it offers the opportunity for a second opinion when interpreting complex radiology reports. To support this, the framework allows for essential safeguards, such as a selective human-in-the-loop approach using ensemble agreement thresholds to flag ambiguous cases for manual expert review, ensuring the ongoing integrity and reliability of the automated outputs in real-world applications.
The use of local LLMs ensures that the methodology is implementable within secure hospital infrastructures, maintaining data privacy. However, while local model parameters remain locked, systematic random sampling of high-confidence reports should be carried out to monitor potential data drift and to align with quality management system standards, ensuring the technical integrity required for both clinical and research usage.
Limitations
Although the findings of this project are promising, key limitations should be addressed. The dataset relied on automatic translation from Danish to English using the LLMs’ own translational ability, which was a process not explicitly audited. This translation process should be considered when implementing a solution in other countries, as it risks losing clinically relevant information, especially when working with low-resource languages.28,29
This work did not involve annotations by radiologists, as our objective was to capture the oncologist’s interpretation of free-text reports. This methodological choice, while potentially introducing a bias from individual differences in how radiologists interpret and describe imaging, accurately reflects oncological decision-making pipelines that rely on free-text reports alone. Our focus was on oncologists’ binary PD/non-PD classification, not a substitution of the oncology RECIST25 standard.
The non-inferiority comparison was conducted against a mixed group of medical students and experts, rather than against experts alone. We did this to evaluate our framework against research practices using medical students to annotate RWD. Importantly, our sensitivity analysis excluding students did not change our conclusion.
Finally, prompt optimization focused on Gemma 2 and Llama 3, possibly leading to suboptimal performance for Gemma and Mistral. However, our study’s objective was not to provide a head-to-head comparison of different model types but to ensure we did not overlook a model with superior performance.
Future work
Future research should focus on extending the methodology from patient-level to lesion-specific tracking, which would improve granularity of the reverse questioning framework, allowing it to differentiate between systemic progression and mixed responses. Additionally, developing a clinical interface to expose the rationales would enable it to function as an explainable second opinion. Finally, external validity should be evaluated by extending the methodology to a broader range of cancer subtypes and clinical contexts, such as early-stage disease or first-relapse scenarios.
Conclusion
This study demonstrates that the proposed prompt-based reverse questioning LLM framework can classify PD events in metastatic breast cancer radiology reports with agreement that is non-inferior to human annotation. We quantified the human intra-rater variability, which demonstrated a key benefit of the LLM framework’s consistent performance compared with the inherent variability of human annotators. The method can be configured to meet specific research objectives, including maximizing specificity while maintaining 100% sensitivity, which in this setting demonstrated a potential reduction of the manual workload required for annotating radiological reports by 72%. These findings support the use of LLM-assisted annotation as a scalable approach for generating large annotated datasets with applicability for research, and future work could prove the framework to be valuable as decision support in a routine clinical setting.
This study aimed to evaluate the performance of LLMs in identifying cancer progression events using real-world radiology reports. Our results demonstrate that our proposed reverse questioning framework is non-inferior to human annotators in classifying radiology reports as PD or non-PD (mean Cohen’s kappa of 0.79 for human-to-human versus 0.82 for human-to-LLM, P < 0.001). Importantly, our best-performing model could potentially reduce the manual review workload by 72%. Additionally, we quantified human inter- and intra-rater agreement and developed a practical framework for using LLMs to annotate progression events within Danish-language radiology reports from patients with metastatic breast cancer. This provides evidence for the LLM’s ability to perform at the level of human annotators.
Inter- and intra-rater agreement
Manual review of free-text radiology reports poses significant challenges due to the complex and nuanced nature of cancer progression. These challenges are particularly relevant in RWE, where radiologists might not provide structured labels, in contrast to prospective research settings where progression is systematically assessed using standardized RECIST criteria.25 This complexity is reflected in our human annotator agreement data, where the Fleiss kappa increased from 0.73 (95% CI 0.69-0.77) to 0.79 (95% CI 0.75-0.84) between annotation rounds. Across both rounds, the expert annotators demonstrated lower mean pairwise agreement compared with the students. This is likely because the expert annotators apply tacit rules not spelled out in the guide or because they are less adherent to the guide altogether.
To our knowledge, there are no inter-rater agreement benchmarks described specifically for free-text radiology reports in the real-world data (RWD) setting. Our findings show a mean intra-rater Cohen’s kappa of 0.87, and a 5% self-disagreement rate. These results are consistent with other studies, such as a study on asthma reports,26 which reported an inter-rater agreement of 0.75 and an intra-rater agreement of 0.81. Similarly, another study27 found an inter-rater agreement ranging from 0.62 to 0.76 and intra-rater agreement of ranging from 0.48 to 0.77. Although both studies likely involved less interpretive complexity than the oncology progression assessment, it highlights that medical text interpretation tasks are inherently complex, increasing the need for consistent methods such as LLM-assisted annotation.
LLM performance
The reliability of our LLM framework is strengthened by its evaluation on unseen RWD. This approach was particularly important as the performance of LLMs have been seen to drop significantly when given simple perturbations such as changing names and numbers, suggesting that their high benchmark scores may be due to data memorization rather than genuine reasoning.18 Given these limitations with standard benchmarks, we used an independent test set to provide a fair evaluation of the LLMs in a clinical context, thereby ensuring a true and unbiased assessment of their reliability for high-stakes applications.
Our ‘wisdom of crowds’ approach significantly enhanced the robustness and reliability of predictions, especially given that our top-performing single LLM instance on the validation set dropped to the 13th place when evaluated on the independent test set. The ensemble approach allows us to compute a certainty score (% of models agreeing), which has significant value in real-world scenarios, as the score can be used to configure models to a pre-defined sensitivity–specificity balance.
For instance, our implementation allows for further increasing the likelihood of identifying all PD cases. By decreasing the agreement threshold from 70% to 30%, the number of manual evaluations only increases from 39 to 47. This slight tradeoff dramatically boosts the confidence of maintaining 100% sensitivity in large real-world datasets. This ability to adjust the threshold introduces a key safety mechanism, enhancing the framework’s robustness and adaptability for future clinical implementation.
A post hoc correlation analysis (Supplementary Material C, available at https://doi.org/10.1016/j.esmorw.2026.100689) confirmed that report characteristics were heterogeneous. Importantly, these factors did not meaningfully confound model–expert agreement, supporting the applicability of our method in real-world clinical reports.
Implementation perspectives
The automated pipeline provides significant opportunities for the generation of robust datasets for large-scale RWE studies, such as investigating real-world progression-free survival and generating data to train image-to-category deep learning models. From a clinical perspective it offers the opportunity for a second opinion when interpreting complex radiology reports. To support this, the framework allows for essential safeguards, such as a selective human-in-the-loop approach using ensemble agreement thresholds to flag ambiguous cases for manual expert review, ensuring the ongoing integrity and reliability of the automated outputs in real-world applications.
The use of local LLMs ensures that the methodology is implementable within secure hospital infrastructures, maintaining data privacy. However, while local model parameters remain locked, systematic random sampling of high-confidence reports should be carried out to monitor potential data drift and to align with quality management system standards, ensuring the technical integrity required for both clinical and research usage.
Limitations
Although the findings of this project are promising, key limitations should be addressed. The dataset relied on automatic translation from Danish to English using the LLMs’ own translational ability, which was a process not explicitly audited. This translation process should be considered when implementing a solution in other countries, as it risks losing clinically relevant information, especially when working with low-resource languages.28,29
This work did not involve annotations by radiologists, as our objective was to capture the oncologist’s interpretation of free-text reports. This methodological choice, while potentially introducing a bias from individual differences in how radiologists interpret and describe imaging, accurately reflects oncological decision-making pipelines that rely on free-text reports alone. Our focus was on oncologists’ binary PD/non-PD classification, not a substitution of the oncology RECIST25 standard.
The non-inferiority comparison was conducted against a mixed group of medical students and experts, rather than against experts alone. We did this to evaluate our framework against research practices using medical students to annotate RWD. Importantly, our sensitivity analysis excluding students did not change our conclusion.
Finally, prompt optimization focused on Gemma 2 and Llama 3, possibly leading to suboptimal performance for Gemma and Mistral. However, our study’s objective was not to provide a head-to-head comparison of different model types but to ensure we did not overlook a model with superior performance.
Future work
Future research should focus on extending the methodology from patient-level to lesion-specific tracking, which would improve granularity of the reverse questioning framework, allowing it to differentiate between systemic progression and mixed responses. Additionally, developing a clinical interface to expose the rationales would enable it to function as an explainable second opinion. Finally, external validity should be evaluated by extending the methodology to a broader range of cancer subtypes and clinical contexts, such as early-stage disease or first-relapse scenarios.
Conclusion
This study demonstrates that the proposed prompt-based reverse questioning LLM framework can classify PD events in metastatic breast cancer radiology reports with agreement that is non-inferior to human annotation. We quantified the human intra-rater variability, which demonstrated a key benefit of the LLM framework’s consistent performance compared with the inherent variability of human annotators. The method can be configured to meet specific research objectives, including maximizing specificity while maintaining 100% sensitivity, which in this setting demonstrated a potential reduction of the manual workload required for annotating radiological reports by 72%. These findings support the use of LLM-assisted annotation as a scalable approach for generating large annotated datasets with applicability for research, and future work could prove the framework to be valuable as decision support in a routine clinical setting.
Glossary
Glossary
-Chain-of-thought prompting: an LLM prompting technique that encourages step-by-step reasoning.
-EHR: electronic health record. Digital versions of patients’ health care information.
-Expert consensus: annotated dataset where the two expert annotators discussed, agreed, and re-labeled scan reports which they initially disagreed on.
-Kappa: a measure of agreement between annotators, correcting for agreement occurring by chance. Cohen’s kappa is used for agreement between two raters, while Fleiss kappa is a generalization for agreement among three or more raters.
-LLM: large language model. A type of artificial intelligence model focused on understanding and generating human language.
-Non-inferiority margin: the largest difference acceptable for a method to be considered no worse than a standard one.
-Non-progression (non-PD): the primary outcome indicating either a stable or improving disease in cancer patients. Defined as all states that are not progression.
-Ollama: an open-source framework for running LLMs locally.
-Progression (PD): the primary outcome indicating a worsening of disease in cancer patients.
-Quantization: the process of reducing the numerical precision of a model’s weights and activations to decrease model size.
-RWD: real-world data. Data relating to patient health status and/or the delivery of health care routinely collected from a variety of sources.
-RWE: real-world evidence. Clinical evidence that is derived from the analysis of real-world data.
-RECIST: Response Evaluation Criteria in Solid Tumors. A standardized set of rules for evaluating objective response in solid tumors according to radiology images.
-Temperature: a parameter in LLMs which controls the randomness or creativity of the generated output. Ranging from 0 to 1, lower values lead to more deterministic results, while higher values lead to less predictable outputs.
-Chain-of-thought prompting: an LLM prompting technique that encourages step-by-step reasoning.
-EHR: electronic health record. Digital versions of patients’ health care information.
-Expert consensus: annotated dataset where the two expert annotators discussed, agreed, and re-labeled scan reports which they initially disagreed on.
-Kappa: a measure of agreement between annotators, correcting for agreement occurring by chance. Cohen’s kappa is used for agreement between two raters, while Fleiss kappa is a generalization for agreement among three or more raters.
-LLM: large language model. A type of artificial intelligence model focused on understanding and generating human language.
-Non-inferiority margin: the largest difference acceptable for a method to be considered no worse than a standard one.
-Non-progression (non-PD): the primary outcome indicating either a stable or improving disease in cancer patients. Defined as all states that are not progression.
-Ollama: an open-source framework for running LLMs locally.
-Progression (PD): the primary outcome indicating a worsening of disease in cancer patients.
-Quantization: the process of reducing the numerical precision of a model’s weights and activations to decrease model size.
-RWD: real-world data. Data relating to patient health status and/or the delivery of health care routinely collected from a variety of sources.
-RWE: real-world evidence. Clinical evidence that is derived from the analysis of real-world data.
-RECIST: Response Evaluation Criteria in Solid Tumors. A standardized set of rules for evaluating objective response in solid tumors according to radiology images.
-Temperature: a parameter in LLMs which controls the randomness or creativity of the generated output. Ranging from 0 to 1, lower values lead to more deterministic results, while higher values lead to less predictable outputs.
출처: PubMed Central (JATS). 라이선스는 원 publisher 정책을 따릅니다 — 인용 시 원문을 표기해 주세요.