Large language models for toxicity extraction in oncology trials: A real-world benchmark in prostate radiotherapy.
1/5 보강
PICO 자동 추출 (휴리스틱, conf 2/4)
유사 논문P · Population 대상 환자/모집단
55 patients, 8968 toxicity records).
I · Intervention 중재 / 시술
추출되지 않음
C · Comparison 대조 / 비교
추출되지 않음
O · Outcome 결과 / 결론
[CONCLUSIONS] Off-the-shelf LLMs can extract clinically relevant toxicities with performance approaching human inter-rater reliability, at variable but often negligible costs. While grade-level accuracy remains limited, LLM integration into oncology workflows is feasible, offering scalable, low-cost support for toxicity monitoring and data abstraction in clinical research.
[BACKGROUND] Accurate toxicity assessment is critical in oncology trials, yet current reporting frameworks such as the Common Terminology Criteria for Adverse Events (CTCAE) remain labor-intensive and
- 표본수 (n) 55
- Sensitivity 74.0 %
APA
Mastroleo F, Borras-Osorio M, et al. (2026). Large language models for toxicity extraction in oncology trials: A real-world benchmark in prostate radiotherapy.. Radiotherapy and oncology : journal of the European Society for Therapeutic Radiology and Oncology, 216, 111348. https://doi.org/10.1016/j.radonc.2025.111348
MLA
Mastroleo F, et al.. "Large language models for toxicity extraction in oncology trials: A real-world benchmark in prostate radiotherapy.." Radiotherapy and oncology : journal of the European Society for Therapeutic Radiology and Oncology, vol. 216, 2026, pp. 111348.
PMID
41419026 ↗
Abstract 한글 요약
[BACKGROUND] Accurate toxicity assessment is critical in oncology trials, yet current reporting frameworks such as the Common Terminology Criteria for Adverse Events (CTCAE) remain labor-intensive and subject to inter-observer variability. Large language models (LLMs) offer potential to automate extraction and grading of adverse events from clinical notes and patient-reported outcomes (PROs), but their comparative performance and cost-effectiveness remain underexplored.
[METHODS] We evaluated five off-the-shelf LLMs (Gemini 2.0 Flash, Gemini 2.5 Flash, Gemini 2.5 Pro, GPT-4o, and GPT-5) using a rule-augmented few-shot prompting strategy to extract CTCAE-graded gastrointestinal and genitourinary toxicities from a prospective prostate radiotherapy trial (NCT02874014; n = 55 patients, 8968 toxicity records). Binary and grade-level accuracy, precision, recall, specificity, F1 score, Cohen's kappa, and computational costs were assessed.
[RESULTS] All models achieved high binary accuracy (84.6-87.4 %) and moderate grade accuracy (79.1-82.3 %). GPT-4o reached the best binary (87.4 %) and grade (83.5 %) accuracy, while Gemini 2.5 Pro demonstrated highest sensitivity (74.0 %). Specificity peaked with GPT-4o (96.0 %). Cohen's kappa values indicated moderate agreement (0.552-0.560 for binary; 0.401-0.465 for grades). Costs for the entire extraction varied substantially: Gemini 2.0 Flash delivered competitive accuracy at $0.77 total, whereas Gemini 2.5 Pro and GPT-5 exceeded $21.
[CONCLUSIONS] Off-the-shelf LLMs can extract clinically relevant toxicities with performance approaching human inter-rater reliability, at variable but often negligible costs. While grade-level accuracy remains limited, LLM integration into oncology workflows is feasible, offering scalable, low-cost support for toxicity monitoring and data abstraction in clinical research.
[METHODS] We evaluated five off-the-shelf LLMs (Gemini 2.0 Flash, Gemini 2.5 Flash, Gemini 2.5 Pro, GPT-4o, and GPT-5) using a rule-augmented few-shot prompting strategy to extract CTCAE-graded gastrointestinal and genitourinary toxicities from a prospective prostate radiotherapy trial (NCT02874014; n = 55 patients, 8968 toxicity records). Binary and grade-level accuracy, precision, recall, specificity, F1 score, Cohen's kappa, and computational costs were assessed.
[RESULTS] All models achieved high binary accuracy (84.6-87.4 %) and moderate grade accuracy (79.1-82.3 %). GPT-4o reached the best binary (87.4 %) and grade (83.5 %) accuracy, while Gemini 2.5 Pro demonstrated highest sensitivity (74.0 %). Specificity peaked with GPT-4o (96.0 %). Cohen's kappa values indicated moderate agreement (0.552-0.560 for binary; 0.401-0.465 for grades). Costs for the entire extraction varied substantially: Gemini 2.0 Flash delivered competitive accuracy at $0.77 total, whereas Gemini 2.5 Pro and GPT-5 exceeded $21.
[CONCLUSIONS] Off-the-shelf LLMs can extract clinically relevant toxicities with performance approaching human inter-rater reliability, at variable but often negligible costs. While grade-level accuracy remains limited, LLM integration into oncology workflows is feasible, offering scalable, low-cost support for toxicity monitoring and data abstraction in clinical research.
🏷️ 키워드 / MeSH 📖 같은 키워드 OA만
같은 제1저자의 인용 많은 논문 (1)
🏷️ 같은 키워드 · 무료전문 — 이 논문 MeSH/keyword 기반
- A Phase I Study of Hydroxychloroquine and Suba-Itraconazole in Men with Biochemical Relapse of Prostate Cancer (HITMAN-PC): Dose Escalation Results.
- Self-management of male urinary symptoms: qualitative findings from a primary care trial.
- Clinical and Liquid Biomarkers of 20-Year Prostate Cancer Risk in Men Aged 45 to 70 Years.
- Diagnostic accuracy of Ga-PSMA PET/CT versus multiparametric MRI for preoperative pelvic invasion in the patients with prostate cancer.
- Comprehensive analysis of androgen receptor splice variant target gene expression in prostate cancer.
- Clinical Presentation and Outcomes of Patients Undergoing Surgery for Thyroid Cancer.