본문으로 건너뛰기
← 뒤로

Large language models for toxicity extraction in oncology trials: A real-world benchmark in prostate radiotherapy.

1/5 보강
Radiotherapy and oncology : journal of the European Society for Therapeutic Radiology and Oncology 📖 저널 OA 24.3% 2021: 1/2 OA 2022: 0/1 OA 2024: 0/4 OA 2025: 3/48 OA 2026: 32/95 OA 2021~2026 2026 Vol.216() p. 111348
Retraction 확인
출처

PICO 자동 추출 (휴리스틱, conf 2/4)

유사 논문
P · Population 대상 환자/모집단
55 patients, 8968 toxicity records).
I · Intervention 중재 / 시술
추출되지 않음
C · Comparison 대조 / 비교
추출되지 않음
O · Outcome 결과 / 결론
[CONCLUSIONS] Off-the-shelf LLMs can extract clinically relevant toxicities with performance approaching human inter-rater reliability, at variable but often negligible costs. While grade-level accuracy remains limited, LLM integration into oncology workflows is feasible, offering scalable, low-cost support for toxicity monitoring and data abstraction in clinical research.

Mastroleo F, Borras-Osorio M, Patel SP, Peterson S, Wilson R, Zhou M

📝 환자 설명용 한 줄

[BACKGROUND] Accurate toxicity assessment is critical in oncology trials, yet current reporting frameworks such as the Common Terminology Criteria for Adverse Events (CTCAE) remain labor-intensive and

🔬 핵심 임상 통계 (초록에서 자동 추출 — 원문 검증 권장)
  • 표본수 (n) 55
  • Sensitivity 74.0 %

이 논문을 인용하기

↓ .bib ↓ .ris
APA Mastroleo F, Borras-Osorio M, et al. (2026). Large language models for toxicity extraction in oncology trials: A real-world benchmark in prostate radiotherapy.. Radiotherapy and oncology : journal of the European Society for Therapeutic Radiology and Oncology, 216, 111348. https://doi.org/10.1016/j.radonc.2025.111348
MLA Mastroleo F, et al.. "Large language models for toxicity extraction in oncology trials: A real-world benchmark in prostate radiotherapy.." Radiotherapy and oncology : journal of the European Society for Therapeutic Radiology and Oncology, vol. 216, 2026, pp. 111348.
PMID 41419026 ↗

Abstract

[BACKGROUND] Accurate toxicity assessment is critical in oncology trials, yet current reporting frameworks such as the Common Terminology Criteria for Adverse Events (CTCAE) remain labor-intensive and subject to inter-observer variability. Large language models (LLMs) offer potential to automate extraction and grading of adverse events from clinical notes and patient-reported outcomes (PROs), but their comparative performance and cost-effectiveness remain underexplored.

[METHODS] We evaluated five off-the-shelf LLMs (Gemini 2.0 Flash, Gemini 2.5 Flash, Gemini 2.5 Pro, GPT-4o, and GPT-5) using a rule-augmented few-shot prompting strategy to extract CTCAE-graded gastrointestinal and genitourinary toxicities from a prospective prostate radiotherapy trial (NCT02874014; n = 55 patients, 8968 toxicity records). Binary and grade-level accuracy, precision, recall, specificity, F1 score, Cohen's kappa, and computational costs were assessed.

[RESULTS] All models achieved high binary accuracy (84.6-87.4 %) and moderate grade accuracy (79.1-82.3 %). GPT-4o reached the best binary (87.4 %) and grade (83.5 %) accuracy, while Gemini 2.5 Pro demonstrated highest sensitivity (74.0 %). Specificity peaked with GPT-4o (96.0 %). Cohen's kappa values indicated moderate agreement (0.552-0.560 for binary; 0.401-0.465 for grades). Costs for the entire extraction varied substantially: Gemini 2.0 Flash delivered competitive accuracy at $0.77 total, whereas Gemini 2.5 Pro and GPT-5 exceeded $21.

[CONCLUSIONS] Off-the-shelf LLMs can extract clinically relevant toxicities with performance approaching human inter-rater reliability, at variable but often negligible costs. While grade-level accuracy remains limited, LLM integration into oncology workflows is feasible, offering scalable, low-cost support for toxicity monitoring and data abstraction in clinical research.

🏷️ 키워드 / MeSH 📖 같은 키워드 OA만

같은 제1저자의 인용 많은 논문 (1)

🏷️ 같은 키워드 · 무료전문 — 이 논문 MeSH/keyword 기반