Leveraging large language models to populate structured clinical case report forms from unstructured medical notes in radiation oncology.
증례보고
3/5 보강
TL;DR
Large language models can automatically extract and structure data from unstructured medical notes with an average time of 16 s per note, indicating inaccuracies in the routine ground truth.
OpenAlex 토픽 ·
Topic Modeling
Artificial Intelligence in Healthcare and Education
Machine Learning in Healthcare
Large language models can automatically extract and structure data from unstructured medical notes with an average time of 16 s per note, indicating inaccuracies in the routine ground truth.
APA
Marcel Nachbar, Nianzi Yi, et al. (2026). Leveraging large language models to populate structured clinical case report forms from unstructured medical notes in radiation oncology.. Clinical and translational radiation oncology, 58, 101143. https://doi.org/10.1016/j.ctro.2026.101143
MLA
Marcel Nachbar, et al.. "Leveraging large language models to populate structured clinical case report forms from unstructured medical notes in radiation oncology.." Clinical and translational radiation oncology, vol. 58, 2026, pp. 101143.
PMID
41859030 ↗
Abstract 한글 요약
[BACKGROUND AND PURPOSE] Large language models (LLMs) have shown growing potential for clinical text processing, but their systematic application in radiation oncology-especially for non-English clinical documentation-remains underexplored. This study investigated whether pretrained LLMs can automatically extract, analyze, and structure radiotherapy-relevant information from routine unstructured medical notes, with the goal of supporting automated population of electronic case report forms (eCRFs).
[MATERIALS AND METHODS] This study examined prostate cancer patients treated with the MR-Linac, for whom ground truth data exist in the MOMENTUM database. A total of 100 patients were included, with 90 used for prompt development and 10 for independent testing. Medical notes were extracted, anonymized, and categorized by time points. The Llama-3.1-8b model was used, with prompts designed using chain-of-thought (CoT) logic with five in-context examples. The model output was post-processed, and extracted data was compared against ground truth.
[RESULTS] Medical notes were successfully processed, with predicted values generated in an average time of 16 s per note. The LLM achieved matching accuracies of 83.6% and 83.8% on the development and testing datasets. Analysis revealed that the model disagreed with specific values in 8.1% of development dataset cases and 8.6% of testing dataset cases. An independent manual review before model evaluation showed approximately 7.5% of routinely collected test data did not match reviewed values, indicating inaccuracies in the routinely acquired ground truth.
[CONCLUSION] This study demonstrated the effectiveness of LLMs in structuring clinical data from medical non-English notes, with high accuracy in extracting and categorizing information. While multi-institutional validation is needed, the results indicate a significant healthcare impact through efficient data management, processing notes in 16 s, and accurately populating CRFs with minimal staff involvement.
[MATERIALS AND METHODS] This study examined prostate cancer patients treated with the MR-Linac, for whom ground truth data exist in the MOMENTUM database. A total of 100 patients were included, with 90 used for prompt development and 10 for independent testing. Medical notes were extracted, anonymized, and categorized by time points. The Llama-3.1-8b model was used, with prompts designed using chain-of-thought (CoT) logic with five in-context examples. The model output was post-processed, and extracted data was compared against ground truth.
[RESULTS] Medical notes were successfully processed, with predicted values generated in an average time of 16 s per note. The LLM achieved matching accuracies of 83.6% and 83.8% on the development and testing datasets. Analysis revealed that the model disagreed with specific values in 8.1% of development dataset cases and 8.6% of testing dataset cases. An independent manual review before model evaluation showed approximately 7.5% of routinely collected test data did not match reviewed values, indicating inaccuracies in the routinely acquired ground truth.
[CONCLUSION] This study demonstrated the effectiveness of LLMs in structuring clinical data from medical non-English notes, with high accuracy in extracting and categorizing information. While multi-institutional validation is needed, the results indicate a significant healthcare impact through efficient data management, processing notes in 16 s, and accurately populating CRFs with minimal staff involvement.