Using large language models for clinical staging of colorectal cancer from imaging reports: a pilot study.
1/5 보강
[PURPOSE] Accurate data collection and analysis are crucial in clinical research, particularly for extracting information from unstructured medical records in cancer research.
APA
Kim JS, Baek SJ, et al. (2025). Using large language models for clinical staging of colorectal cancer from imaging reports: a pilot study.. Annals of surgical treatment and research, 109(5), 318-327. https://doi.org/10.4174/astr.2025.109.5.318
MLA
Kim JS, et al.. "Using large language models for clinical staging of colorectal cancer from imaging reports: a pilot study.." Annals of surgical treatment and research, vol. 109, no. 5, 2025, pp. 318-327.
PMID
41255477 ↗
Abstract 한글 요약
[PURPOSE] Accurate data collection and analysis are crucial in clinical research, particularly for extracting information from unstructured medical records in cancer research. Traditional methods often struggle with this task. Large language models (LLMs) specializing in natural language processing (NLP), like ChatGPT (OpenAI), show potential for automating this process. This study evaluated whether GPT-4 could accurately extract clinical stages of colorectal cancer (CRC) from imaging reports.
[METHODS] Using specific prompts based on the American Joint Committee on Cancer TNM staging system, GPT-4 was tested on the unstructured abdominal imaging reports of 100 CRC patients. The results were evaluated by a colorectal surgical oncologist and compared with data manually extracted by a nonspecialist data manager.
[RESULTS] GPT-4 demonstrated high accuracy in extracting lesion locations (96.0%) and T (89.0%), N (90.0%), and M (85.0%) stages, with an overall TNM stage extraction accuracy of 69.0%. The combined accuracy for TNM stage and lesion location was 67.0%. Human data managers had similar TNM stage accuracy but lower lesion-location accuracy (76.0%). Higher accuracy was observed when reports directly mentioned stages and were in English only.
[CONCLUSION] This study confirms that LLM-based NLP, with proper prompt engineering, can accurately extract clinical stages from CRC imaging reports, particularly in English-only contexts.
[METHODS] Using specific prompts based on the American Joint Committee on Cancer TNM staging system, GPT-4 was tested on the unstructured abdominal imaging reports of 100 CRC patients. The results were evaluated by a colorectal surgical oncologist and compared with data manually extracted by a nonspecialist data manager.
[RESULTS] GPT-4 demonstrated high accuracy in extracting lesion locations (96.0%) and T (89.0%), N (90.0%), and M (85.0%) stages, with an overall TNM stage extraction accuracy of 69.0%. The combined accuracy for TNM stage and lesion location was 67.0%. Human data managers had similar TNM stage accuracy but lower lesion-location accuracy (76.0%). Higher accuracy was observed when reports directly mentioned stages and were in English only.
[CONCLUSION] This study confirms that LLM-based NLP, with proper prompt engineering, can accurately extract clinical stages from CRC imaging reports, particularly in English-only contexts.
🏷️ 키워드 / MeSH 📖 같은 키워드 OA만
같은 제1저자의 인용 많은 논문 (5)
- Outcomes of Nivolumab Plus Ipilimumab After Atezolizumab Plus Bevacizumab in Advanced HCC: An International Multicentre Study.
- Radiomics-Based Prediction of Lymphedema after Radiotherapy in Breast Cancer: Integrating Clinical and Dosimetric Features.
- Response: Reassessing the Use of Nivolumab Plus Ipilimumab After Atezolizumab Plus Bevacizumab in Advanced HCC.
- Ultrasound-Based Classification of Infraorbital Fat and Groove Patterns: Clinical Implications for Safer Filler Injection.
- Attention-deficit/hyperactivity disorder in pediatric cancer survivors: risk and medication use in a nationwide population-based study.
📖 전문 본문 읽기 PMC JATS · ~20 KB · 영문
INTRODUCTION
INTRODUCTION
In clinical research, accurately collecting and analyzing patient data are crucial. Traditionally, data collection has relied on manual processes, and even with clinical data warehouses (CDWs), extracting information from unstructured medical records requires significant human effort. The process is laborious, time-consuming, and costly, posing obstacles to effective clinical research [12]. Data managers need to be well-versed in medical terminology and disease, necessitating extensive training. Consistency in data collection is also challenging, requiring strict criteria and definitions to ensure high-quality databases. In cancer research, including colorectal cancer (CRC), data extraction errors are reduced when semi-structured records, such as pathology reports or functional test reports for cardiac or pulmonary assessments, are used. On the other hand, collecting data from freeform medical records, such as imaging reports, surgical notes, medical charts, nursing records, and unstructured questionnaires, is difficult and the results are often inaccurate, which reduces the reliability of clinical research outcomes [3456].
With the increasing use of big data in clinical research, there is a growing need to develop methods for automatically extracting data from unstructured medical records written in natural language. Numerous studies have reported the application of various artificial intelligence (AI) models for natural language processing (NLP) [346789]. However, traditional NLP methods face significant challenges in clinical settings. These include the burden of extensive preprocessing, such as pretraining requirements, a lack of flexibility in understanding language context or accurately representing data, and scalability issues arising from difficulties in management, expansion, and performance degradation.
In comparison, large-scale language models (LLMs) are specifically designed for human conversation and language generation, making them well-suited for NLP tasks AI models. Since the introduction of ChatGPT by OpenAI in November 2022, efforts have been made to apply LLMs in various fields, including medicine. Research using LLMs encompasses disease occurrence and prognosis prediction, medical imaging analysis, clinical decision support system development, virtual simulations for medical education and training, and automated content creation [101112131415161718192021222324252627]. Among these applications, NLP using LLMs holds particular promise for clinical settings due to their superior contextual understanding and usability. Unlike traditional NLP methods, LLMs require minimal preprocessing and can be effectively employed through prompt engineering or fine-tuning with small amounts of labeled data, eliminating the need for extensive labeling of medical records. Despite this potential, there are relatively few studies on the use of LLMs in clinical settings. In particular, extracting clinical stages from the medical records of cancer patients remains an underexplored task due to the complexity of clinical expertise required during the prompt engineering phase [28]. This study aims to determine whether LLMs can accurately extract CRC clinical stages from imaging reports written in natural language.
In clinical research, accurately collecting and analyzing patient data are crucial. Traditionally, data collection has relied on manual processes, and even with clinical data warehouses (CDWs), extracting information from unstructured medical records requires significant human effort. The process is laborious, time-consuming, and costly, posing obstacles to effective clinical research [12]. Data managers need to be well-versed in medical terminology and disease, necessitating extensive training. Consistency in data collection is also challenging, requiring strict criteria and definitions to ensure high-quality databases. In cancer research, including colorectal cancer (CRC), data extraction errors are reduced when semi-structured records, such as pathology reports or functional test reports for cardiac or pulmonary assessments, are used. On the other hand, collecting data from freeform medical records, such as imaging reports, surgical notes, medical charts, nursing records, and unstructured questionnaires, is difficult and the results are often inaccurate, which reduces the reliability of clinical research outcomes [3456].
With the increasing use of big data in clinical research, there is a growing need to develop methods for automatically extracting data from unstructured medical records written in natural language. Numerous studies have reported the application of various artificial intelligence (AI) models for natural language processing (NLP) [346789]. However, traditional NLP methods face significant challenges in clinical settings. These include the burden of extensive preprocessing, such as pretraining requirements, a lack of flexibility in understanding language context or accurately representing data, and scalability issues arising from difficulties in management, expansion, and performance degradation.
In comparison, large-scale language models (LLMs) are specifically designed for human conversation and language generation, making them well-suited for NLP tasks AI models. Since the introduction of ChatGPT by OpenAI in November 2022, efforts have been made to apply LLMs in various fields, including medicine. Research using LLMs encompasses disease occurrence and prognosis prediction, medical imaging analysis, clinical decision support system development, virtual simulations for medical education and training, and automated content creation [101112131415161718192021222324252627]. Among these applications, NLP using LLMs holds particular promise for clinical settings due to their superior contextual understanding and usability. Unlike traditional NLP methods, LLMs require minimal preprocessing and can be effectively employed through prompt engineering or fine-tuning with small amounts of labeled data, eliminating the need for extensive labeling of medical records. Despite this potential, there are relatively few studies on the use of LLMs in clinical settings. In particular, extracting clinical stages from the medical records of cancer patients remains an underexplored task due to the complexity of clinical expertise required during the prompt engineering phase [28]. This study aims to determine whether LLMs can accurately extract CRC clinical stages from imaging reports written in natural language.
METHODS
METHODS
This retrospective single-arm study was conducted in April 2024 and was approved by the Institutional Review Board of Korea University Anam Hospital (No. 2024AN0254). We utilized GPT-4 (Legacy model, Turbo version), implemented in ChatGPT (OpenAI, Inc.), as a representative LLM to extract tumor location and clinical TNM (cTNM) stages from preoperative abdominal CT reports of patients with CRC at the time of their initial diagnosis. The reports were unstructured free-text documents prepared by more than 5 different radiologists. While the imaging reports were primarily written in English, some included mixed Korean, and the lengths of the reports varied significantly. In this context, clinical TNM stage refers to radiologic TNM rather than pathologic TNM. The TNM staging system is an internationally recognized standard system for assessing cancer progression. TNM represents tumor (T), nodes (N), and metastasis (M), systematically categorizing the primary tumor, the involvement of regional lymph nodes (LNs), and the presence of distant metastases.
This study used preoperative imaging reports from CRC patients who underwent resection at the Division of Colon and Rectal Surgery, Korea University Anam Hospital. A colorectal surgical oncologist manually extracted the lesion location and TNM stage from the reports and selected those suitable for the study. Reports with tumor locations in the cecum, ascending colon, descending colon, sigmoid colon, and rectum were included. For simplicity in classification, less common lesion locations such as the appendix, hepatic flexure, transverse colon, splenic flexure, rectosigmoid junction, and anal canal were excluded. Additionally, reports containing multiple lesions, presenting more than 1 stage for a single lesion, or reporting no mention of the primary colonic lesion were excluded. However, reports without mentions of LNs or metastasis were included, and the absence of such mentions was evaluated as N0 and M0, respectively. A total of 100 imaging reports meeting these criteria were selected and used for the study.
Prompt engineering was performed to assign GPT-4 the role of data manager in the colorectal division, instructing it to extract the lesion location and TNM stage from reports according to specific rules (Appendix). To minimize potential errors commonly associated with LLMs (e.g., hallucination, misinterpretation, omission), we provided highly detailed extraction rules and a predefined output format. The extraction rules were as follows: (1) Use the TNM stage if it is directly recorded in the report. (2) If it is not directly recorded but the definition of TNM is included, use the corresponding stage. (3) If neither a direct TNM stage nor a definition is included but imaging findings or descriptions correspond to a TNM stage, use the corresponding stage. The TNM staging definitions and regional LNs (based on the CRC lesion location) from the 18th edition of the American Joint Committee on Cancer (AJCC) guideline were provided as references for rule 2, along with radiological expressions for each T and N stage for rule 3. The extraction format specified was: (1) lesion location (1 word), (2) TNM stage, (3) reason for extracting the TNM stage. Each report was processed in a new conversation to prevent previous extraction influences, repeating the task each time by including role assignment, extraction rules, extraction format, and report in 1 message window.
GPT-4’s extracted results were evaluated to determine lesion location, T stage, N stage, M stage, TNM stage, and overall accuracy (Fig. 1). The manually extracted data, provided by a colorectal surgical oncologist, served as the reference standard for the assessment. If GPT-4’s answers matched the reference exactly or were broadly similar (e.g., M1b vs. M1), they were considered correct. This comparison determined GPT-4’s extraction accuracy, which was also compared with results previously extracted by a nonspecialist human data manager. The extraction accuracy for each item was also analyzed based on lesion location (colon or rectum), the inclusion of direct mentions of the T and N stages, and the use of mixed language (English and Korean) in the reports.
This retrospective single-arm study was conducted in April 2024 and was approved by the Institutional Review Board of Korea University Anam Hospital (No. 2024AN0254). We utilized GPT-4 (Legacy model, Turbo version), implemented in ChatGPT (OpenAI, Inc.), as a representative LLM to extract tumor location and clinical TNM (cTNM) stages from preoperative abdominal CT reports of patients with CRC at the time of their initial diagnosis. The reports were unstructured free-text documents prepared by more than 5 different radiologists. While the imaging reports were primarily written in English, some included mixed Korean, and the lengths of the reports varied significantly. In this context, clinical TNM stage refers to radiologic TNM rather than pathologic TNM. The TNM staging system is an internationally recognized standard system for assessing cancer progression. TNM represents tumor (T), nodes (N), and metastasis (M), systematically categorizing the primary tumor, the involvement of regional lymph nodes (LNs), and the presence of distant metastases.
This study used preoperative imaging reports from CRC patients who underwent resection at the Division of Colon and Rectal Surgery, Korea University Anam Hospital. A colorectal surgical oncologist manually extracted the lesion location and TNM stage from the reports and selected those suitable for the study. Reports with tumor locations in the cecum, ascending colon, descending colon, sigmoid colon, and rectum were included. For simplicity in classification, less common lesion locations such as the appendix, hepatic flexure, transverse colon, splenic flexure, rectosigmoid junction, and anal canal were excluded. Additionally, reports containing multiple lesions, presenting more than 1 stage for a single lesion, or reporting no mention of the primary colonic lesion were excluded. However, reports without mentions of LNs or metastasis were included, and the absence of such mentions was evaluated as N0 and M0, respectively. A total of 100 imaging reports meeting these criteria were selected and used for the study.
Prompt engineering was performed to assign GPT-4 the role of data manager in the colorectal division, instructing it to extract the lesion location and TNM stage from reports according to specific rules (Appendix). To minimize potential errors commonly associated with LLMs (e.g., hallucination, misinterpretation, omission), we provided highly detailed extraction rules and a predefined output format. The extraction rules were as follows: (1) Use the TNM stage if it is directly recorded in the report. (2) If it is not directly recorded but the definition of TNM is included, use the corresponding stage. (3) If neither a direct TNM stage nor a definition is included but imaging findings or descriptions correspond to a TNM stage, use the corresponding stage. The TNM staging definitions and regional LNs (based on the CRC lesion location) from the 18th edition of the American Joint Committee on Cancer (AJCC) guideline were provided as references for rule 2, along with radiological expressions for each T and N stage for rule 3. The extraction format specified was: (1) lesion location (1 word), (2) TNM stage, (3) reason for extracting the TNM stage. Each report was processed in a new conversation to prevent previous extraction influences, repeating the task each time by including role assignment, extraction rules, extraction format, and report in 1 message window.
GPT-4’s extracted results were evaluated to determine lesion location, T stage, N stage, M stage, TNM stage, and overall accuracy (Fig. 1). The manually extracted data, provided by a colorectal surgical oncologist, served as the reference standard for the assessment. If GPT-4’s answers matched the reference exactly or were broadly similar (e.g., M1b vs. M1), they were considered correct. This comparison determined GPT-4’s extraction accuracy, which was also compared with results previously extracted by a nonspecialist human data manager. The extraction accuracy for each item was also analyzed based on lesion location (colon or rectum), the inclusion of direct mentions of the T and N stages, and the use of mixed language (English and Korean) in the reports.
RESULTS
RESULTS
Table 1 shows the features of the CRC lesions: 58 colon cancer cases and 42 rectal cancer cases. Stage T3 was the most common, with 57 cases of N1 or N2 and 43 cases of N0. Most cases had no intra-abdominal metastasis. The median length of the imaging reports was 94 words (Q1, 73 words; Q3, 119 words) as shown in Table 2. Fifty-five cases had direct mentions of the T or N stages in the reports, and 35 reports mixed Korean and English.
Table 3 presents the accuracy of the lesion locations and TNM stages extracted by GPT-4 and the human data manager. GPT-4 showed 96.0% accuracy in lesion-location extraction and 89.0%, 90.0%, and 85.0% for the T, N, and M stages, respectively. Accuracy was higher for advanced T stages (T1, 50.0%; T2, 87.5%; T3, 91.4%; and T4, 90.0%) and positive LN cases (96.5%). GPT-4 had lower accuracy for intra-abdominal metastasis extraction (70.6%). Overall, it correctly extracted the TNM stage in 69.0% of cases and both the TNM stage and lesion location in 67.0%.
The human data manager had TNM stage extraction accuracy similar to that of GPT-4, but the lesion-location accuracy was significantly lower (76.0%, P < 0.001), resulting in only 54.0% correct both TNM stage and lesion-location extractions. The human data manager’s substage accuracy was higher for stages T2 and T3 but lower for T1 and T4, respectively (T1, 75%; T2, 81.3%; T3, 91.4%; and T4, 60.0%). Despite these differences, there was no statistically significant difference in the overall T stage extraction accuracy between the human data manager and GPT-4. The human data manager also showed no significant differences in N stage accuracy but had larger discrepancies for M stages compared to GPT-4 (M0, 97.6%; M1, 64.7%).
The subgroup analyses revealed that GPT-4’s extraction accuracy did not differ based on the word count of the imaging reports or the cancer location (colon vs. rectum) (Table 4). However, the T and TNM stage accuracy was significantly higher when direct mentions of these stages were included in the reports (T stage, 96.4% vs. 80.0%, P = 0.008; TNM stage, 80.0% vs. 55.6%, P = 0.012). Although the difference was not statistically significant, the N stage accuracy trended higher when direct mentions were present (94.5% vs. 84.4%, P = 0.088). M stage accuracy did not differ between the groups.
Reports written exclusively in English showed significantly higher accuracy for T stage, TNM stage, and overall accuracy than mixed-language reports (T stage, 77.1% vs. 95.4%, P = 0.005; TNM stage, 54.3% vs. 76.9%, P = 0.019; overall accuracy, 51.4% vs. 75.4%, P = 0.015). The accuracy for lesion-location extraction did not differ statistically, but it trended higher in English-only reports (91.4% vs. 98.5%, P = 0.089). N and M stage accuracy did not differ between the groups.
Table 1 shows the features of the CRC lesions: 58 colon cancer cases and 42 rectal cancer cases. Stage T3 was the most common, with 57 cases of N1 or N2 and 43 cases of N0. Most cases had no intra-abdominal metastasis. The median length of the imaging reports was 94 words (Q1, 73 words; Q3, 119 words) as shown in Table 2. Fifty-five cases had direct mentions of the T or N stages in the reports, and 35 reports mixed Korean and English.
Table 3 presents the accuracy of the lesion locations and TNM stages extracted by GPT-4 and the human data manager. GPT-4 showed 96.0% accuracy in lesion-location extraction and 89.0%, 90.0%, and 85.0% for the T, N, and M stages, respectively. Accuracy was higher for advanced T stages (T1, 50.0%; T2, 87.5%; T3, 91.4%; and T4, 90.0%) and positive LN cases (96.5%). GPT-4 had lower accuracy for intra-abdominal metastasis extraction (70.6%). Overall, it correctly extracted the TNM stage in 69.0% of cases and both the TNM stage and lesion location in 67.0%.
The human data manager had TNM stage extraction accuracy similar to that of GPT-4, but the lesion-location accuracy was significantly lower (76.0%, P < 0.001), resulting in only 54.0% correct both TNM stage and lesion-location extractions. The human data manager’s substage accuracy was higher for stages T2 and T3 but lower for T1 and T4, respectively (T1, 75%; T2, 81.3%; T3, 91.4%; and T4, 60.0%). Despite these differences, there was no statistically significant difference in the overall T stage extraction accuracy between the human data manager and GPT-4. The human data manager also showed no significant differences in N stage accuracy but had larger discrepancies for M stages compared to GPT-4 (M0, 97.6%; M1, 64.7%).
The subgroup analyses revealed that GPT-4’s extraction accuracy did not differ based on the word count of the imaging reports or the cancer location (colon vs. rectum) (Table 4). However, the T and TNM stage accuracy was significantly higher when direct mentions of these stages were included in the reports (T stage, 96.4% vs. 80.0%, P = 0.008; TNM stage, 80.0% vs. 55.6%, P = 0.012). Although the difference was not statistically significant, the N stage accuracy trended higher when direct mentions were present (94.5% vs. 84.4%, P = 0.088). M stage accuracy did not differ between the groups.
Reports written exclusively in English showed significantly higher accuracy for T stage, TNM stage, and overall accuracy than mixed-language reports (T stage, 77.1% vs. 95.4%, P = 0.005; TNM stage, 54.3% vs. 76.9%, P = 0.019; overall accuracy, 51.4% vs. 75.4%, P = 0.015). The accuracy for lesion-location extraction did not differ statistically, but it trended higher in English-only reports (91.4% vs. 98.5%, P = 0.089). N and M stage accuracy did not differ between the groups.
DISCUSSION
DISCUSSION
Our study demonstrates that, with appropriate prompt engineering, LLMs can accurately extract clinical stages from imaging reports written in natural medical language. The extraction accuracy of the LLM was comparable to that of a human data manager, and it was notably higher when the imaging reports included direct mentions of the stage or were written solely in English.
Since the introduction of ChatGPT, various medical studies have explored LLM applications [101112131415161718192021222324252627]. However, many of them have focused on single, straightforward tasks. For example, a previous study from our institution involved explaining the condition of CRC patients to GPT-3.5 and asking it to suggest treatment plans and then comparing those suggestions with previous multidisciplinary team results [13]. Although that might seem complex, the structure of the task was relatively simple because the state of the CRC patient was pre-assessed by human doctors, and GPT-3.5 did not influence that judgment. In contrast, our study required GPT-4 to complete a 2-step process: first, it had to process natural language inputs from imaging reports and output them in a structured format for lesion location and TNM staging; second, it had to use predefined rules to stage the patients from report descriptions. This study confirms that LLMs can complete such a complex, 2-stage task. Additionally, the fact that such tasks can be accomplished solely through prompt engineering is a significant advantage of LLMs.
We assigned GPT-4 the role of a data manager in the colorectal division, instructing it to extract the TNM stage and providing it with rules for that task. Notably, we found that most imaging reports do not use the exact definitions from the AJCC guidelines. For example, T3 is defined as “tumor invades through the muscularis propria into pericolorectal tissues,” but reports seldom use that phrasing. Instead, they describe it morphologically, such as “smooth or nodular extension of a discrete mass of tumor tissue beyond the contour of the bowel wall with extension into pericolic fat.” By including such radiological descriptions in the rules, we could improve the accuracy of the LLM’s judgment. This prompt engineering approach can be widely applied to other cancer types as well. We also specified the format for extracting answers. Without this, LLMs tend to respond in sentence form, which can cause readability issues. The formatted outputs can be converted into CSV, JSON, or Excel files for statistical analysis, indicating that such LLM capabilities could significantly enhance CDWs by extracting valuable information from unstructured data. However, when processing large volumes of medical data, practical considerations arise, such as the need to implement application programming interface (API)-based methods instead of relying solely on dialog-box-driven prompt engineering. This remains a separate topic for further discussion.
In our study, GPT-4 demonstrated high accuracy in extracting TNM stage (85%–90%) and lesion location (96%), comparable to an experienced human data manager. The higher accuracy in lesion-location extraction is likely because GPT-4 referenced only the report content, whereas human data managers incorporate additional records, such as colonoscopy and digital rectal examinations, that introduce variability. Similarly, the accuracy difference in M staging could stem from the inclusion of other imaging examinations. The key point is that GPT-4 accurately extracted clinical stages based solely on the provided examination results.
In this study, the extraction accuracy for lower T and N stages was lower, likely due to frequent false positives in descriptions of tumor depth or node metastasis. Negative prediction is more challenging than positive prediction for T and N stages [28]. For example, reports of early CRC (Tis or T1) mention only the presence of the primary tumor without specifying its depth, often leading to over-staging. For LNs, nonspecific mentions such as a “visible” LN, equivocal or benign-sounding descriptions, or the absence of any nodal mention often caused false positives. However, GPT-4 sometimes correctly identified negatives, suggesting that it considers the overall context of the report rather than just specific keywords.
Unlike the T and N stages, GPT-4 showed better negative M stage prediction than positive. Most of the errors in identifying metastatic cases occurred because the system did not classify non-regional LN metastasis, such as paraaortic LN, as M1, instead evaluating it as N positive, or because it made mistakes in the substaging of M1. These issues likely arise from insufficient prompt engineering details rather than inherent ambiguity in the reports or problems with the LLM’s comprehension. This issue could likely be improved by offering more specific and detailed explanations at the prompting stage.
The imaging reports used in this study were descriptive texts freely written by multiple radiologists without a standardized format or restrictions, resulting in heterogeneity in length, language usage, and structure. Therefore, we also examined factors that influenced GPT-4’s extraction accuracy for TNM staging: advances in the T, N, and M stages, lesion location (colon vs. rectum), direct stage mentions in reports, and mixed-language usage [6]. Direct mentions of T and N stages and English-only reports significantly improved extraction accuracy. As expected, reports with direct T and N stage mentions had high accuracy, with the rare errors caused by discrepancies between the descriptions and the stated stages. Because reports rarely present the M stage directly, those 2 groups did not appear to differ. The mixed-language reports showed lower accuracy for lesion location and T stage, with GPT-4 often suggesting higher stages than the reference.
Our study has limitations. First, discrepancies between the described content and the T and N stages stated in the reports could affect accuracy. According to the working rules, if the TNM stage was presented in the report, the LLM was instructed to use that stage. Therefore, any deviations were considered incorrect. However, if the content of the descriptions had been more broadly acknowledged, the accuracy might have been higher. Second, colorectal specialists evaluated the reference TNM stages, which could be erroneous. Reviewing instances in which several LLMs produced incorrect results revealed reference errors. Third, when evaluating the M stage, it is necessary to comprehensively assess metastasis to other organs, such as the lungs, in addition to abdominal organs, using various imaging results. However, this study evaluated extraction accuracy based only on metastasis confirmed in abdominal imaging. The human data manager might have reviewed multiple examinations comprehensively to extract the M stage, which could have led to some cases being considered incorrect. Future studies should assess multimodal evaluations.
Despite those limitations, our study is unique in using LLMs to complete complex, 2-step tasks involving judgment and showing that they achieved high accuracy comparable to that of experienced human data managers. The outputs can be directly applied to database construction and other practical uses. Moreover, this approach offers high potential for scalability and generalizability, as it can be extended beyond CRC to other cancer types, imaging modalities, report formats, and languages.
In conclusion, our study confirms that using LLM-based NLP to extract clinical stages for cancers such as CRC is feasible and highly accurate. Proper prompt engineering, including stage definitions and radiological descriptions, and the use of English-only reports enhances accuracy. Future efforts should simplify prompt engineering and integrate multimodal evaluations to facilitate the easy and direct application of LLM-based NLP in clinical settings.
Our study demonstrates that, with appropriate prompt engineering, LLMs can accurately extract clinical stages from imaging reports written in natural medical language. The extraction accuracy of the LLM was comparable to that of a human data manager, and it was notably higher when the imaging reports included direct mentions of the stage or were written solely in English.
Since the introduction of ChatGPT, various medical studies have explored LLM applications [101112131415161718192021222324252627]. However, many of them have focused on single, straightforward tasks. For example, a previous study from our institution involved explaining the condition of CRC patients to GPT-3.5 and asking it to suggest treatment plans and then comparing those suggestions with previous multidisciplinary team results [13]. Although that might seem complex, the structure of the task was relatively simple because the state of the CRC patient was pre-assessed by human doctors, and GPT-3.5 did not influence that judgment. In contrast, our study required GPT-4 to complete a 2-step process: first, it had to process natural language inputs from imaging reports and output them in a structured format for lesion location and TNM staging; second, it had to use predefined rules to stage the patients from report descriptions. This study confirms that LLMs can complete such a complex, 2-stage task. Additionally, the fact that such tasks can be accomplished solely through prompt engineering is a significant advantage of LLMs.
We assigned GPT-4 the role of a data manager in the colorectal division, instructing it to extract the TNM stage and providing it with rules for that task. Notably, we found that most imaging reports do not use the exact definitions from the AJCC guidelines. For example, T3 is defined as “tumor invades through the muscularis propria into pericolorectal tissues,” but reports seldom use that phrasing. Instead, they describe it morphologically, such as “smooth or nodular extension of a discrete mass of tumor tissue beyond the contour of the bowel wall with extension into pericolic fat.” By including such radiological descriptions in the rules, we could improve the accuracy of the LLM’s judgment. This prompt engineering approach can be widely applied to other cancer types as well. We also specified the format for extracting answers. Without this, LLMs tend to respond in sentence form, which can cause readability issues. The formatted outputs can be converted into CSV, JSON, or Excel files for statistical analysis, indicating that such LLM capabilities could significantly enhance CDWs by extracting valuable information from unstructured data. However, when processing large volumes of medical data, practical considerations arise, such as the need to implement application programming interface (API)-based methods instead of relying solely on dialog-box-driven prompt engineering. This remains a separate topic for further discussion.
In our study, GPT-4 demonstrated high accuracy in extracting TNM stage (85%–90%) and lesion location (96%), comparable to an experienced human data manager. The higher accuracy in lesion-location extraction is likely because GPT-4 referenced only the report content, whereas human data managers incorporate additional records, such as colonoscopy and digital rectal examinations, that introduce variability. Similarly, the accuracy difference in M staging could stem from the inclusion of other imaging examinations. The key point is that GPT-4 accurately extracted clinical stages based solely on the provided examination results.
In this study, the extraction accuracy for lower T and N stages was lower, likely due to frequent false positives in descriptions of tumor depth or node metastasis. Negative prediction is more challenging than positive prediction for T and N stages [28]. For example, reports of early CRC (Tis or T1) mention only the presence of the primary tumor without specifying its depth, often leading to over-staging. For LNs, nonspecific mentions such as a “visible” LN, equivocal or benign-sounding descriptions, or the absence of any nodal mention often caused false positives. However, GPT-4 sometimes correctly identified negatives, suggesting that it considers the overall context of the report rather than just specific keywords.
Unlike the T and N stages, GPT-4 showed better negative M stage prediction than positive. Most of the errors in identifying metastatic cases occurred because the system did not classify non-regional LN metastasis, such as paraaortic LN, as M1, instead evaluating it as N positive, or because it made mistakes in the substaging of M1. These issues likely arise from insufficient prompt engineering details rather than inherent ambiguity in the reports or problems with the LLM’s comprehension. This issue could likely be improved by offering more specific and detailed explanations at the prompting stage.
The imaging reports used in this study were descriptive texts freely written by multiple radiologists without a standardized format or restrictions, resulting in heterogeneity in length, language usage, and structure. Therefore, we also examined factors that influenced GPT-4’s extraction accuracy for TNM staging: advances in the T, N, and M stages, lesion location (colon vs. rectum), direct stage mentions in reports, and mixed-language usage [6]. Direct mentions of T and N stages and English-only reports significantly improved extraction accuracy. As expected, reports with direct T and N stage mentions had high accuracy, with the rare errors caused by discrepancies between the descriptions and the stated stages. Because reports rarely present the M stage directly, those 2 groups did not appear to differ. The mixed-language reports showed lower accuracy for lesion location and T stage, with GPT-4 often suggesting higher stages than the reference.
Our study has limitations. First, discrepancies between the described content and the T and N stages stated in the reports could affect accuracy. According to the working rules, if the TNM stage was presented in the report, the LLM was instructed to use that stage. Therefore, any deviations were considered incorrect. However, if the content of the descriptions had been more broadly acknowledged, the accuracy might have been higher. Second, colorectal specialists evaluated the reference TNM stages, which could be erroneous. Reviewing instances in which several LLMs produced incorrect results revealed reference errors. Third, when evaluating the M stage, it is necessary to comprehensively assess metastasis to other organs, such as the lungs, in addition to abdominal organs, using various imaging results. However, this study evaluated extraction accuracy based only on metastasis confirmed in abdominal imaging. The human data manager might have reviewed multiple examinations comprehensively to extract the M stage, which could have led to some cases being considered incorrect. Future studies should assess multimodal evaluations.
Despite those limitations, our study is unique in using LLMs to complete complex, 2-step tasks involving judgment and showing that they achieved high accuracy comparable to that of experienced human data managers. The outputs can be directly applied to database construction and other practical uses. Moreover, this approach offers high potential for scalability and generalizability, as it can be extended beyond CRC to other cancer types, imaging modalities, report formats, and languages.
In conclusion, our study confirms that using LLM-based NLP to extract clinical stages for cancers such as CRC is feasible and highly accurate. Proper prompt engineering, including stage definitions and radiological descriptions, and the use of English-only reports enhances accuracy. Future efforts should simplify prompt engineering and integrate multimodal evaluations to facilitate the easy and direct application of LLM-based NLP in clinical settings.
출처: PubMed Central (JATS). 라이선스는 원 publisher 정책을 따릅니다 — 인용 시 원문을 표기해 주세요.
🏷️ 같은 키워드 · 무료전문 — 이 논문 MeSH/keyword 기반
- EXPLORING THE IMPACT OF GUT MICROBIOTA MODULATION ON COLORECTAL CANCER THERAPY: A BIBLIOMETRIC ANALYSIS OF PROBIOTIC AND PREBIOTIC INTERVENTIONS.
- Nanotechnology-Assisted Molecular Profiling: Emerging Advances in Circulating Tumor DNA Detection.
- Artificial intelligence and breast cancer screening in Serbia: a dual-perspective qualitative study among radiologists and screening-aged women.
- Functional-based multi-omics early prediction of radiation pneumonitis in NSCLC using AI-generated perfusion and ventilation from planning CT.
- Safe discharge on the second postoperative day after major colorectal surgery: a decision-making strategy based on quantitative serological data.
- System-Wide Implementation of Colorectal Cancer Screening in a Value-Based Care Setting.