Augmenting Large Language Model With Prompt Engineering and Supervised Fine-Tuning in Non-Small Cell Lung Cancer Tumor-Node-Metastasis Staging: Framework Development and Validation.
1/5 보강
[BACKGROUND] Accurate tumor node metastasis (TNM) staging is fundamental for treatment planning and prognosis in non-small cell lung cancer (NSCLC).
- 95% CI 0.850-0.959
APA
Jin R, Ling C, et al. (2026). Augmenting Large Language Model With Prompt Engineering and Supervised Fine-Tuning in Non-Small Cell Lung Cancer Tumor-Node-Metastasis Staging: Framework Development and Validation.. JMIR AI, 5, e77988. https://doi.org/10.2196/77988
MLA
Jin R, et al.. "Augmenting Large Language Model With Prompt Engineering and Supervised Fine-Tuning in Non-Small Cell Lung Cancer Tumor-Node-Metastasis Staging: Framework Development and Validation.." JMIR AI, vol. 5, 2026, pp. e77988.
PMID
41984624
DOI
10.2196/77988
Abstract
[BACKGROUND] Accurate tumor node metastasis (TNM) staging is fundamental for treatment planning and prognosis in non-small cell lung cancer (NSCLC). However, its complexity poses significant challenges. Traditional rule-based natural language processing methods are constrained by their reliance on manually crafted rules and are susceptible to inconsistencies in clinical reporting.
[OBJECTIVE] This study aimed to develop and validate a robust, accurate, and operationally efficient artificial intelligence framework for the TNM staging of NSCLC by strategically enhancing a large language model, GLM-4-Air (general language model), through advanced prompt engineering and supervised fine-tuning (SFT).
[METHODS] We constructed a curated dataset of 492 deidentified real-world medical imaging reports, with TNM staging annotations rigorously validated by senior physicians according to the AJCC (American Joint Committee on Cancer) 8th edition guidelines. The GLM-4-Air model was systematically optimized via a multi-phase process: iterative prompt engineering incorporating chain-of-thought reasoning and domain knowledge injection for all staging tasks, followed by parameter-efficient SFT using low-rank adaptation for the reasoning-intensive primary tumor characteristics (T) and regional lymph node involvement (N) staging tasks. The final hybrid model was evaluated on a completely held-out test set (black-box) and benchmarked against GPT-4o using standard metrics, statistical tests, and a clinical impact analysis of staging errors.
[RESULTS] The optimized hybrid GLM-4-Air model demonstrated reliable performance. It achieved higher staging accuracies on the black-box test set: 92% (95% CI 0.850-0.959) for T, 86% (95% CI 0.779-0.915) for N, 92% (95% CI 0.850-0.959) for distant metastasis status (M), and 90% for overall clinical staging; by comparison, GPT-4o attained 87% (95% CI 0.790-0.922), 70% (95% CI 0.604-0.781), 78% (95% CI 0.689-0.850), and 80%, respectively. The model's robustness was further evidenced by its macro-average F1-scores of 0.914 (T), 0.815 (N), and 0.831 (M), consistently surpassing those of GPT-4o (0.836, 0.620, and 0.698). Analysis of confusion matrices confirmed the model's proficiency in identifying critical staging features while effectively minimizing false negatives. Crucially, the clinical impact assessment showed a substantial reduction in severe category I errors, which are defined as misclassifications that could significantly influence subsequent clinical decisions. Our model committed 0 category I errors in M staging and fewer category I errors in T and N staging. Furthermore, the framework demonstrated practical deployability, achieving efficient inference on consumer-grade hardware (eg, 4 RTX 4090 GPUs) with latencies suitable and acceptable for clinical workflows.
[CONCLUSIONS] The proposed hybrid framework, integrating structured prompt engineering and applying SFT to reasoning-heavy tasks (T/N), enables the GLM-4-Air model to serve as a highly accurate, clinically reliable, and cost-efficient solution for automated NSCLC TNM staging. This work demonstrates the efficacy and potential of a domain-optimized smaller model compared with an off-the-shelf generalist model, holding promise for enhancing diagnostic standardization in resource-aware health care environments.
[OBJECTIVE] This study aimed to develop and validate a robust, accurate, and operationally efficient artificial intelligence framework for the TNM staging of NSCLC by strategically enhancing a large language model, GLM-4-Air (general language model), through advanced prompt engineering and supervised fine-tuning (SFT).
[METHODS] We constructed a curated dataset of 492 deidentified real-world medical imaging reports, with TNM staging annotations rigorously validated by senior physicians according to the AJCC (American Joint Committee on Cancer) 8th edition guidelines. The GLM-4-Air model was systematically optimized via a multi-phase process: iterative prompt engineering incorporating chain-of-thought reasoning and domain knowledge injection for all staging tasks, followed by parameter-efficient SFT using low-rank adaptation for the reasoning-intensive primary tumor characteristics (T) and regional lymph node involvement (N) staging tasks. The final hybrid model was evaluated on a completely held-out test set (black-box) and benchmarked against GPT-4o using standard metrics, statistical tests, and a clinical impact analysis of staging errors.
[RESULTS] The optimized hybrid GLM-4-Air model demonstrated reliable performance. It achieved higher staging accuracies on the black-box test set: 92% (95% CI 0.850-0.959) for T, 86% (95% CI 0.779-0.915) for N, 92% (95% CI 0.850-0.959) for distant metastasis status (M), and 90% for overall clinical staging; by comparison, GPT-4o attained 87% (95% CI 0.790-0.922), 70% (95% CI 0.604-0.781), 78% (95% CI 0.689-0.850), and 80%, respectively. The model's robustness was further evidenced by its macro-average F1-scores of 0.914 (T), 0.815 (N), and 0.831 (M), consistently surpassing those of GPT-4o (0.836, 0.620, and 0.698). Analysis of confusion matrices confirmed the model's proficiency in identifying critical staging features while effectively minimizing false negatives. Crucially, the clinical impact assessment showed a substantial reduction in severe category I errors, which are defined as misclassifications that could significantly influence subsequent clinical decisions. Our model committed 0 category I errors in M staging and fewer category I errors in T and N staging. Furthermore, the framework demonstrated practical deployability, achieving efficient inference on consumer-grade hardware (eg, 4 RTX 4090 GPUs) with latencies suitable and acceptable for clinical workflows.
[CONCLUSIONS] The proposed hybrid framework, integrating structured prompt engineering and applying SFT to reasoning-heavy tasks (T/N), enables the GLM-4-Air model to serve as a highly accurate, clinically reliable, and cost-efficient solution for automated NSCLC TNM staging. This work demonstrates the efficacy and potential of a domain-optimized smaller model compared with an off-the-shelf generalist model, holding promise for enhancing diagnostic standardization in resource-aware health care environments.
같은 제1저자의 인용 많은 논문 (5)
- Comparison of the efficacy of endoscopic submucosal dissection and transanal endoscopic microsurgery in the treatment of rectal neuroendocrine tumors ≤ 2 cm.
- Tarsal-Fixation With Aponeurotic Flap Linkage in Blepharoplasty: Bridge Technique.
- Dynamic T-Cell Reprogramming Modulates the Treatment Outcome of Neoadjuvant Immunochemotherapy in Non-Small-Cell Lung Cancer.
- Staged Management of Traumatic Diaphragmatic Hernia With Open Abdomen: A Case Report on Successful Abdominal Wall Reconstruction.
- Comments on "Opinions on the Treatment Strategy after Breast Augmentation by Polyacrylamide Hydrogel Injection".