Performance of latest AI models, RAG, and MCP on lung cancer-related questions.
[BACKGROUND] Large language models (LLMs) have advanced rapidly.
APA
Zhao X, Yang M, et al. (2026). Performance of latest AI models, RAG, and MCP on lung cancer-related questions.. Digital health, 12, 20552076261427503. https://doi.org/10.1177/20552076261427503
MLA
Zhao X, et al.. "Performance of latest AI models, RAG, and MCP on lung cancer-related questions.." Digital health, vol. 12, 2026, pp. 20552076261427503.
PMID
41836629
Abstract
[BACKGROUND] Large language models (LLMs) have advanced rapidly. However, concerns remain regarding their reliability in clinical settings due to the inherent issues of hallucinations and inadequate referencing.
[MATERIALS AND METHODS] We evaluated six current LLMs: GPT-4.1 (GPT), o3, Gemini-2.5-Pro-Preview-0506 (Gemini), Grok-3 (Grok), Qwen3-235B-A22B (Qwen3), and Claude Sonnet 4 (Claude), as well as two technologies that extend LLM capabilities using external knowledge bases: retrieval-augmented generation (RAG) and Model Context Protocol (MCP). Each model was evaluated using 50 questions selected from a 132-question pool developed based on the Chinese Medical Association guideline for clinical diagnosis and treatment of lung cancer (2024 Edition). Three models-Qwen, GPT, and Grok-were further analyzed to assess performance changes with RAG and MCP integration. All responses were independently reviewed by two qualitative evaluators.
[RESULTS] Overall, o3 achieved the highest accuracy (50%), followed by GPT (48%) and Gemini (48%), then Grok (44%), Qwen (40%), and Claude (36%). However, implementing RAG (LLM-RAG) or MCP (LLM-MCP) significantly improved accuracy, with statistical differences observed between baseline LLMs and their RAG- or MCP-enhanced counterparts. Lexical richness and semantic noise both diminished, whereas the semantic clarity and accuracy of verbs, noun-verb combinations, and content words improved.
[CONCLUSIONS] The six latest LLMs performed similarly on lung cancer-related questions. The integration of RAG or MCP significantly enhanced accuracy while simplifying sentence structure, focusing more on the main topics, and using more accurate vocabulary.
[MATERIALS AND METHODS] We evaluated six current LLMs: GPT-4.1 (GPT), o3, Gemini-2.5-Pro-Preview-0506 (Gemini), Grok-3 (Grok), Qwen3-235B-A22B (Qwen3), and Claude Sonnet 4 (Claude), as well as two technologies that extend LLM capabilities using external knowledge bases: retrieval-augmented generation (RAG) and Model Context Protocol (MCP). Each model was evaluated using 50 questions selected from a 132-question pool developed based on the Chinese Medical Association guideline for clinical diagnosis and treatment of lung cancer (2024 Edition). Three models-Qwen, GPT, and Grok-were further analyzed to assess performance changes with RAG and MCP integration. All responses were independently reviewed by two qualitative evaluators.
[RESULTS] Overall, o3 achieved the highest accuracy (50%), followed by GPT (48%) and Gemini (48%), then Grok (44%), Qwen (40%), and Claude (36%). However, implementing RAG (LLM-RAG) or MCP (LLM-MCP) significantly improved accuracy, with statistical differences observed between baseline LLMs and their RAG- or MCP-enhanced counterparts. Lexical richness and semantic noise both diminished, whereas the semantic clarity and accuracy of verbs, noun-verb combinations, and content words improved.
[CONCLUSIONS] The six latest LLMs performed similarly on lung cancer-related questions. The integration of RAG or MCP significantly enhanced accuracy while simplifying sentence structure, focusing more on the main topics, and using more accurate vocabulary.
같은 제1저자의 인용 많은 논문 (5)
- Heterogeneous Magnetic Resonance Nanoprobe for Assisting Liver Fibrosis Three-Dimensional Reconstruction and Cascaded Therapy.
- Key molecules and functional subsets of regulatory T cells in maternal-fetal immune tolerance: Recent advances.
- Population pharmacokinetics and exposure-response analysis of durvalumab in patients with resectable stage II to IIIB (N2) NSCLC in the phase III AEGEAN study.
- CCDC137 stabilizes S100A6 to activate the PI3K/AKT pathway and drive acute myeloid leukemia progression.
- A Novel Modified Bu/Vp16/cy/Flu/Ara-C Conditioning Regimen Enhances Outcomes for High-Risk Acute Lymphoblastic Leukemia Patients Undergoing Allogeneic Hematopoietic Stem Cell Transplantation.