본문으로 건너뛰기
← 뒤로

Performance of latest AI models, RAG, and MCP on lung cancer-related questions.

Digital health 2026 Vol.12() p. 20552076261427503

Zhao X, Yang M, Tian K, Jiang H, Guo D, Wang Y, Du J

📝 환자 설명용 한 줄

[BACKGROUND] Large language models (LLMs) have advanced rapidly.

이 논문을 인용하기

BibTeX ↓ RIS ↓
APA Zhao X, Yang M, et al. (2026). Performance of latest AI models, RAG, and MCP on lung cancer-related questions.. Digital health, 12, 20552076261427503. https://doi.org/10.1177/20552076261427503
MLA Zhao X, et al.. "Performance of latest AI models, RAG, and MCP on lung cancer-related questions.." Digital health, vol. 12, 2026, pp. 20552076261427503.
PMID 41836629

Abstract

[BACKGROUND] Large language models (LLMs) have advanced rapidly. However, concerns remain regarding their reliability in clinical settings due to the inherent issues of hallucinations and inadequate referencing.

[MATERIALS AND METHODS] We evaluated six current LLMs: GPT-4.1 (GPT), o3, Gemini-2.5-Pro-Preview-0506 (Gemini), Grok-3 (Grok), Qwen3-235B-A22B (Qwen3), and Claude Sonnet 4 (Claude), as well as two technologies that extend LLM capabilities using external knowledge bases: retrieval-augmented generation (RAG) and Model Context Protocol (MCP). Each model was evaluated using 50 questions selected from a 132-question pool developed based on the Chinese Medical Association guideline for clinical diagnosis and treatment of lung cancer (2024 Edition). Three models-Qwen, GPT, and Grok-were further analyzed to assess performance changes with RAG and MCP integration. All responses were independently reviewed by two qualitative evaluators.

[RESULTS] Overall, o3 achieved the highest accuracy (50%), followed by GPT (48%) and Gemini (48%), then Grok (44%), Qwen (40%), and Claude (36%). However, implementing RAG (LLM-RAG) or MCP (LLM-MCP) significantly improved accuracy, with statistical differences observed between baseline LLMs and their RAG- or MCP-enhanced counterparts. Lexical richness and semantic noise both diminished, whereas the semantic clarity and accuracy of verbs, noun-verb combinations, and content words improved.

[CONCLUSIONS] The six latest LLMs performed similarly on lung cancer-related questions. The integration of RAG or MCP significantly enhanced accuracy while simplifying sentence structure, focusing more on the main topics, and using more accurate vocabulary.

같은 제1저자의 인용 많은 논문 (5)