Comparison of large language models and expert multidisciplinary team decisions in colorectal cancer.

Qu B; Cao L; Wu C; Chen Y; Sun T; Pei J; Huang L; Hou X; Li D; Wu A

doi:10.1136/bmjhci-2025-101780

← 뒤로

Comparison of large language models and expert multidisciplinary team decisions in colorectal cancer.

1/5 보강

BMJ health & care informatics 2026 Vol.33(1)

PICO 자동 추출 (휴리스틱, conf 2/4)

유사 논문

P · Population 대상 환자/모집단

1153 cases with consistent outputs.

I · Intervention 중재 / 시술

추출되지 않음

C · Comparison 대조 / 비교

추출되지 않음

O · Outcome 결과 / 결론

[CONCLUSION] LLMs can partially replicate expert MDT recommendations in colorectal cancer. Their integration into clinical workflows should aim to complement, rather than replace, human expertise.

Qu B, Cao L, Wu C, Chen Y, Sun T, Pei J

📖 무료 전문 🟢 PMC 전문 PMC12983819

PubMed ↗ DOI ↗ BibTeX ↓ RIS ↓

📝 환자 설명용 한 줄

이 논문을 인용하기

↓ .bib ↓ .ris

APA Qu B, Cao L, et al. (2026). Comparison of large language models and expert multidisciplinary team decisions in colorectal cancer.. BMJ health & care informatics, 33(1). https://doi.org/10.1136/bmjhci-2025-101780

MLA Qu B, et al.. "Comparison of large language models and expert multidisciplinary team decisions in colorectal cancer.." BMJ health & care informatics, vol. 33, no. 1, 2026.

PMID 41806973 ↗

DOI 10.1136/bmjhci-2025-101780

Abstract

[OBJECTIVES] To evaluate the ability of large language models (LLMs) to simulate multidisciplinary team (MDT) decision-making in colorectal cancer, a malignancy that often requires complex treatment planning.

[METHODS] We retrospectively analysed 1423 colorectal cancer cases discussed at MDT meetings at Peking University Cancer Hospital between January 2023 and December 2024. Three LLMs-OpenAI o3-mini-2025-01-31, DeepSeek-R1 671b and Qwen qwq-plus-2025-03-05-were tested for their ability to replicate MDT recommendations using a standardised treatment categorisation framework. Each case was processed three times per model; only cases with consistent outputs across all three runs were included. Concordance between AI-generated decisions and expert MDT consensus was assessed using agreement percentages and Cohen's kappa.

[RESULTS] O3 demonstrated the highest intramodel stability, with an agreement rate of 81.0% (Fleiss' kappa=0.794), yielding 1153 cases with consistent outputs. Concordance with MDT consensus was comparable across the three models, ranging from 62.5% to 65.4%. Multivariable analysis of O3 outputs identified treatment-naïve status, non-metastatic disease and colon tumour location as independent predictors of higher concordance with experts.

[DISCUSSION] LLMs showed fair overall agreement with expert MDT decisions, with stronger performance in standardised and less complex clinical scenarios. Areas of higher concordance included treatment-naïve non-metastatic colon cancer, treated non-metastatic rectal cancer and treated non-metastatic colon cancer.

[CONCLUSION] LLMs can partially replicate expert MDT recommendations in colorectal cancer. Their integration into clinical workflows should aim to complement, rather than replace, human expertise.

🏷️ 키워드 / MeSH 📖 같은 키워드 OA만

같은 제1저자의 인용 많은 논문 (1)

Association Between Soil Patterns and Mortality with Distinct Types of Cancers and CVD Across the USA.
Life (Basel, Switzerland) 2025

📖 전문 본문 읽기 PMC JATS · ~37 KB · 영문

Introduction

Introduction
Artificial intelligence, particularly large language models (LLMs), is driving transformative changes in healthcare by learning from vast amounts of medical data.15 These technologies facilitate intelligent medical record analysis, automated medical image interpretation, clinical decision support for physicians and emergency medicine decision support.69
The multidisciplinary team (MDT) approach has emerged as a cornerstone of modern oncology practice,10 integrating expertise from surgery, medical oncology, radiation oncology, radiology, pathology and other specialties to deliver evidence-based, personalised treatment strategies. Colorectal cancer, the third most prevalent malignancy globally,11 exemplifies the necessity of such collaborative care.12 Robust clinical evidence confirms that MDTs substantially improve patient outcomes, including overall survival, disease-free survival and cancer-specific survival,1315 with particularly pronounced benefits observed in advanced or metastatic disease.16
Prior research has indicated the potential utility of LLMs in MDT clinical decision-making. An Israeli study revealed that the treatment recommendations generated by ChatGPT-3.5 achieved 70.0% consistency (7 out of 10 cases) with the clinical decisions made by a breast cancer MDT.2 Another German research team found in their analysis of primary head and neck cancer cases that the Cohen’s kappa coefficient for clinical recommendation consistency between ChatGPT-3.5 and MDT expert was 0.44, while ChatGPT-V.4.0 reached 0.46.17 These research findings provide substantial evidence that LLMs can fulfil a significant supporting role in the MDT clinical decision-making process.
The study leverages real-world clinical data to evaluate the capabilities of three different LLMs in simulating MDT discussions and generating relevant treatment plans for clinical cases. We analysed the consistency of these models in providing treatment recommendations and categorised their suggestions based on predefined clinical standards.

Methods

Methods

Datasets
This retrospective study utilised clinical data from the electronic medical record system of Peking University Cancer Hospital, Beijing, China, including patients who underwent MDT consultations between January 2023 and December 2024. Inclusion criteria were diagnosis of colorectal cancer, complete outpatient records, MDT discussions focused on oncological purposes (excluding management of hernias and stomas), availability of CT or MRI reports and documented final MDT treatment recommendations.
The collected information included demographic characteristics (age, gender), primary diagnosis, the most recent medical records, reports from radiology, endoscopy, pathology and laboratory. The consensus recommendations from expert MDTs were also documented.

MDT workflow and treatment categorisation framework
MDT meetings were convened regularly by core members including surgical oncologists, radiologists, medical oncologists, radiation oncologists, pathologists and hepatobiliary surgeons as needed. Each MDT session produced a structured record of consensus treatment recommendations.
To facilitate comparative analysis with AI-generated outputs, a standardised four-category treatment classification framework was established: category A: surgical intervention; category B: systemic treatment (including chemotherapy, immunotherapy and targeted therapy); category C: concurrent chemoradiotherapy (including radiotherapy alone); category D: recommendation for further evaluation.
Within the collected final MDT treatment recommendations, expert treatment decisions were coded based on the next-step treatment plan. For multistep treatment proposals, only the initial-step treatment decision was recorded. If experts simultaneously recommended two treatment options, this was marked as multiple-choice, and in subsequent analyses, LLMs considering either option were deemed consistent. The entire coding process strictly adhered to the expert consensus opinions in the original medical records.

Simulation of MDT decision-making using LLMs
Three LLMs were evaluated: o3-mini-2025-01-31, DeepSeek-R1 671b and qwq-plus-2025-03-05. For simplicity, they are hereafter referred to as O3, DeepSeek and QwQ, respectively. To simulate an MDT discussion, a structured prompt instructed the LLMs to adopt the roles of various specialists typically involved in cancer care: a surgical oncologist, a gastroenterologist, a radiologist, an imaging specialist, a pathologist and a specialised nursing expert. For each case, LLM inputs were derived from clinical summaries prepared prior to MDT discussion. The input variables provided to the models are detailed in online supplemental Table S1. All inputs consisted of structured variables and report-based free-text summaries extracted from official clinical records. Case records were processed and provided to the language models in Chinese, without translation into other languages or additional automated text preprocessing.
All models were configured according to their official documentation and evaluated under standardised prompt settings designed to reflect real-world MDT use (details in online supplemental methods). For each clinical case, every LLM generated a simulated MDT discussion culminating in a treatment recommendation. The final decision was mapped into one of the four predefined treatment categories. The model was then prompted to return only the corresponding classification code. Figure 1 depicts the structured multirole prompt design used to emulate interdisciplinary discussions and treatment planning.

Outcome measures
This study employed a repeated-measures design in which each patient case was processed three independent times by each of the three models. Only results showing complete consistency across all three outputs from each model were included in the final analysis. The primary outcome measured the concordance between LLMs and expert MDT treatment decisions. The agreement was defined as an exact match between the LLM-generated recommendation and the expert-coded treatment decision. Secondary outcomes included a comparative analysis of different LLM performances, assessment of output consistency across repeated trials and identification of clinically relevant factors associated with discrepancies between LLM and expert decisions.

Statistical analysis
Fleiss’ kappa and percentage agreement were employed to evaluate model stability across repeated runs. Cohen’s kappa and percentage agreement were employed to evaluate the consistency between clinical expert and LLM decisions in the MDT context. Univariable binary logistic regression was employed to analyse for each potential influencing factor, with the significance level set at p<0.05. In the multivariate analysis phase, in addition to incorporating variables that showed statistical significance in the univariate analysis, clinically important predictors identified through clinical expertise were also included in the initial multivariate model. The backward elimination method was used for variable selection, with variables ultimately retained in the final model demonstrating statistical significance (p<0.05). Results were presented as ORs with corresponding 95% CI. All statistical analyses were performed using SPSS (V.25.0) and R (V.4.4.3).

Results

Results

Study cohort
Between January 2023 and December 2024, a total of 3165 colorectal cancer patients received MDT discussion at the Gastrointestinal Cancer Center, Unit III, Peking University Cancer Hospital, Beijing, China. After screening, 1423 MDT records were included for analysis (figure 2). Baseline characteristics of the 1423 colorectal cancer patients are shown in table 1. Most patients had rectal cancer with localised disease at presentation.

Performance of three LLMs
After three independent runs, a total of 1 153 910, and 971 cases were included in the formal analysis for O3, DeepSeek and QwQ, respectively. The intramodel agreement rates were 81.0% for O3, 63.9% for DeepSeek and 68.2% for QwQ (figure 3A). The 95% CI for O3’s intramodel agreement did not overlap with those of the other two models, indicating a statistically significant difference in consistency across runs.
With respect to concordance between LLM-generated recommendations and expert MDT decisions, agreement rates were 65.4% for O3, 62.5% for DeepSeek and 63.4% for QwQ (figure 3B). The corresponding Cohen’s kappa values, 0.380, 0.373 and 0.394, fell within the range considered to reflect fair agreement. As the 95% CIs of these kappa values overlapped, no significant differences in performance were observed among the three models. Detailed results are presented in online supplemental Table S2. A comparison of baseline characteristics among the cases included in the formal analysis revealed no significant differences across the three LLMs (online supplemental Table S3).
To address potential selection effects introduced by the strict three-run consistency criterion, we compared baseline characteristics between included and excluded cases for each model (online supplemental Table S4). Across all three models, excluded cases demonstrated a substantially higher proportion of missing clinical T and N staging information and a higher prevalence of metastatic disease. In the DeepSeek model, included patients were slightly younger than excluded patients; although this difference reached statistical significance (p=0.01), the median ages and IQRs were highly comparable, suggesting a minimal effect size. In the QwQ model, excluded cases included a higher proportion of colon tumours compared with included cases.
Given that the strict three-run consistency criterion excluded a substantial proportion of cases, we conducted a supplementary sensitivity analysis using a majority voting strategy (2/3 agreement). Under this approach, the number of included cases increased to 1416 for O3, 1395 for DeepSeek and 1388 for QwQ (online supplemental Table S5). Agreement with expert MDT decisions and corresponding Cohen’s kappa values were of similar magnitude to those observed in the primary analysis (online supplemental Table S5).

Patterns and determinants of LLM–MDT concordance
Using the primary analysis cohort defined by complete agreement across three independent runs (n=1153 for O3), we examined patterns of agreement and discordance between LLM-generated recommendations and expert MDT decisions and identified factors associated with LLM–expert concordance.
As shown in the Sankey diagram (figure 3C), the highest agreement was observed for category A (surgical intervention) and category C (concurrent chemoradiotherapy), with concordance rates of 73.8% and 86.3%, respectively. In contrast, concordance was substantially lower for category B (systemic treatment) and category D (recommendation for further evaluation), at 20.0% and 18.3%, respectively. The primary sources of discordance were cases in which experts recommended surgical intervention (A), but O3 selected concurrent chemoradiotherapy (C) in 13.5% of such cases (103 of 763), and cases in which experts advised further evaluation (D), while O3 recommended surgical intervention (A) in 50.6% (91 of 180).
Univariable and multivariable analyses were primarily performed using O3, the most stable model, to identify factors associated with LLM–expert MDT concordance. In univariable analysis of 1153 cases, colon tumour location (vs rectum) was the only significant predictor of concordance (OR=2.04, 95% CI 1.55 to 2.67, p<0.001; table 2). Multivariable analysis further identified three independent predictors of higher concordance: prior treatment, non-metastatic disease and colon tumour location.
Similar associations were observed for DeepSeek, whereas QwQ showed a distinct pattern, with improved concordance mainly in cT0–2 colon tumours (online supplemental Table S6). Based on the three predictors identified by O3, patients were stratified into eight subgroups. Concordance rates across subgroups are shown in figure 3D, with higher-than-average agreement observed in treatment-naïve non-metastatic colon cancer, treated non-metastatic rectal cancer and treated non-metastatic colon cancer. Detailed subgroup distributions are provided in online supplemental Table S7, and the overall cross-tabulation of O3 and MDT recommendations for the full cohort is summarised in online supplemental Table S8.

Discussion

Discussion
MDT is a cornerstone of oncology care, particularly for complex and clinically challenging malignancies. MDT discussions represent one of the most complex clinical decision-making scenarios, posing significant challenges for AI-assisted support. However, research on the integration of AI into MDT decision-making for oncology remains limited. In this study, we analysed a large-scale retrospective cohort (N=1423) with comprehensive clinical data to evaluate the performance of an LLM in colorectal cancer MDT decision-making. This study established standardised clinical input environments mirroring actual MDT workflows, enabling direct comparison between LLM and expert MDT recommendations.
Among the three evaluated LLMs, the O3 model achieved the highest concordance with expert consensus decisions and demonstrated superior internal consistency. However, despite this relative advantage, overall agreement remained limited, with Cohen’s kappa values below 0.4, indicating only minimal concordance.18
The strict three-run consistency criterion prioritised output stability but excluded cases where model recommendations varied across repeated runs. A supplementary majority voting analysis included more cases while yielding similar agreement and kappa estimates. This suggests that the modest concordance observed is driven primarily by the complexity of MDT decision-making, rather than by the choice of consistency criterion.
Importantly, MDT recommendations were used as a pragmatic reference for comparison rather than an outcome-based gold standard, and concordance should not be interpreted as evidence of correct or optimal treatment. This study did not evaluate the clinical safety of model outputs, nor did it aim to emulate MDT reasoning. Instead, the models were assessed as decision-support tools capable of summarising overall treatment direction in common scenarios. Against this background, the following sections examine specific patterns of agreement and discordance between model outputs and expert MDT decisions.
Several representative patterns of model-expert discordance, as illustrated in the Sankey diagram, warrant further analysis. One such pattern involved cases in which experts recommended surgical intervention (A), while O3 suggested concurrent chemoradiotherapy (C). This discrepancy frequently arose in tumours of the upper and mid rectum. For upper rectal cancers, which share biological behaviour and lymphatic drainage with colon cancer, surgery and systemic therapy are typically preferred over neoadjuvant radiotherapy.19 20 O3, however, did not clearly differentiate among rectal subsites and often recommended chemoradiotherapy regardless of tumour location.
In contrast, for mid-rectal cancers where the risk of local recurrence is higher, neoadjuvant chemoradiotherapy is the standard of care.19 O3 recommended concurrent chemoradiotherapy (C) in these cases aligns with established guidelines. The experts’ preference for surgery, however, may reflect the surgical orientation of MDT members, underscoring the influence of clinical background on decision-making.
Another factor contributing to this disagreement was the differing interpretation of treatment response between O3 and clinical experts. In several cases, experts judged the response to neoadjuvant therapy as sufficient and advised proceeding to surgery, whereas O3 continued to recommend chemoradiotherapy, highlighting the model’s limited capacity to dynamically assess treatment efficacy.
In cases where experts recommended further evaluation while O3 suggested immediate surgery, the model often anticipated the eventual treatment direction but failed to account for the intermediate diagnostic process. Notably, for patients who achieved a clinical complete response (cCR), O3 predominantly recommended surgical resection, whereas clinicians frequently preferred a non-operative, organ-preserving ‘watch-and-wait’ (W&W) strategy. W&W involves carefully monitoring patients instead of performing immediate surgery, aiming to preserve bowel function and avoid surgical complications.21 This approach is increasingly supported by clinical studies,22 23 highlighting the unique role of human clinicians in balancing cancer control with quality-of-life considerations.
In non-metastatic rectal cancer, concordance between the model and expert decisions was lower in treatment-naïve patients than in those who had already received therapy (54.8% vs 71.2%; p<0.01). This difference reflects the complexity of rectal cancer management, which often varies by tumour location and frequently involves clinical trial protocols. Experts may recommend treatments that differ from standard guidelines when they are leading or involved in clinical trial design. This contributes to model-expert disagreement and highlights the unique role of clinicians in advancing care through research, which is an aspect that current AI models are not yet equipped to replicate.
Overall, concordance tended to be higher in more standardised clinical scenarios, whereas lower agreement was observed in cases involving contingent or indeterminate decision-making (category D). In this clinical setting, recommendations for further evaluation generally reflected the need for additional diagnostic work-up—such as completion of staging with PET-CT or MRI, repeat endoscopic assessment or molecular and genetic testing—rather than uncertainty about treatment intent. As a result, discordance in category D often reflects differences in how interim evaluation steps are represented, rather than fundamental disagreement regarding the eventual treatment approach. Beyond the primary patterns of discordance, several clinically complex or exceptional scenarios also contributed to discrepancies between expert decisions and LLM outputs. When imaging and endoscopic findings were discordant—such as colonoscopy suggesting a mid-to-upper rectal lesion while CT indicated rectosigmoid involvement—LLMs often failed to adequately reconcile conflicting information. Limitations were also evident in cases with tumour-related complications (eg, obstruction or perforation), which require urgent and individualised clinical judgement. In addition, for patients with multiple primary lesions, LLMs frequently oversimplified decision-making by focusing on a single lesion rather than adopting a comprehensive treatment strategy. Taken together, these subgroup differences and patterns of discordance suggest that model–expert agreement varies systematically across clinical contexts. Consistent with these observations, univariate and multivariable logistic regression analyses showed that concordance was more likely in colon cancer, treated cases and non-metastatic disease.
This study has several limitations. First, the strict three-run consistency criterion prioritised output stability but excluded cases in which model recommendations varied across repeated runs. As these excluded cases more often involved incomplete staging information or metastatic disease, this approach may have limited generalisability to cases with incomplete information or uncertain decision contexts, despite improving internal validity.
Second, this retrospective study relied on routine clinical records, in which some baseline information, particularly cT and cN staging, was not consistently documented. As a result, the information available to clinicians during MDT discussion and that provided to the language models may not have been fully aligned. Third, MDT recommendations were used as a pragmatic reference rather than an outcome-based gold standard, and concordance should not be interpreted as evidence of optimal or correct treatment.
Fourth, the number of metastatic cases was limited, and this was a single-centre cohort predominantly composed of older male patients with rectal cancer, which may limit the generalisability of the findings to other populations and practice settings. Fifth, MDT recommendations were grouped into four broad treatment categories, which did not reflect more detailed clinical decisions, such as treatment sequence or whether organ-preserving strategies were considered. Finally, the analysis was based on text information from MDT records and did not include imaging data or other non-text clinical information that often influences MDT decision-making.
Future studies should include longitudinal follow-up and patient outcomes, use prospective designs to assess clinical usefulness and safety, and explore improved ways of representing MDT decisions, as well as approaches that incorporate up-to-date guidelines. Future MDT models will likely involve collaborative interaction between LLMs and physicians. For routine cases, AI could rapidly generate high-accuracy treatment recommendations for expert review and confirmation, thereby potentially conserving clinical resources. In complex cases, AI may serve as an intelligent decision-support tool, synthesising clinical data and medical knowledge while offering novel perspectives to stimulate expert deliberation and optimise therapeutic strategies. Clinicians will retain their essential role in managing challenging cases through MDT discussions and in conducting clinical research to generate new evidence, thereby enhancing AI-assisted decision-making capabilities through an iterative improvement cycle.

Supplementary material

Supplementary material
10.1136/bmjhci-2025-101780online supplemental file 110.1136/bmjhci-2025-101780online supplemental file 2

출처: PubMed Central (JATS). 라이선스는 원 publisher 정책을 따릅니다 — 인용 시 원문을 표기해 주세요.

🏷️ 같은 키워드 · 무료전문 — 이 논문 MeSH/keyword 기반

A Phase I Study of Hydroxychloroquine and Suba-Itraconazole in Men with Biochemical Relapse of Prostate Cancer (HITMAN-PC): Dose Escalation Results.
Cancer research communications 2026 Talmor B 외 📖 OA
Self-management of male urinary symptoms: qualitative findings from a primary care trial.
The British journal of general practice : the journal of the Royal College of General Practitioners 2026 Wheeler JR 외 📖 OA
Clinical and Liquid Biomarkers of 20-Year Prostate Cancer Risk in Men Aged 45 to 70 Years.
JAMA network open 2026 Lindholz M 외 📖 OA
Diagnostic accuracy of Ga-PSMA PET/CT versus multiparametric MRI for preoperative pelvic invasion in the patients with prostate cancer.
Science progress 2026 Qin Z 외 📖 OA
Clinical Presentation and Outcomes of Patients Undergoing Surgery for Thyroid Cancer.
Journal of the College of Physicians and Surgeons--Pakistan : JCPSP 2026 Khan MMU 외 📖 OA
Association of patient health education with the postoperative health related quality of life in low- intermediate recurrence risk differentiated thyroid cancer patients.
Scientific reports 2026 Li S 외 📖 OA