본문으로 건너뛰기
← 뒤로

Development and validation of a diagnostic machine learning model for gastric cancer risk based on double-negative T cell-related features.

1/5 보강
Cancer cell international 📖 저널 OA 98.7% 2022: 8/8 OA 2023: 2/2 OA 2024: 17/17 OA 2025: 121/121 OA 2026: 86/89 OA 2022~2026 2026 Vol.26(1) OA
Retraction 확인
출처

Yin Z, Zhang G, Yin Z, Ma W, Yang J, Deng W

📝 환자 설명용 한 줄

[BACKGROUND] Gastric cancer (GC) remains a major global health challenge, characterized by high morbidity and mortality rates.

이 논문을 인용하기

↓ .bib ↓ .ris
APA Yin Z, Zhang G, et al. (2026). Development and validation of a diagnostic machine learning model for gastric cancer risk based on double-negative T cell-related features.. Cancer cell international, 26(1). https://doi.org/10.1186/s12935-025-04080-7
MLA Yin Z, et al.. "Development and validation of a diagnostic machine learning model for gastric cancer risk based on double-negative T cell-related features.." Cancer cell international, vol. 26, no. 1, 2026.
PMID 41526926 ↗

Abstract

[BACKGROUND] Gastric cancer (GC) remains a major global health challenge, characterized by high morbidity and mortality rates. Early diagnosis is essential for improving patient outcome. This study aims to develop a diagnostic model based on specific signature genes by investigating the association between double-negative (DN) T cells and GC.

[METHODS] A bidirectional Mendelian randomization (MR) analysis was conducted to assess the causal relationship between immune cell phenotypes and GC pathogenesis. Three machine learning (ML) algorithms, combined with logistic regression, were employed to identify featured genes. Real-world cohorts and animal experiments were applied to validate the expression levels of DN T cells and selected model genes. Virtual screening was further performed to identify potential therapeutic candidates.

[RESULTS] DN T cells were identified as significant risk factors for GC. A diagnostic model incorporating four genes-EML4, IL32, FXYD5, and TTC39C-was constructed using ML algorithms and demonstrated high predictive accuracy across multiple clinical cohorts. External validation and experimental analyses confirmed elevated DN T cell levels and increased expression of all model genes in GC tissues, correlating with poor prognosis. Virtual screening identified potential therapeutic compounds with strong binding affinity to target proteins, indicating their potential for GC treatment.

[CONCLUSIONS] The study established a novel diagnostic model for GC based on DN T cell signature genes, which shows robust predictive performance and significant clinical benefit. The findings underscore the important role of DN T cells and model genes in GC, providing new insights into early diagnosis and potential therapeutic targets for effective management of GC.

🏷️ 키워드 / MeSH 📖 같은 키워드 OA만

같은 제1저자의 인용 많은 논문 (5)

📖 전문 본문 읽기 PMC JATS · ~75 KB · 영문

Background

Background
Gastric cancer (GC) is the fifth most common malignant tumor and the fourth leading cause of cancer-related deaths worldwide, particularly prevalent in developing countries with notably high incidence rates in East Asia, Eastern Europe, and South America [1]. Prognosis in GC varies significantly by stage; early-stage GC, confined to the mucosal and submucosal layers, can typically be treated with radical surgical resection, yielding a five-year survival rate exceeding 90% for most patients. In contrast, advanced GC necessitates a comprehensive treatment plan that incorporates surgery, chemotherapy, radiotherapy, and targeted therapies, resulting in five-year survival rates usually below 30% [2]. This underscores the critical importance of early diagnosis and intervention in improving the prognosis for GC patients. Currently, endoscopic biopsy is the predominant method for early screening of GC. However, the high subjectivity of pathological diagnoses and the time-consuming nature of immunohistochemistry represent significant limitations in clinical practice. The elevation of serum carcinoembryonic antigen (CEA), CA19-9, and CA72-4 provides some diagnostic utility for GC [3], but these markers exhibit limited specificity and sensitivity when used in isolation. Consequently, there is an urgent need to develop more sensitive, accurate, and efficient diagnostic models to enhance early screening and diagnostic methods for high-risk populations.
The tumor immune microenvironment (TIME) is a complex ecosystem influenced by interactions among various immune cells and chemokines, as well as regulatory and feedback mechanisms that significantly impact tumor development. Different subpopulations of T cells, as key components of the adaptive immune system, serve as important markers for predicting tumor biology and therapeutic sensitivity. T cells are primarily classified into two major groups based on their surface markers: CD4+ T cells and CD8+ T cells. Studies have shown that increased CD8+ T cell infiltration within the TIME correlates positively with favorable outcomes in cancer patients [4, 5]. Furthermore, higher levels of CD8+ T cell infiltration are typically observed in microsatellite instability-high (MSI-H) and high tumor mutation burden (TMB) subtypes, indicating a heightened responsiveness to antitumor therapies, particularly immunotherapeutics [6–8]. On the other hand, the proportion of CD4+ T cell infiltration can also serve as an effective predictor of long-term survival and therapeutic response in tumor patients [9, 10]. Beyond these major subsets, CD3+ CD4-CD8- double-negative (DN) T cells, though a minor population in peripheral blood, have garnered significant interest for their dual roles in tumor immunity. They can act as immune effectors against hematological malignancies [11], yet also potentially contribute to lymphomagenesis [12, 13]. In solid tumors, DN T cells are often enriched in the TIME, potentially originating from CD8+ T cells via CD8 downregulation [14]. Their functional impact is complex, but a consensus is emerging that they often act as potent negative regulators of anti-tumor immunity. Proposed mechanisms extend beyond the Fas/FasL pathway [15, 16] to include T-cell receptor (TCR)-dependent, cell-contact-mediated suppression of effector T cells (distinct from Tregs) and the secretion of inhibitory cytokines like IL-10 and TGF-β, which collectively dampen immune responses and promote processes like epithelial-mesenchymal transition (EMT) [17–19]. This multifaceted immunosuppressive profile underpins clinical observations linking DN T cell infiltration to treatment resistance, tumor progression, and immune evasion [20]. In our preliminary Mendelian randomization analysis, we discovered a significant causal relationship between elevated DN T cells and gastric carcinogenesis, suggesting their potential as diagnostic and prognostic indicators in GC.
In the preliminary phase of this study, we conducted immune cell-related Mendelian randomization (MR) analyses and discovered a significant causal relationship between elevated DN T cell infiltration and gastric carcinogenesis. These findings suggest that the presence and functional status of DN T cells may serve as potential diagnostic and prognostic indicators of tumors.
The rapid advancement of artificial intelligence (AI) and big data technologies has greatly accelerated the discovery of potential biomarkers. After establishing a causal relationship between the infiltration of DN T cells and gastric carcinogenesis, we annotated DN T cell subpopulations within the TIME of GC using single-cell RNA sequencing. We extracted marker genes specific to these subpopulations and applied multiple ML algorithms to identify key signature genes, ultimately constructing a novel diagnostic model. Subsequent validation in clinical cohorts and animal models confirmed significantly elevated levels of DN T cell infiltration and model gene expression in GC tissues compared to normal gastric tissues, with the extent of these increases strongly associated with poor patient prognosis. Finally, through molecular docking and subsequent molecular dynamics (MD) simulations targeting the model genes, we identified potential preventive and therapeutic agents. This integrated approach proposes a promising new strategy for the diagnosis and management of GC.

Methods

Methods

GWAS data acquisition
All GWAS data utilized in this study were sourced from the IEU OpenGWAS project website (https://gwas.mrcieu.ac.uk/). The GWAS data for GC was identified as ebi-a-GCST90018849, comprising 1,029 GC cases and 475,087 matched controls from healthy populations [21]. The data for 731 immune cell phenotypes ranged from ebi-a-GCST0001391 to ebi-a-GCST0002121, derived from a non-overlapping European cohort with 3,757 samples [22]. Expression Quantitative Trait Loci (eQTL) analyses combined with GWAS were employed to explore potential regulatory effects on gene expression for complex traits. All GWAS samples were obtained from European populations, with ethical approval from relevant committees, and informed consent was secured from all participants.

Bidirectional MR analysis
Bidirectional MR analysis was conducted to investigate potential bidirectional causal associations between the 731 immune cell phenotypes and GC. Three fundamental assumptions must be met for MR analysis: (1) SNPs must be associated with exposure factors; (2) SNPs must be independent of any confounders; and (3) SNPs should influence the outcome solely through the exposure factors. The “TwoSampleMR” R package facilitated the bidirectional MR analysis. Initially, traits were screened for strongly associated SNPs with a significance threshold of p < 1e-5, and unbalanced SNPs were refined through clumping (r²<0.001). A threshold of F >10 was applied to eliminate weakly instrumented variables [23]. The MR analysis incorporated five methods: Inverse Variance Weighted (IVW), MR Egger, Simple mode, Weighted median, and Weighted mode [24–26]. Among these, IVW is typically regarded as the most robust and definitive method, provided the Odds Ratio (OR) values align and no horizontal pleiotropy or heterogeneity is present. A p-value of < 0.05 was considered indicative of a significant causal association between exposure and outcome. The “mr_pleiotropy_test” function was employed to evaluate horizontal pleiotropy, while the “mr_heterogeneity” function was used to assess heterogeneity based on both IVW and MR Egger methods, with results visualized through funnel plots [27]. A p-value ≥ 0.05 indicated the absence of horizontal pleiotropy or heterogeneity. Sensitivity analysis was performed using the leave-one-out method. Ultimately, we identified a significant causal association between an increased number of DN T cells and the occurrence of GC.

Annotation of DN T cells and acquisition of characterized genes
Single-cell RNA sequencing (scRNA-seq) data of five GC samples from GSE167297, comprising a total of 22,464 cells, were downloaded from the TISCH database (http://tisch.comp-genomics.org/) [28]. To ensure data quality, we implemented the following rigorous quality control criteria: cells were retained only if they contained between 200 and 6,000 detected genes; cells with total unique molecular identifier (UMI) counts below 500 or above 50,000 were excluded; and cells with a mitochondrial gene ratio exceeding 15% were removed to eliminate apoptotic or damaged cells. Potential doublets were further identified and removed using the DoubletFinder software (v2.0.3). The TISCH database provided initial quality control and normalization of the data. We converted the data into a “Seurat” object and applied the “ScaleData” function for further normalization [29]. The “RunPCA” function was utilized to extract the first 15 principal components for dimensionality reduction based on the first 2,000 highly variable genes. Cell clustering was performed using the “FindNeighbors” and “FindClusters” functions. Uniform Manifold Approximation and Projection (UMAP) and t-distributed Stochastic Neighbor Embedding (tSNE) were employed for 2D visualization of the data [28]. Subsequently, T cells were annotated based on the expression of canonical marker genes CD3D and CD3E. From this T cell population, DN T cells were specifically defined as CD3D + CD3E + cells that were negative for CD4 and CD8A expression. The “FindAllMarkers” function was then used to identify characteristic genes of DN T cells, applying screening criteria of pct >0.25, logFC >0.25, and removal of non-coding genes. Gene Ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) enrichment analyses were conducted to explore the biosignaling pathways.

ML algorithm screens for GC-related DN T cell signature genes
GC-related signature screening was performed for the identified DN T cell genes based on transcriptomic data from the TCGA-STAD cohort. Initially, differential gene expression analysis was conducted to identify differentially expressed genes between GC and adjacent normal samples, using the thresholds |logFC|>1 and p < 0.05, with p-values corrected for false discovery rate (FDR). Subsequently, three ML algorithms were employed for feature screening related to GC. The “glmnet” R package was utilized for Least Absolute Shrinkage and Selection Operator (LASSO) regression, achieving feature selection by constructing a penalty function that compresses null regression coefficients [30]. The “randomForest” R package was used to perform the Random Forest (RF) algorithm, which hypothesizes the impact of features on predictions through their application in varying decision trees, with a default of 100 iterations. The RF model is considered robust when 500 decision trees are constructed. Gene importance was assessed based on the “decrease in accuracy” and filtered to retain genes with importance scores greater than 1.
Additionally, the “e1071” and “caret” R packages were employed to implement the Support Vector Machine-Recursive Feature Elimination (SVM-RFE) algorithm, a posterior selection method based on the maximum margin principle of SVM [31]. After 10-fold cross-validation, genes corresponding to the model with the highest accuracy and lowest error were selected for further analysis. Finally, the genes identified by the three ML algorithms were intersected to obtain the DN T cell signature genes closely associated with gastric carcinogenesis.

Nomogram construction and evaluation
Based on the identified GC-related DN T cell signature genes, independent variables were selected using a multifactorial logistic regression forward-backward method, and a nomogram was constructed for GC diagnosis. The probability of a patient being diagnosed with GC was calculated by summing the scores of the modeled genes. The “rms” and “regplot” R packages were employed to fit and visualize the nomogram, while calibration curves were utilized to assess the consistency between the predicted and actual outcomes. The Decision Curve Analysis (DCA) and Clinical Impact Curve (CIC), based on the “rmda” R package, evaluated the net clinical benefit of the nomogram, with the baseline incidence in the population set to 0.02 [32]. Additionally, the area under the Receiver Operating Characteristic (ROC) curve (AUC) and the F1-score from the confusion matrix were computed to further assess the classification performance of the nomogram; higher AUC and F1-scores indicate better prediction accuracy. To validate the model’s accuracy, six external cohorts—GSE54129, GSE66229, GSE179252, GSE13911, GSE 113255, and GSE184336—were used.

Cohort immune infiltration and prediction of treatment responsiveness
A set of 23 marker genes for immune cells was retrieved from the TISIDB database (http://cis.hku.hk/TISIDB/data/download/CellReports.txt) [33]. The relative abundance of these 23 immune cell infiltrations within the TIME of TCGA-STAD cohort patients was assessed using the ssGSEA algorithm. Four Immunophenoscores (IPS), downloaded from The Cancer Immunome Atlas (TCIA) database (https://tcia.at/home), were utilized to reflect the responsiveness of TCGA-STAD patients to different immunotherapies [34]. Higher IPS values indicate greater responsiveness to immunotherapy. The “oncoPredict” R package constructed a ridge regression model based on the “GDSC2” dataset, which was used to predict the LNIC50 and AUC values for seven common STAD drugs in TCGA-STAD patients [35]. Lower predicted LNIC50 and AUC values suggest increased sensitivity of the patients to the drugs.

Shapley additive explanation (SHAP) analysis
SHAP is a tool designed to interpret predictions made by ML models. Grounded in game theory, SHAP analysis quantifies the average marginal contribution of each feature to a specific prediction, thereby elucidating the results of the prediction model and effectively addressing the “black box” problem [36]. To mitigate class imbalance, we oversampled a limited number of samples in the dataset and subsequently divided the data into training and testing sets at a ratio of 7:3 for model construction and performance evaluation. The contribution of each feature gene to the model’s predictions was calculated using SHAP values based on a K-nearest neighbor (KNN) classifier, and the importance of each gene was ranked to assess its influence on model decisions.

Clinical sample collection
The study prospectively enrolled 27 GC patients who were scheduled to undergo radical surgery at the Department of Gastrointestinal Surgery, Third Xiangya Hospital, Central South University, between January 2023 and December 2024 (the XY3-GC cohort). The inclusion criteria were: (1) initial diagnosis of primary gastric adenocarcinoma confirmed by preoperative gastroscopy and biopsy; (2) planned for curative resection; and (3) availability of complete clinicopathological data. Patients were excluded if they had: (1) distant metastasis (M1 stage); (2) a history of other malignancies; (3) severe cardiopulmonary, hepatic, or renal dysfunction contraindicating surgery; (4) autoimmune diseases, active infections, or chronic infectious diseases that could significantly alter immune status; or (5) receipt of neoadjuvant therapy prior to surgery. Tumor tissues and matched paracancerous tissues (from normal gastric mucosa > 2 cm from the tumor edge) were collected from all participants. All samples were stored in liquid nitrogen or neutral buffered formalin within 30 min of resection for subsequent experiments. The study was approved by the Institutional Ethics Committee, and written informed consent was obtained from all patients. Table 1 summarizes the clinical information of the patients.

Animal experimentation
Thirty six-week-old male 615 mice, averaging 20 g in weight, were obtained from the Laboratory Animal Center of the Institute of Hematology, Chinese Academy of Medical Sciences [License No. SCXK (Tianjin) 2020-0001]. The mice were housed in the SPF-grade laboratory of the Department of Laboratory Zoology at Central South University [License No. SYXK (Hunan) 2020-0019]. The mouse forestomach carcinoma (MFC) cell line was purchased from Wuhan Procell Life Technology (Wuhan, China) and cultured in RPMI-1640 complete medium supplemented with 10% fetal bovine serum and 1% penicillin-streptomycin at 37 °C in a 5% CO2 incubator. Upon collection, a cell suspension of 10^7 cells in 0.2 mL was injected subcutaneously into the right hind limb of two 615 mice. After approximately 14 days, when the tumors reached a diameter of 10 mm, the mice were euthanized, and the tumors were excised for passaging. This procedure was repeated until the 8th generation, at which point the tumors were preserved as a source for in situ transplantation.
For in situ transplantation, reserved tumor blocks were placed in serum-free RPMI-1640 medium, and the fibrous envelope, blood vessels, and necrotic tissues were removed. Viable tumor masses were selected and cut into approximately 1 mm³ pieces. Ten 615 mice were fasted for 12 h prior to surgery, then anesthetized with 0.3% sodium pentobarbital via intraperitoneal injection. The abdominal skin was disinfected and incised along the midline with ophthalmic scissors to carefully expose the peritoneum and stomach wall. The plasma membrane layer at the greater curvature of the stomach was gently peeled away using ophthalmic forceps to create a localized gastric wall niche. The tumor blocks were embedded into the gastric wall recess of five mice in the in situ tumor-forming group, and surrounding plasma membrane was lifted to envelop the tumor. A drop of 5µL tissue adhesive (3 M Vetbond, China) was added, and the stomachs were repositioned after the adhesive solidified. In the sham-operated group, the stomachs of five other mice were closed without implanting tumor blocks. Following surgery, the stomachs of all mice were closed in layers and bandaged with sterile gauze. After recovery, the mice were observed for activity, and the abdomen was gently palpated for the formation of a hard mass. Three weeks post-surgery, the mice were euthanized, and the entire stomach tissue was removed for subsequent experiments.

Histopathological analysis
Tumors and paracancerous tissues from 27 GC patients, along with five gastric in situ graft tumors and five normal gastric tissues from 615 mice, were fixed in neutral fixative at room temperature and subsequently embedded in paraffin. The gross morphology of mouse in situ graft tumors and normal gastric mucosa was examined using hematoxylin and eosin (H&E) staining. Paraffin-embedded tissue sections were baked at 60 °C for 12 h, deparaffinized in xylene for 20 min (repeated three times), and then dehydrated through graded ethanol (100%, 95%, 85%, and 75%) for 5 min each, followed by a wash with distilled water for 5 min. Sections were stained with hematoxylin for 10 min, rinsed with distilled water, and blued with PBS before being stained with eosin for 5 min and rinsed again. The sections underwent gradient alcohol dehydration, were placed in xylene for 10 min, and then sealed with neutral gum.
To quantify the expression levels of target proteins and GC tumor markers, the tissue sections were baked, deparaffinized, and rehydrated before being treated with 3% hydrogen peroxide at 25 °C for 25 min to block endogenous peroxidase. The sections were then incubated with primary antibodies: IL32 (1:200, Abiowell, China), EML4 (1:200, Abiowell, China), FXYD5 (1:200, Abiowell, China), and TTC39C (1:500, ThermoFisher, USA) overnight at 4 °C. Following this, CEACAM5 (1:200, Abiowell, China) and Cytokeratin 7 (1:200, Abiowell, China) were incubated overnight. The expression of CD3, CD4, and CD8a in the tissues was detected using multiplexed immunofluorescence. Following baking, deparaffinization, and rehydration, sections were immersed in citrate buffer (AWI0113a, Abiowell, China) for antigen retrieval. Antibody drops for CD8a (1:100, Abiowell, China) were applied overnight at 4 °C, followed by incubation with anti-rabbit HRP secondary antibody and fluorescent dye TSA-520 (AWI0688, Abiowell, China). This process was repeated for CD4 and CD3 using TSA-570 (AWI0689, Abiowell, China) and TSA-620 (AWI0690, Abiowell, China), respectively. Nuclei were stained with DAPI, rinsed, sealed with buffered glycerol, and stored in the dark for fluorescence microscopy.
All slides were scanned using a Pannoramic MIDI II-3D scanner (HISTECH Ltd.), and images were acquired at the same magnification in SlideViewer (Version 2.5). The area of immunohistochemical positivity was analyzed using ImageJ (Java 1.8.0_322).

RNA extraction and quantitative Real-Time polymerase chain reaction (qRT-PCR)
Total RNA was extracted using the Accurate SteadyPure Fast RNA Extraction Kit (AG21023, Hunan, China). The concentration and purity of the isolated RNA were assessed using NanoDrop (Thermo Fisher, USA). cDNA synthesis for reverse transcription was performed using HiScript III RT SuperMix for qPCR (Vazyme #R323-01), and real-time quantification was conducted using ChamQ Universal SYBR qPCR Master Mix (Vazyme #Q711). GAPDH was utilized as the internal reference gene, and the 2-∆∆Ct method was employed to standardize the comparative expression levels of target genes. The primers used for qRT-PCR are detailed in Table S1.

Flow cytometry
Immune cells in tissue samples were analyzed using flow cytometry. Tissue cells were isolated via trypsinization, and cell aggregates were removed to obtain single-cell suspensions, which were subsequently incubated with Zombie NIR™ Dye (BioLegend, USA) and fluorescent antibodies against mouse CD45, CD11b, CD3, CD4, and CD8a (BD Pharmingen™, USA) at 4 °C for 40 min in the dark. The stained cells were washed three times with PBS and analyzed with flow cytometry to determine the expression levels of various fluorescent markers. Appropriate channel settings allowed the differentiation of various immune cell populations based on expression of CD45 and CD11b, while T cells and their subsets were further classified based on CD3, CD4, and CD8 expression. The relative proportions of cells in each subpopulation were analyzed and quantified using FlowJo (Version 10.7.2).

Western blotting
Total protein was extracted from tissue samples using RIPA lysis buffer supplemented with protease and phosphatase inhibitors. Protein concentration was determined using a BCA assay. Equal amounts of protein were mixed with 5× SDS-PAGE loading buffer (Epizyme, China), boiled at 95 °C for 10 min to denature, and then separated by electrophoresis on SDS-polyacrylamide gels (Epizyme, China). Subsequently, the proteins were transferred onto polyvinylidene fluoride (PVDF) membranes (Millipore, USA). The membranes were blocked with 5% non-fat milk in Tris-buffered saline containing 0.1% Tween-20 (TBST) for 1 h at room temperature and then incubated with primary antibodies diluted in blocking buffer at 4 °C overnight. The primary antibodies used were as follows: anti-FXYD5 (1:1000, Abiowell, China), anti-PD-L1 (1:1000, Cell Signaling Technology, USA), and anti-β-Tubulin (1:10,000, Abcam, UK). After incubation, the membranes were washed three times with TBST and incubated with horseradish peroxidase (HRP)-conjugated goat anti-rabbit IgG secondary antibody (1:10,000, Abcam, UK) for 1 h at room temperature. Protein bands were visualized using an enhanced chemiluminescence (ECL) substrate (Meilunbio, China) and captured with a chemiluminescence imaging system (Servicebio, China). β-Tubulin was used as an internal loading control for normalization.

Molecular Docking
Human EML4 (Uniprot ID: Q9HC35), FXYD5 (Uniprot ID: Q96DB9), IL32 (Uniprot ID: P24001), and TTC39C (Uniprot ID: Q8N584) were selected as receptor proteins based on previous analyses. The three-dimensional structures of these proteins were obtained from the AlphaFold database (version 2.0, accessed 2024/12/21). Structure files for approved drug molecules were acquired from the ZINC 20 database, and energy minimization (conformational optimization) was performed using the MMFF94 force field. Proteins were prepared by removing original ligands and water molecules using PyMOL (Version 2.5.7), followed by the addition of hydrogen bonds and calculation of Gasteiger charges using AutoDock Tools (Version 1.5.7). After pre-processing, molecular docking was executed using AutoDock Vina (Version 1.2.3) to obtain docking free energy and results. The molecular docking parameters were set as follows:
EML4: Center(X, Y,Z) = (53.593, 40.872, 22.749); Size(X×Y×Z) = (20.25 × 20.25 × 41.25),
FXYD5: Center(X, Y,Z) = (8.41, 2.008, −11.464); Size(X×Y×Z) = (66.85 × 66.85 × 66.85),
IL32: Center(X, Y,Z) = (3.206, −3.011, 22.422); Size(X×Y×Z) = (32.0 × 36.0 × 112.0),
TTC39C: Center(X, Y,Z) = (17.804, 1.899, 4.735); Size(X×Y×Z) = (104.0 × 56.0 × 62.0).

Molecular dynamics simulations
Compounds exhibiting the largest absolute binding energy to each target protein were selected for MD simulations conducted using GROMACS (2024.3 GPU-CUDA). The AMBER14SB_OL15 force field was applied for the proteins, and the TIP3P model was utilized for water molecules. Topology files for the proteins were generated using GROMACS; for the EML4 protein, which contains non-standard amino acid residues, normalization was performed with pdbfixer in the conda environment prior to topology generation (specifically, MSE non-standard residues were substituted with MET). Ligand compounds were constructed using Antechamber Tools, and complex topology files were assembled through manual combination.
For the membrane protein FXYD5, a distinct simulation system was constructed to account for the lipid environment. A transmembrane fragment of FXYD5 (sequence: GLLVAAVLFITGIIILTSG), encompassing the complete α-helical structure from SER127 to LYS165 which contains the ligand binding site, was embedded into a mixed lipid bilayer composed of POPC and POPE (ratio 85:15) using the CHARMM-GUI online server. The protein was oriented perpendicular to the x-y plane of the membrane, with the ligand molecule maintaining its relative position to FXYD5. The system was solvated with TIP3P water layers above and below the membrane, resulting in final dimensions of 5.35 nm × 5.35 nm × 9.95 nm. During simulation, semi-isotropic pressure coupling was applied exclusively in the x and y directions to maintain membrane integrity.
For the other three target proteins (EML4, IL32, TTC39C), each complex was placed in a cubic box filled with TIP3P water molecules under periodic boundary conditions (PBC).
Across all systems, sodium and chloride ions were added to achieve charge neutrality. Energy minimization of the entire system was performed using the steepest descent method (maximum 50,000 steps). The LINCS algorithm constrained bond lengths and angle vibrations involving hydrogen atoms, while long-range electrostatic interactions were addressed by the Particle-Mesh Ewald (PME) method, with a short-range electrostatic cutoff set at 1.0 nm. The V-rescale thermostat was used to maintain a temperature of 298 K. All systems underwent initial pre-equilibration through 100 ps NVT simulation, followed by 100 ps NPT simulation using the C-rescale barostat (1.0 bar, isotropic for non-membrane systems, semi-isotropic for the FXYD5-membrane system). Finally, production simulations were conducted for 100 ns with a time step of 2.0 fs under constant temperature and pressure conditions.

Statistical analysis
ML and bioinformatics analyses were primarily performed using R (version 4.2.1), with Perl employed for batch data processing. Differential expression analysis was conducted with the “limma” R package. Group comparisons for continuous variables utilized the Wilcoxon test (two groups) or Kruskal-Wallis test (three or more groups). Categorical variables were compared using the chi-square test. Survival differences were assessed with Kaplan-Meier curves and log-rank tests via the “survival” and “survminer” packages. Correlation analyses were performed using Pearson’s method. SHAP analysis and molecular docking (including MMFF94 force field optimization) were implemented in Python (version 3.9.9). Statistical analysis for experimental data was carried out using GraphPad Prism (version 9.0). A two-tailed p-value < 0.05 was considered statistically significant for all tests.

Results

Results

Exploring the causal impact of immune cell phenotypes on GC
To investigate the causal effects of 731 immune cell phenotypes on GC pathogenesis, we performed a two-sample MR analysis. The IVW method was employed as the primary analytical approach, with all five MR methods aligned in the same direction, and without indications of horizontal pleiotropy and heterogeneity as screening criteria. We identified causal associations between 35 immune cell phenotypes and GC pathogenesis, with four phenotypes exhibiting significant results across all four MR methods: CD4-CD8- T cell Absolute Count, IgD- CD27- B cell Absolute Count, IgD + CD38dim B cell Absolute Count, and IgD expression on IgD + B cells. The first three phenotypes were found to be risk factors for the development of GC, whereas the last phenotype served as a protective factor (Table S2). Since the CD4-CD8- T cell Absolute Count exhibited the highest β value, indicating its most substantial causal effect on GC pathogenesis, we selected it for further investigation. An increased CD4-CD8- T cell Absolute Count significantly elevated the risk of GC pathogenesis (β = 0.138, 95% CI: 0.059–0.218, p_IVW = 6.4e-04, Fig. 1A) and showed no evidence of horizontal pleiotropy (p = 0.25, Fig. 1A) or heterogeneity (p_MR_Egger = 0.24, p_IVW = 0.22, MR_PRESSO outlier SNP = None, Fig. 1B). Figure 1C delineates the total effect of the 20 SNPs on the onset of GC and illustrates the individual effect of each SNP. A leave-one-out sensitivity analysis confirmed the robustness of the observed causal associations (Fig. 1D).

Screening of genes characterizing DN T cells in the GC microenvironment
To explore the biological functions of DN T cells in the GC microenvironment, we analyzed single-cell sequencing data from GSE167297 to construct a microenvironmental landscape of GC. DN T cells were annotated based on the expression profiles of CD3D, CD3E, CD4, and CD8A (Fig. 2A and C), leading to the identification of 113 marker genes. The results of GO (Figure S1A, S1B) and KEGG enrichment analyses (Figure S1C, S1D) indicated that these 113 marker genes were primarily enriched in T-cell-related pathways, including T-cell activation, T-cell differentiation, and T-cell receptor signaling pathways.
Subsequently, we conducted further gene screening using transcriptomic data. Differential analysis revealed that 53 out of the 113 marker genes exhibited significant expression differences between GC samples and normal gastric tissues, with 46 genes showing significantly higher expression levels in GC samples. To refine our search, three ML algorithms were applied for feature selection. The RF algorithm identified 17 feature genes (Fig. 2D and E), while the LASSO algorithm yielded 23 feature genes (Fig. 2F and G), and the SVM-RFE algorithm identified 18 feature genes (Fig. 2H and I). A total of 9 genes were common across the three algorithms: IL32, ABRACL, TTC39C, FYB1, FXYD5, EML4, ANKRD28, LEPROTL1, and SH2D2A (Fig. 2J).

Construction and evaluation of GC diagnostic nomogram
Based on the identified GC-related DN T cell signature genes, independent variables were selected using a multifactorial logistic regression forward-backward method, and a nomogram was constructed for GC diagnosis (Fig. 3A). The probability of a patient being diagnosed with GC was calculated by summing the scores of the modeled genes. The “rms” and “regplot” R packages were employed to fit and visualize the nomogram. To quantitatively assess the calibration of the nomogram, we calculated the Brier score and performed the Hosmer-Lemeshow goodness-of-fit test. The Brier score of 0.025 indicates excellent predictive accuracy (a score closer to 0 represents perfect accuracy), and the non-significant result of the Hosmer-Lemeshow test (p-value = 1) suggests no significant deviation between the predicted probabilities and observed outcomes, confirming good model fit. Calibration curves demonstrated high predictive consistency of the model (Fig. 3B), while DCA indicated that the model provided a high net clinical benefit for patients compared to the all-patient scenario and the no-diagnosis scenario (Fig. 3C). Additionally, the CIC generated from the DCA results showed that the nomograms offered superior overall net benefits across a wide and practical range of threshold probabilities (Fig. 3D).
We further evaluated the classification performance of the nomogram using the ROC curve and confusion matrix. The results revealed that the nomogram achieved an AUC of 0.88 or higher in both the TCGA training cohort and the six external validation cohorts, which was significantly superior to the single-gene results of the four model genes (Figure S2). The F1-scores of the confusion matrices all exceeded 0.85, demonstrating that the nomogram has high predictive classification performance for GC diagnosis (Figure S3).

Interpretation of the GC diagnostic prediction model
To elucidate the decision-making process of the GC diagnostic model, we employed SHAP analysis to interpret the model’s output by assessing the contribution of each feature gene to the predictions. The summary plot (Fig. 3E) visualizes the direction and magnitude of the influence exerted by the four feature genes on the model’s predictions in the TCGA-STAD cohort. This plot demonstrates how each gene either positively or negatively affects the predicted probability of GC. The summary bar graph (Fig. 3F) shows the relative contributions of each feature gene to the model’s predictions, with average SHAP values indicating that EML4 is the most influential gene in the prediction model. The decision plot (Fig. 3G) provides a cumulative view of how the four genes influence the decision pathway for all samples, helping to trace the model’s reasoning through each sample’s data. Additionally, the dependency plot (Fig. 3H) illustrates the relationship between each feature gene’s actual value and its corresponding SHAP value, revealing interactions between pairs or more feature genes. This plot highlights how changes in a gene’s expression affect the model’s output, showing clear trends and interactions between the genes.
Our analysis reveals that genes with SHAP values greater than zero drive positive predictions in the model, suggesting that higher expression levels of each of the four target genes promote GC diagnosis. Moreover, the co-expression of these genes generally exhibits a synergistic effect, where the combined expression of these genes further amplifies the diagnostic prediction.

Exploring the biological functions of DN T model genes
To investigate the biological functions of the four DN T cell model genes, we first analyzed their expression at both single-cell and transcriptome levels. All four genes exhibited significantly higher expression in DN T cells (Fig. 4A) and showed increased expression in GC samples compared to normal gastric samples (Fig. 4B). Kaplan-Meier survival analysis based on data from the K-M Plotter database indicated that all four genes acted as significant prognostic risk factors for GC patients (Figs. 4C-E). Figure 4F illustrates the correlation between the expression of these four genes and the infiltration levels of 23 immune cell types, while Fig. 4G demonstrates their relationship with clinical treatment responsiveness metrics, including LNIC50 and AUC values for four IPS and seven drugs. Additionally, we performed MR analysis of model gene expression associated with GC onset using eQTL data. The results revealed that only increased expression of FXYD5 was a significant risk factor for GC development, whereas expression levels of the remaining genes showed no significant causal association with GC (Figure S4, Table S3).

Real-world clinical cohort validation of DN T cell infiltration and prognostic significance
To validate the correlation between DN T cells and GC progression, we collected tumor and matched paracancerous tissues from 27 GC patients. Using multiplex immunofluorescence, we quantified the infiltration of CD4+, CD8+, and DN T cells in tumor and adjacent normal tissues. The results demonstrated a significant increase in both the absolute count and the proportion of DN T cells in tumor tissues compared to paracancerous tissues (Figs. 5A-C). Based on the level of DN T cell infiltration in tumor sections, patients were stratified into DN T-High (n = 14) and DN T-Low (n = 13) groups. Survival analysis revealed that patients in the DN T-High group had significantly shorter progression-free survival (PFS) than those in the DN T-Low group (P = 0.047), with higher DN T infiltration observed in patients experiencing disease progression (Figs. 5D, E). Furthermore, elevated DN T cell counts were significantly associated with advanced clinical stage (Stage III) and the presence of signet-ring cell carcinoma, a subtype with poor prognosis (Fig. 5E).

Model gene expression landscape in clinical cohorts
Expression validation in the same cohort confirmed that all four model genes (EML4, IL32, FXYD5, TTC39C) were significantly upregulated in tumor tissues at both mRNA and protein levels (Figs. 6A, B). Among them, EML4, IL32, and FXYD5 showed particularly high expression in the DN T-High subgroup (Fig. 6C). Given FXYD5’s identification as a causal factor in GC and its high weight in the model, we further investigated its role. Western blot analysis of three paired DN T-High vs. DN T-Low tissues revealed a positive correlation between FXYD5 protein levels, DN T cell infiltration, and the expression of PD-L1—a key immune checkpoint molecule (Fig. 6D). This suggests that FXYD5 may contribute to an immunosuppressive microenvironment.

In vivo validation using a mouse model of GC
To further corroborate the association of DN T cells and model genes with GC, we established an in situ GC model in mice. Tumor blocks from the MFC cell line were transplanted into the gastric wall of 615 mice. Three weeks post-surgery, successful tumor formation was confirmed by gross observation and histology (H&E staining), which revealed loss of normal gastric architecture and features of malignant transformation (Figs. 7A–C). IHC further verified the tumor phenotype, showing elevated expression of GC markers CEA and CK7 (Fig. 7C).
Consistent with our human cohort findings, the in situ tumors exhibited significantly higher mRNA levels of EML4, IL32, and FXYD5 (Fig. 7D), and elevated protein expression of all four model genes (Fig. 7E), compared to normal gastric tissues.
Crucially, flow cytometry analysis demonstrated a significant increase in the proportion of DN T cells among total T cells, as well as within the entire live cell population, in the tumor-bearing mice compared to controls (Figs. 7F, G). These in vivo results collectively validate that DN T cell infiltration and the expression of our model genes are positively associated with GC development.

Virtual screening identifies potential therapeutic agents targeting model proteins
Given the significant upregulation and prognostic value of the four model genes in GC, we performed virtual screening to identify potential therapeutic compounds. Using the encoded proteins (EML4, IL32, FXYD5, TTC39C) as targets, we screened 5,903 approved drugs (including 1,615 FDA-approved). Binding affinity was assessed by molecular docking, with binding energy < 0 kcal/mol indicating spontaneous binding and values < −7.2 kcal/mol suggesting strong affinity [37]. The top-ranking compounds for each target were Eltrombopag (EML4, −7.3 kcal/mol), Antrafenine (IL32, −11 kcal/mol), ZINC000014880001 (FXYD5, −8.9 kcal/mol), and Ledipasvir (TTC39C, −12.9 kcal/mol) (Tables S4-S13).
To evaluate binding stability, we conducted 100 ns MD simulations. The root-mean-square deviation (RMSD) analysis revealed distinct behaviors across complexes. The RMSD curves for EML4 and IL32 complexes closely overlapped with their respective protein monomers, indicating highly stable binding with minimal conformational perturbation (Figs. 8A, C). For the FXYD5 complex, simulated within a physiologically relevant POPC/POPE lipid bilayer, the RMSD plateaued within 0.25 nm of the monomer, demonstrating robust stability in a membrane environment (Fig. 8B). In contrast, the TTC39C complex exhibited a larger conformational shift (~ 1.5 nm RMSD), yet achieved a stable plateau, suggesting a structurally adapted but consistent binding mode (Fig. 8D).
Further analysis of the MD trajectories provided insights into atomic-level interactions. Secondary structure analysis confirmed the preservation of key structural elements during simulations. The root-mean-square fluctuation (RMSF) indicated that residue flexibility was largely constrained upon ligand binding, particularly in the active sites. Solvent-accessible surface area (SASA) values remained stable, implying no significant unfolding or exposure of hydrophobic cores. Analysis of the radius of gyration (Rg) showed minimal fluctuations, indicating that the overall compactness of the protein-ligand complexes was maintained throughout the simulations (Figure S5). Additionally, the number of hydrogen bonds (H-bonds) between each ligand and its target, as analyzed in Figure S5, remained relatively constant throughout the simulation, underscoring the persistence of specific polar interactions that contribute to complex stability. Notably, the FXYD5-ZINC000014880001 complex maintained exceptional conformational integrity within the lipid bilayer, highlighting the robustness of this interaction under near-physiological conditions.
In summary, Eltrombopag, Antrafenine, ZINC000014880001, and Ledipasvir demonstrate high binding affinity and stable interactions with their respective target proteins, supporting their potential as candidate therapeutics for GC.

Discussion

Discussion
Early-stage GC patients can achieve high rates of complete resection and low recurrence rates through radical surgery, whereas advanced GC is typically associated with local and distant metastases, making radical resection unfeasible. Advanced cases generally require combined radiotherapy and chemotherapy, leading to shorter survival periods and higher recurrence rates [38, 39]. Therefore, early screening and diagnosis of GC are essential for improving patient prognosis and reducing the risk of recurrence and healthcare burdens.
To our knowledge, this is the first study to integrate DN T cells, multi-omics data, and ML algorithms for GC diagnosis. Through bidirectional MR analysis, we identified a significant causal relationship between DN T cell abundance and GC pathogenesis among 731 immune cell phenotypes. Clinical and animal experiments further confirmed that elevated DN T cell infiltration is associated with advanced disease stage, poor histological subtypes, and increased recurrence, underscoring the potential of DN T cells as a prognostic biomarker in GC. DN T cells can have either pro- or anti-cancer roles depending on the tumor type. While studies have shown that DN T cells can effectively inhibit the growth of pancreatic cancer cells and promote apoptosis [40], they have also been found to significantly increase in tumors such as non-small cell lung cancer (NSCLC), hepatocellular carcinoma, and gliomas, where they suppress the immune response to tumor cells by secreting various cytokines, such as IL-10 and TGF-β [41–43]. Furthermore, a substantial increase in DN T cells has been confirmed in the lymph nodes of melanoma patients exhibiting treatment resistance and disease progression, suggesting their contribution to tumor invasiveness [44–46]. Our study is the first to report a causal association between DN T cells and GC pathogenesis, offering a new perspective on the disease’s pathogenesis and characterization.
After determining that DN T cells are significant risk factors for GC occurrence, we identified marker genes for DN T cells based on single-cell data and employed three ML algorithms—LASSO, RF and SVM-RFE—for feature screening. We subsequently constructed a prediction model for GC diagnosis using multifactorial logistic regression, which included the four identified genes. Several ML models for predicting GC risk and prognosis have been published; however, most are based on data mining and validation from one or a few cohorts, making it challenging to achieve optimal predictive performance with a single algorithm. Following multiple iterations of refinement, our predictive model demonstrated high AUC values and F1-scores across all training and validation cohorts, indicating exceptional classification performance. Results from DCA and CIC further confirmed that the model can provide substantial clinical benefit to patients. Our predictive model also exhibited superior performance in multiple external validation sets. Although ML is often perceived as a “black box,” making it difficult to accept such opaque decision-making tools in clinical settings, we employed SHAP analysis to elucidate the model’s functionality both globally and locally. By intuitively assigning weights to each modeled gene, we enabled personalized predictions based on specific input values, demonstrating the feasibility of advancing this model for clinical application.
We subsequently identified four key genes—IL32, EML4, FXYD5, and TTC39C—through single-cell sequencing and multiple ML algorithms. All four genes were significantly upregulated in GC tissues and correlated with poor prognosis. As an immune cytokine, IL32 secreted by cancer-associated fibroblasts (CAFs) promotes tumor invasion and metastasis via the integrin β3-p38MAPK signaling pathway [47]. EML4, a microtubule-associated protein, plays a crucial role in maintaining cytoskeletal stability and facilitating chromosome congregation during mitosis. Its fusion with ALK results in the EML4-ALK oncogenic fusion gene, which constitutively activates downstream signaling pathways such as MAPK and PI3K/AKT, thereby promoting tumor initiation and progression [48]. Furthermore, emerging evidence indicates that EML4-ALK undergoes liquid–liquid phase separation via aromatic residues within its EML4 domain, leading to the formation of biomolecular condensates that enhance the activation of downstream effectors including STAT3 and drive lung tumorigenesis [49]. Notably, the EML4-ALK fusion has also been identified in gastrointestinal signet ring cell carcinoma [50]. Collectively, our data demonstrate that EML4 is highly expressed in GC tumor tissues and is significantly associated with patient prognosis. Given the established role of EML4-ALK in driving tumorigenesis through constitutive signaling and phase separation, we propose that EML4 may function through analogous oncogenic mechanisms to promote GC development and progression. FXYD5, identified in our study as a causal risk factor for GC, is a known activator of the TGF-β/SMAD signaling pathway and a driver of EMT, thereby promoting GC progression [51]. Beyond its pro-tumorigenic role, emerging evidence positions FXYD5 as a modulator of the tumor immune microenvironment, implicated in immune evasion and resistance to immunotherapy [52]. In line with this, our Western blot analysis of patient tissues revealed a compelling positive correlation between FXYD5 protein levels, the infiltration of DN T cells, and PD-L1 expression. This triad of associations suggests a novel, concerted role for FXYD5 in shaping an immunosuppressive niche. We hypothesize that FXYD5, potentially through its interaction with or influence on DN T cells, acts as a functional transducer that upregulates PD-L1, thereby equipping tumor cells to evade immune surveillance. Elucidating this FXYD5/PD-L1/DN T cell axis provides a rationale for future research and could unveil new therapeutic vulnerabilities in GC. TTC39C mainly mediates protein-protein interactions and is involved in the progression of various tumors [53]. These findings provide a theoretical basis for considering these model genes as potential preventive or therapeutic targets for GC. Building on this, we conducted a virtual screening of drugs using the characterized proteins as targets, ranking 5,903 small molecule compounds based on binding energy. We identified Eltrombopag, Antrafenine, ZINC000014880001, and Ledipasvir as compounds with the highest affinity for EML4, IL32, FXYD5, and TTC39C, respectively. MD simulations elucidated the interactions between drugs and target proteins from various perspectives. Eltrombopag is primarily used for treating thrombocytopenia; however, recent studies have found it can effectively inhibit breast cancer metastasis by targeting HuR proteins and disrupting the tumor microvasculature [54, 55]. While Ledipasvir is mainly used for chronic hepatitis C virus (HCV) infection, it has also been shown to down-regulate the Src-EPHA2-Akt pathway in colorectal and triple-negative breast cancers [56]. Antrafenine, a non-steroidal anti-inflammatory drug (NSAID), and ZINC000014880001 have not yet been extensively explored in the context of tumor treatment. These findings may aid in the development of clinical drugs for GC.
However, our study has some limitations. The lack of a unified expert consensus to guide the selection of features for the prediction model leaves ambiguity regarding the optimal number of clinical features to include. Future studies should incorporate additional clinical risk factors and more external validation cohorts to further optimize the model. Additionally, further experiments are warranted to explore the specific regulatory mechanisms by which DN T cells and model genes influence GC development. Moreover, the targeted drugs identified through virtual screening require further experimental validation to confirm their preventive or therapeutic effects on GC.

Conclusions

Conclusions
In this study, we established a causal link between DN T cells and GC, and developed a diagnostic model based on four signature genes that shows high predictive accuracy. Experimental validation confirmed the clinical relevance of four model genes, and virtual screening identified potential therapeutic agents. This study provides a new strategy for early diagnosis and targeted treatment of GC.

Supplementary Information

Supplementary Information

출처: PubMed Central (JATS). 라이선스는 원 publisher 정책을 따릅니다 — 인용 시 원문을 표기해 주세요.

🏷️ 같은 키워드 · 무료전문 — 이 논문 MeSH/keyword 기반

🟢 PMC 전문 열기