A targeted DNA methylation method for detecting gastrointestinal cancer in circulating cell-free DNA.
1/5 보강
Colorectal cancer (CRC) and gastric cancer (GC) are leading causes of cancer-related mortality.
- Sensitivity 90.7%
APA
Jiang Z, Guo Y, et al. (2026). A targeted DNA methylation method for detecting gastrointestinal cancer in circulating cell-free DNA.. iScience, 29(1), 114342. https://doi.org/10.1016/j.isci.2025.114342
MLA
Jiang Z, et al.. "A targeted DNA methylation method for detecting gastrointestinal cancer in circulating cell-free DNA.." iScience, vol. 29, no. 1, 2026, pp. 114342.
PMID
41509908 ↗
Abstract 한글 요약
Colorectal cancer (CRC) and gastric cancer (GC) are leading causes of cancer-related mortality. However, cost-effective methods for simultaneous detection of CRC and GC in circulating cell-free DNA (cfDNA) remain insufficiently explored. To address this, we developed targeted methylated CpG tandem amplification and sequencing (tMCTA-seq), a PCR-based method utilizing a set of locus-specific primers with a universal CGCGCGG primer, and targeted a panel of 110 loci. The method demonstrated high technical sensitivity below one haploid genome equivalent. Using a repeated nested cross-validation framework, the ensemble model, applied to 448 plasma samples (170 CRC, 101 GC, and 177 control participants), achieved areas under the curve (AUCs) of 0.928 (88.2% sensitivity and 90.7% specificity) for CRC and 0.926 (86.7% sensitivity and 94.4% specificity) for GC on the test set. Furthermore, tMCTA-seq differentiated between CRC and GC (AUC = 0.819). Thus, tMCTA-seq is a cost-effective, methylation-based approach for simultaneous detection of and differentiation between two major gastrointestinal cancers in blood.
🏷️ 키워드 / MeSH 📖 같은 키워드 OA만
같은 제1저자의 인용 많은 논문 (5)
- Key Considerations for the Lenvatinib-Pembrolizumab-Chemotherapy Regimen in Advanced Gastric Cancer: Insights From LEAP-015.
- Deep learning-based mismatch repair prediction using colorectal cancer macroscopic images: a diagnostic study.
- Comment on "Agile 3 + and 4 Scores Accurately Predict Major Adverse Liver Outcomes, Liver Transplant, Progression of MELD Score, the Development of Hepatocellular Carcinoma, and Death in NAFLD".
- Using F-FDG PET/CT-derived body composition features to predict lymphovascular invasion in non-small cell lung cancer.
- Single-cell profiling reveals lineage-specific fibroblast stromal subtypes drive ECM remodeling and immune modulation in the hepatocellular carcinoma tumor microenvironment.
📖 전문 본문 읽기 PMC JATS · ~66 KB · 영문
Introduction
Introduction
Colorectal cancer (CRC) and gastric cancer (GC) are among the top five most common cancers and leading causes of cancer deaths worldwide.1 Endoscopy is considered the gold standard for CRC and GC diagnosis and has been recommended for CRC screening, as well as GC screening in Asian countries with a high incidence of GC.2,3,4 However, endoscopy has limitations, including invasiveness, which leads to poor patient compliance. Blood-based methods, including DNA methylation analysis of circulating tumor DNA (ctDNA), have demonstrated potential in facilitating CRC and GC detection.5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22
DNA methylation methods for analyzing ctDNA include PCR-based techniques that target individual markers, which are effective for detecting a particular type of cancer, as well as sequencing-based strategies that provide comprehensive insights into cancer heterogeneity, beneficial for broad-spectrum cancer screening. Previous studies have reported numerous PCR-based methylation assays for detecting CRC in blood. The methylated SEPT9-based assay was the first developed clinical assay.5 Several multi-marker methylation assays have also been reported, including those based on BCAT1/IKZF1, C9orf50/KCNQ5/CLIP4, IRF4/IKZF1/BCAT1, and SEPT9/BCAT1/IKZF1/BCAN/VAV3.7,9,13,14 For GC detection, methylation assays such as the RNF180/SEPT9-based assay have been reported.8,17 Sequencing-based, genome-scale DNA methylation approaches include whole-genome, semi-targeted, and targeted methods, which generally examine over 100,000 loci.10,11,15,16,18,21,23,24,25,26 Recently, several prospective cohort studies on ctDNA methylation-based CRC screening have been reported.6,12,19,20 However, there is still a need for a cost-effective method for simultaneous detection of CRC and GC, particularly in high-GC-risk countries like China.4
We previously developed the methylated CpG tandem amplification and sequencing (MCTA-seq) method, which examines approximately 20,000 CGCGCGG sites and 9,000 CpG islands (CGIs), for semi-targeted, genome-scale detection of hypermethylated CGIs in circulating cell-free DNA (cfDNA).24 We have used this method for screening well-performing markers for detecting gastrointestinal cancers.27,28 In this study, we established a targeted version of this assay, tMCTA-seq, and employed it to evaluate a panel of 110 mCGCGCGG-CpG markers for the simultaneous detection and differentiation of CRC and GC in blood, analyzing a total of 448 plasma samples from patients with CRC or GC and control participants without cancer.
Colorectal cancer (CRC) and gastric cancer (GC) are among the top five most common cancers and leading causes of cancer deaths worldwide.1 Endoscopy is considered the gold standard for CRC and GC diagnosis and has been recommended for CRC screening, as well as GC screening in Asian countries with a high incidence of GC.2,3,4 However, endoscopy has limitations, including invasiveness, which leads to poor patient compliance. Blood-based methods, including DNA methylation analysis of circulating tumor DNA (ctDNA), have demonstrated potential in facilitating CRC and GC detection.5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22
DNA methylation methods for analyzing ctDNA include PCR-based techniques that target individual markers, which are effective for detecting a particular type of cancer, as well as sequencing-based strategies that provide comprehensive insights into cancer heterogeneity, beneficial for broad-spectrum cancer screening. Previous studies have reported numerous PCR-based methylation assays for detecting CRC in blood. The methylated SEPT9-based assay was the first developed clinical assay.5 Several multi-marker methylation assays have also been reported, including those based on BCAT1/IKZF1, C9orf50/KCNQ5/CLIP4, IRF4/IKZF1/BCAT1, and SEPT9/BCAT1/IKZF1/BCAN/VAV3.7,9,13,14 For GC detection, methylation assays such as the RNF180/SEPT9-based assay have been reported.8,17 Sequencing-based, genome-scale DNA methylation approaches include whole-genome, semi-targeted, and targeted methods, which generally examine over 100,000 loci.10,11,15,16,18,21,23,24,25,26 Recently, several prospective cohort studies on ctDNA methylation-based CRC screening have been reported.6,12,19,20 However, there is still a need for a cost-effective method for simultaneous detection of CRC and GC, particularly in high-GC-risk countries like China.4
We previously developed the methylated CpG tandem amplification and sequencing (MCTA-seq) method, which examines approximately 20,000 CGCGCGG sites and 9,000 CpG islands (CGIs), for semi-targeted, genome-scale detection of hypermethylated CGIs in circulating cell-free DNA (cfDNA).24 We have used this method for screening well-performing markers for detecting gastrointestinal cancers.27,28 In this study, we established a targeted version of this assay, tMCTA-seq, and employed it to evaluate a panel of 110 mCGCGCGG-CpG markers for the simultaneous detection and differentiation of CRC and GC in blood, analyzing a total of 448 plasma samples from patients with CRC or GC and control participants without cancer.
Results
Results
Development of the tMCTA-seq method
The MCTA-seq method comprises three steps following bisulfite treatment: random priming, CGCGCGG primer extension, and adapter primer amplification24 (Figure S1A). To develop a targeted MCTA-seq, we introduced a multiplex targeted PCR amplification step prior to the CGCGCGG primer extension, employing a set of locus-specific primers that target genomic regions upstream of selected CGCGCGG sites (Figure 1A). This created a single-sided, multiplex-nested PCR. Unlike traditional methods, this reaction employed a single “inner” CGCGCGG primer, which simplified primer design. Pre-amplifying the targeted CGCGCGG regions also increased the capture efficiency of the CGCGCGG primer due to an increased number of templates. Furthermore, we introduced a dUTP strategy to reduce the number of primer dimers and polymers (Figure S1B). dUTP, instead of dTTP, was used during the random priming step, with the random primers also containing two dUTPs. Following bead purification, the multiplex PCR was switched to dTTP, using a dUTP-tolerant polymerase for amplification. This strategy enabled the use of uracil-DNA glycosylase (UDG) to selectively degrade the dUTP-incorporated primer dimers and non-targeted templates, while preserving the dTTP-incorporated targeted products for further processing.
We found that two rounds of random priming enhanced the recovery of original target molecules by nearly 2-fold compared with one round, while three rounds yielded less increase; therefore, we chose two rounds for tMCTA-seq (Figure S2A). The multiplex PCR, as well as the increase in random priming rounds, greatly increased molecule number, and thus allowed for bead purification prior to the CGCGCGG capture that greatly reduced the primer dimers. The dUTP-dTTP-UDG strategy further reduced the number of primer dimers and polymers and increased the cleaned read ratio (Figure S2B). Finally, a dimer-free library was obtained, with the gel purification step in MCTA-seq being omitted (Figure S3).
On the basis of our previous studies,27,28 we established a panel of 110 mCGCGCGG-CpG sites (including 72 CRC markers and 68 GC markers, with 30 loci overlapping between the two cancer types) and designed primers to target these sites. The detailed information of these markers, including their MCTA-seq methylation values in CRC and GC tumor tissues and adjacent normal tissues, is provided in Table S1. To determine the capture efficiency and the limit of detection (LOD) of tMCTA-seq, we performed a dilution experiment in which the fully methylated human genomic DNA (FMG) was serially diluted (2-fold) from 150 pg down to 2.3 pg and spiked into 6 ng of WBC gDNA. The average molecule capture efficiency for the 110 mCGCGCGG-CpGs was approximately 23%, with the LOD reaching its lowest point at 2.3 pg, which is less than the equivalent of one haploid genome (Figure 1B). The average molecule capture efficiency of tMCTA-seq was approximately 2.5-fold that of MCTA-seq (average 1,128 and 449 unique molecular identifier [UMI] sequences for tMCTA-seq and MCTA-seq, respectively; Figure 1C; Table S2). The average sequencing depth of tMCTA-seq was 81 reads per UMI sequence compared with 3 reads per UMI sequence for MCTA-seq, yielding a 27-fold increase (Figure 1D). After normalizing the total sequenced reads, we found that tMCTA-seq enriched the targeted mCGCGCGG-CpGs by approximately 100-fold compared with MCTA-seq.
The 68 GC markers exhibited higher overall methylation levels in MLH1-methylated GC tissues compared with those in MLH1-unmethylated ones, consistent with our previous findings28 (Figure S4; Table S1). In contrast, the 72 CRC markers showed similar total methylation levels between MLH1-methylated and unmethylated CRC tissues. These results suggest that the panel has relatively high sensitivity for MLH1-methylated GC tumors.
We compared tMCTA-seq with other methods, including qPCR and hybridization capture-based sequencing, in terms of cost and turnaround time. This comparison showed that tMCTA-seq is more cost-effective than the hybridization capture-based sequencing method, with the cost and turnaround time efficiency approaching those of the qPCR method (Table S3).
Together, we developed tMCTA-seq, a targeted DNA methylation method capable of simultaneously detecting over 100 methylation markers, with an average capture efficiency of 23% and an LOD less than the equivalent of one haploid genome.
Detecting CRC in blood
Subsequently, we applied the 110-marker panel tMCTA-seq to plasma samples from CRC patients (n = 170), GC patients (n = 101), and control participants (n = 177) (Tables S4 and S5). For both CRC and GC, cancer patients and controls were randomly split into a training set (70%) for model development and a hold-out test set (30%) for independent evaluation (Figure S5). We quantified the total UMI counts of the 72 mCGCGCGG-CpG CRC markers, and the values for both early-stage and advanced-stage CRC patients in the training set were significantly higher than those for the controls (median UMI counts of 15.5, 57, and 40 for stage I, II, and III CRC patients, respectively, compared with 3 for the controls; p < 0.0001, two-tailed Mann-Whitney-Wilcoxon [MWW] test); a consistent trend was observed in the test set (Figure S6A; Table S5).
We investigated seven base models: three variants of logistic regression (L1-regularized LASSO, L2-regularized ridge, and ElasticNet), two tree-based algorithms (random forest and gradient boosting), a support vector machine (SVM), and a marker count approach that used a single total count value of the 72 markers. We also developed an ensemble model integrating all seven base models (see STAR Methods). To assess model performance and stability, we implemented a repeated nested validation procedure on the training set. This process was repeated 200 times; in each repetition, the data were partitioned into an 80% internal training subset and a 20% internal test subset. Model development, including feature selection and optimization via 5-fold cross-validation, was performed exclusively on the 80% internal training subset. The resulting models were then evaluated on the unseen 20% internal test subset to obtain robust performance estimates (Figure S5). Learning curves derived from this analysis showed that model performance plateaued as the sample size increased (Figure S7).
The results demonstrated that the ensemble model achieved the best performance, with an area under the curve (AUC) of 0.948 (±0.029 SD, 95% confidence interval [CI]: 0.890–0.993), which was the highest among all models (Figures S6B and S6C). The stage-specific AUCs were 0.900, 0.973, and 0.958 for stages I, II, and III, respectively, each of which ranked among the highest across all models (Figure 2A). The model demonstrated a sensitivity of 83.2% (95% CI: 0.667–0.958) and specificity of 92.9% (95% CI: 0.680–1.000), corresponding to the upper-right point on the plot (Figure 2B). The stage-specific sensitivities, learning curve analysis, and Brier scores also supported the superior performance of the ensemble model (Figures S6D and S7; Table S4).
Having established the parameters and the classification threshold at 0.34 of the ensemble model on the training set (see STAR Methods, Table S4), we applied it to the hold-out test set. The model achieved AUCs of 0.796, 0.986, and 0.971 for stage I, II, and III patients, respectively, and 0.928 for stages I–III combined (Figures 2C and 2D). It achieved an overall sensitivity of 88.2% (45/51) and a specificity of 90.7% (49/54), with stage-specific sensitivities of 71.4% (10/14) for stage I, 93.8% (15/16) for stage II, and 95.2% (20/21) for stage III CRC patients (Figure 2E). Prediction scores for each patient and control individual are provided in Table S5.
We also conducted a comparative analysis of a 14-marker sub-panel selected from the 72-marker panel, which included six well-established CRC markers (SEPT9, C9orf50, KCNQ5, CLIP4, IRF4, and IKZF1). The 14-marker sub-panel showed inferior performance to the 72-marker panel, highlighting the effectiveness of our full panel (Figure S8).
Detecting GC in blood
Next, we examined the performance of tMCTA-seq for detecting GC. The total UMI counts of the 68 mCGCGCGG-CpG GC markers were significantly elevated in GC at both early and advanced stages compared with controls in the training set (median counts of 10, 18, and 39 for stage I, II, and III GC, respectively, versus 4 for controls; p < 0.0001, two-tailed MWW test); a consistent trend was observed in the test set (Figure S9A; Table S5). Following a similar approach to the CRC analysis, we investigated eight candidate models using the repeated nested cross-validation framework on the training set (Figure S5). The results demonstrated that the ensemble model again exhibited the best performance, achieving an AUC of 0.884 (±0.056 SD, 95% CI: 0.760–0.974) for stages I–III combined, the highest among all models (Figure S9B). The stage-specific AUCs were 0.799, 0.887, and 0.939 for stages I, II, and III, respectively, consistently outperforming the other models (Figure 3A). The model achieved a sensitivity of 75.3% (95% CI: 0.498–0.929), with a specificity of 86.3% (95% CI: 0.680–1.000), corresponding to the upper-right points on the plot (Figure 3B). The stage-specific sensitivities, Brier score, and learning curve analysis also supported the superior performance of the ensemble model (Figures S7C, S7D, S9C, and 9D).
Having established the parameters and the classification threshold at 0.38 of the ensemble model on the training set (see STAR Methods, Table S4), we applied it to the hold-out test set. The model achieved AUCs of 0.822, 0.942, and 0.991 for stage I, II, and III patients, respectively, and 0.926 for stages I–III combined (Figures 3C and 3D). It achieved an overall sensitivity of 86.7% (26/30) and a specificity of 94.4% (51/54), with 66.7% sensitivity of (6/9) for stage I, 88.9% (8/9) for stage II, and 100% (12/12) for stage III GC patients (Figure 3E). Prediction scores for each patient and control individual are provided in Table S5.
We also evaluated the performance of SEPT9, which has been proposed as a ctDNA marker for GC. Analysis of two SEPT9 CGCGCGG loci in our panel yielded an overall sensitivity of 46.7% (14/30) for GC detection, with stage-specific sensitivities of 33.3% (3/9), 33.3% (3/9), and 66.7% (8/12) for stages I, II, and III, respectively, and a specificity of 88.9% (48/54). The performance was inferior to that of the 68-marker panel, supporting the effectiveness of our full panel.
Comparison with serological biomarkers
To further evaluate the clinical utility of tMCTA-seq, we compared its performance with commonly used serological biomarkers in clinical practice, including carbohydrate antigen 125 (CA125), carbohydrate antigen 72-4 (CA72-4), carbohydrate antigen 19-9 (CA19-9), and carcinoembryonic antigen (CEA) in the hold-out test sets (Figure 4A; Table S5).
For CRC detection, CEA (with a threshold of 5 ng/mL) showed sensitivities of 21.4% (3/14), 50.0% (8/16), and 23.8% (5/21) for stages I, II, and III, respectively, resulting in an overall sensitivity of 31.3% (16/51; Figure 4B). Even when all four serological biomarkers were combined (considering a positive result if any one marker was abnormal), the sensitivities were only 35.7% (5/14), 50.0% (8/16), and 52.3% (11/21) for stages I, II, and III, respectively, with an overall sensitivity of 47.1% (24/51; Figure 4B). In contrast, tMCTA-seq detected 88.6% (31/35) of CEA-negative CRC patients and 87.5% (14/16) of CEA-positive CRC patients (Figure 4B). Among CA19-9-positive CRC patients (n = 11), 90.9% (10/11) were detected by tMCTA-seq, while among CA19-9-negative CRC patients (n = 39), 87.2% (34/39) were detected. When combining tMCTA-seq with CEA, the sensitivities increased to 78.6% (11/14) and 100% (16/16) for stages I and II, respectively, with an overall sensitivity of 92.2% (47/51).
For GC detection, CEA showed sensitivities of 33.3% (3/9), 11.1% (1/9), and 33.3% (4/12) for stages I, II, and III, respectively, with an overall sensitivity of 26.7% (8/30; Figure 4C). When all four serological biomarkers were combined, the sensitivities were 33.3% (3/9), 11.1% (1/9), and 50.0% (6/12) for stages I, II, and III, respectively, with an overall sensitivity of 33.3% (10/30; Figure 4C). In comparison, tMCTA-seq detected 86.4% (19/22) of CEA-negative GC patients and 87.5% (7/8) of CEA-positive GC patients. tMCTA-seq detected all three CA19-9-positive GC patients (100%, 3/3) and 85.2% (23/27) of CA19-9-negative patients (Figure 4C). When combining tMCTA-seq with CEA, the sensitivities increased to 77.8% (7/9) for stage I, with an overall sensitivity of 90.0% (27/30).
Together, these results demonstrated that tMCTA-seq outperforms traditional serological biomarkers in detecting both CRC and GC.
Differentiation between CRC and GC in blood
Next, we explored the ability of tMCTA-seq to distinguish between CRC and GC in these samples. The 110-marker panel included 46 mCGCGCGG-CpGs as discriminative methylated markers between CRC and GC, comprising 22 CRC-specific markers (CRC versus GC) and 24 GC-specific markers (GC versus CRC; Table S1). For patients who were positive for both CRC and GC prediction scores (CRC prediction score ≥ 0.34, and GC prediction score ≥ 0.38; n = 63), we calculated the ratio of the total UMI counts of the CRC-specific markers to those of the GC-specific markers. The results showed that this ratio was significantly higher in CRC patients than in GC patients (Figure 5A). It achieved a predictive AUC value of 0.819 (95% CI: 0.707–0.930; Figure 5B). Using a log ratio value of 0 as the cutoff, we developed a classifier that predicted GC with high accuracy. The prediction accuracy for stage I, II, and III GC was 100% (5/5), 85.7% (6/7), and 90.9% (10/11), respectively, while the prediction accuracy for stage I, II, and III CRC was 66.7% (4/6), 80.0% (12/15), and 58.8% (10/17), respectively.
We evaluated strategies for combined CRC and GC detection with or without differentiating between them. Since blood-based CRC screening has been clinically approved, we established a CRC-prioritized approach for the differentiated strategy. All individuals who tested positive for CRC were recommended colonoscopy. Among them, those who also tested positive for GC and were subsequently classified as GC by the discriminative markers were advised to undergo a combined upper endoscopy and colonoscopy (Table S6). We also evaluated an undifferentiated strategy, for which all individuals testing positive for both CRC and GC markers were advised to undergo a combined upper endoscopy and colonoscopy (Table S6). We simulated a screening in 100,000 individuals, assuming a prevalence of 0.5% for CRC and 0.35% for GC in the target screening population4,20 (Table S7). A comparative analysis showed that the positive predictive value (PPV) increased from 4.2% in the undifferentiated strategy to 5.6% in the differentiated strategy (Figure 5C). However, the undifferentiated strategy showed higher sensitivity for GC detection (86.7%; 26/30 versus 80.0%; 24/30) (Figure 5D; Table S7). For both strategies, the sensitivity and PPV for CRC detection were 88.2% (45/51) and 7.1%, respectively, which were the same as those for the CRC-only detection strategy. It was noteworthy that a CRC-only approach would miss all GC patients, whereas the combined CRC and GC detection approach identifies a substantial proportion of them (Table S7).
Together, our results demonstrated that tMCTA-seq can effectively distinguish between CRC and GC in blood. Combined CRC and GC detection identified GC patients who would be missed in a CRC-only approach.
Development of the tMCTA-seq method
The MCTA-seq method comprises three steps following bisulfite treatment: random priming, CGCGCGG primer extension, and adapter primer amplification24 (Figure S1A). To develop a targeted MCTA-seq, we introduced a multiplex targeted PCR amplification step prior to the CGCGCGG primer extension, employing a set of locus-specific primers that target genomic regions upstream of selected CGCGCGG sites (Figure 1A). This created a single-sided, multiplex-nested PCR. Unlike traditional methods, this reaction employed a single “inner” CGCGCGG primer, which simplified primer design. Pre-amplifying the targeted CGCGCGG regions also increased the capture efficiency of the CGCGCGG primer due to an increased number of templates. Furthermore, we introduced a dUTP strategy to reduce the number of primer dimers and polymers (Figure S1B). dUTP, instead of dTTP, was used during the random priming step, with the random primers also containing two dUTPs. Following bead purification, the multiplex PCR was switched to dTTP, using a dUTP-tolerant polymerase for amplification. This strategy enabled the use of uracil-DNA glycosylase (UDG) to selectively degrade the dUTP-incorporated primer dimers and non-targeted templates, while preserving the dTTP-incorporated targeted products for further processing.
We found that two rounds of random priming enhanced the recovery of original target molecules by nearly 2-fold compared with one round, while three rounds yielded less increase; therefore, we chose two rounds for tMCTA-seq (Figure S2A). The multiplex PCR, as well as the increase in random priming rounds, greatly increased molecule number, and thus allowed for bead purification prior to the CGCGCGG capture that greatly reduced the primer dimers. The dUTP-dTTP-UDG strategy further reduced the number of primer dimers and polymers and increased the cleaned read ratio (Figure S2B). Finally, a dimer-free library was obtained, with the gel purification step in MCTA-seq being omitted (Figure S3).
On the basis of our previous studies,27,28 we established a panel of 110 mCGCGCGG-CpG sites (including 72 CRC markers and 68 GC markers, with 30 loci overlapping between the two cancer types) and designed primers to target these sites. The detailed information of these markers, including their MCTA-seq methylation values in CRC and GC tumor tissues and adjacent normal tissues, is provided in Table S1. To determine the capture efficiency and the limit of detection (LOD) of tMCTA-seq, we performed a dilution experiment in which the fully methylated human genomic DNA (FMG) was serially diluted (2-fold) from 150 pg down to 2.3 pg and spiked into 6 ng of WBC gDNA. The average molecule capture efficiency for the 110 mCGCGCGG-CpGs was approximately 23%, with the LOD reaching its lowest point at 2.3 pg, which is less than the equivalent of one haploid genome (Figure 1B). The average molecule capture efficiency of tMCTA-seq was approximately 2.5-fold that of MCTA-seq (average 1,128 and 449 unique molecular identifier [UMI] sequences for tMCTA-seq and MCTA-seq, respectively; Figure 1C; Table S2). The average sequencing depth of tMCTA-seq was 81 reads per UMI sequence compared with 3 reads per UMI sequence for MCTA-seq, yielding a 27-fold increase (Figure 1D). After normalizing the total sequenced reads, we found that tMCTA-seq enriched the targeted mCGCGCGG-CpGs by approximately 100-fold compared with MCTA-seq.
The 68 GC markers exhibited higher overall methylation levels in MLH1-methylated GC tissues compared with those in MLH1-unmethylated ones, consistent with our previous findings28 (Figure S4; Table S1). In contrast, the 72 CRC markers showed similar total methylation levels between MLH1-methylated and unmethylated CRC tissues. These results suggest that the panel has relatively high sensitivity for MLH1-methylated GC tumors.
We compared tMCTA-seq with other methods, including qPCR and hybridization capture-based sequencing, in terms of cost and turnaround time. This comparison showed that tMCTA-seq is more cost-effective than the hybridization capture-based sequencing method, with the cost and turnaround time efficiency approaching those of the qPCR method (Table S3).
Together, we developed tMCTA-seq, a targeted DNA methylation method capable of simultaneously detecting over 100 methylation markers, with an average capture efficiency of 23% and an LOD less than the equivalent of one haploid genome.
Detecting CRC in blood
Subsequently, we applied the 110-marker panel tMCTA-seq to plasma samples from CRC patients (n = 170), GC patients (n = 101), and control participants (n = 177) (Tables S4 and S5). For both CRC and GC, cancer patients and controls were randomly split into a training set (70%) for model development and a hold-out test set (30%) for independent evaluation (Figure S5). We quantified the total UMI counts of the 72 mCGCGCGG-CpG CRC markers, and the values for both early-stage and advanced-stage CRC patients in the training set were significantly higher than those for the controls (median UMI counts of 15.5, 57, and 40 for stage I, II, and III CRC patients, respectively, compared with 3 for the controls; p < 0.0001, two-tailed Mann-Whitney-Wilcoxon [MWW] test); a consistent trend was observed in the test set (Figure S6A; Table S5).
We investigated seven base models: three variants of logistic regression (L1-regularized LASSO, L2-regularized ridge, and ElasticNet), two tree-based algorithms (random forest and gradient boosting), a support vector machine (SVM), and a marker count approach that used a single total count value of the 72 markers. We also developed an ensemble model integrating all seven base models (see STAR Methods). To assess model performance and stability, we implemented a repeated nested validation procedure on the training set. This process was repeated 200 times; in each repetition, the data were partitioned into an 80% internal training subset and a 20% internal test subset. Model development, including feature selection and optimization via 5-fold cross-validation, was performed exclusively on the 80% internal training subset. The resulting models were then evaluated on the unseen 20% internal test subset to obtain robust performance estimates (Figure S5). Learning curves derived from this analysis showed that model performance plateaued as the sample size increased (Figure S7).
The results demonstrated that the ensemble model achieved the best performance, with an area under the curve (AUC) of 0.948 (±0.029 SD, 95% confidence interval [CI]: 0.890–0.993), which was the highest among all models (Figures S6B and S6C). The stage-specific AUCs were 0.900, 0.973, and 0.958 for stages I, II, and III, respectively, each of which ranked among the highest across all models (Figure 2A). The model demonstrated a sensitivity of 83.2% (95% CI: 0.667–0.958) and specificity of 92.9% (95% CI: 0.680–1.000), corresponding to the upper-right point on the plot (Figure 2B). The stage-specific sensitivities, learning curve analysis, and Brier scores also supported the superior performance of the ensemble model (Figures S6D and S7; Table S4).
Having established the parameters and the classification threshold at 0.34 of the ensemble model on the training set (see STAR Methods, Table S4), we applied it to the hold-out test set. The model achieved AUCs of 0.796, 0.986, and 0.971 for stage I, II, and III patients, respectively, and 0.928 for stages I–III combined (Figures 2C and 2D). It achieved an overall sensitivity of 88.2% (45/51) and a specificity of 90.7% (49/54), with stage-specific sensitivities of 71.4% (10/14) for stage I, 93.8% (15/16) for stage II, and 95.2% (20/21) for stage III CRC patients (Figure 2E). Prediction scores for each patient and control individual are provided in Table S5.
We also conducted a comparative analysis of a 14-marker sub-panel selected from the 72-marker panel, which included six well-established CRC markers (SEPT9, C9orf50, KCNQ5, CLIP4, IRF4, and IKZF1). The 14-marker sub-panel showed inferior performance to the 72-marker panel, highlighting the effectiveness of our full panel (Figure S8).
Detecting GC in blood
Next, we examined the performance of tMCTA-seq for detecting GC. The total UMI counts of the 68 mCGCGCGG-CpG GC markers were significantly elevated in GC at both early and advanced stages compared with controls in the training set (median counts of 10, 18, and 39 for stage I, II, and III GC, respectively, versus 4 for controls; p < 0.0001, two-tailed MWW test); a consistent trend was observed in the test set (Figure S9A; Table S5). Following a similar approach to the CRC analysis, we investigated eight candidate models using the repeated nested cross-validation framework on the training set (Figure S5). The results demonstrated that the ensemble model again exhibited the best performance, achieving an AUC of 0.884 (±0.056 SD, 95% CI: 0.760–0.974) for stages I–III combined, the highest among all models (Figure S9B). The stage-specific AUCs were 0.799, 0.887, and 0.939 for stages I, II, and III, respectively, consistently outperforming the other models (Figure 3A). The model achieved a sensitivity of 75.3% (95% CI: 0.498–0.929), with a specificity of 86.3% (95% CI: 0.680–1.000), corresponding to the upper-right points on the plot (Figure 3B). The stage-specific sensitivities, Brier score, and learning curve analysis also supported the superior performance of the ensemble model (Figures S7C, S7D, S9C, and 9D).
Having established the parameters and the classification threshold at 0.38 of the ensemble model on the training set (see STAR Methods, Table S4), we applied it to the hold-out test set. The model achieved AUCs of 0.822, 0.942, and 0.991 for stage I, II, and III patients, respectively, and 0.926 for stages I–III combined (Figures 3C and 3D). It achieved an overall sensitivity of 86.7% (26/30) and a specificity of 94.4% (51/54), with 66.7% sensitivity of (6/9) for stage I, 88.9% (8/9) for stage II, and 100% (12/12) for stage III GC patients (Figure 3E). Prediction scores for each patient and control individual are provided in Table S5.
We also evaluated the performance of SEPT9, which has been proposed as a ctDNA marker for GC. Analysis of two SEPT9 CGCGCGG loci in our panel yielded an overall sensitivity of 46.7% (14/30) for GC detection, with stage-specific sensitivities of 33.3% (3/9), 33.3% (3/9), and 66.7% (8/12) for stages I, II, and III, respectively, and a specificity of 88.9% (48/54). The performance was inferior to that of the 68-marker panel, supporting the effectiveness of our full panel.
Comparison with serological biomarkers
To further evaluate the clinical utility of tMCTA-seq, we compared its performance with commonly used serological biomarkers in clinical practice, including carbohydrate antigen 125 (CA125), carbohydrate antigen 72-4 (CA72-4), carbohydrate antigen 19-9 (CA19-9), and carcinoembryonic antigen (CEA) in the hold-out test sets (Figure 4A; Table S5).
For CRC detection, CEA (with a threshold of 5 ng/mL) showed sensitivities of 21.4% (3/14), 50.0% (8/16), and 23.8% (5/21) for stages I, II, and III, respectively, resulting in an overall sensitivity of 31.3% (16/51; Figure 4B). Even when all four serological biomarkers were combined (considering a positive result if any one marker was abnormal), the sensitivities were only 35.7% (5/14), 50.0% (8/16), and 52.3% (11/21) for stages I, II, and III, respectively, with an overall sensitivity of 47.1% (24/51; Figure 4B). In contrast, tMCTA-seq detected 88.6% (31/35) of CEA-negative CRC patients and 87.5% (14/16) of CEA-positive CRC patients (Figure 4B). Among CA19-9-positive CRC patients (n = 11), 90.9% (10/11) were detected by tMCTA-seq, while among CA19-9-negative CRC patients (n = 39), 87.2% (34/39) were detected. When combining tMCTA-seq with CEA, the sensitivities increased to 78.6% (11/14) and 100% (16/16) for stages I and II, respectively, with an overall sensitivity of 92.2% (47/51).
For GC detection, CEA showed sensitivities of 33.3% (3/9), 11.1% (1/9), and 33.3% (4/12) for stages I, II, and III, respectively, with an overall sensitivity of 26.7% (8/30; Figure 4C). When all four serological biomarkers were combined, the sensitivities were 33.3% (3/9), 11.1% (1/9), and 50.0% (6/12) for stages I, II, and III, respectively, with an overall sensitivity of 33.3% (10/30; Figure 4C). In comparison, tMCTA-seq detected 86.4% (19/22) of CEA-negative GC patients and 87.5% (7/8) of CEA-positive GC patients. tMCTA-seq detected all three CA19-9-positive GC patients (100%, 3/3) and 85.2% (23/27) of CA19-9-negative patients (Figure 4C). When combining tMCTA-seq with CEA, the sensitivities increased to 77.8% (7/9) for stage I, with an overall sensitivity of 90.0% (27/30).
Together, these results demonstrated that tMCTA-seq outperforms traditional serological biomarkers in detecting both CRC and GC.
Differentiation between CRC and GC in blood
Next, we explored the ability of tMCTA-seq to distinguish between CRC and GC in these samples. The 110-marker panel included 46 mCGCGCGG-CpGs as discriminative methylated markers between CRC and GC, comprising 22 CRC-specific markers (CRC versus GC) and 24 GC-specific markers (GC versus CRC; Table S1). For patients who were positive for both CRC and GC prediction scores (CRC prediction score ≥ 0.34, and GC prediction score ≥ 0.38; n = 63), we calculated the ratio of the total UMI counts of the CRC-specific markers to those of the GC-specific markers. The results showed that this ratio was significantly higher in CRC patients than in GC patients (Figure 5A). It achieved a predictive AUC value of 0.819 (95% CI: 0.707–0.930; Figure 5B). Using a log ratio value of 0 as the cutoff, we developed a classifier that predicted GC with high accuracy. The prediction accuracy for stage I, II, and III GC was 100% (5/5), 85.7% (6/7), and 90.9% (10/11), respectively, while the prediction accuracy for stage I, II, and III CRC was 66.7% (4/6), 80.0% (12/15), and 58.8% (10/17), respectively.
We evaluated strategies for combined CRC and GC detection with or without differentiating between them. Since blood-based CRC screening has been clinically approved, we established a CRC-prioritized approach for the differentiated strategy. All individuals who tested positive for CRC were recommended colonoscopy. Among them, those who also tested positive for GC and were subsequently classified as GC by the discriminative markers were advised to undergo a combined upper endoscopy and colonoscopy (Table S6). We also evaluated an undifferentiated strategy, for which all individuals testing positive for both CRC and GC markers were advised to undergo a combined upper endoscopy and colonoscopy (Table S6). We simulated a screening in 100,000 individuals, assuming a prevalence of 0.5% for CRC and 0.35% for GC in the target screening population4,20 (Table S7). A comparative analysis showed that the positive predictive value (PPV) increased from 4.2% in the undifferentiated strategy to 5.6% in the differentiated strategy (Figure 5C). However, the undifferentiated strategy showed higher sensitivity for GC detection (86.7%; 26/30 versus 80.0%; 24/30) (Figure 5D; Table S7). For both strategies, the sensitivity and PPV for CRC detection were 88.2% (45/51) and 7.1%, respectively, which were the same as those for the CRC-only detection strategy. It was noteworthy that a CRC-only approach would miss all GC patients, whereas the combined CRC and GC detection approach identifies a substantial proportion of them (Table S7).
Together, our results demonstrated that tMCTA-seq can effectively distinguish between CRC and GC in blood. Combined CRC and GC detection identified GC patients who would be missed in a CRC-only approach.
Discussion
Discussion
In this study, we developed a targeted DNA methylation method, tMCTA-seq, and demonstrated its effectiveness for the detection and differentiation of CRC and GC in blood.
First, tMCTA-seq is a PCR-based targeted DNA methylation method with a unique design. Hybridization capture has been utilized to target numerous DNA methylation loci for ctDNA detection, with a notable example being the Galleri test, which captures over 100,000 regions for pan-cancer screening.11 However, PCR-based methods can be simpler, faster, and cheaper than hybridization capture-based methods when targeting a small to moderate number of loci29,30 (Table S3). Compared with conventional PCR-based targeted methods that require two sets of locus-specific primers, tMCTA-seq employs a single set of locus-specific primers in conjunction with a universal nested CGCGCGG primer. This approach simplifies the PCR assay and allows the incorporation of a UMI for accurate quantitative analysis. It also facilitates ctDNA detection, as the short CGCGCGG sequence aligns well with the short fragment length characteristic of ctDNA. In comparison to MCTA-seq, tMCTA-seq streamlines library preparation by omitting the gel purification step. Since tMCTA-seq targets the same mCGCGCGG-CpG sites as MCTA-seq, it allows for initial screening with MCTA-seq to identify well-performing ctDNA markers, which can then be applied to establish a cost-effective tMCTA-seq assay for clinical use.31
The technological performance of tMCTA-seq is robust. The average molecule capture efficiency of the targeted 110 mCGCGCGG is approximately 23%. Additionally, it is cost-effective, requiring only 6–7 million sequencing reads and allowing for saturated sequencing at a depth of approximately 80 reads per UMI for 6 ng of bisulfite-treated DNA. The primer design success rate was also high. For the previously identified 80 CRC mCGCGCGG-CpGs, we designed targeted primers for 74 loci (excluding 6 loci with low or high GC content), and 69 loci showed acceptable capture efficiency and were included in the final 110-marker panel.
Second, our diagnostic models, built upon tMCTA-seq data, could effectively detect both CRC and GC. On the test set, it detected stages I, II, and III of CRC with sensitivities of 71.4%, 93.8%, and 95.2%, respectively, at a specificity of 90.7%. It detected stages I, II, and III of GC with sensitivities of 66.7%, 88.9%, and 100%, respectively, at a specificity of 94.4%. For CRC detection, the results indicated that the performance of tMCTA-seq is comparable to that of the best-reported DNA methylation detection methods7,14,19,20,32,33,34 (Table S8). For GC detection, the results suggest that tMCTA-seq outperforms the methods that utilize fewer DNA methylation markers, including the recently established RNF180/SEPT9 qPCR assay validated in over 1,000 subjects17,35,36,37 (Table S9). The robust performance of tMCTA-seq may be attributable to its multi-marker design, which captures cancers’ molecular diversity, and its selection of carefully pre-screened mCGCGCGG-CpG cancer marker loci. Many of these selected loci feature high CpG density, making their detection challenging by using traditional methods. Since tMCTA-seq can detect CRC and GC simultaneously in a single assay, it should facilitate the noninvasive early detection of these cancers.
Third, compared with single-cancer or pan-cancer detection, the simultaneous detection of CRC and GC offers unique clinical benefits. Compared with blood-based CRC detection alone, combined CRC and GC detection additionally identifies GC patients in a single test. Compared to pan-cancer screening, focusing on these two cancer types, which have well-defined diagnostic and therapeutic pathways through the gastrointestinal endoscopy, helps reduce unnecessary anxiety, financial burden, and overtreatment. Particularly, combined endoscopic detection for CRC and GC is already an established clinical practice. Studies have suggested that this approach is cost-effective even in Europeans with intermediate GC risk levels.38,39 We wish to point out that, similar to blood-based CRC detection, blood-based GC detection serves as an alternative for patients who are unable or unwilling to undergo initial endoscopic screening.
We compared the combined CRC and GC detection with or without differentiation between them. The undifferentiated strategy identified more GC patients, while the differentiated strategy reduced gastroscopy burden and improved gastroscopy PPV. Considering that sensitivity is a critical factor for a gastrointestinal screening test, the undifferentiated strategy may be favorable. However, alternative strategies could be chosen for regions and populations with varying gastrointestinal cancer incidence rates. Assessing the cost-effectiveness of approaches prioritizing either PPV or sensitivity will require additional data from a larger sample cohort with input from health economics experts in the future.
In summary, we developed a targeted DNA methylation technique, tMCTA-seq, for simultaneous detection of and differentiation between CRC and GC in blood. This method offers a promising tool for noninvasive detection of gastrointestinal cancers.
Limitations of the study
This study has several limitations. First, its retrospective case-control design necessitates validation in large-scale, multi-center prospective studies of average-risk populations. Critically, future studies should include larger cohorts of pre-cancerous lesions and stage I cancer to rigorously evaluate early-detection capability and real-world specificity. Second, our non-cancer control group was not stratified by specific benign gastrointestinal conditions; future studies with such well-defined cohorts are needed to better assess assay specificity and potential false positives. Finally, the assay’s prediction accuracy was lower for CRC than for GC. This limitation was nonetheless compensated by our CRC-prioritized differentiation approach, which maintained effective CRC detection. Examining matched tissue samples from misclassified cfDNA cases can help elucidate the underlying causes of misclassification and uncover additional discriminatory markers.
In this study, we developed a targeted DNA methylation method, tMCTA-seq, and demonstrated its effectiveness for the detection and differentiation of CRC and GC in blood.
First, tMCTA-seq is a PCR-based targeted DNA methylation method with a unique design. Hybridization capture has been utilized to target numerous DNA methylation loci for ctDNA detection, with a notable example being the Galleri test, which captures over 100,000 regions for pan-cancer screening.11 However, PCR-based methods can be simpler, faster, and cheaper than hybridization capture-based methods when targeting a small to moderate number of loci29,30 (Table S3). Compared with conventional PCR-based targeted methods that require two sets of locus-specific primers, tMCTA-seq employs a single set of locus-specific primers in conjunction with a universal nested CGCGCGG primer. This approach simplifies the PCR assay and allows the incorporation of a UMI for accurate quantitative analysis. It also facilitates ctDNA detection, as the short CGCGCGG sequence aligns well with the short fragment length characteristic of ctDNA. In comparison to MCTA-seq, tMCTA-seq streamlines library preparation by omitting the gel purification step. Since tMCTA-seq targets the same mCGCGCGG-CpG sites as MCTA-seq, it allows for initial screening with MCTA-seq to identify well-performing ctDNA markers, which can then be applied to establish a cost-effective tMCTA-seq assay for clinical use.31
The technological performance of tMCTA-seq is robust. The average molecule capture efficiency of the targeted 110 mCGCGCGG is approximately 23%. Additionally, it is cost-effective, requiring only 6–7 million sequencing reads and allowing for saturated sequencing at a depth of approximately 80 reads per UMI for 6 ng of bisulfite-treated DNA. The primer design success rate was also high. For the previously identified 80 CRC mCGCGCGG-CpGs, we designed targeted primers for 74 loci (excluding 6 loci with low or high GC content), and 69 loci showed acceptable capture efficiency and were included in the final 110-marker panel.
Second, our diagnostic models, built upon tMCTA-seq data, could effectively detect both CRC and GC. On the test set, it detected stages I, II, and III of CRC with sensitivities of 71.4%, 93.8%, and 95.2%, respectively, at a specificity of 90.7%. It detected stages I, II, and III of GC with sensitivities of 66.7%, 88.9%, and 100%, respectively, at a specificity of 94.4%. For CRC detection, the results indicated that the performance of tMCTA-seq is comparable to that of the best-reported DNA methylation detection methods7,14,19,20,32,33,34 (Table S8). For GC detection, the results suggest that tMCTA-seq outperforms the methods that utilize fewer DNA methylation markers, including the recently established RNF180/SEPT9 qPCR assay validated in over 1,000 subjects17,35,36,37 (Table S9). The robust performance of tMCTA-seq may be attributable to its multi-marker design, which captures cancers’ molecular diversity, and its selection of carefully pre-screened mCGCGCGG-CpG cancer marker loci. Many of these selected loci feature high CpG density, making their detection challenging by using traditional methods. Since tMCTA-seq can detect CRC and GC simultaneously in a single assay, it should facilitate the noninvasive early detection of these cancers.
Third, compared with single-cancer or pan-cancer detection, the simultaneous detection of CRC and GC offers unique clinical benefits. Compared with blood-based CRC detection alone, combined CRC and GC detection additionally identifies GC patients in a single test. Compared to pan-cancer screening, focusing on these two cancer types, which have well-defined diagnostic and therapeutic pathways through the gastrointestinal endoscopy, helps reduce unnecessary anxiety, financial burden, and overtreatment. Particularly, combined endoscopic detection for CRC and GC is already an established clinical practice. Studies have suggested that this approach is cost-effective even in Europeans with intermediate GC risk levels.38,39 We wish to point out that, similar to blood-based CRC detection, blood-based GC detection serves as an alternative for patients who are unable or unwilling to undergo initial endoscopic screening.
We compared the combined CRC and GC detection with or without differentiation between them. The undifferentiated strategy identified more GC patients, while the differentiated strategy reduced gastroscopy burden and improved gastroscopy PPV. Considering that sensitivity is a critical factor for a gastrointestinal screening test, the undifferentiated strategy may be favorable. However, alternative strategies could be chosen for regions and populations with varying gastrointestinal cancer incidence rates. Assessing the cost-effectiveness of approaches prioritizing either PPV or sensitivity will require additional data from a larger sample cohort with input from health economics experts in the future.
In summary, we developed a targeted DNA methylation technique, tMCTA-seq, for simultaneous detection of and differentiation between CRC and GC in blood. This method offers a promising tool for noninvasive detection of gastrointestinal cancers.
Limitations of the study
This study has several limitations. First, its retrospective case-control design necessitates validation in large-scale, multi-center prospective studies of average-risk populations. Critically, future studies should include larger cohorts of pre-cancerous lesions and stage I cancer to rigorously evaluate early-detection capability and real-world specificity. Second, our non-cancer control group was not stratified by specific benign gastrointestinal conditions; future studies with such well-defined cohorts are needed to better assess assay specificity and potential false positives. Finally, the assay’s prediction accuracy was lower for CRC than for GC. This limitation was nonetheless compensated by our CRC-prioritized differentiation approach, which maintained effective CRC detection. Examining matched tissue samples from misclassified cfDNA cases can help elucidate the underlying causes of misclassification and uncover additional discriminatory markers.
Resource availability
Resource availability
Lead contact
Further information and requests for resources and reagents should be directed to and will be fulfilled by the lead contact, Dr. Xin Zhou (zhouxinasd@sina.cn).
Materials availability
All materials and reagents used in this study were either purchased from commercial suppliers or custom-synthesized by commercial vendors. Detailed information for all key resources, including suppliers and catalog numbers, is provided in the key resources table.
Data and code availability
•In compliance with national regulations regarding sharing of human genetic resources, the raw sequencing data must be managed under controlled access. Data are hosted at the GSA-Human database (https://ngdc.cncb.ac.cn/gsa-human/), and accession number is HRA004986/PRJCA018045, which is also provided in the key resources table. Requests to access data should follow the GSA’s “Data Access Request Guidance,” which is available at https://ngdc.cncb.ac.cn/gsa-human/document. Applicants will be required to complete and sign a data access agreement. The data access committee, guided by the DAC chair (Fuchou Tang, tangfuchou@pku.edu.cn), regulates access in accordance with the institutional and national guidelines. Data are to be used solely for research purposes as approved in the data access agreement.
•The original code for the analysis is publicly available on GitHub at https://github.com/Guo-Yuqing/tMCTA-seq. It is also listed in the key resources table.
Lead contact
Further information and requests for resources and reagents should be directed to and will be fulfilled by the lead contact, Dr. Xin Zhou (zhouxinasd@sina.cn).
Materials availability
All materials and reagents used in this study were either purchased from commercial suppliers or custom-synthesized by commercial vendors. Detailed information for all key resources, including suppliers and catalog numbers, is provided in the key resources table.
Data and code availability
•In compliance with national regulations regarding sharing of human genetic resources, the raw sequencing data must be managed under controlled access. Data are hosted at the GSA-Human database (https://ngdc.cncb.ac.cn/gsa-human/), and accession number is HRA004986/PRJCA018045, which is also provided in the key resources table. Requests to access data should follow the GSA’s “Data Access Request Guidance,” which is available at https://ngdc.cncb.ac.cn/gsa-human/document. Applicants will be required to complete and sign a data access agreement. The data access committee, guided by the DAC chair (Fuchou Tang, tangfuchou@pku.edu.cn), regulates access in accordance with the institutional and national guidelines. Data are to be used solely for research purposes as approved in the data access agreement.
•The original code for the analysis is publicly available on GitHub at https://github.com/Guo-Yuqing/tMCTA-seq. It is also listed in the key resources table.
Acknowledgments
Acknowledgments
This work is supported by the 10.13039/501100005090Beijing Nova Program (no. 2022029), the National High Level Hospital Clinical Research Funding and Elite Medical Professionals Project of 10.13039/501100012173China-Japan Friendship Hospital (no. ZRJY2023-QM13), the 10.13039/501100001809National Natural Science Foundation of China (nos. 82003153 and 31901042), and the Beijing Natural Science Foundation (no. 7244406). We thank the supports from open research fund of the National Center for Protein Sciences at 10.13039/501100007937Peking University in Beijing and the high-performance computing platform of the 10.13039/501100011620Center for Life Sciences (10.13039/501100007937Peking University).
This work is supported by the 10.13039/501100005090Beijing Nova Program (no. 2022029), the National High Level Hospital Clinical Research Funding and Elite Medical Professionals Project of 10.13039/501100012173China-Japan Friendship Hospital (no. ZRJY2023-QM13), the 10.13039/501100001809National Natural Science Foundation of China (nos. 82003153 and 31901042), and the Beijing Natural Science Foundation (no. 7244406). We thank the supports from open research fund of the National Center for Protein Sciences at 10.13039/501100007937Peking University in Beijing and the high-performance computing platform of the 10.13039/501100011620Center for Life Sciences (10.13039/501100007937Peking University).
Author contributions
Author contributions
Z.J. conducted the investigation, developed methodologies, validated results, and drafted the original manuscript. Y.G. curated data, performed formal analysis, developed software, and drafted the original manuscript. Y.L. conducted the investigation, developed methodologies, validated results, and drafted the original manuscript. K.Q. curated data, performed formal analysis, developed software, and drafted the revised manuscript. X.L. curated data, developed software, and drafted the original manuscript. S.L. provided resources and drafted the original manuscript. J.R. developed methodologies and drafted the original manuscript. F.T. oversaw project administration, provided resources, and performed review and editing of the manuscript. W.F. provided resources, supervised the research, and performed review and editing of the manuscript. L.W. conceptualized the study, designed the methodology, oversaw project administration, supervised the research, and performed review and editing of the manuscript. X.Z. conceptualized the study, provided resources, supervised the research, and performed review and editing of the manuscript. All authors read and approved the final manuscript.
Z.J. conducted the investigation, developed methodologies, validated results, and drafted the original manuscript. Y.G. curated data, performed formal analysis, developed software, and drafted the original manuscript. Y.L. conducted the investigation, developed methodologies, validated results, and drafted the original manuscript. K.Q. curated data, performed formal analysis, developed software, and drafted the revised manuscript. X.L. curated data, developed software, and drafted the original manuscript. S.L. provided resources and drafted the original manuscript. J.R. developed methodologies and drafted the original manuscript. F.T. oversaw project administration, provided resources, and performed review and editing of the manuscript. W.F. provided resources, supervised the research, and performed review and editing of the manuscript. L.W. conceptualized the study, designed the methodology, oversaw project administration, supervised the research, and performed review and editing of the manuscript. X.Z. conceptualized the study, provided resources, supervised the research, and performed review and editing of the manuscript. All authors read and approved the final manuscript.
Declaration of interests
Declaration of interests
The authors declare no competing interests.
The authors declare no competing interests.
STAR★Methods
STAR★Methods
Key resources table
Experimental model and study participant details
The participants were enrolled from the Department of General Surgery, Peking University Third Hospital. The study was approved by the Ethics Committee of Peking University Third Hospital (IRB00006761-2016003). All participants provided informed consent for the collection of samples before inclusion in the study. In total, we collected 448 samples, including plasma samples from CRC patients (n = 170; mean age: 65 years; 55% males), GC patients (n = 101; mean age: 65 years; 74% males) and control participants (n = 177; mean age: 47 years; 56% males). All participants were Asian (Chinese). Detailed participant information can be found in Table S5.
Method details
Genomic DNA isolation and dilution experiment
Genomic DNA (gDNA) from human WBCs was extracted using the DNeasy Blood & Tissue Kit (Qiagen, 69506) according to the manufacturer’s protocol. The fully methylated human genomic DNA (FMG) was purchased from Zymo Research (D5014). For the dilution or control experiments, the FMG or WBC gDNA was sheared to 150–200 bp using the Covaris system (to mimic cfDNA). After bisulfite treatment, the Qubit ssDNA Assay Kit (Thermo Fisher, Q10212) was used to quantify the concentration of WBC gDNA and FMG. The bisulfite-treated FMG was diluted from 150 pg to 2.3 pg in a 1/2 gradient and spiked into 6 ng of bisulfite-treated WBC gDNA for subsequent tMCTA-Seq library preparation; WBC gDNA without FMG was used as the negative control. For each cfDNA experiment, a positive control (30 pg FMG spiked into 6 ng WBC) and a negative control (6 ng WBC only) were included.
Blood sample processing and cfDNA extraction
To collect 2 mL of plasma, approximately 4 mL of peripheral blood was drawn from each participant. To remove WBCs as completely as possible, blood samples were processed with room temperature centrifugation at 1,350 g and 16,000 g, all within 6 h of collection. cfDNA was extracted from the plasma samples using the VAHTS Free-Circulating DNA Maxi Kit (Vazyme, N903), following the manufacturer’s protocol.
Library preparation
The extracted cfDNA was bisulfite-converted using the EZ DNA Methylation-Lightning Kit (Zymo Research, D5031) following the manufacturer’s protocol, and eluted in 21 μL elution buffer prior to library preparation. For the first step of semi-amplicon, the bisulfite-converted DNA was linearly amplified in a 9 μL reaction containing 1× NEBuffer 2, 0.33 μM of primer A, 2 μL of a mixture of dATP, dCTP, dGTP, and dUTP (Thermo Fisher, R0251). The primer A sequence was adapted from the MCTA-Seq primer A, with modifications including an increased sequence length to facilitate bead purification and incorporation of uracil.28 It consisted of an equimolar mixture of the following four primers at a total concentration of 5 μM: A1 (TTTCTCATTCTTCACAATACACATCTTACTTTCCCTACACGACGCTCTTCCGAUCUHHHHHHHHCGCH), A2 (TTTCTCATTCTTCACAATACACATCTTACTTTCCCTACACGACGCTCTTCCGAUCUHHHHHHHCGHCH), A3 (TTTCTCATTCTTCACAATACACATCTTACTTTCCCTACACGACGCTCTTCCGAUCUHHHHHHCGHHCH), and A4 (TTTCTCATTCTTCACAATACACATCTTACTTTCCCTACACGACGCTCTTCCGAUCUHHHHHCGHHHCH). The reaction mixture was incubated at 95°C for 1 min and then held at 4°C. After adding 5 units of Klenow fragment (exo-; NEB, M0212), the reaction was subjected to the following conditions: 4°C for 50 s, 10°C for 1 min, 20°C for 4 min, 30°C for 4 min, 37°C for 4 min, 95°C for 30 s and held at 4°C. An additional 2.5 μL of reaction mixture containing 1× NEBuffer 2, 0.1 μL of a mixture of dATP, dCTP, dGTP, and dUTP (Thermo Fisher, R0251) and 5 units of Klenow fragment (exo-; NEB, M0212) was added. The reaction was then subjected to the following conditions: 4°C for 50 s, 10°C for 1 min, 20°C for 4 min, 30°C for 4 min, 37°C for 4 min and 75°C for 20 min. A 28 μL product was obtained via 4× bead purification (Beckman Coulter, A63882). In the second step, the double-stranded product underwent amplification using 1× KAPA HiFi HotStart Uracil+ ReadyMix (Roche, KK2802), 0.2 μM of primer TA (5′-ACTTTCCCTACACGACGCTCT-3′) and a mixture of 110 locus-specific primers (0.1 μM each; Table S1). The reaction procedures were as follows: 98°C for 45 s, followed by 10 cycles of 98°C for 15 s, 60°C for 30 s, 72°C for 30 s and a final extension at 72°C for 1 min. Afterward, 1 unit of uracil-DNA glycosylase (UDG; Thermo Fisher, EN0362) was added to degrade the uracil base, thereby facilitating the removal of non-targeted products at 37°C for 30 min. This was followed by 4× bead purification to obtain a 35.5 μL product. In the final step, the targeted mCGCGCGG-CpG loci were amplified in a 50 μL reaction containing 1× Ex Taq Buffer, 1 μL Hot Start Ex Taq (Takara, RR006A), 250 μM of each dNTP, 0.2 μM of primer B (5′-GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCTDDDDCGCGCGG-3', D = A/T/G), 0.25 μM of primer C (5′-CAAGCAGAAGACGGCATACGAGATXXXXXXXXGTGACTGGAGTTCAGACGTGTGCT-3′), and 0.25 μM of primer D (5'- AATGATACGGCGACCACCGAGATCTACACXXXXXXXXACACTCTTTCCCTACACGACGCTCTTCCGATCT-3′, the underlined portion in primer corresponds to the Illumina index sequences). The reaction underwent the following conditions: 95°C for 3 min, 50°C for 30 s and 72°C for 1 min, followed by 19 cycles of 95°C for 30 s, 64°C for 30 s, 72°C for 1 min and a final extension at 72°C for 5 min. The amplified product was then purified via 1.2 × bead purification. Sequencing was performed at 2 Gigabyte (Gb) per sample on the Illumina NovaSeq X Plus platform (sequenced by Novogene, PE150) to generate 150-bp pair-end reads.
Primer design
The NEB Tm calculator (https://tmcalculator.neb.com) was used to determine the annealing temperature, with the reaction conditions set for Taq DNA polymerase and primer concentration at 20 nM. For a specific CGCGCGG site, primers were designed using the following criteria: (1) GC content between 50% and 60%; (2) primer length between 20 and 28 bp; (3) annealing temperature of 60 ± 3°C; (4) the 3′ end preferably ending with CGC (corresponding to the 5' end of the CGCGCGG sequence) while avoiding GGCGC; alternatively, the 3′ end ends with C (corresponding to the 5' end of CGCGCGG); and (5) the distance between the 3′ end of the primer and the 5' end of the CGCGCGG sequence was less than 20 bp.
Quantification and statistical analysis
Data processing
The sequencing data were processed as previously described.27,28 In brief, we used Cutadapt (v3.4) to trim adapter sequences with parameters ‘-a GATCGGAAGAGCACA -A GATCGGAAGAGCGTC -e 0.1’. Custom scripts were subsequently applied to remove tMCTA-Seq primer sequences. Then we used Fastp (v0.23.0) to filter out low-quality reads with parameters ‘-A -q 20 -u 50 -n 5 -L 37’. To further enhance the precision of the analysis, reads containing >3 unmethylated CHs (CC, CA, CT) were excluded. The clean reads were mapped to the human reference genome (hg19) using Bismark (v0.16.3). Samtools (v1.3.1) was used to sort the aligned reads by coordinate. We only kept the aligned reads containing ≥2 CpG sites within ±3 bp of the 5′ end to remove PCR non-specific amplification.
PCR duplication was removed as follows. First, we extracted the 5 bp Unique Molecular Identifier (UMI) sequences from the 5′ end of R1 reads and retained those reads with a quality score exceeding 20 for all five UMI bases. Based on the alignment coordinates of R1 and R2 reads, as well as the UMI sequences, we grouped and counted the aligned reads. When R2 reads had the same alignment coordinates and R1 reads had the same alignment coordinates but exhibited a single-base difference in the UMI sequence, we retained the reads corresponding to the maximum count of UMIs. In cases where the UMI counts were identical, we retained all reads. If the R2 alignment coordinates and the UMI sequences were the same, but the R1 alignment coordinates varied, we retained the reads associated with the most frequent UMI at the R1 alignment coordinates. All UMI sequences must be supported by ≥ 2 reads to ensure the reliability of the results. The methylation value was calculated using non-duplicated reads. The Receiver operating characteristic (ROC) curve, heatmap, boxplots, and scatterplots were all generated using custom R scripts and R packages.
Data partition and candidate classifier models
To build and rigorously evaluate the diagnostic models, the samples from CRC patients and control participants were partitioned into a training set (70% of samples) and a hold-out test set (30%). A similar partitioning was performed for the GC and control cohorts. The splits were conducted using a type-stratified sampling approach based on the AJCC stage for cancer patients and the control status. This strategy ensured that the proportions of each cancer stage (Stage I, II, III) and control participants were consistent between the training and test sets, providing a basis for unbiased model evaluation. Random seeds were set to 0 for data partition.
The training set was exclusively used for all models’ comparison and training. The hold-out test set remained completely unseen during the comparison and training process and was used only once for the final, definitive assessment of the selected model’s performance.
A comprehensive suite of machine learning models was developed and benchmarked to classify individuals as having cancer or being a healthy control, based on the UMI counts of the selected marker panels (72 markers for CRC, 68 for GC). The candidate seven base models included.1.Three variants of logistic regression: L1-regularized (LASSO), L2-regularized (Ridge), and ElasticNet-regularized, which are effective for handling high-dimensional data and performing implicit feature selection.
2.Two tree-based ensemble algorithms: Random Forest and Gradient Boosting, known for their high accuracy and robustness.
3.A Support Vector Machine (SVM) with a radial basis function (RBF) kernel.
4.A simple baseline model, termed the “Marker Count Method,” which uses the total number of detected markers as a single predictive feature.
For models sensitive to feature scaling, such as logistic regression and SVM, a ‘StandardScaler’ was integrated into a ‘scikit-learn’ pipeline. This ensures that data scaling was learned only from the training portion of each cross-validation fold, preventing data leakage and leading to reliable performance estimates. The ensemble model was constructed by integrating the predictions from seven base models, averaging their prediction scores.
Model performance robustness and learning curve analysis
To thoroughly assess the stability and generalization capability of each candidate model, we conducted an extensive learning curve analysis using a repeated, nested cross-validation procedure within the main training set. This entire process was repeated 200 times to ensure the robustness of the results.
In each of the 200 repetitions, the main training set was first partitioned into an 80% training subset and a 20% internal test subset using type-stratified sampling. The subsequent model training and optimization occurred exclusively on the 80% training subset, while the 20% internal test subset was used for performance evaluation in that repetition. Specifically, a 5-fold cross-validation was implemented on the 80% training subset. For each fold, a model was trained on the other 4-folds (which were further subsampled to simulate various training data sizes, ranging from 10% to 100%, for the learning curve) and generated predictions on this validation fold. The ensemble model was constructed using predictions generated from a 5-fold, out-of-fold (OOF) procedure on the 80% training subset (for details, refer to the section “ensemble model construction and final evaluation”). The optimal classification threshold was determined from these predictions by identifying the value that maximized the Youden’s J statistic (sensitivity + specificity - 1) on the validation data. Random seeds were set to 0–199 for the 200 cross-validation repetitions to ensure full reproducibility.
The model trained in that fold, along with its determined optimal threshold, was then applied to the 20% internal test subset to calculate performance metrics. A comprehensive set of metrics was recorded, including AUC, sensitivity, specificity, Brier score, and stage-specific sensitivities. This robust procedure allowed us to not only compare the average performance of the models but also to evaluate the stability of their predictions (via confidence intervals) and understand how their performance scales with increasing amounts of training data.
Statistical significance testing
To validate that the performance of our diagnostic approach was not a result of chance, we performed a permutation test. The labels of the training set (i.e., 'cancer' or 'control') were randomly shuffled 1,000 times, and on each shuffled dataset, the entire 5-fold cross-validation and modeling process was repeated to generate a null distribution of AUC scores. The p-value was calculated as the proportion of AUC scores from the null distribution that were greater than or equal to the AUC score obtained with the real, unshuffled labels. A small p-value (p < 0.001) indicates that the observed model performance is highly statistically significant.
Ensemble model construction and final evaluation
Based on the robustness analysis, an ensemble model that integrates the predictions of all seven candidate models demonstrated superior and more stable performance. The final ensemble model was constructed using predictions generated from a 5-fold, OOF procedure on the entire training set.
Specifically, for each fold, all seven base models were trained on the other 4-folds and used to predict on this validation fold. The average of these seven predictions constituted the OOF prediction for this fold. By repeating this for all 5-folds, we obtained an OOF prediction score for every sample in the training set. The classification threshold for the ensemble model was determined from the complete set of OOF scores by identifying the value that achieved a specificity of 90%. The final ensemble model was constructed through the integration of predictions from all seven base models across the 5-folds, averaging all 35 individual prediction scores.
Finally, this established ensemble model was applied to the completely independent hold-out test set to report the final, unbiased performance metrics, including overall AUC, stage-specific AUCs, sensitivity, and specificity.
Key resources table
Experimental model and study participant details
The participants were enrolled from the Department of General Surgery, Peking University Third Hospital. The study was approved by the Ethics Committee of Peking University Third Hospital (IRB00006761-2016003). All participants provided informed consent for the collection of samples before inclusion in the study. In total, we collected 448 samples, including plasma samples from CRC patients (n = 170; mean age: 65 years; 55% males), GC patients (n = 101; mean age: 65 years; 74% males) and control participants (n = 177; mean age: 47 years; 56% males). All participants were Asian (Chinese). Detailed participant information can be found in Table S5.
Method details
Genomic DNA isolation and dilution experiment
Genomic DNA (gDNA) from human WBCs was extracted using the DNeasy Blood & Tissue Kit (Qiagen, 69506) according to the manufacturer’s protocol. The fully methylated human genomic DNA (FMG) was purchased from Zymo Research (D5014). For the dilution or control experiments, the FMG or WBC gDNA was sheared to 150–200 bp using the Covaris system (to mimic cfDNA). After bisulfite treatment, the Qubit ssDNA Assay Kit (Thermo Fisher, Q10212) was used to quantify the concentration of WBC gDNA and FMG. The bisulfite-treated FMG was diluted from 150 pg to 2.3 pg in a 1/2 gradient and spiked into 6 ng of bisulfite-treated WBC gDNA for subsequent tMCTA-Seq library preparation; WBC gDNA without FMG was used as the negative control. For each cfDNA experiment, a positive control (30 pg FMG spiked into 6 ng WBC) and a negative control (6 ng WBC only) were included.
Blood sample processing and cfDNA extraction
To collect 2 mL of plasma, approximately 4 mL of peripheral blood was drawn from each participant. To remove WBCs as completely as possible, blood samples were processed with room temperature centrifugation at 1,350 g and 16,000 g, all within 6 h of collection. cfDNA was extracted from the plasma samples using the VAHTS Free-Circulating DNA Maxi Kit (Vazyme, N903), following the manufacturer’s protocol.
Library preparation
The extracted cfDNA was bisulfite-converted using the EZ DNA Methylation-Lightning Kit (Zymo Research, D5031) following the manufacturer’s protocol, and eluted in 21 μL elution buffer prior to library preparation. For the first step of semi-amplicon, the bisulfite-converted DNA was linearly amplified in a 9 μL reaction containing 1× NEBuffer 2, 0.33 μM of primer A, 2 μL of a mixture of dATP, dCTP, dGTP, and dUTP (Thermo Fisher, R0251). The primer A sequence was adapted from the MCTA-Seq primer A, with modifications including an increased sequence length to facilitate bead purification and incorporation of uracil.28 It consisted of an equimolar mixture of the following four primers at a total concentration of 5 μM: A1 (TTTCTCATTCTTCACAATACACATCTTACTTTCCCTACACGACGCTCTTCCGAUCUHHHHHHHHCGCH), A2 (TTTCTCATTCTTCACAATACACATCTTACTTTCCCTACACGACGCTCTTCCGAUCUHHHHHHHCGHCH), A3 (TTTCTCATTCTTCACAATACACATCTTACTTTCCCTACACGACGCTCTTCCGAUCUHHHHHHCGHHCH), and A4 (TTTCTCATTCTTCACAATACACATCTTACTTTCCCTACACGACGCTCTTCCGAUCUHHHHHCGHHHCH). The reaction mixture was incubated at 95°C for 1 min and then held at 4°C. After adding 5 units of Klenow fragment (exo-; NEB, M0212), the reaction was subjected to the following conditions: 4°C for 50 s, 10°C for 1 min, 20°C for 4 min, 30°C for 4 min, 37°C for 4 min, 95°C for 30 s and held at 4°C. An additional 2.5 μL of reaction mixture containing 1× NEBuffer 2, 0.1 μL of a mixture of dATP, dCTP, dGTP, and dUTP (Thermo Fisher, R0251) and 5 units of Klenow fragment (exo-; NEB, M0212) was added. The reaction was then subjected to the following conditions: 4°C for 50 s, 10°C for 1 min, 20°C for 4 min, 30°C for 4 min, 37°C for 4 min and 75°C for 20 min. A 28 μL product was obtained via 4× bead purification (Beckman Coulter, A63882). In the second step, the double-stranded product underwent amplification using 1× KAPA HiFi HotStart Uracil+ ReadyMix (Roche, KK2802), 0.2 μM of primer TA (5′-ACTTTCCCTACACGACGCTCT-3′) and a mixture of 110 locus-specific primers (0.1 μM each; Table S1). The reaction procedures were as follows: 98°C for 45 s, followed by 10 cycles of 98°C for 15 s, 60°C for 30 s, 72°C for 30 s and a final extension at 72°C for 1 min. Afterward, 1 unit of uracil-DNA glycosylase (UDG; Thermo Fisher, EN0362) was added to degrade the uracil base, thereby facilitating the removal of non-targeted products at 37°C for 30 min. This was followed by 4× bead purification to obtain a 35.5 μL product. In the final step, the targeted mCGCGCGG-CpG loci were amplified in a 50 μL reaction containing 1× Ex Taq Buffer, 1 μL Hot Start Ex Taq (Takara, RR006A), 250 μM of each dNTP, 0.2 μM of primer B (5′-GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCTDDDDCGCGCGG-3', D = A/T/G), 0.25 μM of primer C (5′-CAAGCAGAAGACGGCATACGAGATXXXXXXXXGTGACTGGAGTTCAGACGTGTGCT-3′), and 0.25 μM of primer D (5'- AATGATACGGCGACCACCGAGATCTACACXXXXXXXXACACTCTTTCCCTACACGACGCTCTTCCGATCT-3′, the underlined portion in primer corresponds to the Illumina index sequences). The reaction underwent the following conditions: 95°C for 3 min, 50°C for 30 s and 72°C for 1 min, followed by 19 cycles of 95°C for 30 s, 64°C for 30 s, 72°C for 1 min and a final extension at 72°C for 5 min. The amplified product was then purified via 1.2 × bead purification. Sequencing was performed at 2 Gigabyte (Gb) per sample on the Illumina NovaSeq X Plus platform (sequenced by Novogene, PE150) to generate 150-bp pair-end reads.
Primer design
The NEB Tm calculator (https://tmcalculator.neb.com) was used to determine the annealing temperature, with the reaction conditions set for Taq DNA polymerase and primer concentration at 20 nM. For a specific CGCGCGG site, primers were designed using the following criteria: (1) GC content between 50% and 60%; (2) primer length between 20 and 28 bp; (3) annealing temperature of 60 ± 3°C; (4) the 3′ end preferably ending with CGC (corresponding to the 5' end of the CGCGCGG sequence) while avoiding GGCGC; alternatively, the 3′ end ends with C (corresponding to the 5' end of CGCGCGG); and (5) the distance between the 3′ end of the primer and the 5' end of the CGCGCGG sequence was less than 20 bp.
Quantification and statistical analysis
Data processing
The sequencing data were processed as previously described.27,28 In brief, we used Cutadapt (v3.4) to trim adapter sequences with parameters ‘-a GATCGGAAGAGCACA -A GATCGGAAGAGCGTC -e 0.1’. Custom scripts were subsequently applied to remove tMCTA-Seq primer sequences. Then we used Fastp (v0.23.0) to filter out low-quality reads with parameters ‘-A -q 20 -u 50 -n 5 -L 37’. To further enhance the precision of the analysis, reads containing >3 unmethylated CHs (CC, CA, CT) were excluded. The clean reads were mapped to the human reference genome (hg19) using Bismark (v0.16.3). Samtools (v1.3.1) was used to sort the aligned reads by coordinate. We only kept the aligned reads containing ≥2 CpG sites within ±3 bp of the 5′ end to remove PCR non-specific amplification.
PCR duplication was removed as follows. First, we extracted the 5 bp Unique Molecular Identifier (UMI) sequences from the 5′ end of R1 reads and retained those reads with a quality score exceeding 20 for all five UMI bases. Based on the alignment coordinates of R1 and R2 reads, as well as the UMI sequences, we grouped and counted the aligned reads. When R2 reads had the same alignment coordinates and R1 reads had the same alignment coordinates but exhibited a single-base difference in the UMI sequence, we retained the reads corresponding to the maximum count of UMIs. In cases where the UMI counts were identical, we retained all reads. If the R2 alignment coordinates and the UMI sequences were the same, but the R1 alignment coordinates varied, we retained the reads associated with the most frequent UMI at the R1 alignment coordinates. All UMI sequences must be supported by ≥ 2 reads to ensure the reliability of the results. The methylation value was calculated using non-duplicated reads. The Receiver operating characteristic (ROC) curve, heatmap, boxplots, and scatterplots were all generated using custom R scripts and R packages.
Data partition and candidate classifier models
To build and rigorously evaluate the diagnostic models, the samples from CRC patients and control participants were partitioned into a training set (70% of samples) and a hold-out test set (30%). A similar partitioning was performed for the GC and control cohorts. The splits were conducted using a type-stratified sampling approach based on the AJCC stage for cancer patients and the control status. This strategy ensured that the proportions of each cancer stage (Stage I, II, III) and control participants were consistent between the training and test sets, providing a basis for unbiased model evaluation. Random seeds were set to 0 for data partition.
The training set was exclusively used for all models’ comparison and training. The hold-out test set remained completely unseen during the comparison and training process and was used only once for the final, definitive assessment of the selected model’s performance.
A comprehensive suite of machine learning models was developed and benchmarked to classify individuals as having cancer or being a healthy control, based on the UMI counts of the selected marker panels (72 markers for CRC, 68 for GC). The candidate seven base models included.1.Three variants of logistic regression: L1-regularized (LASSO), L2-regularized (Ridge), and ElasticNet-regularized, which are effective for handling high-dimensional data and performing implicit feature selection.
2.Two tree-based ensemble algorithms: Random Forest and Gradient Boosting, known for their high accuracy and robustness.
3.A Support Vector Machine (SVM) with a radial basis function (RBF) kernel.
4.A simple baseline model, termed the “Marker Count Method,” which uses the total number of detected markers as a single predictive feature.
For models sensitive to feature scaling, such as logistic regression and SVM, a ‘StandardScaler’ was integrated into a ‘scikit-learn’ pipeline. This ensures that data scaling was learned only from the training portion of each cross-validation fold, preventing data leakage and leading to reliable performance estimates. The ensemble model was constructed by integrating the predictions from seven base models, averaging their prediction scores.
Model performance robustness and learning curve analysis
To thoroughly assess the stability and generalization capability of each candidate model, we conducted an extensive learning curve analysis using a repeated, nested cross-validation procedure within the main training set. This entire process was repeated 200 times to ensure the robustness of the results.
In each of the 200 repetitions, the main training set was first partitioned into an 80% training subset and a 20% internal test subset using type-stratified sampling. The subsequent model training and optimization occurred exclusively on the 80% training subset, while the 20% internal test subset was used for performance evaluation in that repetition. Specifically, a 5-fold cross-validation was implemented on the 80% training subset. For each fold, a model was trained on the other 4-folds (which were further subsampled to simulate various training data sizes, ranging from 10% to 100%, for the learning curve) and generated predictions on this validation fold. The ensemble model was constructed using predictions generated from a 5-fold, out-of-fold (OOF) procedure on the 80% training subset (for details, refer to the section “ensemble model construction and final evaluation”). The optimal classification threshold was determined from these predictions by identifying the value that maximized the Youden’s J statistic (sensitivity + specificity - 1) on the validation data. Random seeds were set to 0–199 for the 200 cross-validation repetitions to ensure full reproducibility.
The model trained in that fold, along with its determined optimal threshold, was then applied to the 20% internal test subset to calculate performance metrics. A comprehensive set of metrics was recorded, including AUC, sensitivity, specificity, Brier score, and stage-specific sensitivities. This robust procedure allowed us to not only compare the average performance of the models but also to evaluate the stability of their predictions (via confidence intervals) and understand how their performance scales with increasing amounts of training data.
Statistical significance testing
To validate that the performance of our diagnostic approach was not a result of chance, we performed a permutation test. The labels of the training set (i.e., 'cancer' or 'control') were randomly shuffled 1,000 times, and on each shuffled dataset, the entire 5-fold cross-validation and modeling process was repeated to generate a null distribution of AUC scores. The p-value was calculated as the proportion of AUC scores from the null distribution that were greater than or equal to the AUC score obtained with the real, unshuffled labels. A small p-value (p < 0.001) indicates that the observed model performance is highly statistically significant.
Ensemble model construction and final evaluation
Based on the robustness analysis, an ensemble model that integrates the predictions of all seven candidate models demonstrated superior and more stable performance. The final ensemble model was constructed using predictions generated from a 5-fold, OOF procedure on the entire training set.
Specifically, for each fold, all seven base models were trained on the other 4-folds and used to predict on this validation fold. The average of these seven predictions constituted the OOF prediction for this fold. By repeating this for all 5-folds, we obtained an OOF prediction score for every sample in the training set. The classification threshold for the ensemble model was determined from the complete set of OOF scores by identifying the value that achieved a specificity of 90%. The final ensemble model was constructed through the integration of predictions from all seven base models across the 5-folds, averaging all 35 individual prediction scores.
Finally, this established ensemble model was applied to the completely independent hold-out test set to report the final, unbiased performance metrics, including overall AUC, stage-specific AUCs, sensitivity, and specificity.
출처: PubMed Central (JATS). 라이선스는 원 publisher 정책을 따릅니다 — 인용 시 원문을 표기해 주세요.
🏷️ 같은 키워드 · 무료전문 — 이 논문 MeSH/keyword 기반
- Prostate Cancer Care for Men with an Intellectual Disability: A Population-based Cohort Study of Symptoms, Diagnosis, Treatment, and Survival.
- Clinical and Liquid Biomarkers of 20-Year Prostate Cancer Risk in Men Aged 45 to 70 Years.
- Association between polygenic risk scores and cardiovascular events in prostate cancer patients receiving androgen deprivation therapy in Han Chinese.
- Diagnostic accuracy of Ga-PSMA PET/CT versus multiparametric MRI for preoperative pelvic invasion in the patients with prostate cancer.
- Comprehensive analysis of androgen receptor splice variant target gene expression in prostate cancer.
- Clinical Presentation and Outcomes of Patients Undergoing Surgery for Thyroid Cancer.