A meta-analysis identifies driver genes and characterizes the molecular epidemiology of colorectal cancer.
메타분석
1/5 보강
PICO 자동 추출 (휴리스틱, conf 2/4)
유사 논문P · Population 대상 환자/모집단
환자: diverse demographics
I · Intervention 중재 / 시술
추출되지 않음
C · Comparison 대조 / 비교
추출되지 않음
O · Outcome 결과 / 결론
Our study demonstrates how combining evidence from multiple sources allows for new discoveries in cancer genomics.
Colorectal cancer is thought to develop through the stepwise accumulation of somatic mutations.
- 연구 설계 meta-analysis
APA
Olafsson S, Thorarinsson T, et al. (2026). A meta-analysis identifies driver genes and characterizes the molecular epidemiology of colorectal cancer.. Scientific reports, 16(1). https://doi.org/10.1038/s41598-026-39255-3
MLA
Olafsson S, et al.. "A meta-analysis identifies driver genes and characterizes the molecular epidemiology of colorectal cancer.." Scientific reports, vol. 16, no. 1, 2026.
PMID
41703114 ↗
Abstract 한글 요약
Colorectal cancer is thought to develop through the stepwise accumulation of somatic mutations. Recent years have seen the publications of several studies greatly advancing our understanding of the molecular events driving the disease. However, individual studies tend to be small and additional insights may be obtained through the combination of data from multiple sources. We performed targeted sequencing of 2172 colorectal cancers from Icelandic patients and combined these data with publicly available mutation calls from 9 515 additional tumours collected from the literature. Analysing microsatellite stable (MSS) and instable (MSI) tumours separately, we find evidence of positive selection of mutations in 112 genes that replicate across multiple studies of patients with diverse demographics. We carried out a meta-analysis of conditional selection, identifying 57 gene pairs where a mutation in one gene influences the selection of the other. We describe many associations with tumour phenotypes, including a strong association between mucinous histology and mutations in the transcription growth factor beta (TGFb) pathway, only in MSS tumours. Our study demonstrates how combining evidence from multiple sources allows for new discoveries in cancer genomics.
🏷️ 키워드 / MeSH 📖 같은 키워드 OA만
📖 전문 본문 읽기 PMC JATS · ~61 KB · 영문
Introduction
Introduction
Colorectal cancer (CRC) is the third most common cancer and the second leading cause of cancer-related deaths world-wide1. It is thought to arise through the stepwise accumulation and selection of somatic mutations throughout life. Identifying all CRC driver mutations, understanding their impact on the disease’s biology and epidemiology and exploiting this knowledge for drug discovery and development are fundamental goals of CRC genetics.
Multiple studies describing the somatic mutation landscape of CRC have been published in recent years that have greatly advanced our understanding of the molecular events driving the disease2–10. However, most studies include only a few hundred cases and have limited statistical power to make inferences about rarely mutated genes and/or individual molecular subtypes of CRC. Differences in patient demographics, the bioinformatic pipelines and the molecular and statistical methods employed contribute to heterogeneity in results and can lead to spurious or inconsistent findings.
In contrast to human genetics, where large-scale meta-analyses have become the norm, the combination of statistical evidence across multiple cohorts is not a common practice in cancer genomics. The Catalogue Of Somatic Mutations In Cancer (COSMIC) includes a widely used cancer gene census database, reporting 77 genes somatically mutated in CRC11,12. However, COSMIC does not provide formal statistical evidence for the role of those genes in CRC in humans, relying instead on expert manual curation of the scientific literature. The Integrative OncoGenomics (IntOGen) represents a second major effort to collectively analyse data from multiple studies to arrive at a consensus on driver genes13. IntOGen (v. 2023.05.31) reports 94 driver genes for CRC identified by any of seven driver detection methods applied on 1 406 microsatellite stable (MSS) CRC samples from several different sources13. However, while IntOGen driver detection methods are applied on multiple cohorts, no attempt is made to statistically combine evidence across cohorts. Furthermore, IntOGen excludes microsatellite instable (MSI) CRCs and is thus not able to identify genes under selection specifically in this subtype or to facilitate comparisons of the selection landscapes in MSI and MSS tumours.
Here, we performed targeted sequencing of 523 genes in 2 172 colorectal cancer patients from Iceland with extensive clinical metadata. We combined these data with publicly available mutation calls from 9 515 additional tumours from seven different sources and performed a meta-analysis of the point mutations found with the aim of identifying a robust list of cancer driver genes and to understand how mutations in these genes differ between tumour sub-groups and if and how they affect prognosis.
Colorectal cancer (CRC) is the third most common cancer and the second leading cause of cancer-related deaths world-wide1. It is thought to arise through the stepwise accumulation and selection of somatic mutations throughout life. Identifying all CRC driver mutations, understanding their impact on the disease’s biology and epidemiology and exploiting this knowledge for drug discovery and development are fundamental goals of CRC genetics.
Multiple studies describing the somatic mutation landscape of CRC have been published in recent years that have greatly advanced our understanding of the molecular events driving the disease2–10. However, most studies include only a few hundred cases and have limited statistical power to make inferences about rarely mutated genes and/or individual molecular subtypes of CRC. Differences in patient demographics, the bioinformatic pipelines and the molecular and statistical methods employed contribute to heterogeneity in results and can lead to spurious or inconsistent findings.
In contrast to human genetics, where large-scale meta-analyses have become the norm, the combination of statistical evidence across multiple cohorts is not a common practice in cancer genomics. The Catalogue Of Somatic Mutations In Cancer (COSMIC) includes a widely used cancer gene census database, reporting 77 genes somatically mutated in CRC11,12. However, COSMIC does not provide formal statistical evidence for the role of those genes in CRC in humans, relying instead on expert manual curation of the scientific literature. The Integrative OncoGenomics (IntOGen) represents a second major effort to collectively analyse data from multiple studies to arrive at a consensus on driver genes13. IntOGen (v. 2023.05.31) reports 94 driver genes for CRC identified by any of seven driver detection methods applied on 1 406 microsatellite stable (MSS) CRC samples from several different sources13. However, while IntOGen driver detection methods are applied on multiple cohorts, no attempt is made to statistically combine evidence across cohorts. Furthermore, IntOGen excludes microsatellite instable (MSI) CRCs and is thus not able to identify genes under selection specifically in this subtype or to facilitate comparisons of the selection landscapes in MSI and MSS tumours.
Here, we performed targeted sequencing of 523 genes in 2 172 colorectal cancer patients from Iceland with extensive clinical metadata. We combined these data with publicly available mutation calls from 9 515 additional tumours from seven different sources and performed a meta-analysis of the point mutations found with the aim of identifying a robust list of cancer driver genes and to understand how mutations in these genes differ between tumour sub-groups and if and how they affect prognosis.
Methods
Methods
This study complies with all relevant ethical regulations. Icelandic sample donors that were living at the time of sampling gave informed consent for genetic research of their tumour samples and for joining encrypted identifiers with patient phenotypes available at deCODE genetics and recorded in the Icelandic cancer registry. Genetic study of samples from deceased donors is permissible without consent conditional on approval by the Icelandic National Bioethics Committee. The study was reviewed and approved by the Icelandic National Bioethics Committee (refs: VSN-17–137 and VSN-18–142).
Statistics and reproducibility
This is a descriptive study with no interventions. No randomization of patients or blinding of investigators was applied and sample size was not predetermined. Mutations from the MSKCC and AC-ICAM were not used in the dN/dS meta-analysis (see below) as synonymous mutations have not been made publicly available. They were used for discovering associations between mutations and patient and tumour phenotypes. We excluded a total of 151 samples from the combined cohort where the CRC subtype assignment was uncertain (Supplementary Note). We also excluded 127 POLE mutant tumours, i.e. tumours with evidence of a mutation in the exonuclease domain of polymerase epsilon, as this is a small group of tumours and its inclusion in the dN/dS analysis could lead to spurious findings from the dNdScv software. dNdScv seeks to adjust the expanded mutation count for each gene based on the trinucleotide context of the mutated base (the bases 3’ and 5’ to it). While this is sufficient to capture the effects of many mutational processes, the effects of sequence context on mutation rates in the context of POLE-mutant tumours extend well beyond the trinucleotide context, potentially introducing biases to the dNdScv analysis14.
The Icelandic cohort of colorectal cancers
We obtained samples and attempted sequencing of 2 548 tumour cores from formalin-fixed and paraffin embedded CRCs dating from 1983 to 2019. DNA isolated from the cores was sequenced using the TSO-500 targeted gene panel from Illumina. In accordance with the manufacturer’s instructions, we excluded samples if the median sequencing coverage was less than 150X, if the median insert size of the DNA library was less than 70 base pairs or if less than 90% of the exon regions assessed had coverage less than 50X. Somatic mutations were called using the TruSight Oncology 500 analysis pipeline (v 1.3.1.3 and 2.1.0.60)15. See the Supplementary Note for further information.
The Icelandic Cancer Registry (http://www.cancerregistry.is) contains > 99% of all diagnoses of solid cancers made in Iceland from January 1 st, 195516. Based on encrypted unique national identifiers, we obtained from the registry specific information relating to the cancer such as the year and age of diagnosis, tumour stage at the time of surgery, anatomical site of the primary tumour, survival information and more. In addition, tumour slides were reviewed by one of three pathologists working on the project to determine mucinous status and other pathological variables as previously described17.
Previously published colorectal cancer datasets
For this study, we downloaded mutation calls and associated clinical meta-data for > 9 000 colorectal cancer patients from several sources listed below. All data were downloaded in August 2022 except for AC-ICAM and Uppsala, where the data were downloaded in February and November 2024, respectively.
We accessed data from six studies which have made mutation calls from whole-exome sequencing (WES) publicly available:
Giannakis et al.2 sequenced formalin-fixed tumours from 619 individuals from two prospective studies carried out in the United states: The Nurses‘ Health Study (n = 380) and the Health Professionals Follow-up Study (n = 239). Samples were sequenced to a median depth of 90x.
Grasso et al.4 report the latest results from the whole-exome sequencing of colorectal cancers belonging to The Cancer Genome Atlas (TCGA) (RRID: SCR_003193). The authors of Grasso et al. report the mutation calls as a supplementary table. However, this table does not include synonymous variants which are necessary for dN/dS analysis. We therefore used the TCGAmutations package18 in R (v 0.4.0) to download mutation calls and associated meta-data from TCGA projects Colon Adenocarcinoma (COAD) and Rectum Adenocarcinoma (READ) and joined this with the metadata provided by Grasso et al. After joining, 553 patients from TCGA were included.
Vasaikar et al.8 performed WES to a minimum depth of 150x as part of a proteomic analysis of 106 fresh-frozen, primary colon cancers collected in the United States by the Clinical Proteomic Tumor Analysis Consortium (CPTAC). This study does not include any rectal cancers.
Zhao et al.7 report the results from whole-exome sequencing of 1 005 tumour-normal pairs from Chinese patients as part of the ChangKang Project. Tumour samples were sequenced to a median depth of 219x. We downloaded a maf file from the ChangKang project website (https://changkang.hapyun.com/download/).
Roelands et al.19 performed WES to a median depth of 129X on 281 fresh-frozen samples collected at Leiden University in the Netherlands. This was part of an effort to create the Atlas and Compass of Immune-Cancer-Microbiome (AC-ICAM) resource.
Nunes et al.20 performed WES on 1 063 fresh frozen primary CRC tumours and their paired control samples collected at Uppsala University Hospital or Umeå University Hospital in Sweden. All samples yielded data of ≥ 60× read coverage.
The remaining data come from studies which used targeted sequencing of gene panels:
Zaidi et al.3 carried out targeted sequencing of 205 genes (listed in the supplementary information of the original publication) in 2 105 formalin-fixed colorectal cancers. The average sequencing coverage of the tumours was 857x. Samples are a part of the Genetics and Epidemiology of Colorectal Cancer Consortium (GECCO) and the Colon Cancer Family Registry (CCFR) cohorts. Ancestry of individual patients is not reported but GECCO is a collaboration of researchers from North America, Australia, Asia and Europe while samples belonging to the CCFR were collected in Ontario and Seattle.
A large number of patients sequenced at the Memorial Sloan Kettering Cancer Center (MSKCC) in New York were included in this study. These have been sequenced using one of four different versions of the MSK-IMPACT (integrated mutation profiling of actionable targets) gene panel and published as part of one or more of several MSKCC studies which have used partially overlapping patient cohorts. Clinical metadata and mutation calls were downloaded through cBioPortal (RRID: SCR_014555) (https://www.cbioportal.org/). To avoid double-counting samples, we merged all the MSKCC studies and then used the mutation calls and phenotypes reported in the most recent study in which the sample appears, assuming that methods used for mutation calling and filtering and metadata collection have improved over time. When both primary and metastatic tumours from the same individual were sequenced, we retained only the sample representing the primary tumour.
A dN/dS meta-analysis
We used the dNdScv software14 (RRID: SCR_023123) to calculate the ratio of the mutation rate at non-synonymous sites and the mutation rate at synonymous sites after taking into account several variables affecting the mutation rate. dN/dS ratios were estimated for each gene in each study separately for MSS and MSI tumours. The parameters used for dndscv are described in the Supplementary Note. We calculated P-values for missense variants, nonsense/splice variants and indels for each gene.
We used Stouffer’s Z-score method to conduct a meta-analysis weighted by the square root of the sample size. Where k is the number of studies and Is the weight of study i which has sample size Ni.
We combined Pmis, Ptru and PInd for each study and P-values for each annotation class were further combined into PGlobal. We considered a gene significant if any of Pmis, Ptru or PGlobal passed Bonferroni correction for multiple testing (threshold = 0.05/20 091 coding genes = 2.5E-6). To guard against artifacts, we set two conditions a gene must fulfil for us to consider it significant: First, we required P < 0.1 in at least three studies to guard against study specific artifacts. Second, if PInd was smaller than Pmis and Ptru, then we required Ptru<0.01. In other words, genes driven by indel mutations should show at least suggestive evidence for enrichment of nonsense mutations to be considered significant. The genes removed by these criteria are ELF3, PLK1, TRIM51, CDKN1B and JUN for MSS tumours and AXIN2, MBD6, KLF3, BCL9L, TTK, ZFP36L2, RPL22, ALG10, EFNB3, CD93, CCDC73, YIF1A, COL18A1, CDH10, ZNF800, CTCF, TRAM1L1, DAZAP1, TEAD2, PCDH10, TM9SF3, AXIN1, FAM214B, GAMT, PTPRK, LMTK3, SREK1IP1, MAP9, COL5A1, KANSL1L, CSNK1A1L, FAM83E and PNPLA5 for MSI tumours. To combine the dN/dS ratios, we calculated a weighted average of the estimated dN/dS ratios for each study, with weights corresponding to the square roots of the sample size of each study. We calculated standard errors based on the Z-scores for each gene to derive 95% confidence intervals.
In the MSI meta-analysis, 18 genes (MUC22, AC026740.1, AL390778.1, BOP1, C2orf72, CTD-2368P22.1, DUX4L4, FAM230A, GCNT6, HUG1, KDM5D, NLGN4Y, NTF4, PCDH11Y, RAB35, USP9Y, UTY and ONECUT3) had no mutations at all and reached significance for a depletion of mutations. This could be indicative of negative selection of mutations in these genes. However, we considered mapping problems to be a more probable explanation and excluded these genes from further analysis.
DepMap analysis
For this study, we used CRISPR screening data from the DepMap project (Release 25Q3)21,22. For each cell line-gene pair, we consideredthat there was a dependency if the gene effect score was lower than − 1. We calculated the proportion of cell lines that were dependent on each gene found significant in the dN/dS analysis, excluding AMY1C because DepMap lacked data for that gene.
Meta-analysis of epistasis among driver genes
We used the Coselens software23 (RRID: SCR_022578) (v. 1.0) to detect patterns of co-occurrence and mutual exclusivity among the driver genes. Coselens controls for biases in tumour mutation load (Supplementary Note). It extends dNdScv to compare the number of driver mutations per tumour after partitioning the dataset by mutation status of a “split-gene”. We considered genes mutated in > 200 tumours as split genes, which resulted in 29 and 43 split genes for MSS and MSI tumours, respectively. We then tested each gene significant in the dN/dS analysis for each group for epistasis with each of the split genes, resulting in 4 759 tests (29 × 87 + 43 × 52). Results were considered significant if P < 1.1E-5 (0.05/4 759). Estimates of differences in driver prevalence by split gene were generated for each study separately and combined using Stouffer’s Z-score method as described above.
Molecular epidemiology of driver genes
We tested genes mutated in > 50 tumours for associations with patient and tumour phenotypes using logistic regression. For each study and gene, we fitted a fixed-effect logistic regression model to predict if tumours carried any non-synonymous coding mutation in the gene based on sex and age of the patient and the location of the primary tumour. Tumours carrying only synonymous mutations in a given gene were assigned a wild-type status. The age distribution in each study was inverse-normal transformed prior to fitting the model. We used likelihood ratio tests to test if the inclusion of each variable improved the fit of the model. To test tumour stage, tumour grade, mucinous histology, BMI and smoking status, we similarly tested if adding these variables to a model which included age, sex and location improved the fit of the model using a likelihood ratio test. We used the METAL software24 (RRID: SCR_002013) to carry out a fixed effect, inverse-variance weighted meta-analysis of all studies.
To assess mutation impact on prognosis, we fitted Cox proportional hazards regression models using tumour stage, age and location as covariates and tested if adding the mutation status of the gene improved the fit of the model using a likelihood ratio test.
This study complies with all relevant ethical regulations. Icelandic sample donors that were living at the time of sampling gave informed consent for genetic research of their tumour samples and for joining encrypted identifiers with patient phenotypes available at deCODE genetics and recorded in the Icelandic cancer registry. Genetic study of samples from deceased donors is permissible without consent conditional on approval by the Icelandic National Bioethics Committee. The study was reviewed and approved by the Icelandic National Bioethics Committee (refs: VSN-17–137 and VSN-18–142).
Statistics and reproducibility
This is a descriptive study with no interventions. No randomization of patients or blinding of investigators was applied and sample size was not predetermined. Mutations from the MSKCC and AC-ICAM were not used in the dN/dS meta-analysis (see below) as synonymous mutations have not been made publicly available. They were used for discovering associations between mutations and patient and tumour phenotypes. We excluded a total of 151 samples from the combined cohort where the CRC subtype assignment was uncertain (Supplementary Note). We also excluded 127 POLE mutant tumours, i.e. tumours with evidence of a mutation in the exonuclease domain of polymerase epsilon, as this is a small group of tumours and its inclusion in the dN/dS analysis could lead to spurious findings from the dNdScv software. dNdScv seeks to adjust the expanded mutation count for each gene based on the trinucleotide context of the mutated base (the bases 3’ and 5’ to it). While this is sufficient to capture the effects of many mutational processes, the effects of sequence context on mutation rates in the context of POLE-mutant tumours extend well beyond the trinucleotide context, potentially introducing biases to the dNdScv analysis14.
The Icelandic cohort of colorectal cancers
We obtained samples and attempted sequencing of 2 548 tumour cores from formalin-fixed and paraffin embedded CRCs dating from 1983 to 2019. DNA isolated from the cores was sequenced using the TSO-500 targeted gene panel from Illumina. In accordance with the manufacturer’s instructions, we excluded samples if the median sequencing coverage was less than 150X, if the median insert size of the DNA library was less than 70 base pairs or if less than 90% of the exon regions assessed had coverage less than 50X. Somatic mutations were called using the TruSight Oncology 500 analysis pipeline (v 1.3.1.3 and 2.1.0.60)15. See the Supplementary Note for further information.
The Icelandic Cancer Registry (http://www.cancerregistry.is) contains > 99% of all diagnoses of solid cancers made in Iceland from January 1 st, 195516. Based on encrypted unique national identifiers, we obtained from the registry specific information relating to the cancer such as the year and age of diagnosis, tumour stage at the time of surgery, anatomical site of the primary tumour, survival information and more. In addition, tumour slides were reviewed by one of three pathologists working on the project to determine mucinous status and other pathological variables as previously described17.
Previously published colorectal cancer datasets
For this study, we downloaded mutation calls and associated clinical meta-data for > 9 000 colorectal cancer patients from several sources listed below. All data were downloaded in August 2022 except for AC-ICAM and Uppsala, where the data were downloaded in February and November 2024, respectively.
We accessed data from six studies which have made mutation calls from whole-exome sequencing (WES) publicly available:
Giannakis et al.2 sequenced formalin-fixed tumours from 619 individuals from two prospective studies carried out in the United states: The Nurses‘ Health Study (n = 380) and the Health Professionals Follow-up Study (n = 239). Samples were sequenced to a median depth of 90x.
Grasso et al.4 report the latest results from the whole-exome sequencing of colorectal cancers belonging to The Cancer Genome Atlas (TCGA) (RRID: SCR_003193). The authors of Grasso et al. report the mutation calls as a supplementary table. However, this table does not include synonymous variants which are necessary for dN/dS analysis. We therefore used the TCGAmutations package18 in R (v 0.4.0) to download mutation calls and associated meta-data from TCGA projects Colon Adenocarcinoma (COAD) and Rectum Adenocarcinoma (READ) and joined this with the metadata provided by Grasso et al. After joining, 553 patients from TCGA were included.
Vasaikar et al.8 performed WES to a minimum depth of 150x as part of a proteomic analysis of 106 fresh-frozen, primary colon cancers collected in the United States by the Clinical Proteomic Tumor Analysis Consortium (CPTAC). This study does not include any rectal cancers.
Zhao et al.7 report the results from whole-exome sequencing of 1 005 tumour-normal pairs from Chinese patients as part of the ChangKang Project. Tumour samples were sequenced to a median depth of 219x. We downloaded a maf file from the ChangKang project website (https://changkang.hapyun.com/download/).
Roelands et al.19 performed WES to a median depth of 129X on 281 fresh-frozen samples collected at Leiden University in the Netherlands. This was part of an effort to create the Atlas and Compass of Immune-Cancer-Microbiome (AC-ICAM) resource.
Nunes et al.20 performed WES on 1 063 fresh frozen primary CRC tumours and their paired control samples collected at Uppsala University Hospital or Umeå University Hospital in Sweden. All samples yielded data of ≥ 60× read coverage.
The remaining data come from studies which used targeted sequencing of gene panels:
Zaidi et al.3 carried out targeted sequencing of 205 genes (listed in the supplementary information of the original publication) in 2 105 formalin-fixed colorectal cancers. The average sequencing coverage of the tumours was 857x. Samples are a part of the Genetics and Epidemiology of Colorectal Cancer Consortium (GECCO) and the Colon Cancer Family Registry (CCFR) cohorts. Ancestry of individual patients is not reported but GECCO is a collaboration of researchers from North America, Australia, Asia and Europe while samples belonging to the CCFR were collected in Ontario and Seattle.
A large number of patients sequenced at the Memorial Sloan Kettering Cancer Center (MSKCC) in New York were included in this study. These have been sequenced using one of four different versions of the MSK-IMPACT (integrated mutation profiling of actionable targets) gene panel and published as part of one or more of several MSKCC studies which have used partially overlapping patient cohorts. Clinical metadata and mutation calls were downloaded through cBioPortal (RRID: SCR_014555) (https://www.cbioportal.org/). To avoid double-counting samples, we merged all the MSKCC studies and then used the mutation calls and phenotypes reported in the most recent study in which the sample appears, assuming that methods used for mutation calling and filtering and metadata collection have improved over time. When both primary and metastatic tumours from the same individual were sequenced, we retained only the sample representing the primary tumour.
A dN/dS meta-analysis
We used the dNdScv software14 (RRID: SCR_023123) to calculate the ratio of the mutation rate at non-synonymous sites and the mutation rate at synonymous sites after taking into account several variables affecting the mutation rate. dN/dS ratios were estimated for each gene in each study separately for MSS and MSI tumours. The parameters used for dndscv are described in the Supplementary Note. We calculated P-values for missense variants, nonsense/splice variants and indels for each gene.
We used Stouffer’s Z-score method to conduct a meta-analysis weighted by the square root of the sample size. Where k is the number of studies and Is the weight of study i which has sample size Ni.
We combined Pmis, Ptru and PInd for each study and P-values for each annotation class were further combined into PGlobal. We considered a gene significant if any of Pmis, Ptru or PGlobal passed Bonferroni correction for multiple testing (threshold = 0.05/20 091 coding genes = 2.5E-6). To guard against artifacts, we set two conditions a gene must fulfil for us to consider it significant: First, we required P < 0.1 in at least three studies to guard against study specific artifacts. Second, if PInd was smaller than Pmis and Ptru, then we required Ptru<0.01. In other words, genes driven by indel mutations should show at least suggestive evidence for enrichment of nonsense mutations to be considered significant. The genes removed by these criteria are ELF3, PLK1, TRIM51, CDKN1B and JUN for MSS tumours and AXIN2, MBD6, KLF3, BCL9L, TTK, ZFP36L2, RPL22, ALG10, EFNB3, CD93, CCDC73, YIF1A, COL18A1, CDH10, ZNF800, CTCF, TRAM1L1, DAZAP1, TEAD2, PCDH10, TM9SF3, AXIN1, FAM214B, GAMT, PTPRK, LMTK3, SREK1IP1, MAP9, COL5A1, KANSL1L, CSNK1A1L, FAM83E and PNPLA5 for MSI tumours. To combine the dN/dS ratios, we calculated a weighted average of the estimated dN/dS ratios for each study, with weights corresponding to the square roots of the sample size of each study. We calculated standard errors based on the Z-scores for each gene to derive 95% confidence intervals.
In the MSI meta-analysis, 18 genes (MUC22, AC026740.1, AL390778.1, BOP1, C2orf72, CTD-2368P22.1, DUX4L4, FAM230A, GCNT6, HUG1, KDM5D, NLGN4Y, NTF4, PCDH11Y, RAB35, USP9Y, UTY and ONECUT3) had no mutations at all and reached significance for a depletion of mutations. This could be indicative of negative selection of mutations in these genes. However, we considered mapping problems to be a more probable explanation and excluded these genes from further analysis.
DepMap analysis
For this study, we used CRISPR screening data from the DepMap project (Release 25Q3)21,22. For each cell line-gene pair, we consideredthat there was a dependency if the gene effect score was lower than − 1. We calculated the proportion of cell lines that were dependent on each gene found significant in the dN/dS analysis, excluding AMY1C because DepMap lacked data for that gene.
Meta-analysis of epistasis among driver genes
We used the Coselens software23 (RRID: SCR_022578) (v. 1.0) to detect patterns of co-occurrence and mutual exclusivity among the driver genes. Coselens controls for biases in tumour mutation load (Supplementary Note). It extends dNdScv to compare the number of driver mutations per tumour after partitioning the dataset by mutation status of a “split-gene”. We considered genes mutated in > 200 tumours as split genes, which resulted in 29 and 43 split genes for MSS and MSI tumours, respectively. We then tested each gene significant in the dN/dS analysis for each group for epistasis with each of the split genes, resulting in 4 759 tests (29 × 87 + 43 × 52). Results were considered significant if P < 1.1E-5 (0.05/4 759). Estimates of differences in driver prevalence by split gene were generated for each study separately and combined using Stouffer’s Z-score method as described above.
Molecular epidemiology of driver genes
We tested genes mutated in > 50 tumours for associations with patient and tumour phenotypes using logistic regression. For each study and gene, we fitted a fixed-effect logistic regression model to predict if tumours carried any non-synonymous coding mutation in the gene based on sex and age of the patient and the location of the primary tumour. Tumours carrying only synonymous mutations in a given gene were assigned a wild-type status. The age distribution in each study was inverse-normal transformed prior to fitting the model. We used likelihood ratio tests to test if the inclusion of each variable improved the fit of the model. To test tumour stage, tumour grade, mucinous histology, BMI and smoking status, we similarly tested if adding these variables to a model which included age, sex and location improved the fit of the model using a likelihood ratio test. We used the METAL software24 (RRID: SCR_002013) to carry out a fixed effect, inverse-variance weighted meta-analysis of all studies.
To assess mutation impact on prognosis, we fitted Cox proportional hazards regression models using tumour stage, age and location as covariates and tested if adding the mutation status of the gene improved the fit of the model using a likelihood ratio test.
Results
Results
We performed targeted sequencing of 523 genes in 2 172 colorectal tumours from Iceland using the Illumina TSO500 panel15 (Methods). We combined these data with publicly available mutation calls from eight previously published sequencing studies of colorectal cancer (Fig. 1, Supplementary Table S1). This included 553 cases from The Cancer Genome Atlas (TCGA)4,25, 106 from the Clinical Proteomic Tumor Analysis Consortium (CPTAC)8, 612 cases from the Nurses Health and Health Professionals Follow-up Studies (NHS/HPFS)2, 974 cases collected as part of the Chinese ChangKang study7, 280 cases from the Atlas and Compass of Immune–Cancer–Microbiome interactions (AC-ICAM) study19, 1 058 cases from Uppsala University Hospital and Umeå University Hospital (Uppsala), 2 019 cases from the Genetics and Epidemiology of Colorectal Cancer Consortium and the Colon Cancer Family Registry (GECCO/CCFR)3 and 4 044 cases obtained by aggregating data from several partially overlapping cohorts underlying recent publications from the Memorial Sloan Kettering Cancer Center (MSKCC)5,6,9,10,26–28. The combined cohort consists of 10 187 (87.4%) non-hypermutated MSS tumours and 1 473 (12.6%) MSI-Hypermutators. There are large differences between these groups across every level of analysis and we stratify by subtype in every analysis presented below. In addition, a subset of the cases is used for each analysis depending on the completeness of the clinical meta-data for each patient and the genes assessed for mutations in each study, as some studies performed targeted sequencing of gene panels.
dNdScv meta-analysis identifies 112 recurrently mutated genes
To look for evidence of positive selection, we used dNdScv (v.0.0.1.0)14 to estimate the ratio of the mutation rate at non-synonymous and synonymous sites (dN/dS) after correcting for trinucleotide context and genomic covariates affecting the mutation rate, including chromatin state, expression level and replication timing (Supplementary Note). We calculated dN/dS ratios for all protein coding genes separately for MSI and MSS tumours in each study (Fig. 2), excluding MSKCC and AC-ICAM because they have not made synonymous mutations available (Methods). We combined the P-values for each gene in MSI and MSS tumours separately using Stouffer’s Z-score method, assigning to each study a weight proportional to the square root of the sample size. To guard against genes driven by study-specific artifacts, we required at least three studies to have P < 0.1 to claim a gene as significant. We also removed genes that showed an enrichment of indels but no evidence for an enrichment of single nucleotide variants annotated as truncating mutations (Methods).
After Bonferroni correction for multiple testing, 112 out of 20 091 protein coding genes tested showed dN/dS ratios significantly greater than 1 (Supplementary Table S2) in the MSS (87 genes) and MSI (52 genes) groups of tumours, indicating positive selection of mutations in these genes (Figs. 2 and 3a). In the MSS and MSI tumours, 45 and 27 genes, respectively, reach significance in the meta-analysis but not in any one individual study (Fig. 3b,c), highlighting the power of combining data across multiple cohorts. We excluded 8 (MSS) and 6 (MSI) genes which only reached significance in a single study, since these may represent study-specific artifacts or mutations only selected in particular demographic groups.
We compared the list of significant genes to the IntOGen and COSMIC databases and found that 37 and 26 genes significant in the MSS and MSI groups, respectively, are linked with CRC in neither of these consensus databases (Supplementary Figure S1). In the MSS group, this includes genes with well-established roles in other cancers13, like IDH1 (p.R132 hotspot), CTCF, PPM1D, KLF5 and ARID2, which we show can also drive colorectal cancer. Other genes belong to established CRC pathways. For example, we found an enrichment of nonsense mutations in RASA1, a regulator of RAS/MAPK signalling, and in MLK4 and MAP2K7, two mitogen-activated protein kinases. Members of the transforming growth factor beta (TGFb) pathway, including TGFBR1, ACVR1B and RGMB, also reached significance in this study. Finally, the list of significant genes not in IntOGen or COSMIC includes several genes involved in transcriptional regulation, including SIN3A, ZBTB7A, TRPS1 and TNRC6B. Among the MSI group of tumours, many significant genes not listed in IntOGen or COSMIC are involved in antigen presentation or downstream apoptotic response, including CD58, HLA-B, HLA-C, NLRC5, RFX5 and CASP8. Although not listed in IntOGen or COSMIC, mutations affecting the antigen presentation machinery are known to associate with MSI CRCs4. Using the Cancer Dependency Map21,22, we tested whether driver genes represent dependencies in CRC cell-lines and found this to generally not be the case (Supplementary Table 2c).
Comparing the selection landscape between MSS and MSI tumours
Statistical power for driver gene detection varies drastically between the MSS and MSI cohorts. Not only is the sample size for MSI tumours much smaller, but the high background mutation rate of this tumour type reduces dN/dS estimates genome-wide compared to MSS tumours. Despite these limitations, it seems clear that the selection landscapes of MSS and MSI tumours overlap considerably, with 27 genes fulfilling criteria for significance in both groups (Fig. 4a). Furthermore, out of the 60 genes reaching significance in MSS and not in the MSI cohort, 40 have P < 0.05 in the MSI cohort, suggesting mutations in these genes may be positively selected in both CRC subtypes, although they only achieve statistical significance in the MSS cohort.
We compared the selection of mutations in MSS and MSI tumours in two ways. First, we compared the dN/dS ratios for missense and truncating mutations (Supplementary Figure S2a and b). However, dN/dS estimates are expected to be generally lower for MSI tumours owing to the high background mutation burden. To formally compare the number of driver mutations per MSS and MSI tumour after accounting for the differences in mutation burden and trinucleotide mutation spectrum, we used the Coselens software (v. 1.0)23. Coselens extends dNdScv to estimate the number of driver mutations in each of the 112 significant genes and tests for differences between two groups of tumours using a likelihood ratio test.
Besides mutations in APC, KRAS and TP53, which are known to be more common in MSS than MSI tumours, and mutations in components of the antigen precenting machinery, such as CD58, HLA-B, HLA-C, NLRC5 and RFX5, which are known to facilitate immune escape in MSI tumours4, we identified several other genes showing evidence of different magnitude of selection between MSS and MSI tumours (Fig. 4b, Supplementary Figure S2c and d, Supplementary Table S3).
The most striking example is PCBP1, for which we observed 134 tumours carrying a missense mutation (p.L100[PQR]) in a gain-of-function hotspot29. All 134 tumours in which these mutations are found are MSS (Supplementary Figure S3), indicating MSS-specific selection. PCBP1 encodes Poly(RC)-Binding Protein 1 which binds to mRNA and regulates transcription of multiple genes. There are several other examples of genes involved in transcriptional regulation, including ZBTB7A, KLF5, CHD4, TBX3 and SIN3A, which have dN/dS ratios > 1 in MSS tumours but show no evidence of selection in MSI tumours.
Genes mapping to canonical CRC pathways (WNT-signalling, TGFb, RAS/MAPK, TP53), tend to show dN/dS ratios either significantly or nominally significantly greater than 1 in both MSS and MSI tumours. However, our analysis reveals preferences for particular genes within these pathways in each tumour type. In the WNT pathway for example, we observe more truncating drivers in APC, TCF7L2 and AMER1 in MSS tumours while ZNRF3, CTNNB1, BCL9L and RNF43 harbour more drivers per MSI tumour. In the RAS/MAPK pathway, KRAS and NRAS have more drivers per MSS tumour while BRAF, MAP2K7 and MAP3K1 have more drivers per MSI tumour. Finally, several genes mapping to the TGFb pathway show more drivers in MSI than in MSS tumours, including BMPR2, ACVR1B and TGFBR2.
Mutational co-occurrence meta-analysis
When a driver mutation arises, it’s effect on the fitness of the host clone depends, among other things, on the set of mutations already carried by the clone. Mutations can interact synergistically to increase the fitness of (pre)cancerous clones or interfere with one another such that the total fitness increase is less than the sum of the gain expected from each driver in isolation. To look for evidence of mutational co-occurrence or exclusivity, we used the Coselens software (v. 1.0)23. Coselens extends dNdScv to compare the number of driver mutations in a gene (query gene) depending on the mutation status of a second gene (split gene) while controlling for systematic biases in tumour mutation load and spectrum (Supplementary Note).
We identified 54 gene-pairs for MSS and 9 for MSI that passed Bonferroni correction for multiple testing (Fig. 5, Supplementary Table S4, Methods). Among these are gene pairs known to exhibit patterns of mutual exclusivity (for example KRAS and BRAF or NRAS and TP53-ATM) and co-occurrence (for example BRAF-RNF43) but also gene pairs that are less well established, including co-occurrence of mutations in KRAS and PCBP1 and of mutations in PIK3CA and CHD4 in MSS tumours. Stratifying by MSI status reveals patterns of conditional selection seemingly specific to MSS tumours, including mutual exclusivity of mutations in PIK3CA/KRAS and TP53. Among MSI tumours, there is a dearth of MLH1 mutations among BRAF mutated tumours and an enrichment of MSH2 mutations among KRAS mutant tumours. The former may result from the co-occurrence of BRAFV600E and hypermethylation of the MLH1 promoter in the CpG island methylator phenotype (CIMP) and the latter from the double mutual exclusivity between MLH1/MSH2 on one hand and BRAF/KRAS on the other.
Molecular epidemiology of driver mutations
Driver point-mutations are often acquired early in life and have influenced the fitness of the normal cells from which the cancer develops decades before diagnosis30. This likely manifests as differences in the mutation frequency of tumours depending on demographic and epidemiological factors which influence selection of pre-neoplastic cells. To look for associations between mutations and tumour sub-groups, we regressed mutation status on sex, age and tumour location (proximal colon, distal colon, rectum) per gene identified in the dN/dS analysis with > 50 mutated tumours (Supplementary Figures S4 and S5, Supplementary table S5). We tested 79 and 95 genes in the MSS and MSI groups, respectively. Coefficients were considered significantly different from zero if P < 2.9E-4 (0.05/(79 + 95)). We note that in these, and all subsequent analyses of this manuscript, the MSKCC and AC-ICAM cohorts can be included as only non-synonymous coding mutations are studied.
Mutations in three genes, BRAF, KRAS and USP9X, associate with CRC in a sex dependent manner (Fig. 6a). Consistent with previous reports31, BRAF and KRAS interact with sex in both MSS and MSI tumours. Interestingly KRAS mutations are more common in men with MSI tumours but in women with MSS tumours. BRAF mutations are more commonly found in women in both groups. USP9X is only significantly associated with sex in MSI tumours, where USP9X mutations are more common in women.
Mutations in ten genes interact with age in our models and some show large differences between MSS and MSI tumours (Fig. 6b). APC, KRAS and TP53 each show significant or nominally significant but opposing age effects in the two groups while BRAF, CTNNB1, MLH1, RNF43 and SLITRK1 associate with age only in the MSI group and ZFP36L2 only in the MSS group. FBXW7 is the only gene showing a comparable age effect between MSI and MSS tumours. Taken together, these results highlight the importance of stratifying by MSI status when studying age effects in CRCs.
The location of the primary tumour within the intestine has a major effect on the mutation frequencies of individual genes. Location interacts with mutations in 18 genes in MSS tumours after correcting for age and sex (Fig. 6c). In addition, mutations in ERBB3 (both overall and when restricting to oncogenic hotspots) are more common in MSI tumours arising in the rectum than in proximal colon but show no location effect in MSS tumours. Several genes which interact with location in MSS tumours, including BRAF, TP53, TCF7L2 and FBXW7, show suggestive associations for MSI tumours (P < 0.05) and effect sizes comparable to MSS tumours, suggesting that location effects may be consistent in both subtypes but statistical power to discover location effects is greater in the MSS group.
We next tested whether pathological stage associates with mutations in particular genes after correcting for sex, age and tumour location among 6 533 MSS and 1 098 MSI tumours with available stage information. Mutations in TP53, BRAF and SMAD4 are increasingly prevalent at more advanced tumour stages in MSS tumours and show a non-significant trend in the same direction in MSI tumours (Fig. 6d and e).
MSS mucinous tumours show an enrichment of mutations in the TGFb pathway
Most CRCs have some degree of extracellular mucinous component visible during histological assessment. When mucin accounts for > = 50% of the tumour volume, the cancer is said to have a mucinous histology. This histological subtype accounts for 10–15% of all CRCs and is more common among the MSI subtype32. In our cohort, 30% and 7.5% of MSI and MSS tumours have mucinous histology, respectively. We tested if mucinous histology associates with mutation status separately for MSI (N = 236 mucinous, 563 non-mucinous) and MSS tumours (N = 395 mucinous, 4 880 non-mucinous) adjusting for age, sex and tumour location. We found nine genes which significantly associate with mucinous histology in MSS tumours only. As previously described, mutations in KRAS are more common and mutations in TP53 are less common in tumours with mucinous histology33,34 (Fig. 7a and b). Strikingly, out of the seven additional genes associated with mucinous histology, six belong to the TGFb pathway. Mutations in SMAD4, SMAD3, SMAD2, ACVR1B, TGFBR1 and IDH1 are each significantly more common in MSS mucinous tumours. 48% of mucinous MSS tumours carry a mutation in one of these genes compared to 19% of non-mucinous MSS tumours and if additional genes affecting the TGFb pathway are considered (ACVR2A, TGFBR2, BMPR2), the fraction of mucinous MSS tumours that carry a mutation rises to 64% vs. 33% in non-mucinous tumours. This enrichment is only seen for MSS tumours but we note that MSI-tumours almost universally (~ 95%) carry a point mutation in one or more of the TGFb pathway genes listed above, which may help to explain why a mucinous phenotype is seen in such a high fraction of MSI tumours.
Mucinous histology is a somewhat subjective binary classification of tumours. To derive an objective, quantitative estimate of mucin levels in tumours, we used a previously published convolution neural network model35 to perform automated tissue decomposition into eight tissue classes in 1 864 (NMSS=1 645, NMSI=219) haematoxylin and eosin stained tissue slides from the deCODE-CRC cohort17 (Methods, Fig. 7c). We tested if quantitative mucin levels predict mutation status of any of the nine genes that associated with mucinous histology in a model which included binary mucinous status, as defined by a pathologist, as a covariate. Mucin levels associate with mutation status in KRAS (OR = 1.33 (1.17–1.50 95% CI), P = 7.3E-6), TP53 (0.71 (0.61–0.82, 95% CI), P = 6.3E-6), SMAD4 (OR = 1.50 (1.22–1.85, 95% CI), P = 6.7E-5) and SMAD2 (OR = 1.74 (1.14–2.64, 95% CI), P = 7.4E-3) (Fig. 7d). Out of the nine genes associated with mucinous histology, these four are among the most commonly mutated in the cohort and the genes for which we have the greatest statistical power to detect an association with mucin levels.
Driver mutations predict overall survival
Six of the cohorts report overall survival from diagnosis. We tested for associations between mutations and overall survival after correcting for tumour stage, sex, age and location of the tumour within the colon. In agreement with previous reports36–38, we found that among MSS tumours (N = 4 814), mutations in APC are consistently associated with improved prognosis (HR = 0.74, (0.67–0.82 95% CI), P = 1.6E-9) while mutations in BRAF associate with worse prognosis (HR = 1.41, (1.23–1.61, 95% CI), P = 1.1E-6) (Supplementary Figures S6 and S7). The association with BRAF was further strengthened when we tested BRAF V600E carriers against wild type (HR = 1.64 (1.40–1.92, 95% CI), P = 8.9E-10). In addition to APC and BRAF, mutations in RNF43 are associated with worse prognosis in MSS tumours in our dataset (HR = 1.66 (1.36–2.02, 95% CI), P = 4.6E-7, Fig. 8a and b). No gene reached significance in the MSI group after correction for multiple testing.
We performed targeted sequencing of 523 genes in 2 172 colorectal tumours from Iceland using the Illumina TSO500 panel15 (Methods). We combined these data with publicly available mutation calls from eight previously published sequencing studies of colorectal cancer (Fig. 1, Supplementary Table S1). This included 553 cases from The Cancer Genome Atlas (TCGA)4,25, 106 from the Clinical Proteomic Tumor Analysis Consortium (CPTAC)8, 612 cases from the Nurses Health and Health Professionals Follow-up Studies (NHS/HPFS)2, 974 cases collected as part of the Chinese ChangKang study7, 280 cases from the Atlas and Compass of Immune–Cancer–Microbiome interactions (AC-ICAM) study19, 1 058 cases from Uppsala University Hospital and Umeå University Hospital (Uppsala), 2 019 cases from the Genetics and Epidemiology of Colorectal Cancer Consortium and the Colon Cancer Family Registry (GECCO/CCFR)3 and 4 044 cases obtained by aggregating data from several partially overlapping cohorts underlying recent publications from the Memorial Sloan Kettering Cancer Center (MSKCC)5,6,9,10,26–28. The combined cohort consists of 10 187 (87.4%) non-hypermutated MSS tumours and 1 473 (12.6%) MSI-Hypermutators. There are large differences between these groups across every level of analysis and we stratify by subtype in every analysis presented below. In addition, a subset of the cases is used for each analysis depending on the completeness of the clinical meta-data for each patient and the genes assessed for mutations in each study, as some studies performed targeted sequencing of gene panels.
dNdScv meta-analysis identifies 112 recurrently mutated genes
To look for evidence of positive selection, we used dNdScv (v.0.0.1.0)14 to estimate the ratio of the mutation rate at non-synonymous and synonymous sites (dN/dS) after correcting for trinucleotide context and genomic covariates affecting the mutation rate, including chromatin state, expression level and replication timing (Supplementary Note). We calculated dN/dS ratios for all protein coding genes separately for MSI and MSS tumours in each study (Fig. 2), excluding MSKCC and AC-ICAM because they have not made synonymous mutations available (Methods). We combined the P-values for each gene in MSI and MSS tumours separately using Stouffer’s Z-score method, assigning to each study a weight proportional to the square root of the sample size. To guard against genes driven by study-specific artifacts, we required at least three studies to have P < 0.1 to claim a gene as significant. We also removed genes that showed an enrichment of indels but no evidence for an enrichment of single nucleotide variants annotated as truncating mutations (Methods).
After Bonferroni correction for multiple testing, 112 out of 20 091 protein coding genes tested showed dN/dS ratios significantly greater than 1 (Supplementary Table S2) in the MSS (87 genes) and MSI (52 genes) groups of tumours, indicating positive selection of mutations in these genes (Figs. 2 and 3a). In the MSS and MSI tumours, 45 and 27 genes, respectively, reach significance in the meta-analysis but not in any one individual study (Fig. 3b,c), highlighting the power of combining data across multiple cohorts. We excluded 8 (MSS) and 6 (MSI) genes which only reached significance in a single study, since these may represent study-specific artifacts or mutations only selected in particular demographic groups.
We compared the list of significant genes to the IntOGen and COSMIC databases and found that 37 and 26 genes significant in the MSS and MSI groups, respectively, are linked with CRC in neither of these consensus databases (Supplementary Figure S1). In the MSS group, this includes genes with well-established roles in other cancers13, like IDH1 (p.R132 hotspot), CTCF, PPM1D, KLF5 and ARID2, which we show can also drive colorectal cancer. Other genes belong to established CRC pathways. For example, we found an enrichment of nonsense mutations in RASA1, a regulator of RAS/MAPK signalling, and in MLK4 and MAP2K7, two mitogen-activated protein kinases. Members of the transforming growth factor beta (TGFb) pathway, including TGFBR1, ACVR1B and RGMB, also reached significance in this study. Finally, the list of significant genes not in IntOGen or COSMIC includes several genes involved in transcriptional regulation, including SIN3A, ZBTB7A, TRPS1 and TNRC6B. Among the MSI group of tumours, many significant genes not listed in IntOGen or COSMIC are involved in antigen presentation or downstream apoptotic response, including CD58, HLA-B, HLA-C, NLRC5, RFX5 and CASP8. Although not listed in IntOGen or COSMIC, mutations affecting the antigen presentation machinery are known to associate with MSI CRCs4. Using the Cancer Dependency Map21,22, we tested whether driver genes represent dependencies in CRC cell-lines and found this to generally not be the case (Supplementary Table 2c).
Comparing the selection landscape between MSS and MSI tumours
Statistical power for driver gene detection varies drastically between the MSS and MSI cohorts. Not only is the sample size for MSI tumours much smaller, but the high background mutation rate of this tumour type reduces dN/dS estimates genome-wide compared to MSS tumours. Despite these limitations, it seems clear that the selection landscapes of MSS and MSI tumours overlap considerably, with 27 genes fulfilling criteria for significance in both groups (Fig. 4a). Furthermore, out of the 60 genes reaching significance in MSS and not in the MSI cohort, 40 have P < 0.05 in the MSI cohort, suggesting mutations in these genes may be positively selected in both CRC subtypes, although they only achieve statistical significance in the MSS cohort.
We compared the selection of mutations in MSS and MSI tumours in two ways. First, we compared the dN/dS ratios for missense and truncating mutations (Supplementary Figure S2a and b). However, dN/dS estimates are expected to be generally lower for MSI tumours owing to the high background mutation burden. To formally compare the number of driver mutations per MSS and MSI tumour after accounting for the differences in mutation burden and trinucleotide mutation spectrum, we used the Coselens software (v. 1.0)23. Coselens extends dNdScv to estimate the number of driver mutations in each of the 112 significant genes and tests for differences between two groups of tumours using a likelihood ratio test.
Besides mutations in APC, KRAS and TP53, which are known to be more common in MSS than MSI tumours, and mutations in components of the antigen precenting machinery, such as CD58, HLA-B, HLA-C, NLRC5 and RFX5, which are known to facilitate immune escape in MSI tumours4, we identified several other genes showing evidence of different magnitude of selection between MSS and MSI tumours (Fig. 4b, Supplementary Figure S2c and d, Supplementary Table S3).
The most striking example is PCBP1, for which we observed 134 tumours carrying a missense mutation (p.L100[PQR]) in a gain-of-function hotspot29. All 134 tumours in which these mutations are found are MSS (Supplementary Figure S3), indicating MSS-specific selection. PCBP1 encodes Poly(RC)-Binding Protein 1 which binds to mRNA and regulates transcription of multiple genes. There are several other examples of genes involved in transcriptional regulation, including ZBTB7A, KLF5, CHD4, TBX3 and SIN3A, which have dN/dS ratios > 1 in MSS tumours but show no evidence of selection in MSI tumours.
Genes mapping to canonical CRC pathways (WNT-signalling, TGFb, RAS/MAPK, TP53), tend to show dN/dS ratios either significantly or nominally significantly greater than 1 in both MSS and MSI tumours. However, our analysis reveals preferences for particular genes within these pathways in each tumour type. In the WNT pathway for example, we observe more truncating drivers in APC, TCF7L2 and AMER1 in MSS tumours while ZNRF3, CTNNB1, BCL9L and RNF43 harbour more drivers per MSI tumour. In the RAS/MAPK pathway, KRAS and NRAS have more drivers per MSS tumour while BRAF, MAP2K7 and MAP3K1 have more drivers per MSI tumour. Finally, several genes mapping to the TGFb pathway show more drivers in MSI than in MSS tumours, including BMPR2, ACVR1B and TGFBR2.
Mutational co-occurrence meta-analysis
When a driver mutation arises, it’s effect on the fitness of the host clone depends, among other things, on the set of mutations already carried by the clone. Mutations can interact synergistically to increase the fitness of (pre)cancerous clones or interfere with one another such that the total fitness increase is less than the sum of the gain expected from each driver in isolation. To look for evidence of mutational co-occurrence or exclusivity, we used the Coselens software (v. 1.0)23. Coselens extends dNdScv to compare the number of driver mutations in a gene (query gene) depending on the mutation status of a second gene (split gene) while controlling for systematic biases in tumour mutation load and spectrum (Supplementary Note).
We identified 54 gene-pairs for MSS and 9 for MSI that passed Bonferroni correction for multiple testing (Fig. 5, Supplementary Table S4, Methods). Among these are gene pairs known to exhibit patterns of mutual exclusivity (for example KRAS and BRAF or NRAS and TP53-ATM) and co-occurrence (for example BRAF-RNF43) but also gene pairs that are less well established, including co-occurrence of mutations in KRAS and PCBP1 and of mutations in PIK3CA and CHD4 in MSS tumours. Stratifying by MSI status reveals patterns of conditional selection seemingly specific to MSS tumours, including mutual exclusivity of mutations in PIK3CA/KRAS and TP53. Among MSI tumours, there is a dearth of MLH1 mutations among BRAF mutated tumours and an enrichment of MSH2 mutations among KRAS mutant tumours. The former may result from the co-occurrence of BRAFV600E and hypermethylation of the MLH1 promoter in the CpG island methylator phenotype (CIMP) and the latter from the double mutual exclusivity between MLH1/MSH2 on one hand and BRAF/KRAS on the other.
Molecular epidemiology of driver mutations
Driver point-mutations are often acquired early in life and have influenced the fitness of the normal cells from which the cancer develops decades before diagnosis30. This likely manifests as differences in the mutation frequency of tumours depending on demographic and epidemiological factors which influence selection of pre-neoplastic cells. To look for associations between mutations and tumour sub-groups, we regressed mutation status on sex, age and tumour location (proximal colon, distal colon, rectum) per gene identified in the dN/dS analysis with > 50 mutated tumours (Supplementary Figures S4 and S5, Supplementary table S5). We tested 79 and 95 genes in the MSS and MSI groups, respectively. Coefficients were considered significantly different from zero if P < 2.9E-4 (0.05/(79 + 95)). We note that in these, and all subsequent analyses of this manuscript, the MSKCC and AC-ICAM cohorts can be included as only non-synonymous coding mutations are studied.
Mutations in three genes, BRAF, KRAS and USP9X, associate with CRC in a sex dependent manner (Fig. 6a). Consistent with previous reports31, BRAF and KRAS interact with sex in both MSS and MSI tumours. Interestingly KRAS mutations are more common in men with MSI tumours but in women with MSS tumours. BRAF mutations are more commonly found in women in both groups. USP9X is only significantly associated with sex in MSI tumours, where USP9X mutations are more common in women.
Mutations in ten genes interact with age in our models and some show large differences between MSS and MSI tumours (Fig. 6b). APC, KRAS and TP53 each show significant or nominally significant but opposing age effects in the two groups while BRAF, CTNNB1, MLH1, RNF43 and SLITRK1 associate with age only in the MSI group and ZFP36L2 only in the MSS group. FBXW7 is the only gene showing a comparable age effect between MSI and MSS tumours. Taken together, these results highlight the importance of stratifying by MSI status when studying age effects in CRCs.
The location of the primary tumour within the intestine has a major effect on the mutation frequencies of individual genes. Location interacts with mutations in 18 genes in MSS tumours after correcting for age and sex (Fig. 6c). In addition, mutations in ERBB3 (both overall and when restricting to oncogenic hotspots) are more common in MSI tumours arising in the rectum than in proximal colon but show no location effect in MSS tumours. Several genes which interact with location in MSS tumours, including BRAF, TP53, TCF7L2 and FBXW7, show suggestive associations for MSI tumours (P < 0.05) and effect sizes comparable to MSS tumours, suggesting that location effects may be consistent in both subtypes but statistical power to discover location effects is greater in the MSS group.
We next tested whether pathological stage associates with mutations in particular genes after correcting for sex, age and tumour location among 6 533 MSS and 1 098 MSI tumours with available stage information. Mutations in TP53, BRAF and SMAD4 are increasingly prevalent at more advanced tumour stages in MSS tumours and show a non-significant trend in the same direction in MSI tumours (Fig. 6d and e).
MSS mucinous tumours show an enrichment of mutations in the TGFb pathway
Most CRCs have some degree of extracellular mucinous component visible during histological assessment. When mucin accounts for > = 50% of the tumour volume, the cancer is said to have a mucinous histology. This histological subtype accounts for 10–15% of all CRCs and is more common among the MSI subtype32. In our cohort, 30% and 7.5% of MSI and MSS tumours have mucinous histology, respectively. We tested if mucinous histology associates with mutation status separately for MSI (N = 236 mucinous, 563 non-mucinous) and MSS tumours (N = 395 mucinous, 4 880 non-mucinous) adjusting for age, sex and tumour location. We found nine genes which significantly associate with mucinous histology in MSS tumours only. As previously described, mutations in KRAS are more common and mutations in TP53 are less common in tumours with mucinous histology33,34 (Fig. 7a and b). Strikingly, out of the seven additional genes associated with mucinous histology, six belong to the TGFb pathway. Mutations in SMAD4, SMAD3, SMAD2, ACVR1B, TGFBR1 and IDH1 are each significantly more common in MSS mucinous tumours. 48% of mucinous MSS tumours carry a mutation in one of these genes compared to 19% of non-mucinous MSS tumours and if additional genes affecting the TGFb pathway are considered (ACVR2A, TGFBR2, BMPR2), the fraction of mucinous MSS tumours that carry a mutation rises to 64% vs. 33% in non-mucinous tumours. This enrichment is only seen for MSS tumours but we note that MSI-tumours almost universally (~ 95%) carry a point mutation in one or more of the TGFb pathway genes listed above, which may help to explain why a mucinous phenotype is seen in such a high fraction of MSI tumours.
Mucinous histology is a somewhat subjective binary classification of tumours. To derive an objective, quantitative estimate of mucin levels in tumours, we used a previously published convolution neural network model35 to perform automated tissue decomposition into eight tissue classes in 1 864 (NMSS=1 645, NMSI=219) haematoxylin and eosin stained tissue slides from the deCODE-CRC cohort17 (Methods, Fig. 7c). We tested if quantitative mucin levels predict mutation status of any of the nine genes that associated with mucinous histology in a model which included binary mucinous status, as defined by a pathologist, as a covariate. Mucin levels associate with mutation status in KRAS (OR = 1.33 (1.17–1.50 95% CI), P = 7.3E-6), TP53 (0.71 (0.61–0.82, 95% CI), P = 6.3E-6), SMAD4 (OR = 1.50 (1.22–1.85, 95% CI), P = 6.7E-5) and SMAD2 (OR = 1.74 (1.14–2.64, 95% CI), P = 7.4E-3) (Fig. 7d). Out of the nine genes associated with mucinous histology, these four are among the most commonly mutated in the cohort and the genes for which we have the greatest statistical power to detect an association with mucin levels.
Driver mutations predict overall survival
Six of the cohorts report overall survival from diagnosis. We tested for associations between mutations and overall survival after correcting for tumour stage, sex, age and location of the tumour within the colon. In agreement with previous reports36–38, we found that among MSS tumours (N = 4 814), mutations in APC are consistently associated with improved prognosis (HR = 0.74, (0.67–0.82 95% CI), P = 1.6E-9) while mutations in BRAF associate with worse prognosis (HR = 1.41, (1.23–1.61, 95% CI), P = 1.1E-6) (Supplementary Figures S6 and S7). The association with BRAF was further strengthened when we tested BRAF V600E carriers against wild type (HR = 1.64 (1.40–1.92, 95% CI), P = 8.9E-10). In addition to APC and BRAF, mutations in RNF43 are associated with worse prognosis in MSS tumours in our dataset (HR = 1.66 (1.36–2.02, 95% CI), P = 4.6E-7, Fig. 8a and b). No gene reached significance in the MSI group after correction for multiple testing.
Discussion
Discussion
We combined somatic point-mutation data from a large Icelandic CRC cohort with publicly available mutation calls to derive a list of driver genes and test their association with clinical variables and prognosis. We compiled a list of 112 genes that show strong evidence of being CRC driver genes across multiple studies of patients of diverse ancestries and demographics and which used separate bioinformatic pipelines for mutation calling and filtering. Many genes reported as CRC drivers in COSMIC and/or IntOGen do not reach significance in any of the studies included herein or in the combined analysis (Fig. 3b and c). This includes several ‘Tier 1’ hallmark genes such as CUX1, EIF3E, HIF1A, GRIN2A, PTPRT, QKI and more. COSMIC is based partially on expert manual curation of the literature, which may include cell-line and mouse data11,39. Our data demonstrates that several of these genes do not show evidence of selection in humans in vivo based on point mutations, although some may be recurrently affected by copy-number variants. Our list of 112 genes is notably smaller than the list of 193 drivers reported in a recent publication by Cornish et al.40, a large analysis of 2 023 whole-exome sequenced CRCs, in which driver genes are identified by applying seven different driver detection methods. Using the union of genes detected by multiple methods is a good approach for maximizing sensitivity but we believe our list of genes showing evidence of selection across multiple independent studies may offer greater specificity. The usefulness of each approach depends on the downstream analysis.
Our comparisons of MSS and MSI tumours suggest that while mutations in immune escape genes and some transcriptional regulators are favoured only in MSI or MSS tumours, respectively, mutations affecting the WNT, TGFb, TP53 or RAS/MAPK signalling are usually selected for in both subtypes. However, each subtype shows preferences for particular genes within the pathway. Although the list of cancer genes overlaps between MSS and MSI tumours, we observed often contrasting relationships between mutation status and demographic variables in these CRC subtypes. Particularly striking is the behaviour of KRAS mutations that significantly correlate with sex, age and tumour site and in each case the direction of effect is different in MSI from that in MSS tumours (Fig. 6). These associations are driven by a handful of mutation hotspots and are highly unlikely to result from differences in mutation rate alone. More probably, they reflect differences in the magnitude of the fitness advantage conferred by these mutations under different physiological conditions.
Only a minority of CRCs present with mucinous histology and those that do are more often of the MSI subtype. As a result, individual studies often have few MSS mucinous tumours and are poorly powered to detect associations between mucinous histology and molecular features. By meta-analysing data from multiple studies, we could achieve sufficiently large sample size to detect a strong correlation with mutations in genes belonging to the TGFb pathway in MSS mucinous tumours. The mutations also associate with mucin levels after accounting for binary mucinous status. It remains unclear if this means that TGFb mutations promote mucin production or if cells carrying TGFb mutations enjoy a selective advantage in a highly mucinous environment. In contrast with past reports, we did not find an enrichment of BRAF mutations in mucinous tumours34. We hypothesise that this spurious association is driven by the pooling of MSI and MSS CRCs for analysis as both BRAF mutations and mucinous histology are found more commonly in MSI than MSS tumours.
MSI status and tumour stage are the strongest predictors of prognosis in CRC. We show that mutations in APC associate with better prognosis while BRAF and RNF43 associate with worse prognosis, only in MSS tumours. We caution that failure to stratify by MSI status can lead to associations between mutation status and prognosis that are driven by the different mutation frequencies in MSS and MSI tumours. For example, one of the studies included in this analysis reported an association between mutations in TP53 and poor prognosis in an unstratified cohort41. Our results suggest this is driven by a higher fraction of MSS cancers being TP53 mutant and MSS cancers generally having worse prognosis, as mutant TP53 does not associate with overall survival in MSS tumours.
Our study has several limitations. First, it focuses exclusively on point-mutations and does not cover structural variants or epigenetic modifications, which are common drivers in cancers30. Second, two of the studies have not made their “silent” mutations publicly available and therefore could not be used in the dN/dS based analyses. Third, the studies have each used their own bioinformatic pipeline and filtering strategy. That the data have not been homogenously processed is both a weakness and a strength of our study. Different filtering strategies can lead to artificial differences in mutation rate between studies but requiring broad support (P < 0.1 in at least three studies) helps to protect against study-specific artifacts in the dN/dS analysis. Finally, although we have taken several steps to reduce the impact of artefactual indel calls and different frequencies of indels in MSS and MSI tumours on our analyses, comparisons of genes commonly affected by indels between MSS and MSI tumours must be done with caution.
Many genes vary in their mutation frequency by demographic factors such as sex and age at diagnosis. These differences, which include codon-specific oncogenic alterations like BRAF V600E and KRAS G12X, are highly unlikely to result from differences in mutation rate alone. Clones of somatic cells participate in a Darwinian competition throughout life. Differences in driver mutation frequencies by site, sex, age etc. likely reflect, at least in part, variation in the magnitude of selective advantage conferred by functionally equivalent mutations under different physiological conditions. Furthering our understanding of the selection environment and mechanisms in normal colon, which today remains rudimentary42, may enable the development of novel cancer prevention interventions with the purpose of enabling normal cells to outcompete and eliminate early neoplastic lesions43 or otherwise steering the evolution of cell clones away from cancer development44.
We combined somatic point-mutation data from a large Icelandic CRC cohort with publicly available mutation calls to derive a list of driver genes and test their association with clinical variables and prognosis. We compiled a list of 112 genes that show strong evidence of being CRC driver genes across multiple studies of patients of diverse ancestries and demographics and which used separate bioinformatic pipelines for mutation calling and filtering. Many genes reported as CRC drivers in COSMIC and/or IntOGen do not reach significance in any of the studies included herein or in the combined analysis (Fig. 3b and c). This includes several ‘Tier 1’ hallmark genes such as CUX1, EIF3E, HIF1A, GRIN2A, PTPRT, QKI and more. COSMIC is based partially on expert manual curation of the literature, which may include cell-line and mouse data11,39. Our data demonstrates that several of these genes do not show evidence of selection in humans in vivo based on point mutations, although some may be recurrently affected by copy-number variants. Our list of 112 genes is notably smaller than the list of 193 drivers reported in a recent publication by Cornish et al.40, a large analysis of 2 023 whole-exome sequenced CRCs, in which driver genes are identified by applying seven different driver detection methods. Using the union of genes detected by multiple methods is a good approach for maximizing sensitivity but we believe our list of genes showing evidence of selection across multiple independent studies may offer greater specificity. The usefulness of each approach depends on the downstream analysis.
Our comparisons of MSS and MSI tumours suggest that while mutations in immune escape genes and some transcriptional regulators are favoured only in MSI or MSS tumours, respectively, mutations affecting the WNT, TGFb, TP53 or RAS/MAPK signalling are usually selected for in both subtypes. However, each subtype shows preferences for particular genes within the pathway. Although the list of cancer genes overlaps between MSS and MSI tumours, we observed often contrasting relationships between mutation status and demographic variables in these CRC subtypes. Particularly striking is the behaviour of KRAS mutations that significantly correlate with sex, age and tumour site and in each case the direction of effect is different in MSI from that in MSS tumours (Fig. 6). These associations are driven by a handful of mutation hotspots and are highly unlikely to result from differences in mutation rate alone. More probably, they reflect differences in the magnitude of the fitness advantage conferred by these mutations under different physiological conditions.
Only a minority of CRCs present with mucinous histology and those that do are more often of the MSI subtype. As a result, individual studies often have few MSS mucinous tumours and are poorly powered to detect associations between mucinous histology and molecular features. By meta-analysing data from multiple studies, we could achieve sufficiently large sample size to detect a strong correlation with mutations in genes belonging to the TGFb pathway in MSS mucinous tumours. The mutations also associate with mucin levels after accounting for binary mucinous status. It remains unclear if this means that TGFb mutations promote mucin production or if cells carrying TGFb mutations enjoy a selective advantage in a highly mucinous environment. In contrast with past reports, we did not find an enrichment of BRAF mutations in mucinous tumours34. We hypothesise that this spurious association is driven by the pooling of MSI and MSS CRCs for analysis as both BRAF mutations and mucinous histology are found more commonly in MSI than MSS tumours.
MSI status and tumour stage are the strongest predictors of prognosis in CRC. We show that mutations in APC associate with better prognosis while BRAF and RNF43 associate with worse prognosis, only in MSS tumours. We caution that failure to stratify by MSI status can lead to associations between mutation status and prognosis that are driven by the different mutation frequencies in MSS and MSI tumours. For example, one of the studies included in this analysis reported an association between mutations in TP53 and poor prognosis in an unstratified cohort41. Our results suggest this is driven by a higher fraction of MSS cancers being TP53 mutant and MSS cancers generally having worse prognosis, as mutant TP53 does not associate with overall survival in MSS tumours.
Our study has several limitations. First, it focuses exclusively on point-mutations and does not cover structural variants or epigenetic modifications, which are common drivers in cancers30. Second, two of the studies have not made their “silent” mutations publicly available and therefore could not be used in the dN/dS based analyses. Third, the studies have each used their own bioinformatic pipeline and filtering strategy. That the data have not been homogenously processed is both a weakness and a strength of our study. Different filtering strategies can lead to artificial differences in mutation rate between studies but requiring broad support (P < 0.1 in at least three studies) helps to protect against study-specific artifacts in the dN/dS analysis. Finally, although we have taken several steps to reduce the impact of artefactual indel calls and different frequencies of indels in MSS and MSI tumours on our analyses, comparisons of genes commonly affected by indels between MSS and MSI tumours must be done with caution.
Many genes vary in their mutation frequency by demographic factors such as sex and age at diagnosis. These differences, which include codon-specific oncogenic alterations like BRAF V600E and KRAS G12X, are highly unlikely to result from differences in mutation rate alone. Clones of somatic cells participate in a Darwinian competition throughout life. Differences in driver mutation frequencies by site, sex, age etc. likely reflect, at least in part, variation in the magnitude of selective advantage conferred by functionally equivalent mutations under different physiological conditions. Furthering our understanding of the selection environment and mechanisms in normal colon, which today remains rudimentary42, may enable the development of novel cancer prevention interventions with the purpose of enabling normal cells to outcompete and eliminate early neoplastic lesions43 or otherwise steering the evolution of cell clones away from cancer development44.
Supplementary Information
Supplementary Information
Below is the link to the electronic supplementary material.
Below is the link to the electronic supplementary material.
출처: PubMed Central (JATS). 라이선스는 원 publisher 정책을 따릅니다 — 인용 시 원문을 표기해 주세요.
🏷️ 같은 키워드 · 무료전문 — 이 논문 MeSH/keyword 기반
- A Phase I Study of Hydroxychloroquine and Suba-Itraconazole in Men with Biochemical Relapse of Prostate Cancer (HITMAN-PC): Dose Escalation Results.
- Self-management of male urinary symptoms: qualitative findings from a primary care trial.
- Clinical and Liquid Biomarkers of 20-Year Prostate Cancer Risk in Men Aged 45 to 70 Years.
- Diagnostic accuracy of Ga-PSMA PET/CT versus multiparametric MRI for preoperative pelvic invasion in the patients with prostate cancer.
- Association of patient health education with the postoperative health related quality of life in low- intermediate recurrence risk differentiated thyroid cancer patients.
- Early local immune activation following intra-operative radiotherapy in human breast tissue.