본문으로 건너뛰기
← 뒤로

SubNExT: Towards accurate, efficient and robust gene expression classification for breast cancer subtyping.

1/5 보강
Computational and structural biotechnology journal 📖 저널 OA 100% 2022: 3/3 OA 2023: 2/2 OA 2024: 12/12 OA 2025: 7/7 OA 2026: 5/5 OA 2022~2026 2026 Vol.31() p. 412-421
Retraction 확인
출처

Paygambar K, Jean-Marie R, Mziou-Sallami M, Meyer V

📝 환자 설명용 한 줄

Optimizing genomic-based molecular subtyping is key to promote personalized medicine.

이 논문을 인용하기

↓ .bib ↓ .ris
APA Paygambar K, Jean-Marie R, et al. (2026). SubNExT: Towards accurate, efficient and robust gene expression classification for breast cancer subtyping.. Computational and structural biotechnology journal, 31, 412-421. https://doi.org/10.1016/j.csbj.2025.12.027
MLA Paygambar K, et al.. "SubNExT: Towards accurate, efficient and robust gene expression classification for breast cancer subtyping.." Computational and structural biotechnology journal, vol. 31, 2026, pp. 412-421.
PMID 41608407 ↗

Abstract

Optimizing genomic-based molecular subtyping is key to promote personalized medicine. While neural networks face setbacks regarding tabular data modeling, deep learning has undergone groundbreaking advances across multiple domains, catalyzing further breakthroughs across AI applications. New neural network architectures exhibit enhanced performance, efficiency, and robustness, which could benefit the genomic use-case. In this study we introduce SubNExT, an optimized shallow CNN with a ConvNeXt backbone using t-SNE and DeepInsight 2D-converted gene expression for breast cancer subtyping. It was compared with other modelization strategies for gene expression data, by optimizing a Transformer, an MLP and XGBoost for unconverted values, a 1D CNN (NeXt-TDNN) for ordered values, and a ViT as an alternative for 2D-converted expression. During evaluation, SubNExT obtains an accuracy of 87.12%, matching the state-of-the-art XGBoost and its 87.24% acc at the top of the benchmark. SubNExT manages this performance with just 76k parameters and the shortest training time, as well as the best stability and robustness among all considered approaches. By providing accurate, efficient and robust molecular subtyping of breast cancer using gene expression data, SubNExT and its design principles catalyze deep learning adoption in oncogenomics.

🏷️ 키워드 / MeSH 📖 같은 키워드 OA만

📖 전문 본문 읽기 PMC JATS · ~74 KB · 영문

Introduction

1
Introduction
Omics are a key source of information for the description of tissue and cell functioning mechanisms [1]. With the increasing incidence of cancer, oncogenomics is positioned as a leading field of research [2]. For tabular data modeling, tree-based models tend to yield better performance, especially gradient-boosted trees which are less affected by uninformative features [3]. While neural networks have become a reference for multiple applications, they still struggle with the irregular patterns, heterogeneity and sparsity of tabular data [4]. Notably, omics datasets are subject to the curse of dimensionality, with generally more features than samples [5], posing a generalization and efficiency challenge. To tackle this issue, different strategies have been developed in genomics and epigenetics. This includes various types of deep learning models such as fully-connected, convolutional, or Transformer neural networks, for a wide variety of applications. While related research evaluates classification performance, we find that research work related to robustness and frugality evaluation remains limited in the context of genomics.
This article presents SubNExT, an accurate, efficient and robust method for breast cancer subtype prediction using gene expression tables. SubNExT employs ConvNeXt blocks [6], for t-SNE and DeepInsight-converted expression. This approach is evaluated against established methods, including Multi-Layer Perceptron (MLP) and XGBoost [7], as well as newer architectures such as a 1D Convolutional Neural Network (CNN) with a NeXt-TDNN backbone [8], Transformer-based models (TNNs) with MLP and Fourier embedding, and Vision Transformers (ViTs) [9]. All models are fine-tuned using automated Bayesian hyperparameter optimization (HPO) to ensure a fair evaluation. To the best of our knowledge, this work is among the first efforts to jointly evaluate different strategies for gene expression modeling, with alternatively raw, sequential and 2D-converted data. This study is also one of the first to benchmark models simultaneously on classification performance, stability, efficiency, and robustness in a genomic context.

Related works

2
Related works
Machine Learning (ML) [10] and Deep Learning (DL) [11] are active fields of research, driven by the increasing availability of data and the evolution of computer architectures. In oncogenomics, ML and neural networks have mainly targeted cancer classification and survival prediction, with omics such as gene expression or miRNAs [12] (Table 1).
Mazlan et al. [13] propose a survey showcasing models for different cancer related objectives, giving insights on the advantages and drawbacks of each method. They find that Support Vector Machine (SVM) delivers a high performance level but is rather slow to train. Then, while Random Forest (RF) offers the ability to choose the number of trees, it tends to be computationally expensive as the complexity of the model increases. Naive Bayes (NB) and K-Nearest Neighbors (KNN) classifiers are less preferable because of their lack of performance. In a study on pan-cancer staging, Ma et al. [14] compare XGBoost with other ML classifiers, including SVM and RF, as well as an MLP. XGBoost ranked as the best performing method overall, while also providing explainability insights on the most used features.
MLPs [23], historically the first type of NNs, use fully-connected layers of neurons and are powerful approximators [24]. However, they typically suffer from high computational costs and are agnostic to data structure. In one of the earliest deep learning works for genomics, Mateos et al. [25] evaluate MLPs as a proof-of-concept to determine gene functional classes with neural networks. Working on microarray expression, they highlight poor performance due to input noisiness. In another study, Ahn et al. [26] use multiple databases to compare MLPs with Support Vector Machines (SVMs) and regression methods. Neural networks provided better performance than classical ML alternatives.
Convolutional layers [27], developed for computer vision and signal processing, improve efficiency with locally shared weights. Working on breast cancer subtyping, using gene expression and copy number alteration data, Islam et al. [17] optimized a CNN to achieve the best result in their study.
Transformers introduced an attention-only multi-head design, enabling them to efficiently compute global dependencies [28]. Developed for natural language processing, their ability to capture multiple scales of complex relationships in parallel has also found usage in computer vision and signal processing [29]. Zhang et al. [20] adapted them for molecular breast cancer subtyping. They utilized their ability to capture long-range genomic interactions, enabling the model to achieve the best results in their benchmark. By using attention scores for interpretability, combined with pathway analysis, they were able to give biological relevance to the model’s decision-making. Building on CNN and TNN developments, Khan & Lee [21] proposed a hybrid architecture, using a CNN for efficient feature extraction and attention mechanisms for effective cancer subtype classification. Using an explainability library, they found that the model selects genes found to be involved in the pathology. In a comparative evaluation, their approach outperforms reference methods.
Omics tables can also be converted to 2 dimensions. For instance, Deepinsight [30] leverages dimensional reduction methods to map input genes to pixels, enabling the use of computer vision networks. Gokhale et al. [22] modify DeepInsight with an autoencoder for dimensional reduction [31], and adapt the number of channels to account for features mapped to the same coordinates. This approach yields the best performance in their cancer classification study when combined with a ViT. Working on breast cancer detection, Mohammed et al. [18] take another approach by reshaping the most expressed genes into an image. Their best result is achieved with a CNN coupled with the Ebola optimization search algorithm. Another alternative for 2D conversion is to exploit protein-protein interaction networks for cancer type prediction [19]. After mapping genes to corresponding proteins, the resulting graph can be converted to 2D using diagonal and adjacency matrices. However, they found limitations to this approach, mainly regarding interpretability.
Alharbi & Vakanski [32] analyze the landscape of gene expression modeling in a literature review. While confirming the wide usage and performance of SVM, RFs and MLP, the review highlights CNNs’ advantages as being leverageable using 2D conversion. Then, they identify that Transformers make an efficient alternative to RNNs. Relying on network properties, Graph Neural Networks (GNNs) require additional feature engineering and data which restricts their usage.
In this study, we optimize multiple recent neural network architectures for breast cancer subtyping using gene expression. This work includes SubNExT, a CNN with a ConvNeXt backbone, NeXt-TDNN, Transformers and ViTs, which are compared to XGBoost and MLP. Models are evaluated in terms of classification performance, efficiency, stability, and robustness. We highlight SubNExT for its accuracy, robustness and efficiency.

Materials and methods

3
Materials and methods
3.1
Data
The dataset [33] used in this work originates from TCGA’s cancer-related cohorts [34]. Given the prevalence of its usage in related works [35], we use gene expression data to predict breast cancer subtypes. 20,531 expression values for 1247 breast cancer patient samples are available in total. The PAM50 [36] test serves as a ground truth, describing 5 breast cancer subtypes based on 50 reference genes. 950 samples with available PAM50 subtypes and gene expression data are used in this study. To predict molecular subtypes, gene expression from all available genes are included in the pipeline. Data is scaled using either min-max normalization or z-score standardization, depending on the model’s downstream performance (Table 4). Min-max normalization, based on each features’ maximum and minimum values, has the advantage of preserving feature relations, but is sensitive to outliers’ extreme values. Z-score standardization, using mean and standard deviation, is less affected by outliers, but is impaired by distribution changes on unseen data [37]. Finally, as breast cancer subtyping is an unbalanced classification problem, with respectively 45% Luminal A, 20% Luminal B, 15% basal-like, 12% normal-like and 7% her2 samples, random oversampling was added to mitigate this issue in its default imblearn [38] implementation.
Data is split 50/50 between the optimization and evaluation datasets to avoid leaking samples used in the optimization phase. The optimization set is further split 80/20 for XGBoost training and testing, and 70/10/20 for neural networks in training, validation, and testing. The evaluation set is split with a 5 5-Fold stratified cross validation to gauge model performance when training with different sample configurations (Fig. S1).

3.2
Feature engineering
To optimize model performance, adequate feature engineering has been considered depending on the architectural characteristics (Fig. S2).
Enabling the use of 2D models, such as ConvNeXt and ViT, DeepInsight [30] maps expression arrays to a 2D space (Fig. S3) using t-distributed stochastic neighbor embedding (t-SNE) [39] to preserve the distances between points. The method was parameterized closely to its original implementation (Table S1). As NeXt-TDNN is intended for sequential data, the Entrez [40] database enables ordering genes according to their locus.
Additional feature engineering methods were implemented, aiming to further optimize XGBoost. Thus, Principal Component Analysis (PCA) [41], a feature extractor based on correlation [41], and Independent Component Analysis (ICA), which extracts features using a chosen metric [42], are integrated in the optimization process. Alternatively, we also included non-linear manifold learning techniques, expected to capture non-linear relationships between attributes [43], with Isometric Mapping (ISOMAP) [44] and Uniform Manifold Approximation and Projection (UMAP) [45].

3.3
Transformer and ConvNeXt Adaptation
A selection of neural network architectures has been implemented using TensorFlow [46] (Fig. 1).
Transformers have been leveraged for tabular data modeling, where they’ve shown comparable performance to high-performing gradient-boosted trees, with TabTransformer [47]. Following this development we also implemented a Transformer, based on the original architecture from Vaswani et al. [28], using transformer encoder blocks with multi-head self-attention (MHSA) and a feed-forward network for classification. Gene expression arrays are only composed of continuous features, which are not completely independent, hence we chose to process them in transformer blocks unlike TabTransformer. Two alternatives for neural feature embedding were implemented, with a fully-connected layer as a baseline, and a Fourier embedding block, developed by Gorishniy et al. [48], as an alternative. Our implementation also uses a fully-connected layer, with 1 neuron for channel reduction before classification, found to enhance convergence and lower computational costs.
ConvNeXt [6] aims to improve on previous CNN architectures, by integrating Transformer design principles in its block design. First, it only has one activation per block instead of one for each convolutional layer. Then, kernel size strays from the usual 3 3 format. ConvNeXt blocks use both 7 7 kernels, acting as MHSA, and 1 1 kernels serving as the feed-forward network. For SubNExT, we found the best results by first processing the input through a stem [49] block, analog to our Transformer’s dimension reduction and embedding layers. In the stem, MaxPooling and strided convolutions are used to perform dimensionality reduction, with minimal computational costs. It is followed by 3 ConvNeXt blocks, and a classification head. Previous architectures designed to enhance CNN efficiency were considered, namely Effnet [50] and EfficientNetV2 [51] (Fig. S4). However, when trained on 2D-converted expression data, both models exhibited limited convergence and suboptimal performance (Table S2). We suspect the smaller filter size, one of the main differences with ConvNeXt, is responsible for the performance drops. Thus, we chose the ConvNeXt backbone.
With ViTs, Dosovitskiy et al.[9] took another approach in adapting the Transformer design to 2D inputs. Each image is divided in patches, which then pass through Transformer blocks alongside positional informations. Thus, we also implemented a ViT for DeepInsight-converted data.
Finally, to leverage developments from signal processing, we chose to order genes according to their locus. This induces a positional bias, aiming for the sequential model to capture different interactions compared to other strategies. Ordered inputs are classified with NeXt-TDNN, the 1D adaptation of ConvNeXt from Heo et al. [8], that uses multi-sized filters. Similarly to ConvNeXt, feature extraction is performed with 1D convolution stem block.

3.4
Model optimization using Bayesian search and regularization techniques
For hyperparameter tuning, we chose Bayesian optimization as it can be more efficient than other alternatives, such as random oversampling [52], especially when handling multiple hyperparameter combinations (Table S3). It is implemented with Optuna [53].
First, regarding network depth for Transformers and MLPs, Steck [54] showed that shallow networks outperformed deeper alternatives on sparse datasets. Thus, we chose a range of lower values, as gene expression contains many uninformative features [55]. Then, the number of neurons in the dimension reduction and hidden layers is also optimized. These aim at selecting the most relevant features, which in turn should reduce overfitting. Additionally, for all Transformer models, the number and size of attention heads is another optimized hyperparameter. The Fourier embedding layer has specific hyperparameters regarding the input, hidden and output dimensions, as well as the standard deviation of the weight initialization. These are optimized to promote convergence. For ViTs, the input image is further divided in equal-sized patches, with the patch size as another tuning hyperparameter.
In all networks, regularization promotes sparsity, which has been found to improve performance in similar applications [56]. Lasso regularization (L1) shrinks minor coefficients to 0, while Ridge regularization (L2) constrains the square of the weights in favor of small values [57]. Regularization can further prevent overfitting [58], with dropout [59] randomly deactivating neurons, which has also been shown to improve robustness [60]. In all models, Bayesian optimization adjusts L1 and L2 coefficients as well as the dropout ratio.
Taking inspiration from recent neural network developments, we also tested alternative activation functions in the different architectures. Traditionally, Rectified Linear Unit (ReLU) was widely used as an activation function. However, because it reduces to 0 any negative input, a loss of information can occur which can impair convergence, a problem known as the dying ReLu. Thus, LeakyReLU was proposed as an alternative, where negative inputs are multiplied by a small coefficient instead. Other alternatives have been developed since then. Notably used in Transformers and ConvNeXt, Gaussian Error Linear Unit (GELU) takes a stochastic approach, multiplying by zero depending on their values using a Gaussian distribution [61]. Alternatively, Swish also generates non-zero outputs for negative values, with a smooth non-monotonic function. Used in Transformer networks, it has been shown to outperform ReLu in some setups [62]. Finally, while it has not been as represented as the aforementioned methods, Parametric Rectified Linear Unit (PReLU) is similar to LeakyReLU, with a learnable parameter for the negative inputs [63]. Activation functions were optimized separately from HPO, with adequate choices made depending on model convergence behavior.
Then, in regards to the training optimizer, different automated gradient descent algorithm choices are evaluated, namely Stochastic Gradient Descent (SGD), Adam and Root Mean Square Propagation (RMSProp). We tested AdamW as well, a variation of Adam adapted for L2 normalization [64]. Parameters relative to the optimizer are also managed by HPO, including learning rate, batch size, and weight decay for AdamW. To prevent overfitting, early stopping was also included [65]. Patience rate was defined as 1/5th of the maximum epochs, except for SubNExT where it was disabled due to the low number of epochs (50).
For a fair model benchmark, we also implemented and optimized reference methods. Tabular data modeling usually involves using tree-based methods, and gradient-boosted decision trees are the best performers in that regard. Thus, XGBoost [7], implemented with its proprietary library and optimized with HPO, serves as the gold standard. Alternatively, for deep learning, MLPs are considered the most suited method for tabular data. The implemented reference MLP architecture originates from the work of Franco et al. [66], and was further optimized for improved performance.

3.5
Robustness
Robustness evaluation focuses on model performance when facing a perturbation [67]. Its wide definition can take into account both artificial perturbations, such as adversarial attacks on model prediction [68], and natural perturbations, which include low probability events or distribution shifts [69].
In genomics, adversarial robustness has been studied through random perturbations and adversarial training [70], as well as perturbations in the feature space [71]. Other studies have focused on natural robustness, by generating more realistic error models, for instance by shifting sequences [72], or replicating common errors in sequencing tools [73]. This work aims to reproduce variations during gene expression profiling.
Working on characterizing variability in gene expression analysis, Stupnikov et al. [74] evaluated different gene expression pipelines. They found that noise levels were similar depending on the setup, and to an extent proportional to expression levels. Following Tarazona et al.’s study [75], on quality assessments for gene expression analysis using a noise simulation, we use a noise level of 20% of the input value [76] (Fig. 2). Perturbations are drawn from a Gaussian distribution. Using the standard deviation set to the noise level, the process is repeated 50 times per sample to simulate a range of variations (Fig. S5).
To also assess potential batch effects, using TCGA metadata, we characterized the global structure of gene expression with a PCA. We focused on sample acquisition regions and corresponding subtypes, relying on principal components (PC) PC1 and PC2. First, subtypes and regions account for 10.4% and 8.1% of the total variance (Fig. S8, Fig. S9). Then, focusing on the regions, while they display different centroids, they also exhibit comparable dispersion, within their respective sample distribution, on PC1 and PC2 (–, –) (Table 2). Looking at PC distribution per region, the South is associated with a distinctly lower mean PC1 score, , than the other US regions, between and . A Welch -test, comparing the South and the other US sites confirms a significant shift on PC1 (). Overall, Southern samples behave as a coherent shifted domain in PC space, characterizing a batch effect. Additionally, the South also consists of a more balanced proportion of white (38%), African-American (16%), Latino (16%), and Asian (30%) samples than the rest of the dataset (over 90% white) (Table S4). All these observations support using the South as a subset to illustrate a batch effect, and to further evaluate robustness. Thus, the batch effect is evaluated here by isolating Southern samples for training and by testing them separately (Fig. S6).

3.6
Evaluation metrics
Performance is evaluated in regards to classification, stability, efficiency, and robustness.
To measure the model’s classification performance, three indicators are considered. First, accuracy (acc) identifies the proportion of correctly identified samples in each class. Then, receiver operating characteristic area under the curve (ROCAUC) assesses the model’s prediction ranking, providing insights on its ability to distinguish between classes. Finally, F-score rewards models with a good balance between precision, which measures confidence in predicting a positive sample relative to a class, and recall, which reflects the capacity to find all the positive cases. These metrics are computed with a weighted average [77], to account for class imbalance, implemented with scikit-learn [78].
Additionally, to estimate reproducibility over different training conditions, stability is measured as accuracy standard deviation (std) over the cross-validation. Model evaluation is also conducted based on efficiency, represented by training times as well as the parameter count for neural networks.
Model robustness is measured with noise robustness and batch robustness. Noise robustness is defined as the proportion of correctly predicted samples that retain the correct classification under perturbation, and batch robustness as the ratio between the accuracy on certain samples and the accuracy on the rest of the dataset.
Finally, the significance of our results is also evaluated. A Shapiro-Wilk test [79], was first performed to verify distribution normality, followed by a pairwise t-test [80] for normally distributed values, or a pairwise Wilcoxon signed-rank test [81]. To validate stability claims, a pairwise Fisher test [82] is used to evaluate the homogeneity of accuracy variance between models.

Results

4
Results
In this study, we optimized new generation neural network architectures for breast cancer subtyping using gene expression. SubNExT, NeXt-TDNN, and Transformer variations are evaluated in terms of classification performance, stability, efficiency and robustness (Table 3). They are compared with fine-tuned reference state-of-the-art methods, XGBoost and an MLP. All models train for a maximum of 500 epochs, with 100 epochs of patience, except SubNExT which displayed significantly faster convergence only requiring 50 epochs. HPO is performed for 300 iterations, on the optimization set. Finally, performance is evaluated in a 5 5 fold cross-validation on the rest of the dataset.
First, looking at classification accuracy, SubNExT with its DeepInsight-converted 2D features outperforms all other DNNs. It reaches 87.12% acc, 87.01% F-score and 97.69% ROCAUC. The best Transformer model in this context is the vanilla transformer, without Fourier feature embedding. It achieves 83.31% acc, 83.33% F-Score and 95.52% ROCAUC, against 77.39% acc, 77.37% F-score and 92.77% ROCAUC with Fourier embedding. In comparison, the optimized MLP manages 85.73% acc, 85.71% F-Score and 97.01% ROCAUC. Regarding classification performance, only XGBoost slightly outperformed SubNExT by achieving 87.24% acc, 87.15% F-Score and 98.07% ROCAUC. After confirming that accuracy values are normally distributed (Table S9), the t-test validates that SubNExT’s accuracy results are significantly different from all other models’, except XGBoost, with in all comparisons (Table S7). Additionally, confusion matrices enable us to appreciate the performance of each model relative to the different classes, showing slightly lower performance on minority classes (Fig. S10).
Stability was evaluated by looking at accuracy standard deviation, in the cross-validation. Focusing on the most accurate models, SubNExT reaches the lowest value 3.37 accuracy standard deviation, while MLP is third to last with 4.12. XGBoost returns the worst value of 4.5 standard deviation (Fig. 3). In this case, the paired F-test does not validate that this difference is significant, as in all comparisons (Table S8). Nonetheless, the analysis of additional dispersion metrics, namely inter-quartile range and coefficient of variation, is consistent in showing a marginally superior stability for SubNExT (Table S6).
In regards to efficiency, of all the considered methods SubNExT is the fastest to train, with 2 m36 s on average. In comparison the Transformer takes over 15 minutes, while XGBoost requires on average 10 m30 s. Inference times are short for all models, between 3 and 13 ms. SubNExT’s efficiency also transpires regarding parameter count, as it is an order of magnitude smaller than non-convolutional models, with only 76 k parameters. In contrast, the Transformer has 881 k parameters, while the MLP is by far the largest neural network with 10.8 M parameters (Table 4). Regarding optimization behavior, an advantage of using a proven CNN backbone is the optimization process. As most hyperparameters have already been fine-tuned in ConvNeXt, SubNExT ends up needing far fewer optimization steps than alternative solutions. With only 31 steps required, it is optimized faster than the Transformer with 217 steps. In comparison, MLP and XGBoost require 241 and 150 steps respectively.
In this paper, we evaluate natural robustness with two tests. Regarding noise resistance, Gaussian noise proportional to expression has been applied. SubNExT achieves the second best result of 98.61% noise robustness, just behind the Transformer with 99.11%, and followed by the MLP (97.3%). In comparison, XGBoost retains 95.17% of correctly-classified samples under perturbation, the second worst result. In this case, a Wilcoxon test was performed as the Transformer’s robustness values are not normally distributed as (Table S13). Results indicate that SubNExT is significantly more robust than other methods, with for all paired tests (Table S12). They also highlight that neural networks’ noise robustness is significantly better than XGBoost’s, except for CNN1D (). Additionally, we assess model resilience against batch effect by isolating Southern United States samples, which display more population heterogeneity. The models were trained on the general population, excluding the south, and then tested on both datasets. After isolating Southern United States samples, average model accuracy has degraded from 84.73% to 78.71%. SubNExT retains the most of its performance with 95.06% batch robustness, translating to 83.58% accuracy on southern samples. While MLP reaches the second highest robustness of 94.48%, its accuracy of 79.64% on the south falls short compared to XGBoost, at 81.92%, which has a higher performance level overall. This highlights how model accuracy can be affected by dataset heterogeneity, and shows SubNExT’s superior ability to retain its performance level when facing perturbations.
Looking at the overall results, we find that the best compromise is offered by SubNExT. Besides offering the highest classification performance out of all DNNs, it is also the most efficient, most stable, and among the most robust method. Although XGBoost, the gold standard for tabular data modeling, offers slightly higher classification performance, SubNExT achieves comparable results with fewer resources and a better ability to maintain its performance level under perturbations. Thus, for its scalability advantages and inherent robustness, SubNExT represents a promising candidate model.

Discussion

5
Discussion
In this study, SubNExT, a 2D CNN with a ConvNeXt backbone trained on DeepInsight-converted data, outperforms other DNNs for breast cancer subtyping. SubNExT also matches XGBoost’s predictive performance and surpasses it in efficiency, stability, and robustness. However, despite a thorough optimization, no method surpasses XGBoost’s classification performance. We investigate factors that could lead to these results.
Overall, Deep learning models tend to perform better with an increasing amount of data, while ML models reach a plateau [83]. With only samples in this study, larger datasets may yield further improvements in favor to neural networks.
Delving deeper into model architectures, while convolutional layers have the ability to extract patterns using shared weights to capture local features, inspired by the visual cortex [84], we suspect this is not fully exploited for 2D-converted gene expression. Here, input images are obtained using t-SNE, by mapping genes according to their expression distribution. Thus, while DeepInsight attempts to replicate an image, the results do not display the complex patterns present in natural images that CNNs were designed to extract. Performance could also be impacted by the chosen image dimensions. However, in our tests we explored models with both 100 100 and 224 224 images which offered marginal improvements outweighed by disproportionately higher computational costs. Regarding NeXt-TDNN, we argue that ordering expression values based on gene locus might not capture gene interactions, as they intervene at both a inter- and intra-chromosomal level [85]. These mechanisms of epistasis, however, require modeling gene interactions, which results in complex interaction networks [86], that are more suited to GNNs.
Focusing on Transformers, in related studies [87] ViTs are found to require more parameters than CNNs to achieve competitive performance, a characteristic shared by other Transformer-based networks [88], [89]. Thus, they could generally be undersized in this context. Transformers have also been shown to struggle on small datasets, compared to CNNs [87]. They have also been associated with generalization issues[90], which were observed in MLPs as well [91].
On the subject of hyperparameter tuning, Bayesian optimization enabled efficient search space exploration. Looking at the loss curves, optimization leads to very different behaviors (Fig. S11). While SubNExT and NeXt-TDNN are relatively stable, MLP and Transformer show large loss variations. In contrast, ViT tends to suddenly converge after a plateau. While the search function was solely focused on minimizing loss, different HPO strategies could be explored to analyze their impact on performance. Raiaan et al. [92] reviewed HPO algorithms, covering statistical methods, including BO, but also multiple meta-heuristics, as well as numerical and sequential approaches. These optimization methods could be implemented as alternatives to analyze their behavior, and the resulting models. Tailored strategies could in fact also benefit efficiency, whereas neural networks are associated with challenges regarding monetary and environmental costs [93]. Hence, Menghani [94] highlighted different solutions exploiting HPO to simultaneously optimize performance and frugality. Moreover, Neural architecture search (NAS) [95] could be implemented to further optimize model design, following these principles. These supplementary steps could, however, increase processing times and resource consumption, while additional performance or efficiency returns are not guaranteed.
Regarding robustness, proportional Gaussian noise was applied to evaluate noise robustness, and a southern US sample performance test enabled batch robustness analysis. These approaches were respectively built upon a previous study, focused on noise levels in gene expression data acquisition [75], and motivated by the population’s heterogeneity for insights on model behavior in deployment scenarios. While these tests allowed for an exploratory analysis of model robustness, Taori et al. [96] suggest that it does not guarantee the same level of robustness in a real-world setting, which in this context would be a newly-sequenced cohort. As a whole, due to the perturbations’ complexity, the authors emphasize that training on larger and more diverse datasets is currently the only effective strategy to improve natural robustness.
With respect to the input data choice, although the current study focused on gene expression, because of its presence in related studies, other data sources could also be considered. Additional modalities such as methylation or copy number alterations could be integrated for multi-modal learning. Complementary data types, including images, could also be implemented using models adapted to varying combinations of features, with different integration strategies [2], [97]. This would enable the assessment of data modality pertinence, regarding their contribution in model performance.
Finally, as for future developments, although this study evaluated models in terms of classification performance, stability, resource usage and robustness, future work could address complementary aspects such as interpretability and explainability. Explainable AI (XAI) can intervene on multiple levels, whether it is by incorporating expert knowledge in input features, or integrated into model construction, or even after the model’s prediction with a post-hoc analysis [98]. In related studies, different solutions have been explored to enhance interpretability in neural networks, in particular for multi-omic learning [99]. Regarding model-agnostic methods, SHAP [100] and LIME [101] are among the most popular approaches to assess feature importance in decision making. They offer a more efficient and scalable alternative to permutation-based methods. Regarding model-specific approaches, while decision trees are usually considered white-box models, aggregation methods such as XGBoost can reach such a level of complexity that they could also be considered black-box. Thus, Sagi et al.
[102] proposed a method to extract a decision tree from a trained XGBoost, enabling the decomposition of the decision making process[102]. The incorporation of SHAP could also be considered as an alternative for feature importance assessment [103]. These methods, combined with Gene Ontology and pathway analysis, can highlight new non-linear relationships between genes and provide useful biological meaning to the models’ outputs [104]. Transformers can benefit from their attention mechanisms to compute feature importance. However, recent studies discuss the reliability of attention-based explainability, as their results can differ from aforementioned methods such as SHAP [105]. Focusing on CNNs, they can benefit from visualization techniques, for instance GRAD-CAMs [106]. They enable the identification of discriminative regions that have been linked with clinically-relevant procedures [107]. In this context, SubNExT relies on 2D conversion, which hinders the direct association between feature map regions and the corresponding genes. Thus, DeepInsight’s authors proposed DeepFeature [108] and DeepInsight-3D [109] to enhance explainability by reverse mapping pixels to corresponding genes, combined with GRAD-CAMs. This allows the identification of gene groups involved in the model’s decision making process (Fig. S12). While these methods enable to identify important features for a model, they require additional attention to ensure feature redundancy among samples [110]. These features must then be thoroughly examined with biological analysis tools, which can lead to relevant pathways [111], among other insightful biological information to evaluate model biological relevance [112].

Conclusion

6
Conclusion
This work introduced SubNExT, a 2D CNN with a ConvNeXt backbone using t-SNE and DeepInsight 2D-converted gene expression. SubNExT was evaluated against modern deep learning architectures, which includes a 1D CNN using NeXt-TDNN blocks, several Transformer variants, but also reference algorithms, namely XGBoost and MLP. Our goal was to propose a method outperforming the state-of-the-art, regarding several criteria relevant to their usability. This is to the best of our knowledge a novel approach in the context of genomics, through the inclusion of both efficiency and robustness concerns in a single study. SubNExT managed to reach state-of-the-art classification performance, on par with XGBoost and superior to other neural networks.
SubNExT also showed superior stability, lower computational constraints, and ranked among the most robust models. These results highlight the relevance of adapting architectural developments for genomics applications. Accordingly, the ConvNeXt backbone enabled us to build a model that achieves state-of-the-art classification performance, with lower resource usage and better resistance to noise, promoting its adoption in clinical usage.
Building on this study, future work could explore the purpose of adding data modalities, such as other omics, clinical tables, or medical images. Additionally, multi-objective optimization strategies could be adopted to further balance classification, stability and frugality, while XAI may enhance interpretability and clinical relevance.

CRediT authorship contribution statement

CRediT authorship contribution statement
Karl Paygambar: Writing – review & editing, Writing – original draft, Software, Methodology, Investigation, Conceptualization. Roude Jean-Marie: Writing – original draft, Software, Methodology, Investigation. Mallek Mziou-Sallami: Writing – review & editing, Validation, Supervision, Methodology. Vincent Meyer: Writing – review & editing, Validation, Supervision, Project administration.

Declaration of competing interest

Declaration of competing interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

출처: PubMed Central (JATS). 라이선스는 원 publisher 정책을 따릅니다 — 인용 시 원문을 표기해 주세요.

🏷️ 같은 키워드 · 무료전문 — 이 논문 MeSH/keyword 기반

🟢 PMC 전문 열기