본문으로 건너뛰기
← 뒤로

Instability and performance limits of convolutional neural networks on non-sequential medical tabular data: an empirical investigation.

1/5 보강
Scientific reports 📖 저널 OA 98.2% 2021: 24/24 OA 2022: 32/32 OA 2023: 45/45 OA 2024: 140/140 OA 2025: 938/938 OA 2026: 732/767 OA 2021~2026 2026 Vol.16(1)
Retraction 확인
출처

Wang C, Elgendi M, Shin H

📝 환자 설명용 한 줄

[UNLABELLED] Convolutional neural networks (CNNs) have shown outstanding performance in image recognition, but their application to non-sequential tabular data remains debatable.

이 논문을 인용하기

↓ .bib ↓ .ris
APA Wang C, Elgendi M, Shin H (2026). Instability and performance limits of convolutional neural networks on non-sequential medical tabular data: an empirical investigation.. Scientific reports, 16(1). https://doi.org/10.1038/s41598-026-39875-9
MLA Wang C, et al.. "Instability and performance limits of convolutional neural networks on non-sequential medical tabular data: an empirical investigation.." Scientific reports, vol. 16, no. 1, 2026.
PMID 41775758 ↗

Abstract

[UNLABELLED] Convolutional neural networks (CNNs) have shown outstanding performance in image recognition, but their application to non-sequential tabular data remains debatable. This study investigates the architectural sensitivity of CNNs when applied to non-sequential medical datasets and compares their performance with multi-layer perceptrons (MLPs) under various structural settings. Three publicly available medical tabular datasets were used: integrative clinical and CT feature dataset (iCTCF), Breast Cancer Wisconsin Diagnostic (BCWD), and UCI Heart Disease (UCI-HD). We systematically varied the number of kernels, kernel sizes, and fully connected (FC) nodes in a 1D-CNN architecture and compared the classification performance with that of MLP models, while conducting 1,000 feature-order permutation experiments to quantify order sensitivity under randomized structural settings. Effect-size statistics were computed to describe class separability; no feature filtering was performed. Across permutations, MLPs demonstrated superior stability with significantly tighter dispersion than CNNs across all datasets. While CNNs achieved peak AUROCs comparable to (BCWD: 0.987 vs. 0.986) or higher than (iCTCF: 0.739 vs. 0.681) MLPs in certain configurations, they exhibited greater performance variability and a distinct negative skew, reflecting high sensitivity to feature ordering. In UCI-HD, the peak AUROC favored MLP (0.878 vs. 0.829). Post-hoc analyses confirmed that CNN performance is highly contingent on structural hyperparameters—particularly kernel size—rather than robust feature learning. CNN performance on tabular data is heavily dependent on arbitrary feature ordering and structural design, posing risks of stochastic degradation. Clinical AI applications using such data must prioritize stability over peak performance and account for the lack of inherent spatial structure in tabular inputs.

[SUPPLEMENTARY INFORMATION] The online version contains supplementary material available at 10.1038/s41598-026-39875-9.

🏷️ 키워드 / MeSH 📖 같은 키워드 OA만

같은 제1저자의 인용 많은 논문 (5)

📖 전문 본문 읽기 PMC JATS · ~66 KB · 영문

Introduction

Introduction
Convolutional neural networks (CNNs) have achieved exceptional image classification performance by hierarchically capturing spatial features through convolutional layers1–3. With localized receptive fields that detect fine-grained spatial and temporal patterns, CNNs are particularly effective in vision or sequence tasks requiring precise recognition of local structure4,5. As network depth increases, CNNs learn progressively abstract representations6, enabling high-level concept extraction from raw inputs and driving major advances in image and speech recognition4,7. In medical imaging, CNN-based diagnoses have shown performance comparable to that of subspecialist radiologists, for example, detecting hip fractures on X-rays8, intracranial hemorrhage on CT9, scaphoid fractures on X-ray10, breast cancer on MRI11, stroke on non-contrast CT12, and pneumonia on chest X-rays13. Beyond images, several studies have also applied CNNs to non-sequential datasets that lack clear spatiotemporal structure14. Here, non-sequential data in which features are arranged without inherent order or adjacency. For instance, Liu et al. used CNNs on clinical variables including demographics, lifestyle, and laboratory tests to predict risks of brain, lung, and coronary artery infarction, outperforming rule-based methods15. Similarly, Yang applied CNNs to electronic medical history records to classify diseases such as hypertension, diabetes, chronic obstructive pulmonary disease (COPD), gout, arrhythmias, asthma, gastritis, and gastric polyps, with a 95% accuracy rate16, and CNNs have been used to elucidate functional activity of DNA sequences17.
Despite these successful experiences, in 2019, Ivan published a study that demonstrated a significant reduction in CNN classification performance when spatial connections were randomized during image classification18. This highlights a potential obstacle when applying CNNs to non-sequential data; CNNs are designed to exploit local neighborhood structure, and when such structure is absent or ambiguous, they may fail to capture relevant context or may overfit spurious correlations. Medical tabular data, which reflect patients’ physiological and pathological states, further complicate this issue, features often carry domain-specific meaning and exhibit complex interdependencies that are not organized along a spatial or temporal axis.
Although some studies have shown improved performance when applying CNNs to non-sequential data compared to conventional methods15–17,19–21, the potential risks and limitations of such an approach have not been fully considered. Therefore, investigating the suitability of classifying non-sequential data using CNNs can provide useful guidelines for selecting and optimizing models used in data analysis based on artificial intelligence (AI), especially given the increasing demand for AI-based analytics on a variety of non-sequential data types and the widespread use of CNNs in the field.
Tree-based models such as XGBoost and Random Forest, which have demonstrated strong and consistent performance on tabular datasets, are permutation-invariant with respect to feature order22. That is, shuffling the column sequence dose not affect their predictions because features are selected per split in a feature-agnostic manner. This robustness is advantageous in practice but limits their utility for probing how feature arrangement influences performance.
In contrast, CNNs apply spatially constrained operations that make them structurally sensitive to input ordering. While less common on tabular data, CNNs offer a unique lens for investigating how architectural components—such as kernel size, number of filters, and fully connected capacity—interact with non-sequential structures.
Accordingly, this study evaluates CNNs as classifier for non-sequential medical tabular data and analyzes the conditions under which they can be applied. We examine performance across datasets with differing characteristics and identify architectural factors that most strongly influence outcomes. Our findings indicate that careful architectural design—covering kernel size, number of filters, and FC nodes—is essential to avoid performance degradation and instability when spatial/temporal connectivity is absent. Importantly, this work dose not propose or evaluate a multimodal fusion architecture; rather, it isolates and analyzes CNNs structural sensitivity on non-sequential tabular inputs. Even with model tuning, unintended performance degradation can occur when CNNs are applied to data without inherent spatial structure, underscoring the need for configuration choices that respect the properties of the input domain.

Methods

Methods

Impact of randomizing data order on convolution results
A CNN is a neural network architecture that incorporates convolutional layers that apply a mathematical operation known as convolution23. Convolution involves overlaying a window or kernel on the data and calculating the dot product of the overlapping data and kernel weights. This process is repeated as the kernel slides along the data.
Figure 1 provides an example of the convolution operation process compared to the general weighting operation of the FC layer. Suppose the data input to the convolution layer is denoted by , the output data by , and the kernel weight of the convolution layer by , then the calculation of the i-th element of the output from the input is given by:

where N, K, and F represent the total number of input samples, kernel size, and the number of kernels, respectively. The value of denotes the k-th kernel weight of the f-th kernel, and the number of weights varies depending on the size and number of kernels. The number of output nodes of the convolution layer is determined by the number of input data and the size of the kernel, becoming when the kernel can slide over the input data with a stride of 1. In a CNN, the convolution results are calculated by convolving overlapping data with the kernel, which can vary depending on the adjacent data. Thus, the convolution results can reveal data adjacency or connectivity. However, this also means that the convolution results may differ depending on the ordering of the data. For instance, if the order of and is changed in Fig. 1a, the values of and , which include the kernel operation results for these data, will change as well.
In contrast, in the FC layer, which is a basic architecture in an artificial neural network, values are passed to the next layer without using a separate kernel, as shown in Fig. 1b24. The i-th output of the FC layer, , is calculated as:
where is the k-th input data, and is the weight of the k-th input data for the i-th output. In the FC layer, individual weights are assigned to each input data without using a kernel. Consequently, a change in the order of the input data does not affect the FC layer. A detailed explanation of the layer-by-layer computations for each model is provided in Supplementary Material 1.

Datasets
This study utilized three datasets, each containing mutually independent non-sequential data. The dataset used in this paper is a publicly available tabular dataset that has been frequently utilized in peer-reviewed journals. We selected non-sequential data with a variable number of features to assume various input types. The selected datasets are among the most frequently used in machine learning research and education. Table 1 provides details on the datasets used in this study, while Supplementary Material 2 provides additional information on the features included in each dataset and their effect sizes.

iCTCF dataset: the first dataset used was the integrative computed tomography (CT) images and clinical features (CFs) for COVID-19 (iCTCF) dataset, which was collected from two cohorts from the Union hospital (UH) and Liyuan hospital (LH) in Wuhan, China25. The iCTCF dataset consists of data from 1,521 subjects: 662 were diagnosed and treated for COVID-19, 57 died, and 802 had unknown outcomes. The dataset’s 130 features include nine categories of clinical features (basic demographic information, routine blood test, inflammation, blood coagulation test, biochemical test, immune cell typing, cytokine profile test, autoimmune test, routine urine test), as well as SARS-CoV-2 laboratory test results, morbidity, and mortality. CT images were provided for 1,342 patients. In the iCTCF dataset, raw CT images were not used; where applicable, only CT-derived descriptors together with other clinical variables were included as tabular inputs. We used data from 57 deceased patients and 57 subjects randomly selected from the 662 treated patients with ages similar to those of the deceased to balance the number of samples per class. A chi-square test was conducted on the ages between the two groups, and it was confirmed that there was no significant difference with a p-value of less than 0.05. And, it estimated mortality using 87 out of the 130 features, excluding those that were used as labels or were not measured. The collection, use, and retrospective analysis of the iCTCF dataset were approved by the institutional ethics committees of HUST-UH (IRB ID: [2020] IEC (A001)) and HUST-LH (IRB ID: [2020] IEC (A001)) in China.
BCWD dataset: the second dataset used was the Breast Cancer Wisconsin Diagnostic (BCWD) dataset, which consists of breast cancer diagnoses collected at the University of Wisconsin Hospital and includes cell nuclei features of breast cancer tissue26. The BCWD dataset is available as an open-access dataset from the University of California Irvine (UCI) machine learning repository27. The dataset comprises 569 records, each of which contains 32 features, including patient ID, 30 features of breast cancer cell nuclei (such as radius, texture, perimeter, area, smoothness, compactness, concavity, concave points, symmetry, and fractal dimension), and a binary label indicating tumor type (malignant/benign). In this study, all 569 records were employed, and 31 features, excluding patient ID, were used to classify malignant/benign tumors. This retrospective study utilized a de-identified dataset. The University of California Irvine Institutional Review Board (IRB) deemed the protocol IRB exempt.
UCI-HD dataset: the UCI heart disease (UCI-HD) dataset is another publicly available dataset from the UCI machine learning repository. It is based on a Cleveland database published in 1988 and contains 303 records and 14 features to predict the occurrence of heart disease28,29. The features include age, sex, smoking, blood pressure, cholesterol, diabetes, stable angina, health condition, chest pain type, maximum heart rate, angina with exercise, type of slope of ST segment in relation to maximum heart rate, labels for blood vessels, and heart disease. In this study, all 303 records and 14 features were used to predict the occurrence of heart disease. This retrospective study utilized a de-identified dataset. The University of California Irvine Institutional Review Board deemed the protocol IRB exempt. These three datasets span heterogeneous clinical domains (infectious disease, oncology, and cardiology) and provide a diverse range of feature counts and sample sizes. The iCTCF dataset al.one includes over 1,500 patient records with 130 clinical features, while the BCWD and UCI-HD datasets serve as standard machine learning benchmarks in diagnostic modeling. This selection ensures a balance between dataset complexity and interpretability, enabling a robust yet reproducible evaluation of model behavior.
Effect size30 is a quantitative measure of the difference between two groups, and it is commonly used in experimental research to determine the practical significance of a result. In this study, the effect size was used to evaluate the classification ability of features, serving as an indicator of how well each dataset included high-performing features for classification. A small effect size is typically considered to be around 0.2, a medium effect size is around 0.5, and a large effect size is around 0.8 or higher. However, the interpretation of effect sizes can vary depending on the context and field of study. Based on the analysis of the effect size (Cohen’s d) for each dataset, the iCTCF dataset had 49 features (56.98%) with an effect size smaller than small, 35 features (40.7%) with an effect size between small and medium, and only 2 features (2.32%) with an effect size between medium and large. In contrast, the BCWD dataset had 5 features (16.67%) with an effect size smaller than small, 9 features (30%) with an effect size between small and medium, 6 features (20%) with an effect size between medium and large, and 10 features (33.33%) with an effect size greater than large. Finally, the UCI-HD dataset had 4 features (30.77%) with an effect size smaller than small, 8 features (61.54%) with an effect size between small and medium, and 1 feature (7.69%) with an effect size between medium and large. Neither the iCTCF dataset nor the UCI-HD dataset had features with very large effect sizes. The detailed features and effect sizes of each dataset are available in Supplementary Material 2.

Data preprocessing and partitioning
To ensure reproducibility and consistency across experiments, we implemented a standardized preprocessing pipeline for all datasets. Each dataset—iCTCF, BCWD, and UCI-HD—was partitioned into training (60%), validation (20%), and test (20%) sets using stratified random sampling to preserve the class distribution across subsets. A fixed random seed (random_state = 42) was used for all data splitting and model initialization steps to enable exact replication of results. For the iCTCF dataset, which contained missing values, mean imputation was applied on a feature-wise basis using statistics derived from the training set alone. No missing values were present in the BCWD and UCI-HD datasets. Prior to model training, all continuous features were min-max normalized to a [0, 1] range using the minimum and maximum values calculated from the training set. These normalization parameters were subsequently applied to the validation and test sets to avoid information leakage. For each dataset, we computed per-feature t-values, p-values, and Cohen’s d solely for descriptive characterization of class separability; no effect-size thresholds were applied and no features were filtered. The effect size, t-value, and p-value for all features are reported in Supplementary Tables S4–S6. These statistics contextualize dataset heterogeneity and were not used for model training, tuning, or selection. Because our protocol involves 1,000 refits with randomized feature order and CNN hyperparameters, model-based attribution methods (e.g., SHAP/LIME) were not emphasized due to instability across permutations.

Permutation tests
To assess the impact of the non-sequential input data order, permutation tests were conducted in this study31. A permutation test is a statistical significance test that can determine whether the observed difference between groups is statistically significant or simply due to chance. The test involves randomly shuffling the data and calculating the test statistics repeatedly to create a distribution. In doing so, the occurrence of errors due to chance is reduced, resulting in more reliable results. To generate a diverse range of input data distributions, random selections were made for the kernel size (ranging from 10% to 100% of the input dataset, increasing by 10%), the number of kernels (ranging from 1 to 256), and the number of nodes (ranging from 1 to 256). This approach was adopted to calculate the average performance in various input scenarios.

Model generation and validation
In this study, all datasets were stratified and split into training (80%) and validation (20%) sets using ‘train_test_split’ with a fixed random seed. This ensured reproducibility and label balance in both subsets. We utilized 1D Convolutional Neural Networks (CNN) and Multilayer Perceptron (MLP) models to examine the differences in classification performance based on the connectivity of non-sequential input datasets. The 1D CNN model comprised 1D convolution, FC, and output layers. The number of kernels in the convolution layer and their size, as well as the number of dense layer nodes in the FC layer, can affect the classification performance of the CNN model. Therefore, to reduce the variance in performance due to these factors, we randomized the CNN model structure at each iteration of the permutation test. In practice, kernel sizes were designed to cover the entire feature length N by using nine proportional segments (including boundary cases near 1 and N), enabling like-for-lie comparisons across datasets with different N. The kernel size was increased by 10% increments from 10% to 100% of the total number of input features of each database, while the number of kernels and FC nodes were chosen from powers of 2, ranging from 20 to 28, as shown in Fig. 2a. To isolate the effects of kernel size (K), number of filters (F), and FC capacity (M) without introducing layer-interaction confounders, we did not employ pooling, batch normalization, attention/fusion mechanisms, or residual connections.

The MLP model consisted of an input layer, an FC layer, and an output layer. To facilitate a performance comparison with the CNN model, we set the number of nodes to the same range as in the CNN case, as illustrated in Fig. 2b. The number of parameters of the CNN and MLP models used in this study, organized by structure including kernel size, number of kernels, and nodes, have been summarized in Supplementary Material 3. To ensure fairness, we compared the parameter counts of CNN and MLP models under all configurations (Supplementary Table S11). Notably, MLP models achieved competitive or superior AUROC scores despite having considerably fewer parameters than CNN models, indicating that performance differences were not attributable to model complexity alone.
During the permutation test, the CNN model was generated multiple times with randomly specified parameters, including the number of kernels, kernel size, and number of FC nodes. To evaluate the structural sensitivity of CNNs to input ordering, we conducted repeated experiments using 1,000 permutations of feature columns and CNN architectural parameters. Feeding tabular features to a 1D-CNN imposes an artificial order; the 1,000 feature permutations were designed to expose order-sensitivity rather than to assume latent spatial locality. For each permutation, we evaluated three complementary settings to avoid conditional bias: (i) Dense-fixed (M fixed; K and F randomly sampled from the sweep sets), (ii) Kernel-fixed (K fixed; F and M randomly sampled), and (iii) Filter-fixed (F fixed; K and M randomly sampled). Train/validation splits were held constant within each permutation instance. Although traditional k-fold cross-validation was not applied, this permutation-based evaluation served a similar role by assessing performance variability and model robustness under randomized conditions32. Concise pseudocode describing the structural sweeps and permutation protocol is provided in the Supplementary Material 6 (Algorithm S1).
The activation function for the FC layer of the CNN is a rectified linear unit (ReLU)33, and the output layer uses the Softmax function, which is commonly used for multi-class classification tasks. The CNN model was trained using the Adam optimizer34, with a learning rate of 0.0001 and a batch size of 256. We trained models for up to 300 epochs with early stopping on validation loss (patience = 10, min_delta = 0.01, restore_best_weights = True). Parameter counts assume stride = 1 and valid padding for Conv1D and a single input channel (tabular features arranged along the length dimension). Conv and FC layers use bias parameters in implementation; when reporting “weights” we indicate whether biases are included () or excluded. Categorical cross-entropy was used as the loss function for training. To prevent overfitting, the learning process was set to stop after 10 consecutive non-improving validation losses. Meanwhile, an MLP model was also generated multiple times with a randomly specified number of neurons in the FC layers. The activation function of the MLP model is also a ReLU, and the output layer uses the Softmax function, similar to the CNN model. The MLP model was also trained using the Adam optimizer with a learning rate of 0.0001 and a batch size of 256. The same early stopping configuration (patience = 10, min_delta = 0.01, restore_best_weights = True) and maximum epochs (300) were used for the MLP to maintain parity with the CNN training protocol. To implement the models and evaluate their performance, we used Python (ver. 3.9.7), Tensorflow (ver. 2.7.0), and Keras (ver. 2.7.0).
The performance evaluation of the developed CNN model was conducted using the Area Under the Receiver Operating Characteristic (AUROC) curve35, which is a widely used metric in machine learning and statistics, to evaluate the performance of binary classification models. The AUROC score ranges from 0 to 1, where a score of 0.5 indicates a random classifier and a score of 1.0 indicates a perfect classifier; that is, a higher AUROC score indicates better performance of the classification model. For brevity, we report AUROC as the primary metric in the main text; distributional summaries across permutations and structural configurations are provided in the Supplementary Material. To isolate structural effects under extensive permutation and architecture randomization, the threshold-independent AUROC was adopted as the principal endpoint. Threshold-dependent metrics (e.g., sensitivity, specificity, F1) can fluctuate with the decision threshold and class prevalence, introducing variability orthogonal to the structural factors under study; therefore, they were not emphasized in the analysis. Feeding tabular features as a 1D sequence imposes an artificial order; our 1,000 feature permutations are designed to expose and quantify this structural limitation.

Results

Results
Overall, our findings indicate that the classification performance and reliability of CNNs are influenced by various factors, such as data sequence, dataset characteristics, and CNN structure parameters. Examples of these parameters include the number of kernels, kernel size, and number of FC nodes. The results from the permutation tests reveal that the highest classification performance was achieved on the BCWD dataset26, which has the largest effect size, as indicated by the highest average AUROC and the least deviation due to changes in the CNN structure.
Conversely, the UCI-HD dataset28, which has the smallest effect size, showed a large variation in the average AUROC depending on the CNN structure, indicating a high sensitivity to it. In contrast, the iCTCF dataset25 was found to be less sensitive to CNN structural changes, but it was on this dataset that the model exhibited worse overall performance. The distribution of AUROCs obtained from the permutation tests conducted on non-sequential data incorporating CNN structural changes is presented in Fig. 3. Further details on how the performance varies with the CNN structure can be found in Supplementary Material 3.

Performance according to the size of the convolution layer kernel
For all datasets, we observed a decrease in the average AUROC as kernel size in the convolutional layer increased. The largest decrease was found for the UCI-HD dataset, followed by the iCTCF and BCWD datasets. Additionally, we found that increasing kernel size resulted in an increase in the deviation of AUROC, with the most notable change observed in the UCI-HD dataset, followed by the iCTCF and BCWD datasets. These results suggest that smaller kernel sizes may be more effective for classification tasks for these types of datasets, while larger kernel sizes may lead to increased variability in performance, as depicted in Fig. 3a.

Performance according to the number of convolution layer kernels
The results indicate that increasing the number of kernels in the convolutional layer generally increased the average AUROC for all datasets. This effect was most pronounced in the BCWD and UCI-HD datasets, for which clear trends were observed. In addition, as the number of kernels increased, the deviation of AUROC tended to decrease, with the BCWD dataset showing the smallest deviation. The iCFCT dataset exhibited the smallest decrease in AUROC deviation, while the UCI-HD dataset showed the highest decrease in AUROC deviation, as shown in Fig. 2b.

Performance according to the number of FC nodes after the Convolution layer
Figure 3c shows the distribution of AUROC for each dataset as the number of FC nodes changes. In addition, Fig. 4 shows the change in the relative AUROC of the CNN and MLP models for each dataset as the number of FC nodes increases. The shading represents the 95th percentile of the AUROC distribution, whereas the dashed line indicates its median value. The gap between the dotted and solid lines represents the AUROC deviation of the predicted output. When comparing the performance of the CNN and MLP models as the number of FC nodes increases, we observe that the CNN performs better with fewer nodes.

However, as the number of FC nodes increases, the performance of the FC network approaches that of the CNN on all datasets and even surpasses that of the iCTCF and UCI-HD datasets. In Fig. 4, it is evident that when the number of nodes in both datasets exceeds 128, the performance of the MLP models surpasses that of the CNN models. The 95th percentiles of the AUROC distributions of both the CNN and MLP models were similar to those of the iCTCF dataset. However, for the BCWD dataset, the CNN showed a higher bias than the FC network. The 95th percentile and bias of the distribution decreased as the number of nodes increased for both networks. On the UCI-HD dataset, the 95th percentile of the AUROC distribution of the MLP model decreased significantly, and that of the CNN increased as the number of FC nodes increased. Furthermore, a large negative bias was observed for the CNN, whereas the FC network did not show any particular bias. In all cases, the CNN showed greater skewness than the FC network. Unlike the FC network in MLP models, CNN models which displayed no significant alterations in skewness as the number of FC layer nodes increased, significant changes in skewness with the number of nodes were observed, suggesting that AUROC may, at times, have exceptionally low values.

Discussion and conclusion

Discussion and conclusion
The study found that for CNN models with non-sequential data inputs, higher effect sizes of the input data led to better classification performance, while datasets composed mostly of features with small effect sizes showed lower performance. In terms of CNN model structure, the average AUROC increased while performance deviation decreased as the number of kernels increased in all datasets. In contrast, increasing the kernel size led to a decrease in average AUROC. Furthermore, increasing the number of nodes in the FC layer resulted in higher average AUROC and lower performance deviation. The Bonferroni post-hoc heatmaps in Supplementary Material 4 indicate that the effect of CNN architecture on performance was most pronounced in the BCWD dataset, followed by the iCTCF and UCI-HD datasets, in that order, which is consistent with the discussion of effect size mentioned earlier. In all three datasets, the number of nodes in the FC layer was found to be the main contributor to the performance change. However, the BCWD and iCTCF datasets, which have large and small effect sizes, respectively, showed less statistical differences between the number and size of kernels than the UCI-HD dataset. This may be because the BCWD and iCTCF datasets include features with large and small effect sizes, respectively, and are therefore less affected by the combination of input features due to kernel operations. Compared to the FC models, the CNN models tended to perform better with a smaller number of dense layer nodes. However, the results of the permutation test of the CNN model showed negative skewness with a long tail in the 0 direction in most distributions. Because the analysis targeted structural sensitivity under randomized permutations, model-based attributions were not reported to avoid conflating refitting and ordering effects with true signal. This work is not a new method proposal but a cautionary case study that empirically documents the order- and structure-sensitivity of CNNs on non-sequential tabular inputs.
In summary, optimizing CNN model structures based on the input data’s effect size can lead to improved classification performance. Performance variations based on data characteristics: the study found that, when compared to the iCTCF and UCI-HD datasets, the BCWD dataset, which includes many features with large effect sizes, demonstrated overall high average AUROC and low performance variability, regardless of the number of nodes in the FC layer, number of kernels, and kernel size. A large number of features with large effect sizes implies a greater ability to separate classes more clearly. The research findings suggest that, even if input features are grouped by kernel, the presence of features with high effect sizes can lead to excellent classification performance and minimal performance deviation in various CNN models, even when there are changes in the CNN structure. Although effect size was not a study endpoint, we incidentally observed that datasets containing a higher proportion of large-effect features tended to yield higher average AUROC and lower variance across CNN configurations. This likely reflects inherent separability of the input space rather than an architectural advantage or any feature selection, consistent with our use of Cohen’s d as a descriptive (not filtering) measure (see Supplementary Tables S4–S6).
The iCTCF dataset, which has a relatively small effect size, demonstrated the lowest overall classification AUROC. However, the study found that the iCTCF dataset exhibited minimal change in its AUROC with respect to the number of nodes, number of kernels, and kernel size in the FC layer when compared to the UCI-HD dataset. For the iCTCF dataset, the majority of features (57%) have small effect sizes, resulting in low overall classification performance regardless of the number of weights. Increasing the number of weights only diversifies the combination of features with poor classification performance, thus the performance variation is unlikely to be significant. Conversely, the UCI-HD dataset demonstrated a relatively significant fluctuation in average AUROC and AUROC deviation range based on permutation, depending on the CNN configurations. The UCI-HD dataset comprises more than half (61.5%) of the features with small to medium effect sizes. Therefore, it is anticipated that the overall classification performance would surpass that of iCTCF, and there is a greater likelihood of performance improvement as the number of weights increases.
While the degradation of CNNs on permuted tabular data may be mathematically anticipated due to the loss of spatial locality, our findings serve as a critical empirical warning. In clinical informatics, CNN-based architectures are frequently adopted for tabular health records without sufficient regard for feature ordering. Our work quantifies this stochastic vulnerability, demonstrating that the performance of such models is often a lucky byproduct of an arbitrary feature sequence rather than a robust learning of medical features. Performance on non-sequential input data according to changes in the CNN structure: When a kernel of size K is applied to input data with N features, the outcome is a total of output nodes as a result of kernel sliding. If there are F kernels, then the number of output nodes becomes . The number of weights in this case is K × F. If the number of output dense layer nodes is M, then the weight between the convolutional output and the dense layer is . Therefore, the total number of weights in the network is ., where is the number of input features, the kernel size, the number of filters, and the number of fully connected (FC) nodes. (If biases are included, add ). Further details can be found in Supplementary Material 1.

Effect of the kernel size: as K increases, the number of weights in the convolutional layer increases, but the number of weights in the entire network may decrease. Conversely, as K becomes smaller relative to M and N, the number of weights in the entire network increases. The number of weights in convolutional layers also increases by increasing the kernel size, which, in turn, increases the number of input samples that are combined through the kernel while simultaneously reducing the number of weights in the FC layer that can associate different weight combinations. As a result, the independence of the input features is reduced, resulting in poor classification performance and increased instability for non-sequential inputs. This decrease in performance with increasing kernel size is illustrated in Fig. 3a. For non-sequential data, neighboring data combinations do not have any particular meaning, and thus, the kernel learns a random sequence instead of a specific pattern. This is in contrast to sequential data, such as images, where computing with a kernel can be thought of as learning a specific pattern of neighboring samples, which allows the network to extract features from the data automatically.

Effects of the number of kernels: the number of kernels in a CNN is directly proportional to the number of weights in the convolutional layer and the total number of weights. Increasing the number of kernels in a CNN involves applying multiple weight vectors to the same input data, resulting in greater independence of the input data on the output. Essentially, this implies that to solve an equation with multiple unknowns, the number of equations required is equivalent to the number of unknowns. Consequently, as the number of kernels increases, the CNN network becomes capable of more independently reflecting the influence of input data on the output, thus minimizing the impact of the order of the input data. Figure 3a shows that a kernel size of 1 indicates a lack of a convolution-based combination of input features, suggesting that the input data independently contributes to the output. This observation is also reflected in the results obtained by increasing the number of kernels in Fig. 3b. For example, increasing the independence of the input data by increasing the number of kernels leads to similar results to those obtained with a kernel size of 1, as shown in Fig. 3a. These results suggest that a sufficiently large number of kernels is required to produce stable results when applying non-sequential inputs to a CNN, which is dependent on the number of input features.

Number of nodes in the FC layer: the number of nodes in a FC layer does not impact the number of weights in a convolutional layer. However, the number of nodes in an FC layer is directly proportional to the total number of weights in the layer. Increasing the number of FC layer nodes can improve the independence of the original inputs because it allows the multiple-input convolutional layer to be represented by multiple expressions. Therefore, increasing the number of FC layer nodes improves the expressiveness of the model owing to the increased number of weights. This simultaneously reduces the performance deviation caused by non-sequential inputs.

Comparing the effect of convolutional and FC layer weights: increasing the number of kernels and the number of nodes in the FC layer can improve classification accuracy and reduce performance variance, albeit in different aspects. The former can be viewed as generating multiple equations for the same sample data, while the latter decomposes already-combined input values through various equations to determine the expressiveness of the original input data. This effect is demonstrated in Figs. 3b and c. Both figures report performance improvement and deviation reduction when the number of kernels or the number of nodes in the FC layer is increased. However, unlike the moderate performance improvement from increasing the number of kernels (Fig. 2b), the performance improvement tends to be sharp at a certain number of nodes when increasing the number of FC layer nodes. Increasing the number of kernels is effective in extracting more sophisticated features of the input data within the kernel size, while increasing the number of FC layer nodes enhances the representation of the entire input. However, it appears that securing the minimum representation of the entire data is crucial for performance improvement, rather than finding features within a specific region (kernel) in data without spatial and temporal connectivity. Therefore, it is important to balance the use of kernels and FC layer nodes to achieve optimal performance on non-sequential datasets. Specifically, we used a single convolutional layer to ensure that the effects of kernel size and number of filters could be evaluated independently, without interference from downstream layers. This design choice was essential for isolating the contribution of each architectural component. Since spatial locality is not preserved in non-sequential tabular data, we systematically varied the kernel size based on the total number of features in each dataset, dividing the input dimension into proportional intervals. This approach allowed us to explore how receptive field size influences representation learning without assuming spatial continuity. Similarly, we tested a wide range of filter and FC node configurations to observe their individual effects under controlled structural permutations.
Comparing the performance of the FC and CNN models: the results of the performance comparison between FC and CNN models reveal that in the iCTCF dataset, which comprises features with small effect sizes, increasing the number of nodes does not significantly improve performance for either model. Moreover, the classification performances of the CNN and FC models do not exhibit a substantial difference. This suggests that altering the combination of features through changes in kernel size and number of kernels in the convolutional layer has a performance limit, as it primarily comprises features with low effect sizes. In contrast, in the BCWD dataset, which contains numerous features with large effect sizes, the FC model demonstrates a significant performance improvement as the number of nodes increases, and has a low performance deviation. In the UCI-HD dataset, both FC and CNN models display notable performance enhancement beyond a certain number of nodes, which highlights the importance of a minimum number of weights for identifying key features within the dataset. It was observed that the performance of the CNN model improves with a smaller number of nodes in the FC layer compared to the FC model. This can be attributed to the larger number of weights present in CNN architectures, owing to the presence of multiple convolutional layers relative to FCs. One notable limitation of CNNs highlighted in this study is the presence of negative skewness and long tails toward zero in all datasets and under all conditions. This could be attributed to the fact that CNNs use kernels to extract features from uncorrelated data, which could result in the extraction of random features. As a result, it is possible for CNNs to produce poor results in some cases, even when their structure has been optimized.
In this study, a single-layer CNN model and an FC model were used to analyze the effect of non-sequential inputs from the perspective of CNNs and FCs according to changes number of kernels, kernel size, and number of nodes. However, it is common for CNNs and FCs to be used as deep CNN (DCNN) or deep neural network (DNN) structures consisting of multiple layers. Therefore, it is necessary to consider and verify the results using generalized models of DCNN and DNN to ensure the generalizability of the findings. Moreover, recent years have seen the application of various techniques such as dropout and batch normalization to DCNN and DNN structures to improve performance. Note that while this study suggests that an appropriate CNN structure should be adopted when applying CNNs to non-sequential inputs, it does not provide guidance on selecting a specific structure. This is because the structure of CNN models can vary depending on the purpose of research and the characteristics of non-sequential input data. Additionally, to generalize the results, it can be useful to match the number of parameters in the CNN and MLP models and draw conclusions accordingly. We compared the number of parameters across the three datasets and found that MLP models generally outperformed CNN models despite having fewer parameters, thus we cannot attribute MLP’s better performance to parameterization differences (see Supplementary Table 5). However, as the number of layers in a CNN increases, the results may vary, so a comparative study considering the number of parameters should be conducted to obtain generalized results for multi-layer CNNs. Furthermore, medical tabular data is characterized by both inter-individual and intra-individual variability, and it features interrelated variations among different indicators. It can be applied as a specificity of input data in various models. Therefore, we used real medical data in tabular format in this study. However, the model and medical tabular dataset used in this study do not represent all CNN models and medical tabular data. While the results of this research can provide insights within the scope of the analyzed datasets, a broader application of these findings necessitates the use of a more diverse range of medical tabular datasets. Although the combined sample size across the three datasets used in this study exceeds 1,500 patients, the data may not fully reflect the broad demographic, geographic, or clinical diversity seen in real-world populations. As such, the generalizability of the findings may be limited. Future studies that incorporate larger and more diverse cohorts could further improve the reliability and applicability of this architectural framework in clinical practice. Given the non-local nature of tabular inputs, broad superiority of CNNs is neither theoretically nor empirically expected; instances of near-parity are best viewed as data-specific exceptions. As the analysis targeted threshold-independent structural effects, sensitivity, specificity, and F1 were not comprehensively reported; deployment-oriented evaluations may incorporate these metrics at clinically motivated, pre-specified cutoffs.
Therefore, future research should address the minimum requirements necessary to ensure stable performance when applying CNNs to non-sequential inputs, as well as the examination of the applicability of generalized CNN models to non-sequential data that intend to precisely match the parameters between CNN and MLP models. Deployment-oriented interpretation may include SHAP on representative, fixed configurations (pre-specified architecture and feature order) using clinically grounded background data. In addition, this study focused on single-layer CNNs to enable isolated evaluation of structural components. While deeper or multi-layered CNNs may yield improved performance, they introduce complex layer interactions that hinder objective component-wise comparisons. Furthermore, performance gains are not always guaranteed. We expect that future studies applying deeper architectures in more generalized settings could provide complementary insights and support broader applicability of our findings. Especially, CNN models tend to be more location dependent with shallower layers, while deeper layers lead to increased location independence, potentially reducing the impact of the order of tabular data and thus improving performance. In future studies, examining the performance changes in CNN models with varying layer depths could be expected to further substantiate the reliability of the results of this research. Furthermore, it is anticipated that by utilizing a multitude of medical tabular datasets and conducting comparative validations among various models such as RNN, LSTM, LASSO, ElasticNet, as well as CNN and MLP, a generalized model and results for non-sequential medical data could be obtained.

Supplementary Information

Supplementary Information
Below is the link to the electronic supplementary material.

출처: PubMed Central (JATS). 라이선스는 원 publisher 정책을 따릅니다 — 인용 시 원문을 표기해 주세요.

🏷️ 같은 키워드 · 무료전문 — 이 논문 MeSH/keyword 기반

🟢 PMC 전문 열기