Transformer-enhanced deep ensemble for multi-class liver disease classification using computed tomography images.

Bhardwaj S; Aggarwal S; Kumar N; Prasad A; Mapari S; Kaushal RK

doi:10.1038/s41598-026-43256-7

← 뒤로

Transformer-enhanced deep ensemble for multi-class liver disease classification using computed tomography images.

1/5 보강

Scientific reports 📖 저널 OA 97.4% 2021~2026 2026 Vol.16(1)

Bhardwaj S, Aggarwal S, Kumar N, Prasad A, Mapari S, Kaushal RK

📖 무료 전문 🟢 PMC 전문 PMC13087263

PubMed ↗ DOI ↗ BibTeX ↓ RIS ↓

📝 환자 설명용 한 줄

이 논문을 인용하기

↓ .bib ↓ .ris

APA Bhardwaj S, Aggarwal S, et al. (2026). Transformer-enhanced deep ensemble for multi-class liver disease classification using computed tomography images.. Scientific reports, 16(1). https://doi.org/10.1038/s41598-026-43256-7

MLA Bhardwaj S, et al.. "Transformer-enhanced deep ensemble for multi-class liver disease classification using computed tomography images.." Scientific reports, vol. 16, no. 1, 2026.

PMID 41803329 ↗

DOI 10.1038/s41598-026-43256-7

Abstract

The liver related diseases such as cirrhosis, fatty liver disease, and hepatocellular carcinoma raise significant health challenges in the world due to their increasing prevalence and their complexity in detection. This study features a deep learning-powered computer-aided diagnostic system that includes computed tomography (CT) images in the classification of liver diseases into groups of several classes. The task is to improve diagnosis quality through the implementation of all three pre-trained convolutional neural networks (CNNs) (ResNet50V2, DenseNet121, and MobileNetV2) to make multi-scale and multi-class imaging classification. All CNN models were thoroughly fine-tuned and evaluated at first. Transformer blocks were then added to every backbone to form a hybrid model. The models were trained and assessed on a CT liver dataset and performance measured based on precision, recall, F1-score and Matthews Correlation Coefficient. Findings show that there are marked enhancements with the addition of transformers especially in diagnosis of complex conditions, e.g., cirrhosis and fatty liver. The ensemble model, which was improved using the transformer, had the best overall accuracy of 97% which was higher than any single model. This study addresses the clinical benefits of CNNs, and transformers used together to classify liver diseases and provides suggestions about its application in clinical practice.

🏷️ 키워드 / MeSH 📖 같은 키워드 OA만

같은 제1저자의 인용 많은 논문 (1)

A Comparative Analysis of HER2 Immunohistochemistry in Core Biopsy Versus Excision in the Era of HER2 ``Low'' Breast Cancers.
Clinical breast cancer 2026

📖 전문 본문 읽기 PMC JATS · ~136 KB · 영문

Introduction

Introduction
Liver is the largest parenchymal organ of human body. It makes about 2% of total body weight of an adult and is essential for metabolism, detoxification and synthesis of essential proteins. Liver is a complex organ that does a variety of physiological functions and, therefore, it is prone to various forms of threats like viral infections, toxic exposure, metabolic disorders, and autoimmune diseases. Any form of liver disease can trigger a cascade of events which can adversely affect nearly all of the organ systems1. Hepatic disorders rank as the eleventh major cause of mortality and have claimed 2 million lives annually2. Cirrhosis and Liver cancer together contribute 3.5% to the world mortality, hepatocellular carcinoma (HCC) is in the sixteenth place in the list of causes of mortality, whereas cirrhosis is in the eleventh place. Cirrhosis is also a leading cause of disease disability-adjusted life year (DALYs) and years of life lost. Liver Cancer and Cirrhosis each constitute 1.6% and 2.1% of the global disease burden respectively significantly contributing to DALYs3. In 2019, cirrhosis and other chronic liver disorders resulted in around 1.47 million fatalities, a notable rise from 1.01 million in 1990, despite a decrease in the age-standardized mortality rate throughout this timeframe4.
Liver conditions also have a severe impact on the health services, economic security, and the global quality of life. Chronic liver disease is estimated to account for 4% of global mortality, having another major contributing factor being non-alcoholic fatty liver disease, currently referred to as metabolic dysfunction associated steatotic liver disease (MASLD), which has an estimated global prevalence of 32.8%. MASLD significantly contributes to severe hepatic consequences, including cirrhosis and HCC5,6. The economic cost of liver diseases goes beyond the physical condition as there is also psychological discomfort and weakened functional capacity as well as other disorders are also linked with liver diseases such as type 2 diabetes mellitus7,8. These complex challenges are to be addressed with various strategies, including enhancing health services that are affordable, reducing health disparities, and investing in translational research.
The growing rates of the hepatic diseases and the metabolic disease, as well as sedentary lifestyle conditions, demonstrate the necessity of an in-depth explanation of the epidemiology, risk factors, and systemic implications. This understanding is crucial in coming up with specific prevention programmes, diagnostic procedures and therapeutic interventions with machine learning (ML) also contributing to precision strategies for various liver diseases9.
Diagnostic assessment of liver disease incorporates the spectrum of modalities, which include imaging technology, laboratory testing, and the new developing technologies. Traditional laboratory investigations are still the standard practice in the evaluation of hepatic function and diagnosing the etiology of the disease. Biochemical indicators, especially alanine aminotransferase (ALT) and aspartate aminotransferase (AST) are sensitive but not specific to hepatocellular injury biomarkers of individual liver diseases10. Advanced ML models are being developed to predict abnormal increases in such indicators, even in specific contexts like drug-induced liver injury11.
This means that a follow-up imaging is a common procedure to verify diagnoses and measure the severity of a disease especially when clinical tests have uncharacteristic results.
In the management and the non-invasive detection of liver conditions, imaging is essential. Similarly, novel deep learning (DL) systems utilizing multi-sequence cardiac magnetic resonance are showing promise for prognostic prediction in other critical medical areas12. Predominant modalities used in clinical practice include computed tomography (CT), magnetic resonance imaging (MRI), and ultrasound. Ultrasound is majorly used in the imaging for diagnosing liver disease but has disadvantages of increased technical failure rates in obese conditions and reliance on operator skill13. However, recent advancements in DL, such as those enabling fast super-resolution ultrasound microvessel imaging, are addressing some of these limitations and expanding its diagnostic capabilities14. MRI is better in soft tissue contrast particularly in imaging of abdominal areas by use of intravenous contrast agents15,16 but is limited by its cost and the distortion of magnetic fields. CT provides fast acquisition rates, large spatial resolution and multiplanar reconstruction features making it very appropriate in diagnosis of diseases13. Diagnostic benefits of CT usually supersede the risks of ionising radiations thus justifying its use in routine practice17.
The diagnosis of the hepatic pathologies has reached a significant level with the usage of artificial intelligence, especially the Computer-Aided Diagnosis (CAD) systems which are based on the usage of DL, extending to complex tasks such as virtual stenting for aortic repair using graph DL and chest X-ray diagnosis using advanced filter and alignment networks18,19.
These systems have been quite successful in the diagnosis of the fatty liver disease, cirrhosis and HCC and even in many cases they have a higher sensitivity than the traditional operator-based methods which have a low sensitivity. A more sophisticated method that takes advantage of transfer learning is by using the already trained CNNs.In studies by Aboulwafa et al., and Reddy et al., the classification accuracy of of CNNs like DenseNet 201 and VGG 16 models has reached 95 per cent and 90 per cent respectively20,21, which has aided to the radiologists expertise and improved the consistency of diagnosis. Similar to CNNs, transformer-based architectures are effective at the detection of hepatic pathologies because they are able to capture contextual dependencies in medical images, with optimized pre-trained transformer models demonstrating efficiency in classification22. Novel DL architectures, such as inception models, are also being developed for precise interpretation in other medical imaging domains like mammography23.
A new ensemble model that was transformed through the use of transformers was developed by combining the benefits of all the three models, aligning with advancements in multi-source deep feature fusion for medical image analysis24,25.Compared to classic models, Hybrid Transformer Neural Networks (HTNNs) prove to be better at classifying focal liver lesions through contrast ultrasound, achieving a higher level of accuracy and precision26. Further innovations include novel Vision Transformer architectures with efficient blocks for medical image classification27.
Moreover, predictive modeling of liver disease progression has been done using ensemble models that use hyperparameter-optimised logistic regression. The integration of explainable AI with ensemble learning and transfer learning is also being explored for medical detection tasks28. The accuracy of these models has proven to be satisfactory with up to 83% in prediction of disease trajectory thus, enabling individualistic therapeutic planning and early intervention29. In spite of these developments, there are still challenges, including the issue of heterogeneity of data collections, model generalisability, and the need to do real-world clinical validation. Nevertheless, the application of CAD on the basis of transfer learning and transformer ensembles is a tremendous change in the field of diagnostics in the hepatology sphere. In health-care environments of different types, AI-based methods can significantly improve clinical decision making, earlier detection rates and patient outcomes.
Nevertheless, CAD based on the transfer learning and transformer ensembles represent a significant improvement in the domain of diagnostics in hepatology. In other healthcare conditions, artificial intelligence-based approaches can be of considerable help in enhancing the level of clinical decision-making and early detection, as well as patient outcomes.
This study aims to develop a robust, transformer-enhanced deep ensemble framework for the automated multi-class classification of liver diseases specifically cirrhosis, fatty liver, and HCC using CT images, thereby improving diagnostic accuracy and clinical applicability. Even though the effectiveness of these previous DL methods has been demonstrable in binary liver disease classification, there still exists a substantial void in the creation of common multi-class systems that utilize the hybrid CNN-Transformer frameworks. Several current models concentrate on one or binary disease classification, cannot be generalized to various clinical presentations, and rarely combine attention with ensemble learning to stratify liver pathology. Furthermore, low-level imaging properties of initial conditions, and heavy calculations are also a continued problem in practical applications. To address these gaps, this work proposes to add transformer modules to a multiplicity of pre-trained CNNs, namely ResNet50V2, DenseNet121, and MobileNetV2, in order to augment the representation of both local and global features, create a hybrid ensemble model that synergistically exploits the complementary benefits and perform a thorough performance analysis on a multi-class CT liver dataset.
The proposed approach will be likely to work with automated multi-classification of liver diseases. The contributions made by the author in this study effort are as outlined below:

Three pre-trained CNN models ResNet50V2, DenseNet121, and MobileNetV2 have been first used in liver pathology classification. These models were also evaluated independently following refinement and retraining on a CT image dataset with performance metrics like F1 -score, precision, and recall.

Transformer modules were incorporated into each of the aforementioned pre-trained models in the second stage to form hybrid pre-trained transformer architectures. The trained variants of the modified models were later trained and tested to determine the improvement in feature extraction and classification accuracy of the attention-based mechanism of transformers.

Hybrid ensemble transformer model was developed based on the performance of CNN and CNN-Transformer models. This framework incorporated the representations of features of ResNet50V2, DenseNet121, and MobileNetV2 integrated with transformer modules. The aim was to employ the synergistic benefits of every model to improve the classification accuracy for the liver diseases.

A comprehensive performance evaluation was conducted on all the pre-trained models, models enhanced with transformers and finally on the ensemble integrated with transformer model.

The remainder of this paper is organized as follows. Section 2 presents a detailed review of related work and recent advancements in DL-based predictive technologies for liver disease diagnosis. Section 3 describes the materials and methods, including dataset acquisition, preprocessing, baseline models, transformer-enhanced architectures, and the proposed hybrid ensemble framework. Section 4 outlines the experimental setup and evaluation protocol. Section 5 reports the quantitative results and provides a comprehensive analysis of baseline, transformer-enhanced, and ensemble models. Section 6 discusses the findings in depth, highlighting the impact of transformer integration and ensemble learning. Section 7 compares the proposed approach with state-of-the-art methods. Section 8 addresses the limitations and future research directions, followed by the concluding remarks in the final section.

Related work

Related work
The recent development of DL and ML in CAD of liver diseases has been increasing at a fast pace due to the need to have an accurate, non-invasive, and reproducible diagnostic instrument. Early works mostly employed CNNs through transfer learning to learn hierarchical features of medical imagery, which showed promising performance in fatty liver detection and the classification of focal lesions. As an example, the binary classification with CNN-based solutions based on ultrasound and CT images demonstrated moderate and high accuracy rates, which confirms the possibility of DL-assisted liver disease diagnosis19,20. Nevertheless, these early CNN schemes were to a large extent limited by their ability to deal with local receptive fields hence their inability to analyze global contextual information, which is of specific essentiality in diffuse liver pathologies like cirrhosis and early-stage steatosis.
To overcome these drawbacks, the recent literature has looked into the mechanisms of attention and transformer based architectures to improve the contextual feature learning. Research has demonstrated that attention modules can be used to enhance sensitivity and robustness of liver disease detection to a great extent. A study by Dai et al. (2025) showed that a DL-based system of localized liver lesion detection on contrast-enhanced MRI has a similar performance as a radiologist and higher sensitivity as an assistant tool30. Likewise, Peng et al. (2025) introduced a 3D Convolutional Block Attention Module to detect HCC on non-contrast CT, demonstrating performance in opportunistic screening31. In addition to classification, attention-augmented architectures have been used in liver segmentation, an essential step towards proper diagnosis. Guo et al. (2025) described promising results with the help of Mamba-based models to perform few-shot medical image segmentation, whereas Sattari et al. (2025) compared U-Net and Detectron2 to segment liver margins on CT images, noting the effects of segmentation accuracy on the downstream classification32,33.
In parallel with the development of network architectures or design systems, radiomics and hybrid approaches have developed as a formidable force in terms of measuring quantitative imaging biomarkers. The study conducted by Zhang et al. (2025) proved that radiomics-based models using non-contrast abdominal CT had better results than traditional quantitative CT metrics in the diagnosis of fatty liver disease34. Further refinements have been made to imaging by adding predictive value to imaging-based methods using pathological and clinical data. Yu et al. (2025) described a radiopathomics model (Swin transformers) as a method to identify tumor clusters in HCC through vascularity, revealing a significant increase in the prognostic power of MRI and histopathological phenotypes35. Similarly, Zheng et al. (2025) suggested a DL methodology of predicting microvascular invasion in HCC using topology, demonstrating that advanced feature representations could gain more diagnostic models36. They are indicative of a larger shift in traditional CNN-based pipelines to the context-aware, multimodal, and hybrid learning pipeline, including large-kernel adapters and multimodal masked autoencoders of medical image classification37,38.
Ensemble learning has become quite popular as well as a method to enhance robustness and generalizability through complementary model predictions. Famularo et al. used ensemble ML methods that include random forests and multilayer perceptrons to predict microvascular invasion using CT scans, whereas Mohamed et al. used a two-tier stacking ensemble using Indian Liver Patient Dataset, and it showed more reliable classification results. This paradigm has been furthered by multimodal ensemble methods whereby, Lei et al. (2024) indicated that using a single DL model which combined both CT and MRI data could predict microvascular invasion with much better accuracy than single mode models39. These results highlight the increased significance of multi-architecture and multi-source fusion approaches, such as enhanced spectral-spatial transformer-based fusion approaches40.
Ensemble learning has become quite popular as well as a method to enhance robustness and generalizability through complementary model predictions. Famularo et al. used ensemble ML methods that include random forests and multilayer perceptrons to predict microvascular invasion using CT scans, whereas Mohamed et al. used a two-tier stacking ensemble using Indian Liver Patient Dataset, and it showed more reliable classification results41,42. This paradigm has been furthered by multimodal ensemble methods whereby, Lei et al. (2024) indicated that using a single DL model which combined both CT and MRI data could predict microvascular invasion with much better accuracy than single mode models43. These results highlight the increased significance of multi-architecture and multi-source fusion approaches, such as enhanced spectral-spatial transformer-based fusion approaches.
Although these advances have been made, there are a number of serious challenges. Many of the current studies concentrate on binary classification or a single-disease detection that restricts its application in the real-world clinical setting that involves the need to create a differential diagnosis between two or more liver conditions. Even though there are the beginning efforts to predict multiclass liver disease with adaptive preprocessing and ensemble modeling the lack of unified frameworks able to jointly predict fatty liver disease, liver cirrhosis, and HCC are few44,45. Furthermore, although the transformer-based attention has demonstrated to be successful in several medical imaging applications, such as weakly supervised computational pathology, its methodical combination with several CNN backbones within a single ensemble architecture to classify liver disease has received limited research46. Also, most of the multimodal and hybrid models that perform well are associated with high computational costs, making their application in resource-limited clinical settings impossible.
In this context, the present study addresses these gaps by proposing a transformer-enhanced hybrid ensemble framework that integrates global attention modeling with complementary CNN feature extraction for multi-class liver disease classification using CT images. In contrast to the previous studies, which consider an individual modality or binary classification and/or individual attention-enhanced models, the research paper systematically compares the contributions of transformer modules to various CNN models and fuses them into a single ensemble. The proposed methodology enhances the deployment and generalizability of CAD in hepatology by ensuring that the clinically relevant multi-class diagnostic development is achieved with the trade-off of accuracy, robustness, and computational efficiency.

Materials and methods

Materials and methods
This section describes this comprehensive methodology in this research paper such as data collection, pre-processing, structural design of the proposed model, and the experimental environment. It outlines the characteristics of the training and test dataset, the elements and structure of the hybrid transformer ensemble model, and the methodologies that were applied to train, validate and measure the performance of the model.

Dataset description
The data in this study is a set of CT scans of liver diseases provided by multiple databases with a complete open access. The CT scan data of fatty liver was taken from the UNIFESP Chest CT Fatty Liver Competition in Kaggle47. The classification of liver cirrhosis was based on the CT scans sourced from the TCIA/TGCA-LIHC portal venous phase48. The data of the HCC was sourced from The Cancer Imaging Archive Hepatocellular Carcinoma-Multimodality Annotated HCC Cases that contains the non-segmented cases and advanced imaging segmentation [Data set]49. The sample size of the given study comprises of comprised a total of 681 patient cases, encompassing 1,008,731 DICOM slices. Specifically, it includes 156 cases of liver cirrhosis with 7,776 slices, 329 cases of fatty liver with 961,961 slices, and 196 cases of HCC containing 38,994 slices. To enhance classification, each of the classes was given labels: 0 was liver cirrhosis, 1 fatty liver and 2 HCC. The data was first split into 80% training and 20% testing data in order to train and test the models. The 80% training data were then split between 70% training and 10% validation. The sample precisely determined by each class is shown in Table 1 in the subgroups.

Data preprocessing and augmentation
Image processing of CT was done in prior to enhance the efficacy and consistency of models. To mitigate increased computational expenses resulting from the original pixel number, all images were scaled to 224 × 224 pixels to standardize dimensions while preserving predictability. The pixel values were normalized to the [0,1] range to bring equal distribution and Contrast Limited Adaptive Histogram Equalisation (CLAHE) was utilised to enhance local contrast and highlight minor features, especially in areas of limited visibility.
As has been shown in Table 1, there was considerable imbalance in the dataset as the number of samples of patient images in each of the three groups varied significantly. Data was manipulated to enhance the diversity of the datasets through the generation of modified versions of the existing images. Augmentation enhances the generalization of models by introducing variability in attributes, rather than just enlarging the dataset to increase model size. Various augmentation strategies such random translation, random rotation in a range of − 10° to + 10°, and random zoom were applied uniformly across all the classes of liver disease i.e. fatty liver, liver cirrhosis and HCC. These modifications enhanced robustness, reduced overfitting, and improved the model’s applicability to real-world situations. Figure 1 shows representative samples of augmented images.

Baseline Pre-trained models
Three pre-trained CNNs-ResNet50V2, DenseNet121, and MobileNetV2 were utilised as baseline models for the classification of liver diseases, encompassing HCC, cirrhosis, and fatty liver disease. ImageNet pre-trained weights were used to initialise the networks and the original classification layers were removed. In order to preserve generalizable feature representations, convolutional layers were frozen at training, and a lightweight classification head was introduced. This head comprised a Global Average Pooling 2D layer, which diminishes spatial features, a Dropout layer (rate = 0.5) for model regularisation, a Dense layer employing ReLU activation for non-linear feature extraction, and a concluding Dense layer utilising softmax activation to categorise features into three output classes. Figures 2 and 3, and 4 illustrate ResNet50V2, DenseNet121, and MobileNetV2, respectively.
Sparse categorical crossentropy was used as the loss function to train all the models, using the Adam optimizer, with early stopping used to control overfitting. The assessment and training were made on a patient level which ensured that the prediction was done on individuals and not on individual CT slices. Each baseline model offered distinct advantages:
ResNet50V2, with residual learning and pre-activation units, optimised gradient flow and facilitated the extraction of deep features. The mathematical representation of ResNet50V2 residual learning is provided in Eq. (1).
In Eq. (1), and denote the input and output feature maps of the -th residual block, respectively. The function represents the residual mapping learned by the block, where denotes the set of trainable weights associated with the convolutional layers in the -th residual unit. The identity mapping directly propagates to the output through a skip connection, while ReLU denotes the rectified linear unit activation function applied element-wise. This formulation enables effective gradient flow and mitigates the vanishing gradient problem in deep networks.
The intricate layer architecture of DenseNet121 facilitated the exclusive utilisation of features and enhanced detection with minimal fluctuations in greyscale. The intricate connections in DenseNet121 provide feature reuse by incorporating feature maps from all preceding layers, as outlined in Eq. (2) below.
where is the concatenation of feature maps obtained in layers , and is a compound technique of batch normalization, ReLU, and convolution operations and represents the output feature map of the -th layer in DenseNet121. MobileNetV2, utilising depth wise separable convolutions and inverted residuals, demonstrates effective and accurate classification, rendering it appropriate for real-time or resource-constrained clinical applications. MobileNetV2 is predicated on the concept of depth wise separable convolutions, to divide a standard convolution into pointwise and depth wise processes. The ratio of the computational expense of conventional convolution to that of depth wise separable convolution is defined in Eq. (3). where D_k is the kernel size, is the number of input channels, is the number of output channels, and D_F is the spatial dimension of the feature map.

Transformer enhanced pre-trained models
The proposed Transformer-Enhanced architectures aim to unify the most common CNN architectures with transformers to exploit their complementary strengths for better classification of CT images. The transformer block was combined with each of the pre-trained models, i.e., ResNet50V2, DenseNet121 and MobileNetV2 based on the architecture design. All backbones CNN (ResNet50V2, DenseNet121, and MobileNetV2) were initially employed as feature extractors on their own as described in the following sections of 3.4.1,3.4.2, 3.4.3 and 3.4.4. The last classification layers of the already trained networks were stripped off and only the convolutional feature maps are left in them. These feature maps are local and hierarchical texture information of liver CT images. Final convolutional layer output features were transformed into the compact representations of feature vectors by Global Average Pooling, and result in feature vectors of fixed length. The vectors are then rearranged into token embeddings and fed into transformer encoder blocks. The blocks of the transformer are multi-head self-attention and feed-forward layers, which allow the model to detect long-range dependencies and global contextual relationships, which traditional CNNs can ignore. It is especially useful in liver diseases like cirrhosis and fatty liver, which do not have local pathological changes but diffuse. The enhanced representation features of transformers are eventually fed on to a fully connected classification head using the SoftMax activation to make multi-class predictions.

Feature extraction via pre-trained CNN
The input images were sized to 224 × 224 × 3 during the preprocessing stage and trained using the same pre-trained CNNs: ResNet50V2, DenseNet121, or MobileNetV2. For feature extraction, the classification heads of the pre-trained models were removed, retaining only the convolutional backbone. ResNet50V2 incorporates residual connections, DenseNet121 features dense inter-layer connectivity, and MobileNetV2 possesses a highly efficient computational architecture. The last convolutional layer produced a 7 × 7xN tensor as the output of the CNN, where N is model-dependent (e.g., 2048 in ResNet50V2).

Spatial feature flattening
The CNN backbone created feature maps that were fed into a Global Average Pooling layer which reduced the feature maps into a one-dimensional feature vector. This greatly reduced the spatial complexity and preserved global semantic data. In natural language processing, the output feature map or a token is in the form of a patch embedding.

Transformer block integration and contextual feature refinement
The aggregated features were entered into a transformer module comprising of one or multiple encoder blocks. Both blocks contained a self-attention mechanism and multiple heads, and position-wise feed-forward neural network. To maintain the order of sequence and spatial hierarchy of the features, sinusoidal or learnable positional encodings were implemented before to attention on the input vector. The multi-head attention attained by the model at the same time analysed several representation subspaces in the entire image, revealing interdependencies. The basic element of transformer module is the multi-head self-attention mechanism, which allows the model to evaluate the importance of different variables in their connection to each other. Given a sequence of feature vectors as an input , with n being the number of features (or tokens) and d being the dimension of the feature, the self-attention is given as follows:First, the Query (Q), Key (K) and Value (V) matrices are generated using linear projections as shown by Eq. (4):

where are learnable weight matrices.
The scaled dot product attention is calculated like Eq. (5):
Where, computes pairwise similarity scores between queries and keys and denotes the dimensionality of the key vectors and is used as a scaling factor to stabilize gradients during training. Multi-head attention (MHA) involves repeating this process h-times (that is, with h number of heads) simultaneously with different learnt projections. The concatenation of the output of all heads is projected and made as outlined in Eq. (6):
where , and is a learnable output projection matrix.
MHA layer output is then forwarded through a Feed-Forward Network (FFN) which consists of two linear transformations interspersed with ReLU activation as illustrated in Eq. (7):
The feed-forward network is applied independently to each token and consists of two linear transformations with a ReLU activation in between, where and are learnable parameters. Each sub-layer is followed with a residual connection and a layer normalization to allow the stable training and propagate the gradient effectively.

Classification head
The subsequent stage in multi-class classification comprises of the modified feature vector using a dense layer utilizing SoftMax activation. The amount of output neurones corresponded to the classifications of liver disease, which included cirrhosis, fatty liver, and HCC. The model was trained with sparse categorical cross entropy loss and Adam optimizer, and early stopping was also added, based on the validation loss.
Transformer blocks integration allowed the model to learn long-range spatial dependencies and global contextual relationships that would otherwise be overlooked by convolutional layers. In complex conditions with diffuse or textural subtle pathological patterns (e.g., cirrhosis and fatty liver), the self-attention mechanism selected discriminatory areas, including the fibrotic bands in cirrhosis or heterogeneous fat deposition in steatosis. Such global attention helped to improve feature maps by weighting the clinically important regions, which increased the accuracy of the classification and the recall.

Hybrid transformer ensemble model
The hybrid ensemble model proposed combines three transformer-enhanced CNN backbones: ResNet50V2, DenseNet121, and MobileNetV2, which are run parallelly as observed in Fig. 5. Complementary feature representations are then extracted by each backbone on the same input CT image. Each backbone is further refined by a transformer, and the resulting feature vectors are then projected in a shared latent space as a result of fully connected layers that make the dimensionally compatible projection. These aligned feature vectors are then joined together to give a single feature representation which synthesizes multi-scale, multi-depth, and multi-context information. In order to model further interdependencies between features obtained by different CNN architectures, the resulting concatenated feature space is once again fed into another transformer encoder block. This global attention process enables the ensemble to focus on the most informative aspects of models selectively and ignore redundant or noisy representations. A fully connected classification head on the final transformer-refined ensemble representation is fed with the final probability distribution across liver disease classes.The implementation code for this Hybrid Transformer Ensemble Model has been archived in a secure DOI-linked repository and can be provided by the corresponding author for research use50.
Let represent the feature vectors extracted by the following backbones ResNet50V2, DenseNet121 and MobileNetV2 to be represented following Global Average Pooling and projection to a shared dimension. Their combination in the form of concatenation is known as the consolidated feature stack , which is illustrated in figure (8):
The composite feature vector is then fed into the multi-head self-attention block, to achieve cross -feature dependencies, as outlined in Eq. (1) to (4).
This is followed by layer normalization, and then a multi-head mechanism of self-attention is applied, which detects the relationship between heterogenous features and underlines its contextual importance. The Residual connections and two-layer feed-forward net (Dense-512 → Dropout-0.3 → Dense-256) are used to refine the learnt embeddings to gain more stability and better feature generalisation. The transformer enhanced representation is applied to the pooled CNNs to preserve the original discriminative information in them and to incorporate contextually enriched features. The joint embedding is processed using two additional layers (Dense layers density of 256 each, separated by Dropout-0.5 layers) before it reaches the final SoftMax classification head assigning probability of fatty liver, cirrhosis, and HCC. The hybrid ensemble gives a more detailed and robust analysis of liver CT images, both in terms of multi-scale CNN features and the ability of the transformer to encode inter-feature relationships, outperforming the performance of backbones or independent transformer-enhanced models.

Experimental setup

Experimental setup
In order to achieve model efficiency, a variety of hyperparameters were adjusted when training the deep neural network as illustrated in Table 2. At first, pre-trained models with frozen backbone layers were being utilised, but only the additional classification head was being trained. They gradual loaded the data to achieve patient-level representation and the slice level features aggregated by mean pooling. The model was then fine-tuned with the Adam optimizer having the default settings by unfreezing all the layers and training it over another 20 epochs with a reduced learning rate of 1 × 10⁻⁵. These predictions of each of the pre-trained models were combined to form a single feature vector which was fed into the meta-learner. Specifically, features extracted from ResNet50V2, MobileNetV2, and DenseNet121 were first projected into a common 512-dimensional embedding space and then fused using a Transformer-based attention mechanism. The meta-learner was then trained, which lasted 20 epochs and automatically the learning rate was 1 × 10⁻³, as the pre-trained networks were held in the frozen state.
To combine the deep features of the three CNN backbones, a single-layer Transformer encoder was used. The output of each CNN branch was embedded as a 512-dimensional feature, and it constituted a sequence of three tokens that represented three CNNs. The Transformer encoder was composed of a multi-head self-attention module, which had 4 attention heads, with each attention head having a key/query dimension of 64. This was succeeded by a position-wise feed-forward network (FFN) having a hidden dimension of 256 and ReLU activation. Both the attention sublayer and the feed-forward sublayer were surrounded by layer normalization and residual connection, and the dropout rate of 0.1 was implemented to enhance generalizations. Positional encoding was not utilized, because the Transformer works on unordered CNN streams of features and not sequential data.
Since, the CT data utilized in this paper was composed of several axial slices of each patient, the model predictions were initially created at the slice level, and then the combination of the slice predictions was made to create a final patient level diagnosis. The trained model independently processed each slice, and three probability scores were obtained as liver cirrhosis, fatty liver, and HCC. In order to get a patient-level prediction, the prediction probability of all slices of the same patient were combined together with the help of averaging strategy. In particular, given a class of disease, the probabilities that were forecasted on all slices of a patient were averaged to achieve a single representative confidence value per class. The last patient-level label was then given out as the highest averaged probability class. This aggregation methodology makes sure that the diagnosis can capture data about the entire volumetric CT scan as opposed to isolated or noisy slices. It is especially appropriate to liver diseases (cirrhosis, fatty liver, HCC) in which pathological patterns do not always have a focal point and are not limited to one slice.
Early stopping with a patience of four epochs was employed during both the pre-trained model and meta-learner training phases to mitigate overfitting. Given the multi-class classification nature of the challenge, sparse categorical cross-entropy was chosen as the loss function. The sparse categorical cross-entropy loss L for a single sample in a multi-class classification task with C classes is defined in Eq. (9):
sp is the model’s output score for the true class p, whereas the denominator represents the sum of the exponentials of all output scores, where denotes the unnormalized output score predicted for class , and is the total number of classes. The loss is minimised over the entire training dataset throughout the optimisation procedure.
The value of a batch size was 32, which was set based on the limits of the GPU memory, and all the tests were run on the Kaggle platform using the acceleration of the GPU.

Results and discussion

Results and discussion

Performance metrics
Accuracy, Precision, recall, F1-score and Matthews Correlation Coefficient (MCC) were the measures of the efficacy of the proposed model of classification. These measures are appropriate for a multi-class classification51,52. Classification accuracy is an important performance measure, and it gives an intuitive measure of the overall model accuracy. Accuracy is defined as the ratio of correctly classified instances to the total number of instances, expressed as in equation (10):
where , , , and denote true positives, true negatives, false positives, and false negatives, respectively.
Precision is a measurement that determines how accurately the model is predicting the true positives as compared to the number of instances that the model has given out to be positive and therefore assesses the effectiveness of the model in the process of classifying occurrences in a particular class. Precision can be defined quantitatively as in Eq. 11:
In contrast, recall measures the capability of the model to identify all the relevant cases within a specified class based on the proportion of correctly classifying real positive cases. It is statistically defined (in Eq. 12):
FN depicts false negatives. As expressed in the F1-score in Eq. 13, the F1-score serves as a harmonic mean of accuracy and recall, which is a dual evaluative measure that generates both false positive and false negative predictions simultaneously:
In order to have a more detailed analysis of classification performance, particularly with the class imbalance of our dataset, we also obtained the MCC. In contrast to accuracy, which may be deceptive using misbalanced data, MCC takes into account all four categories of the confusion matrix (true positives, true negatives, false positives, false negatives) and represents a score between −1 and +1, with +1 meaning that it was perfectly predicted, 0 random guessing, and −1 complete disagreement. In the case of multi-class classification, MCC is determined as in Equation (14):
These metrics when used in multi-class classification are computed correspondingly to each label of the classes. The asymmetry in the nature of classes of the dataset make it important to pay attention to the metrical interpretation. Unweighted measurements can give false performance measures in case the distribution of classes is lopsided53,54. Weighted versions of accuracy, recall and F1-score were employed in order to counter this limitation. Instead of using absolute frequencies as a weighting factor as in traditional weighted methods, these weighted measurements make use of a relative frequency of each class as a weighting value, therefore, giving a more accurate measure of the performance of the overall model across the entire label space. The assessment process was enhanced by the use of confusion matrices to systematically offer the important classification outcome namely the true positives (TP), the true negatives (TN), the false positives (FP), and the false negatives (FN). The elements of the confusion matrix are required to calculate the above performance measures as well as to allow the complete assessment of the model and in-depth study of the errors.

Baseline model performance
Three specifically tuned pre-trained transfer learning models (ResNet50V2, DenseNet121, and MobileNetV2) were used to detect and classify liver pathologies (fatty liver, liver cirrhosis, and HCC) and their basic performance was evaluated with the help of accuracy, precision, recall, and F1-score. The overview of assessment results is presented in Table 3 that contains all the relevant factors.

Individual model results
The study revealed that the trained model ResNet50V2 had the highest overall accuracy of classification at 0.82 with a high sensitivity of HCC (recall = 0.95) and an F1-Score of 0.97 in detecting malignant patterns in liver CT images. The model achieved the perfect precision of 1.00 for HCC and fatty liver. However, it showed a lower performance in diagnosing liver cirrhosis with precision of 0.56 and F1-score of 0.72. This indicates possible misdiagnosis, presumably because of overlapping features with other liver diseases. The model’s strong discriminative capability is further supported by a Matthews Correlation Coefficient (MCC) of 0.773, reflecting good overall class-wise correlation. The DenseNet121 had an equalizing efficacy in classifying liver diseases with an accuracy of 0.74. The model showed F1 -scores of 0.73 (liver cirrhosis), 0.64 (fatty liver) and 0.87 (HCC). Both liver cirrhosis and HCC had recall of 1.00 indicating its effectiveness in reducing false negatives in either classes. The recall achieved for liver cirrhosis and fatty liver was quite low at 0.47, indicating that the model is not that sensitive in identifying steatosis changes that at times show subtle changes in the CT scan. The MCC for DenseNet121 was 0.68, indicating moderate correlation across classes. In this study, MobileNetV2 was applied, with its characteristic of a lightweight architecture and computational efficiency, reaching an accuracy of 0.69. Its total performance was relatively weak as compared to the other two models, but it still scored F1-scores of 0.67 in cirrhosis, 0.52 in fatty liver and 0.86 in HCC. Like DenseNet121, MobileNetV2 possessed an excellent recall of 1.00 for cirrhosis and HCC. Nonetheless, its fatty liver recall was the lowest at 0.35 among all models, indicating a tendency to overlook occurrences of steatosis. Nonetheless, the method demonstrates relevance in systems with constrained computational capacity. The MCC of 0.62 reflects its relatively lower but still informative classification consistency. Despite its limitations, MobileNetV2 remains relevant for deployment in resource-constrained environments.

Performance analysis
The three models were comparatively analysed where it was observed that ResNet50V2 is most proficient in the classification of liver diseases within the present study framework. The improved classification accuracy, along with significant recall and F1-score of HCC and fatty liver, indicates the presence of high discrimination ability. The suboptimal performance of ResNet50V2 in the identification and classification of liver cirrhosis suggests that the distinct histopathological or morphological characteristics of cirrhotic tissues may be more challenging to differentiate from other liver diseases, indicating a need for further research or the potential application of enhanced feature extraction techniques.
DenseNet121 exhibited no notable shortcomings as it was consistently categorized accurately. Nonetheless, it was not inferior to the classification that did not exceed ResNet50V2. The elevated recall rate of cirrhosis and HCC indicates it is the most effective means of reducing false negatives, which is crucial in clinical practice since undiagnosed cases may lead to delayed treatment. The diminished capacity to accurately identify fatty liver patients may stem from the prevalence and sometimes low appearance of steatosis in CT images. MobileNetV2 exhibited the poorest performance in diagnosing diseases of the liver due to its reduced model complexity. Nonetheless, it had superior performance in diagnosing HCC relative to other models. The minimal recall score suggests that the difficulties in identifying fatty liver stem from the limitations of superficial models in discerning subtle features. This architecture is a feasible choice for integration into a mobile health application or an embedded device with restricted processing capabilities.
The baseline evaluation indicated that deeper networks, such as ResNet50V2, exhibit superior capabilities in executing specific tasks that need heightened precision and sensitivity in classification. While DenseNet121 demonstrated consistent and traditionally reliable performance, MobileNetV2 may exhibit superior computational attributes within the constraints of its diagnostic capabilities. Figures 6, 7 and 8 illustrate the precision, recall, and F1-Score for each model in the classification of liver diseases.
These trends were also demonstrated by the confusion matrices presented in Fig. 9 with ResNet50V2 having high true positive values of HCC and fatty liver and higher false identifications among cirrhosis and others. The patterns of confusion of DenseNet121 and MobileNetV2 were also similar, as the true positive of fatty liver was lower. According to the training plots shown in Figs. 10, 11 and 12, all models had converged steadily and ResNet50V2 had the lowest validation loss, which matched its high accuracy. DenseNet121 and MobileNetV2 exhibited a marginally but still steady loss curve, an indicator of sufficient but less efficient learning dynamics.

Transformer enhanced model performance
Transformer modules were incorporated into the basic architectures of ResNet50V2, DenseNet121, and MobileNetV2 to enhance the performance of the underlying foundational models to produce three different Transformer-Improved pipelines. Each of the configurations was evaluated on the identical dataset across three target categories: liver cirrhosis, fatty liver, and HCC. All CNN backbones contributed independently in the diagnostic accuracy of the ensemble. ResNet50V2 was the best at hierarchical deep features especially on HCC as its residual learning scheme maintained gradient flow. DenseNet121 was able to feature more reuse and correlate better on cirrhosis due to dense connectivity and had lower false negative rates. Although light, MobileNetV2 offered effective spatial feature extraction which supplemented the deep networks, particularly in the differentiation of subtle steatotic features in fatty liver. The combination of the models created a complementary set of features that mitigated class-specific weaknesses: ResNet50V2 was used to enhance HCC detection, DenseNet121 was used to enhance the sensitivity of cirrhosis, and MobileNetV2 was used to add computational efficiency without a significant decrease in performance. The accuracy of this ensemble is 97%, 15–28% higher than that of individual baselines.
Transformer-Enhanced ResNet50V2 achieved highest accuracy of 93% which implies a significant improvement in both sensitivity and precision. The model was characterized by a great recall of liver cirrhosis and HCC, achieving F1-scores of 0.86 of cirrhosis, 0.92 of fatty liver and 1.00 of HCC, alongside a strong MCC of 0.895, reflecting excellent class-wise correlation and overall model reliability. Transformer -Enhanced DenseNet121 attained an impressive accuracy of 88%. The F1-scores of 1.00, 0.86, and 0.83 for cirrhosis, fatty liver, and HCC, respectively, indicate similar performance across categories. Nonetheless, there was minor imprecision in the steatotic patterns, evidenced by the relatively low recall rate of fatty liver, which was 0.76. Despite promising results, its MCC of 0.61 indicates moderate inter-class consistency, with some confusion observed particularly in distinguishing subtle steatotic patterns, as reflected in the corresponding confusion matrix. Transformer-Enhanced MobileNetV2 obtained computations with an accuracy of 87.00 and F1-scores of 1.00 for cirrhosis, 0.84 for fatty liver, and 0.82 for HCC using a computationally efficient model. Despite its lightweight construction, it demonstrated a lower recall of 0.73 for fatty liver while achieving full recollection for cirrhosis and HCC, in accordance with their baseline limitations. Table 4 indicates that all pre-trained models were effective following transformation and the integration of transformers. The model’s MCC of 0.561, though improved from baseline, suggests persistent challenges in consistently identifying fatty liver cases, as also visible in its confusion matrix.

Transformer impact analysis
After incorporation of Transformers, there was significance improvement observed across all the models. Transformer-Enhanced MobileNetV2 had the highest increment followed by an accuracy increasing from 69% to 87% with an improvement of 18%. The MCC also rose from 0.62 to 0.561, indicating better class alignment despite persistent challenges with fatty liver classification. This demonstrates the significant advantages of integrating context modeling of attention into lightweight systems. The previously established robust baseline accuracy of the Transformer-Enhanced ResNet50V2 exhibited improved accuracy and class equilibrium, with the top F1-score increasing from 0.72 to 0.86 and MCC reaching 0.895 which was the highest among all models. The Transformer-Enhanced ResNet50V2 effectively integrates global context modeling with proficient local feature extraction, enhancing the sensitivity for fatty liver diagnosis and ensuring high precision in identifying HCC. Transformer-Enhanced DenseNet121 exhibited superior consistency across classes, demonstrating elevated accuracy and recall for cirrhosis and HCC, though its MCC of 0.61 suggestssome inter-class disparity concerning fatty liver. The accompanying confusion matrices shown in Fig. 16 corroborate these trends, showing reduced false positives and improved true positive rates for cirrhosis and HCC, while fatty liver continues to pose diagnostic challenges across all architectures. The loss plot illustrated in Figs. 17, 18 and 19 shows smooth and stable training, with no signs of overfitting, indicating effective integration of Transformer modules. Transformer-Enhanced MobileNetV2 utilised its architectural efficiency and partially compensated for its low depth through attention-based global context awareness. Figures 13, 14 and 15 comparatively present the accuracy, recall, and F1-Score of different models.

Transformer-enhanced ensemble performance
An integrated architecture utilizing Transformer-based enhancements was developed, incorporating the models ResNet50V2, DenseNet121, and MobileNetV2, predicated on the performance enhancements observed in the individual models. The ensemble was developed in order to maximize the strengths of all models and reduce their respective weaknesses in individual classes. The integration of attention mechanisms into each backbone enhanced global feature modeling and improved classification performance. The ensemble exhibited an impressive accuracy of 97% and notably exceeded all baseline and individual Transformer models. This remarkable outcome demonstrates the ensemble’s capacity to generalize across intricate and diverse liver diseases depicted in CT images.
An overall class-wise performance analysis reinforces further the efficacy of the ensemble. Table 5 shows that the model achieved a precision of 0.94, a recall of 1.00 and an F1-score of 0.97. The model is clinically relevant since it accurately classified all cases of liver cirrhosis without any false negatives, which is essential in clinical diagnosis since false negatives can severely compromise treatment outcomes. Conversely, the ensemble model recorded a perfect precision of 1.00 and recall of 0.94 of fatty liver indicating that all the likelihoods of fatty liver being predicted as accurate, which eliminates the possibility of misclassification. HCC had a recall of 1.00, precision of 0.95, and an optimum of F1- Score of 0.98 whole within all classes. Further, MCC of 0.955 which is the highest observed in this study reflects outstanding class-wise agreement and model stability, reinforcing the ensemble’s suitability for clinical deployment. This conclusion holds therapeutic significance, since it ensures that no HCC instances were missed by the model, hence making it preferable for early identification and intervention in cancer.
The ensemble’s architecture enhances its exceptional performance and effectively addresses the limitations of individual models by utilizing the global attention mechanisms of Transformers and the hierarchical feature representations required by the backbone models. ResNet50V2 shown strong discriminative learning capabilities and accuracy in diagnosing HCC and fatty liver disease. DenseNet121 improved the reliability of recall performance for HCC and cirrhosis. Although lightweight, Mobile Net V2 demonstrated computational efficiency, and when enhanced with the context-expanding capabilities of the Transformer, it became more adept in memory and attuned to subtle details. The integration of features produced a balanced and precise model competent at handling inter-class variance, image noise, and complex pathological patterns. The confusion matrix in Fig. 20 illustrates these results in detail, showing minimal misclassification across all three disease categories. Notably, the model produced zero false negatives for cirrhosis and HCC which is a critical advantage in clinical settings where missed diagnoses can lead to delayed treatment. For fatty liver, the confusion matrix indicates a low false positive rate, confirming the model’s high specificity.
The ensemble’s strong performance is further supported by its loss curve shown in Fig. 21, which demonstrates stable convergence with no signs of overfitting. The training and validation losses decrease smoothly and remain closely aligned, indicating effective learning and strong generalization.
To provide an efficient automated tool for liver disease diagnosis, the Transformer-enhanced ensemble outperformed all models by addressing the limitations of each architecture and improving overall outcomes through the utilization of complementary functions, as shown in Fig. 22.

Performance ablation study of the proposed architecture
An ablation study was done to evaluate the contribution of every architectural component in a systematic order by sequentially adding transformer modules and ensemble learning to the baseline CNN framework. The summarized quantitative results are presented in Table 6. The default CNN model, which is ResNet50V2 without integrating transformers, had an accuracy of 82% that had a weighted precision of 0.85, recall of 0.87, F1-score of 0.72, and MCC of 0.773. In spite of the fact that this model has shown to be highly sensitive to the cases of HCC and fatty liver, its relatively low F1-score and MCC suggest that it may be weak in terms of dealing with the cases of inter-class ambiguity, especially with respect to diffuse cases like cirrhosis. When transformer blocks were added to the CNN backbone, it was noticed that all the evaluation metrics significantly improved. Transformer-Enhanced ResNet50V2 was found to have an accuracy of 93, weighted precision and recall of 0.92 and 0.94 respectively, and F1-score and MCC of 0.86 and 0.895 respectively. This advancement makes it clear that the transformer module significantly advances the modeling of global contextual features, which enables the network to be more successful in modeling long-range dependencies and subtle variations of texture that cannot be effectively modeled with convolutional layers alone. The combined transformer enhancement and ensemble learning framework gave the best performance with an accuracy of 97, a weighted precision and recall of 0.94 each, an F1-score of 0.97 and an MCC of 0.955. This clearly shows that the ultimate performance improvement is achieved due to the synergistic combination of transformer-based global attention and diversity of multi-model features. Comprehensively, the analysis of ablation proves that, although both transformer integration and ensemble learning are useful in enhancing classification performance, the combination of both approaches is necessary to obtain strong and clinically valid multi-class liver disease classification performance on the basis of CT images.

Statistical significance analysis
Although the given performance indicators (Accuracy, F1-Score, MCC) reflect the effectiveness of our proposed transformer enhanced ensemble model, it is necessary to prove that these values have a statistically significant difference and cannot be explained by random differences in the data splits or training procedure. In this regard, we conducted an McNemar test, a non-parametric test of nominal paired data, which is suitable to compare the classification performance of two models when applied to the same test set. The test is conceptualised on a contingency table of disagreements among two classifiers. The null hypothesis (H0) is that the error rate in both the models is equal. A small p-value (usually less than 0.05) enables us to reject H0 and we conclude that the difference in performance is statistically significant. We compared our final Transformer-Enhanced Ensemble Model to the baseline model with the highest performance (ResNet50V2) and the best-performing single Transformer-Enhanced model (ResNet50V2 + Transformer). Table 7 demonstrates the contingency tables and results.
Comparison between the ensemble and the base ResNet50V2 as shown in Table 7 gave a p-value of 0.0036 which is significantly smaller than the standard significance level of 0.05 thus demonstrating a statistically significant increase in performance. Further, in comparison to the Transformer-Enhanced ResNet50V2 model, the ensemble also had an extra statistically significant improvement, with a p-value of 0.0076. These findings enable the null hypothesis to be rejected with high certainty and prove the observed performance increases do not arise as a result of variation. Altogether, the analysis shows that the suggested hybrid ensemble structure introduces a strong and statistically significant improvement in the accuracy of classification and not a marginal or accidental one.

Model complexity and efficiency analysis
The overall analysis of DL models to be used in clinical settings should take into account not the diagnostic accuracy alone, but also the computational and parametric complexity, which directly affect the feasibility in real-world and resource-constrained systems. As such, three efficiency-related measures have been reported in all models under consideration: (i) model size and memory needs, the number of parameters; (ii) the number of floating point operations per inference (FLOPs), the cost of computation; and (iii) the average time per image inferred by a standard NVIDIA Tesla T4. The relative findings are outlined in Table 8. Transformer block integration provided a comparably small parameter overhead of around 1–2 million parameters per backbone, proving to be an efficient design of attention mechanism. The total parameter count of the hybrid ensemble is expected to be about the sum of three transformer-enhanced backbones, as it would be in a late-fusion architecture. Model complexity was proportional to computational cost as well as inference latency. Interestingly, the Transformer-Enhanced MobileNetV2 had a good tradeoff between efficiency and performance, it required 1.4 GFLOPs and 10.5 ms to run a single inference with an accuracy of 87, which is suitable in the implementation of edges and mobile health applications. Conversely, the Hybrid Ensemble model achieved the highest diagnostic accuracy (97) at 9.6 GFLOPs and 48.9 ms/inference, which is also the highest computational load of the models. This setup is a high-performance level suitable to server-based clinical workstations where the highest diagnostic surety is of primary importance compared to a low latency. Notably, because of the practical limits of batch processing in clinical PACS setting, inference time is sufficiently low to provide support to the viability of the model in the real-world setting.

Discussion

Discussion
In this paper, the assessment of DL models that classify liver diseases with more than 3 classes, fatty liver, liver cirrhosis, and HCC is thoroughly performed. The primary novel contribution is the utilization of transformer modules which significantly enhanced diagnostic accuracy among a variety of pre-trained models. After integration of the transformers, the performance metrics- recall and F1-score showed variations between the models especially in challenging classifications like fatty liver and cirrhosis. High classification accuracy of 97% of ensemble learning approach were achieved by combining three models transformed with the help of transformers, indicating the effectiveness of diverse model representations and attention strategies.

Transformer enhanced resNet50V2
The integration of a transformer module significantly improved ResNet50V2 performance compared to previous models, resulting in exceptional recall for HCC and a notable increase in the F1-Score for cirrhosis (from 0.72 to 0.86). This upgrade is seen in Figs. Figs. 14, 15 and 17, as the model continuously outperforms in all metrics following the Transformer enhancement. The integration of the attention mechanism enhances global contextual understanding while maintaining uninterrupted gradient flow to task layers, hence augmenting the model’s skip connections. Consequently, in scenarios with possible overlapping traits, this framework enhances the distinction of intricate tissue patterns. The increased depth and dual-module architecture result in heightened computing demands, potentially impacting real-time clinical implementation.

Transformer enhanced denseNet121
DenseNet121 provided the optimal equilibrium concerning F1-scores, notable sensitivity, and consistently superior outcomes across all coursework using the Transformer. In addition to cirrhosis and fatty liver sensitivity, the Transformer module markedly improved group classification, as evidenced in Figs. Figs. 14, 15 and 17. The DenseNet121 architecture, with its feature reuse structure, achieved a stable recall rate of 1.00 for cirrhosis and demonstrated exceptional classification of fatty liver and HCC, hence enhancing the overall feature augmentation of the Transformer architecture. Despite its extensive array of integrated capabilities, DenseNet121 utilized minimum memory resources to balance the depth of ResNet with the speed of MobileNet.

Transformer enhanced mobileNetV2
After the introduction of the Transformer, MobileNetV2 showed the greatest relative performance increase, which is apparent in Figs. 14, 15 and 17. Transformer-enhanced model proved to have the greatest change in overall performance implying possible to a significant improvement in attention processes to enhance shallow models with global context modeling at limited depths. However, it is a challenge to maintain efficiency and speed during the process of integrating attention layers. This balance is suitable in case of mobile deployment or edge computing but not in the use of extensive diagnostic pipelines not optimized further.

Ensemble learning analysis
Transformer-enhanced ensemble model was offered with both strengthening qualities of the three models, producing the highest overall accuracy and per-class F1 scores. Important contributions of the model were model diversity: MobileNetV2 demonstrated that small changes did not affect its efficiency, DenseNet121 increased the recall rate of cirrhosis, and ResNet50V2 demonstrated the high performance of its classification. Figures 23, 24 and 25 depict that the ensemble was more accurate and measured using metrics of assessment in comparison to any of the individual models. Combinational approach was able to overcome the weaknesses of models in individual classes. However, it was clear that, whereas the ensemble outperformed the individual parts, the addition of more models beyond this tri-model set-up would not yield proportional performance enhancements to its addition and would increase the processing demands significantly.
All models were tested on the basis of accuracy, precision, recall, F1-score, and Matthews Correlation Coefficient (MCC), which were already documented in Sect. 5. Whereas accuracy is a general index of correctness, precision and recall reflect on clinically useful data about false-positive and false-negative behavior, and MCC offers a balanced approach in the case of the imbalance between classes. As it can be seen based on the reported findings, the baseline CNN models demonstrate moderate accuracy with pronounced heterogeneity among disease categories, especially in those that are diffuse, such as fatty liver and cirrhosis. Transformer block integration continues to increase recall and F1-scores in all backbones, which suggests increased sensitivity and improved global contextual understanding. The hybrid transformer-enhanced ensemble model exhibits the most balanced and better performance in all measures scoring the highest total accuracy and MCC in support of better robustness and the agreement per-class. The increase in architectural complexity in the form of multiple parallel backbones and attention layers come at the expense of this performance gain. The trade-off is, however, reasonable in clinical cases when reliability in diagnosing and minimization of false negatives are paramount. Moreover, the findings suggest that lightweight models with transformers provide an optimal trade-off between computational efficiency and diagnostic performance, which allows deployment flexibility based on the available resources.

Comparison with the state of art

Comparison with the state of art
Comparison and contrast with the latest state-of-the-art techniques as they are summarized in Table 8 show that the specified Transformer-Enhanced Ensemble model can be used to attain a competitive and clinically viable performance in the context of CT-based liver disease classification. Other studies that have been done in the past concentrated on binary or disease specific classification. As an example, Zhang et al.55,57 demonstrated high diagnostic accuracy of fatty liver disease with the help of both CT-based radiomics and deep learning with an AUC of up to 0.973, whereas Gupta et al.58 obtained a classification accuracy of 97% with CNNs on focal liver lesions. Equally, Kilic et al.56 introduced two step RCNN-CNN architecture to distinguish between liver tumor and healthy cases with an accuracy of 0.936 in binary and 0.863 in three way classification, respectively. A more intricate literature has added in clinical variables and multimodal inputs; Guo et al.60 trained a deep learning radiomics model with clinical data on early prediction of HCC in cirrhotic patients which reported AUCs of over 0.92 in discovery and validation models, and Xia et al.61 used a multi-modal CRNN model to predict survival risk in HCC patients with AUCs of 0.777 and 0.704 in their validation and test groups, respectively. Wei et al.62 and Midya et al.63 have also applied large-scale lesion classification, with LiLNet model giving an AUC of 97.2% benign and malignant tumor discrimination, and an overall accuracy of 96% with a modified Inception-v3 network in the case of four-class liver tumor classification. In spite of their good results, most of the existing methods are confined in binary classification, single pathology classification or lesion-level classification. By comparison, the proposed model deals with a multi-class scenario that is more clinically realistic and challenging to address with the simultaneous classification of fatty liver, liver cirrhosis, and hepatocellular carcinoma of CT images in a single framework. The proposed Transformer-Enhanced Ensemble model with an accuracy of 97% and a precision and recall of 0.94 and an F1-score of 0.97 outperforms the traditional CNN-only and radiomics-only methods as shown in Table 9. Moreover, attention that has been applied with transformers allows to effectively represent the information concerning the global context, which is especially useful in differentiating diffuse and overlapping liver pathologies, therefore, improving the quality of the diagnosis and clinical relevance.

Limitations
Although the proposed transformer-enhanced ensemble model shows high performance, there are a few limitations of the study that should be considered when interpreting the results. The primary constraints concern the dataset; despite using established public repositories, the total sample size remains limited and imbalanced across classes, with a predominance of fatty liver cases over cirrhosis and HCC. Moreover, the use of retrospective, publicly available CT scans may also introduce possible selection bias and may not fully reflect the diversity in imaging protocols, scanner models and patient demographics that may be observed in clinical practice, which constrains the external validity and generalizability of the model. Technically, the ensemble method, particularly when extended to transformer modules, is computationally costly, both in training and inference, a fact that may be a barrier to its use in resource-constrained clinical settings. Lastly, as in most DL research, the interpretability of the model decisions, a critical component of clinical trust, was not a concern of this study, and no multi-center dataset external validation was performed. It will be essential to address these points in future studies to develop the model as a solid experimental framework to a clinical decision-support tool that can be implemented.

Future scope

Future scope
This study explains how pre-trained transfer training models with Transformer modules can be used to classify liver diseases with high accuracy and robustness when using CT images. The proposed Transformer-enhanced ensemble architecture proved to show outstanding performance in all the parameters evaluated and highlighted the potential of DL in clinical decision-making of liver disease classification.
The results are encouraging; however, there are technical and practical dimensions that necessitate additional investigation. The ensemble model, because to its multi-class architecture and attention layers, necessitates heightened processing demands for both training and inference. The trade-off is warranted by a significantly enhanced performance. Lightweight Transformer-augmented models, such as MobileNetV2, provide an efficient option with considerable accuracy in resource-constrained environments, making them suitable for mobile implementation or integration into a diagnostic framework.
The study relied on CT scans on one dataset, which, though sufficient as evidence-of-concept, may not reflect the complexity that is observed in clinical practice. The increased training and validation to cover multi-center datasets and other imaging modalities (MRI or ultrasound) could further improve the generalizability. The study focused on the concept of classification; however, subsequent studies could improve disease modeling and longitudinal prediction through the use of multi-modal and temporal data. Future studies may explore explainable AI strategies to improve the interpretability and reliability of models within therapeutic contexts64,65. Techniques such as Grad-CAM and SHAP facilitate the identification of significant regions within an image that influence predictions, hence enhancing transparency. Additionally, model pruning, quantification, and information distillation are architectural changes that might be investigated to facilitate application across diverse healthcare environments66.
Although the proposed ensemble model had a high accuracy of 97% against the tested CT datasets, future research will consider in ensuring the robustness and generalizability of the model across multi-center CT datasets taken using different scanners, imaging protocols, and image quality features. External dataset testing and cross-validation will be considered to test the stability of the model when the resolution, noise, and contrast vary as it usually occurs in the clinical practice. Also, future research will involve examining model performance on clinically challenging cases at edges, such as patients with mixed liver pathologies or early-stage liver diseases, whose visual manifestation on CT scans is subtle. The sensitivity and reliability of these complex diagnostic situations could be further enhanced by incorporating more data sources, finer-grained annotations, or uncertainty-aware predictions.

Conclusion

Conclusion
This study presents, a novel deep-learning framework that applies the CT scan in the process of identifying liver-based diseases, including fatty liver disease, liver cirrhosis, and HCC. By the systematic use of attention-based Transformer modules to model into three popular pre-trained CNN’s (ResNet50V2, DenseNet121, and MobileNetV2), we have shown a substantial and sustained enhancement in all the evaluation measures of the baseline models. Based on these personal improvements, we introduced an improved hybrid ensemble framework, Transformer-based, that had a spectacular overall performance of 97%, and equalized precisions, recalls, and F1-scores in the three disease classes. This group succeeded in exploiting the complementary strengths of every backbone architecture and reducing their respective shortcomings with the contextual model provided by Transformers across the entire world.
The high sensitivity and specificity of the model, especially in the diagnosis of difficult to diagnose problems like cirrhosis and early fatty liver, make this work clinically relevant. These findings indicate that there is a high possibility of integration as a computer-aided diagnostic (CAD) tool in clinical workflows where it can assist radiologists by enhancing diagnostic consistency, lowering false negatives, and providing earlier intervention. Technically, this work is empirical evidence that Transformer-CNN hybrid networks can be useful in capturing both local texture features and global contextual relationships in medical images. It is worth noting that the best relative improvement of 18% in accuracy was achieved when attention mechanisms were integrated into lightweight models such as MobileNetV2, which emphasize the potential of efficient but accurate models to be deployed in resource-constrained systems such as mobile-health applications or edge devices. This study methodologically advances the field through the introduction of an ensemble framework, which can be reproduced, and integrate several Transformer-enhanced backbones into a single diagnostic pipeline. This is not only an improvement in the robustness and generalizability of models, but it also provides a generalized template that is applicable to other multi-class medical imaging tasks beyond hepatology. Although the proposed ensemble model has better diagnostic performance, its computer requirements should be considered when deployment to clinics in real-time. Transformer-enhanced MobileNetV2 provides a viable trade-off between speed and accuracy in environments with a high importance on efficiency, and where it is applicable in point-of-care or telemedicine applications. Subsequent work can consider model compression, quantization, and explainable AI to help make the use of AI more accessible to clinics and more accepted. Overall, the paper adds to the current trends in the field of AI-assisted radiology, showing that CNN’s and Transformers can be synergistically used to analyse medical images. The developed framework not only sets a new standard of classification of liver diseases but also creates a basis of a further study on multimodal integration, the prediction of multiple diseases over time, and clinical validation in actual practice.

출처: PubMed Central (JATS). 라이선스는 원 publisher 정책을 따릅니다 — 인용 시 원문을 표기해 주세요.

🏷️ 같은 키워드 · 무료전문 — 이 논문 MeSH/keyword 기반

PIBAdb: a public cohort of multimodal colonoscopy videos and images including polyps with histological information.
Computer methods and programs in biomedicine 2026 Nogueira-Rodríguez A 외 📖 OA
Automated RECOMIA AI-based total metabolic tumor volume in lymphoma - a retrospective study.
EJNMMI research 2026 Sadik M 외 📖 unpaywall
Dual-stage pulmonary nodule detection in CT scans via cross-layer attention and adaptive multi-scale 3D CNN.
Digital health 2026 Wang L 외 📖 OA
GAN-based bone suppression using a combined loss function.
Frontiers in artificial intelligence 2026 Jochymek L 외 📖 OA
Size-Dependent Performance of Abnormal-Focused U-Net Segmentation for Mammographic Lesion Detection: A Two-Stage Hybrid Training Approach.
Cureus 2026 Huang KA 외 📖 OA
Classification of Pancreatic Cancer and Normal Tissue in 2D and 3D Optical Coherence Tomography Images Using Convolutional Neural Networks: A Comparative Study.
Cancers 2026 Druzenko M 외 📖 unpaywall