Development of a hybrid deep learning-based framework for liver fibrosis classification using ultrasound images.
1/5 보강
[BACKGROUND AND AIMS] Liver fibrosis is a progressive accumulation of extracellular matrix proteins with distortion of hepatic architecture and can progress to cirrhosis or hepatocellular carcinoma.
APA
Adesina AF, Olorunfemi BO, et al. (2026). Development of a hybrid deep learning-based framework for liver fibrosis classification using ultrasound images.. iLIVER, 5(1), 100225. https://doi.org/10.1016/j.iliver.2026.100225
MLA
Adesina AF, et al.. "Development of a hybrid deep learning-based framework for liver fibrosis classification using ultrasound images.." iLIVER, vol. 5, no. 1, 2026, pp. 100225.
PMID
41756166 ↗
Abstract 한글 요약
[BACKGROUND AND AIMS] Liver fibrosis is a progressive accumulation of extracellular matrix proteins with distortion of hepatic architecture and can progress to cirrhosis or hepatocellular carcinoma. Biopsy remains the diagnostic gold standard, however, its invasive nature, sampling error, and cost limit routine use. Ultrasound imaging provides a safer, more accessible option but depends on operator expertise and subjective interpretation. Existing deep learning approaches for fibrosis assessment often rely on small datasets or perform only binary classification. This study aimed to develop a hybrid deep learning model combining ResNet50 and VGG16 for automated multi-class classification (F0-F4), enhancing diagnostic accuracy, reducing biopsy reliance, and supporting affordable, interpretable liver disease assessment.
[METHODS] The total of 6323 ultrasound image samples with METAVIR system labels ranging from F0 to F4 was downloaded from Kaggle. After data preprocessing, 80:20 splits were made for training and testing. A hybrid model consisting of fine-tuned ResNet50 and VGG16 was used for classification of fibrosis stages. Model performance was statistically evaluated using sensitivity, specificity, and area under the ROC curve (AUC) for each fibrosis stage, averaging across classes to address imbalance. Robustness and reproducibility were assessed by calculating 95% confidence intervals (CI) for all performance metrics through bootstrap resampling. Grad-CAM was used to interpret the model's predictions.
[RESULTS] The hybrid model was also successful in achieving the highest peak in testing accuracy, reaching 86.64%, compared to the other models (55.26% for ResNet50, 72.73% for VGG16). The classification was also high for the hybrid model, with the highest values for the macro AUC and weighted AUC at 96.79% and 97.79%, respectively. The highest predicted probabilities were seen for the F0 and F4 stages, which were correctly classified, with the Grad-CAM heatmaps showing high focus on the fibrotic regions.
[CONCLUSION] The hybrid model achieved good diagnostic results with high sensitivity, specificity, and confidence. The Grad-CAM images validated that the model was focusing on significant areas, as shown by the heat map, which proves that it has potential as a non-invasive, accurate, and interpretable tool for automated liver fibrosis staging using ultrasound images.
[METHODS] The total of 6323 ultrasound image samples with METAVIR system labels ranging from F0 to F4 was downloaded from Kaggle. After data preprocessing, 80:20 splits were made for training and testing. A hybrid model consisting of fine-tuned ResNet50 and VGG16 was used for classification of fibrosis stages. Model performance was statistically evaluated using sensitivity, specificity, and area under the ROC curve (AUC) for each fibrosis stage, averaging across classes to address imbalance. Robustness and reproducibility were assessed by calculating 95% confidence intervals (CI) for all performance metrics through bootstrap resampling. Grad-CAM was used to interpret the model's predictions.
[RESULTS] The hybrid model was also successful in achieving the highest peak in testing accuracy, reaching 86.64%, compared to the other models (55.26% for ResNet50, 72.73% for VGG16). The classification was also high for the hybrid model, with the highest values for the macro AUC and weighted AUC at 96.79% and 97.79%, respectively. The highest predicted probabilities were seen for the F0 and F4 stages, which were correctly classified, with the Grad-CAM heatmaps showing high focus on the fibrotic regions.
[CONCLUSION] The hybrid model achieved good diagnostic results with high sensitivity, specificity, and confidence. The Grad-CAM images validated that the model was focusing on significant areas, as shown by the heat map, which proves that it has potential as a non-invasive, accurate, and interpretable tool for automated liver fibrosis staging using ultrasound images.
🏷️ 키워드 / MeSH 📖 같은 키워드 OA만
📖 전문 본문 읽기 PMC JATS · ~66 KB · 영문
Introduction
1
Introduction
Liver fibrosis is a common disease condition that emerges from chronic liver injury caused by viral hepatitis, alcoholic liver diseases, non-alcoholic fatty liver disease and parasitic infections 1, 2, 3. The presence of liver fibrosis is expressed via the unusual deposition of extracellular matrix components such as collagen, which leads to the distortion of the liver parenchyma, altered hepatic function and eventual progression into cirrhosis and hepatocellular carcinoma. This occurs when diagnosis is not properly made and treatment instituted 4, 5, 6. As liver diseases are becoming a global concern contributing to the global estimate on mortalities, it is imperative to ensure early and accurate diagnosis of fibrosis as a major component of clinical hepatology.1,4
In histological studies, liver fibrosis is graded and staged with the use of standardized systems.7 METAVIR scoring system is one of the most widely adopted standard system for grading fibrosis.8 METAVIR is an acronym of “meta-analysis of histological data in viral hepatitis.9 The classification of this system is into five stages: F0 (no fibrosis), F1 (portal fibrosis without septa), F2 (portal fibrosis with few septa), F3 (numerous septa without cirrhosis), and F4 (cirrhosis).10 For effective therapeutic treatment and monitoring, accurate prognosis is very important. The traditional standard for assessing the extent of liver fibrosis has been the use of liver biopsy.11,12 However, its tendency for quick spread, possible complications such as bleeding and infection, sampling error and inter-observer differences have limited the use of biopsy in clinical practice especially in developing nations for diagnosis and prognosis in clinical practice.13,14
Due to these recurring challenges with traditional methods of making a diagnosis of fibrosis, non-invasive alternatives have gained progressive attention in recent years.13 Image based techniques, including the most widely used B-mode ultrasound have become more preferred because they are safe, available, affordable and can be observed in real time.15,16 However, standard ultrasound procedure still depends on the operator which causes limitations in sensitivity and specificity in early stage-fibrosis development.8,17 The detection of subtle changes in the texture of hepatic parenchyma might be difficult to detect visually, even by well experienced radiologist.18 Moreover, interpretation of ultrasound images is subjective leading to inconsistencies in diagnostic outcomes and becomes a source of a major barrier in fibrosis assessment.8
Artificial intelligence (AI) and deep learning (DL) have emerged as powerful tools in medical image analysis 19, 20, 21. Much to be noted is the significant success that has been demonstrated by convolutional neural networks (CNN) as a result of their ability to learn hierarchical feature representations directly from image data without the need for manual feature engineering.22 Applications of CNNs have been successful in medical diagnosis, including brain tumor detection, classification of lung disease, diabetic neuropathy grading and liver pathology assessment 23, 24, 25.
The assessment of liver diseases has seen the exploration of CNN methods for analyzing ultrasound images for fibrosis staging.26,27 Their high diagnostic accuracy and performance have been comparable to those of humans in differentiating different stages of fibrosis.26 Even though significant advances have been reported, the use of single model CNN architectures for fibrosis staging might be subject to class imbalance and the complex textural features characteristic of intermediate fibrosis stages (F1 to F3). This has a tendency to lead to misclassification and reduction in diagnostic usage.28
In addressing these challenges, an exploration of the utilization of hybrid CNN models by researchers have begun to emerge in the field of health care.29,30 These methods integrate multiple CNN architectures for the enhancement of feature extraction and model robustness. Hybrid models take advantage of the strengths of the different networks, improving generalization and the capture of diverse spatial and contextual features that may be overlooked by individual models.31 ResNet family of CNNs, for example, is known for residual connections that protect against vanishing gradient issues in deep networks and is very successful in learning complex representations in large datasets.32 On the other hand, VGGNet architecture has a simple uniform design of stacked small convolutional filters which facilitates strong low and mid-level feature extraction capabilities.33 The combination of these architectures in a hybrid model can potentially result in improved performance in distinguishing between fibrosis stages with subtle differences.34
Recent advances in deep learning have explored transformer models and CNNs for liver disease imaging. Huang et al. proposed a spatio-temporal multiple-stream transformer network that integrates multi-sequence MRI to classify liver lesions with state-of-the-art accuracy.35 In ultrasound-based fibrosis assessment, CNNs such as VGGNet and ResNet have shown strong performance for multi-stage METAVIR classification.36 Zhang et al. developed a high-frequency ultrasound deep learning model that effectively discriminates advanced fibrosis and cirrhosis stages.37 However, few studies address automated multi-class fibrosis staging on ultrasound using hybrid architectures. Our study develops a hybrid ResNet50–VGG16 model for F0 to F4 classification, providing an accessible, non-invasive, and interpretable tool that enhances diagnostic accuracy and reduces reliance on biopsy.
It is becoming increasingly feasible to train and validate models on a large scale, with the use of publicly annotated ultrasound image datasets that have been labeled using standardized scoring systems like the METAVIR.38 These datasets make reproducibility and benchmarking across different studies feasible, thus enhancing the development of robust diagnostic tools that can be integrated into real-time clinical practice.39 The summary of related works highlighting advances and gaps in liver fibrosis staging is provided in Table 1.
Despite the significant advances that have been reported through the application of ML and DL methods for the staging of non-invasive liver fibrosis, several major limitations have been identified with regard to the existing literature. Firstly, most existing research has focused on classical ML methods or single CNN models that have been found to be suboptimal with regard to their ability to fully exploit the underlying complexity of ultrasound imaging data as well as their potential to attain optimal diagnostic accuracy. In addition, a large proportion of existing research has been conducted with regard to small-scale institutional data that may limit both reproducibility and generalization of reported research. The majority of existing research has been conducted with regard to private image data rather than publicly available data that may limit large-scale validation.
Although previous methods have demonstrated promise in differentiating patients with advanced fibrosis, the sensitivity and specificity of these methods in patients with early to intermediate disease have been difficult to achieve. Furthermore, there has been a ubiquitous reliance on human interpretation and feature engineering.
To address these limitations, our study applies a feature-fusion hybrid architecture combining fine-tuned ResNet50 and VGG16 networks for automated classification of liver fibrosis stages from ultrasound images. While hybrid CNNs have been explored in medical imaging, the specific combination of fine-tuned ResNet50 and VGG16 with feature-level fusion for multi-class liver fibrosis staging has not been widely reported. Using a large publicly available Kaggle dataset of 6323 labeled images, we systematically evaluate this approach across all METAVIR stages, with interpretability analysis via Grad-CAM. Our results demonstrate robust performance, particularly in detecting early and advanced fibrosis, providing a validated and interpretable framework for automated liver fibrosis staging.
Introduction
Liver fibrosis is a common disease condition that emerges from chronic liver injury caused by viral hepatitis, alcoholic liver diseases, non-alcoholic fatty liver disease and parasitic infections 1, 2, 3. The presence of liver fibrosis is expressed via the unusual deposition of extracellular matrix components such as collagen, which leads to the distortion of the liver parenchyma, altered hepatic function and eventual progression into cirrhosis and hepatocellular carcinoma. This occurs when diagnosis is not properly made and treatment instituted 4, 5, 6. As liver diseases are becoming a global concern contributing to the global estimate on mortalities, it is imperative to ensure early and accurate diagnosis of fibrosis as a major component of clinical hepatology.1,4
In histological studies, liver fibrosis is graded and staged with the use of standardized systems.7 METAVIR scoring system is one of the most widely adopted standard system for grading fibrosis.8 METAVIR is an acronym of “meta-analysis of histological data in viral hepatitis.9 The classification of this system is into five stages: F0 (no fibrosis), F1 (portal fibrosis without septa), F2 (portal fibrosis with few septa), F3 (numerous septa without cirrhosis), and F4 (cirrhosis).10 For effective therapeutic treatment and monitoring, accurate prognosis is very important. The traditional standard for assessing the extent of liver fibrosis has been the use of liver biopsy.11,12 However, its tendency for quick spread, possible complications such as bleeding and infection, sampling error and inter-observer differences have limited the use of biopsy in clinical practice especially in developing nations for diagnosis and prognosis in clinical practice.13,14
Due to these recurring challenges with traditional methods of making a diagnosis of fibrosis, non-invasive alternatives have gained progressive attention in recent years.13 Image based techniques, including the most widely used B-mode ultrasound have become more preferred because they are safe, available, affordable and can be observed in real time.15,16 However, standard ultrasound procedure still depends on the operator which causes limitations in sensitivity and specificity in early stage-fibrosis development.8,17 The detection of subtle changes in the texture of hepatic parenchyma might be difficult to detect visually, even by well experienced radiologist.18 Moreover, interpretation of ultrasound images is subjective leading to inconsistencies in diagnostic outcomes and becomes a source of a major barrier in fibrosis assessment.8
Artificial intelligence (AI) and deep learning (DL) have emerged as powerful tools in medical image analysis 19, 20, 21. Much to be noted is the significant success that has been demonstrated by convolutional neural networks (CNN) as a result of their ability to learn hierarchical feature representations directly from image data without the need for manual feature engineering.22 Applications of CNNs have been successful in medical diagnosis, including brain tumor detection, classification of lung disease, diabetic neuropathy grading and liver pathology assessment 23, 24, 25.
The assessment of liver diseases has seen the exploration of CNN methods for analyzing ultrasound images for fibrosis staging.26,27 Their high diagnostic accuracy and performance have been comparable to those of humans in differentiating different stages of fibrosis.26 Even though significant advances have been reported, the use of single model CNN architectures for fibrosis staging might be subject to class imbalance and the complex textural features characteristic of intermediate fibrosis stages (F1 to F3). This has a tendency to lead to misclassification and reduction in diagnostic usage.28
In addressing these challenges, an exploration of the utilization of hybrid CNN models by researchers have begun to emerge in the field of health care.29,30 These methods integrate multiple CNN architectures for the enhancement of feature extraction and model robustness. Hybrid models take advantage of the strengths of the different networks, improving generalization and the capture of diverse spatial and contextual features that may be overlooked by individual models.31 ResNet family of CNNs, for example, is known for residual connections that protect against vanishing gradient issues in deep networks and is very successful in learning complex representations in large datasets.32 On the other hand, VGGNet architecture has a simple uniform design of stacked small convolutional filters which facilitates strong low and mid-level feature extraction capabilities.33 The combination of these architectures in a hybrid model can potentially result in improved performance in distinguishing between fibrosis stages with subtle differences.34
Recent advances in deep learning have explored transformer models and CNNs for liver disease imaging. Huang et al. proposed a spatio-temporal multiple-stream transformer network that integrates multi-sequence MRI to classify liver lesions with state-of-the-art accuracy.35 In ultrasound-based fibrosis assessment, CNNs such as VGGNet and ResNet have shown strong performance for multi-stage METAVIR classification.36 Zhang et al. developed a high-frequency ultrasound deep learning model that effectively discriminates advanced fibrosis and cirrhosis stages.37 However, few studies address automated multi-class fibrosis staging on ultrasound using hybrid architectures. Our study develops a hybrid ResNet50–VGG16 model for F0 to F4 classification, providing an accessible, non-invasive, and interpretable tool that enhances diagnostic accuracy and reduces reliance on biopsy.
It is becoming increasingly feasible to train and validate models on a large scale, with the use of publicly annotated ultrasound image datasets that have been labeled using standardized scoring systems like the METAVIR.38 These datasets make reproducibility and benchmarking across different studies feasible, thus enhancing the development of robust diagnostic tools that can be integrated into real-time clinical practice.39 The summary of related works highlighting advances and gaps in liver fibrosis staging is provided in Table 1.
Despite the significant advances that have been reported through the application of ML and DL methods for the staging of non-invasive liver fibrosis, several major limitations have been identified with regard to the existing literature. Firstly, most existing research has focused on classical ML methods or single CNN models that have been found to be suboptimal with regard to their ability to fully exploit the underlying complexity of ultrasound imaging data as well as their potential to attain optimal diagnostic accuracy. In addition, a large proportion of existing research has been conducted with regard to small-scale institutional data that may limit both reproducibility and generalization of reported research. The majority of existing research has been conducted with regard to private image data rather than publicly available data that may limit large-scale validation.
Although previous methods have demonstrated promise in differentiating patients with advanced fibrosis, the sensitivity and specificity of these methods in patients with early to intermediate disease have been difficult to achieve. Furthermore, there has been a ubiquitous reliance on human interpretation and feature engineering.
To address these limitations, our study applies a feature-fusion hybrid architecture combining fine-tuned ResNet50 and VGG16 networks for automated classification of liver fibrosis stages from ultrasound images. While hybrid CNNs have been explored in medical imaging, the specific combination of fine-tuned ResNet50 and VGG16 with feature-level fusion for multi-class liver fibrosis staging has not been widely reported. Using a large publicly available Kaggle dataset of 6323 labeled images, we systematically evaluate this approach across all METAVIR stages, with interpretability analysis via Grad-CAM. Our results demonstrate robust performance, particularly in detecting early and advanced fibrosis, providing a validated and interpretable framework for automated liver fibrosis staging.
Methods
2
Methods
This paper proposes a hybrid deep learning model for the classification of liver fibrosis stages based on ultrasound images. The proposed model leverages the power of two of the most popular convolutional neural networks, namely ResNet50 and VGG16, such that the network can learn both deep and fine texture patterns of the images, thus providing a more accurate diagnostic system. Before being fed into the networks, the ultrasound images are preprocessed to resize them to 224 × 224 pixels, convert them to grayscale, normalize them, and perform histogram equalization. These operations ensure that the images are of the same size and intensity, thus highlighting the minute textural information necessary for the identification of the stages of liver fibrosis. The images are then split into training and testing sets to enable the training of the model and its subsequent evaluation. The preprocessed images are then fed into both ResNet50 and VGG16, which have been fine-tuned for the analysis of liver ultrasound images. ResNet50 learns deep high-level features via its residual connections, while VGG16 learns fine and mid-level texture information through its simple stacked convolutional layers.
The training process utilizes the Adam optimizer along with a sparse categorical cross-entropy loss function, and early stopping is also employed based on validation loss. Preprocessing is done only for the training dataset, as it may cause data leakage if done for other datasets as well. By using a combination of macro-level feature extraction using ResNet50 and micro-level texture analysis using VGG16, a more informative feature space can be created, enabling more effective classification between different stages of METAVIR. Fig. 1 depicts the overall workflow from preprocessing to feature extraction, combination, and finally classification.
2.1
Dataset description
The current study utilizes the Liver Histopathology (Fibrosis) Ultrasound Images dataset, which was collected from the Kaggle platform (https://www.kaggle.com/code/houssameddinebhe/liver-histopathology-fibrosis-ultrasound-images/input). The dataset was originally collected from clinical research conducted at Seoul St. Mary’s Hospital and Eunpyeong St. Mary’s Hospital, where ultrasound imaging was carried out as a component of routine clinical evaluation of liver fibrosis. The rationale for the utilization of ultrasound imaging in the clinical research is founded on the safety, non-invasive, and accessibility features of the modality, coupled with the absence of radiation, which makes the modality particularly appropriate for the long-term monitoring of patients suffering from liver diseases. The data set includes high-resolution images obtained during the routine clinical examination of the patients using the B-mode ultrasound modality. The images are annotated in conformity with the METAVIR scoring system, which is a widely accepted system for the assessment of the degree of liver fibrosis. According to the scoring system, liver fibrosis can be classified into five grades: F0 (no fibrosis), F1 (portal fibrosis), F2 (periportal fibrosis), F3 (bridging fibrosis), and F4 (cirrhosis), as depicted in Fig. 2(A–E). According to the data set description, the labels for the images are derived from the clinical data in conformity with the METAVIR scoring system; however, the details of the annotation protocol are not provided. The publicly available data set includes only the class labels for the images; no information regarding the patients is provided.
The dataset has a hierarchical structure that includes image-level data. There is a sample imbalance in terms of the fibrosis stages. The extreme stages of fibrosis, i.e., F0 (healthy) and F4 (cirrhosis), are the best represented in the data set. Moreover, stage F4 alone represents 27% of the data set. Conversely, the intermediate stages of fibrosis, i.e., F1, F2, and F3, are not as well represented in the data set and contribute about 13% each. This sample imbalance in terms of fibrosis stages reflects real-world clinical practice since cirrhosis is easier to diagnose, whereas early and intermediate stages of fibrosis are hard to diagnose using ultrasound imaging. Although this sample imbalance complicates classification in terms of minor variations between consecutive fibrosis stages, it increases the clinical value of this dataset.
2.2
Preprocessing
The preprocessing steps are illustrated in Fig. 1. All the steps were carried out within the Anaconda environment (Version 2023.03, Anaconda Inc., USA) with the help of the Jupyter Notebook and Python (Version 3.10, Python Software Foundation, USA). Initially, the medical images were converted to a fixed resolution of 224 × 224 pixels. This was done to ensure the images were of a uniform size. Further, noise reduction was performed to the images. For this purpose, the images were filtered to reduce noise, which could adversely affect the performance of the CNN model. To increase the size of the training images, the concept of data augmentation was employed. In this regard, the images were augmented to increase their size. The augmentation was done through a series of operations such as rotation, flip, and zoom of the images. All these operations were performed online to increase the size of the images. The augmented images were then provided as input to the CNN model.
2.3
Handling class imbalance
The dataset exhibits class imbalance, with the extreme fibrosis stages (F0 and F4) being overrepresented, while intermediate stages (F1–F3) are less frequent. Specifically, it includes 2114 images of F0, 861 images of F1, 793 images of F2, 857 images of F3, and 1698 images of F4. To mitigate this during training, class weights were applied in the loss function, calculated inversely proportional to class frequencies.
The weight for each class was computed as shown in Equation (1) below:where.• is the weight assigned to class ,
• is the total number of training samples,
• is the total number of classes
• is the number of samples in class .
These class weights were incorporated into the loss function during model training, ensuring that misclassification of underrepresented classes was penalized more heavily.
2.4
Model architecture
The proposed framework utilizes two state-of-the-art convolutional neural networks (CNNs), namely ResNet50 and VGG16, to obtain discriminative features from liver ultrasound images. The selection of these networks is based on the fact that VGG16 is capable of learning fine textures through sequential convolutional layers, while ResNet50 utilizes residual connections to enable the learning of deeper abstract features by avoiding the gradient vanishing problem. Instead of proposing a novel feature fusion technique, this paper rigorously applies and evaluates the feature-level fusion technique to combine the complementary features learned from the two networks, with the aim of enhancing the classification robustness and performance metrics such as accuracy and AUC, which are essential in liver fibrosis diagnosis. Feature-level concatenation was adopted as a fusion strategy because it preserves the full representational capacity of each backbone network, enabling the model to jointly exploit complementary high-level semantic features from ResNet50 and fine-grained texture representations from VGG16, which is particularly important for ultrasound-based tissue characterization.
2.4.1
ResNet50
As a first step, transfer learning using a ResNet50 model was performed to obtain a baseline for the classification task. To use it, all ultrasound images were resized to 224 × 224 pixels. The weights from the ImageNet dataset were used to initialize the weights of the ResNet50 model, and the top layers were replaced by a custom network. It consists of a Global Average Pooling layer, fully connected layers that include dropout to prevent overfitting, and a Softmax layer for classification into different classes, from F0 to F4. During training, the base layers of the ResNet50 model were frozen, as they contain a lot of information from the ImageNet dataset, and only the top layers were allowed to learn. Normalization was performed, and training was done using the Adam optimizer and a sparse categorical cross-entropy loss function. To prevent overfitting, early stopping was performed based on the validation loss.
2.4.2
VGG16
In addition to ResNet50, the VGG16 architecture was also used for evaluation to further test the transfer learning approach for the classification of liver fibrosis. As was done in the previous experiment, all ultrasound images were resized to 224 × 224 pixels, and the pre-trained VGG16 model was employed, initialized with weights from the ImageNet dataset. The convolutional layers were set to be frozen, while new custom layers were added, including the flatten layer, fully connected layers with dropout regularization, and the Softmax classifier for classifying the different fibrosis stages F0, F1, F2, F3, and F4. The Adam optimizer was used for training, while sparse categorical cross-entropy was used as the loss function, together with early stopping to prevent overfitting during training by monitoring the validation loss.
2.4.3
Hybridization of ResNet50 and VGG16
In order to enhance the classification accuracy of liver fibrosis stages, a Hybrid CNN model was proposed by integrating two pre-trained networks, ResNet50 and VGG16, under a single umbrella. By leveraging the strength of both networks, ResNet50 was employed for the extraction of macro structural features, while VGG16 was employed for the extraction of micro textural features from the ultrasound images. The pre-processed images were provided as input to both networks concurrently. The output from ResNet50 was reduced in dimension using global average pooling, while the output from VGG16 was flattened to retain textural information. The output from both networks was combined and provided as input to the dense layers with dropout regularization. The final classification was carried out over the five stages of liver fibrosis, F0 to F4, using a Softmax classifier. The model was trained for a maximum of 10 epochs using the Adam optimizer with early stopping to prevent overtraining. This reliable system combines both macro and micro information from the input image. The hybrid architecture that merges ResNet50 and VGG16 is shown in Fig. 3.
2.4.4
Feature fusion and hybrid model details
A late feature fusion strategy is used to leverage the complementary strengths of both networks.(1)ResNet50 branch: Produces a feature map that is condensed via global average pooling into a 2048-dimensional vector.
(2)VGG16 branch: Outputs a flattened 4096-dimensional vector, preserving detailed texture information.
(3)Fusion: The two vectors are concatenated into a 6144-dimensional hybrid feature vector, creating a richer representation combining global and local features.
(4)Fully connected layers: The fused vector passes through two dense layers with 1024 and 512 neurons, each followed by ReLU activation and a dropout rate of 0.5 to prevent overfitting.
(5)Output layer: A Softmax layer classifies the image into one of five fibrosis stages (F0–F4).
Unless otherwise specified, ResNet50, VGG16, and the hybrid model were trained for up to 10 epochs with a batch size of 32, using the Adam optimizer with an initial learning rate of 1 × 10−4 and sparse categorical cross-entropy loss. Data augmentation (random rotations ± 15°, horizontal flips, zoom 0.9–1.1, and width/height shifts ± 10%) was applied to the training set only. Early stopping (patience = 10 epochs) based on validation loss was used, and class weights were applied to alleviate class imbalance between fibrosis stages. The architecture of the proposed model is illustrated in Fig. 4.
2.5
Evaluation metrics
In order to effectively assess the performance of the proposed model, some of the commonly used classification evaluation metrics have been utilized. These evaluation metrics include accuracy, class-wise sensitivity/recall as seen in Equation (2), specificity, precision, F1-score, and area under the receiver operating characteristic curve (AUC). Accuracy is defined as the ratio of correctly classified instances to the total number of instances, which effectively represents the model’s predictive capability. Class-wise sensitivity/recall is defined as the ratio of correctly predicted positive instances to the total number of actual positive instances, effectively indicating the model’s predictive capability for each stage of fibrosis:
Equation (3) was utilized for computing the specificity as the proportion of true negative predictions among all actual negative cases, quantifying the model’s capacity to correctly exclude non-target classes and reduce false positive assignments:
Precision represents the proportion of true positive predictions among all instances classified as positive, indicating the reliability of the model’s positive predictions and was calculated by using Equation (4) below:
The F1-score, defined as the harmonic mean of precision and recall, was used to provide a balanced evaluation of the model’s performance, particularly in the presence of class imbalance as shown in Equation (5):
The area under the curve of the receiver operating characteristic was used for evaluating discriminative capability of the model. For the multi-class classification of fibrosis stages ranging from F0 to F4, macro-averaged AUC as well as weighted AUC was reported for a robust evaluation of the model in imbalanced class problems.
2.6
Gradient-weighted Class Activation Mapping (Grad-CAM)
The Gradient-weighted Class Activation Mapping (Grad-CAM) technique was used for the explanation of the predictions obtained by the ResNet50, VGG16, and Hybrid models. The Grad-CAM heat maps were extracted from the last convolutional layer and superimposed on the original images of the histology dataset using a standard color map and transparency level. This technique helped in including the correctly classified and incorrectly classified images of the test dataset, thus providing information about the performance of the models on both types of images. The comparison between the actual class labels and the prediction outcomes helped in identifying the correctness of the predictions. The publication-ready class labels with information about the actual class, predicted class, and correctness of classification were automatically generated. The normalization of the Grad-CAM heat maps was done using a standard scale factor.
2.7
Statistical Methods
For each model and fibrosis stage, sensitivity and specificity were calculated from test set confusion matrices using scikit-learn. To estimate statistical robustness, 95% confidence intervalsFor each model and fibrosis stage, sensitivity and specificity were calculated from test set confusion matrices using scikit-learn. To estimate statistical robustness, 95% confidence intervals (CI) were derived for these metrics via bootstrap resampling: the test set was resampled with replacement 1,000 times, and sensitivity and specificity computed for each iteration. The results are reported as mean ± standard deviation (SD), approximating the empirical CI from the bootstrap distributions. Statistical significance of differences in sensitivity across fibrosis stages and models was assessed using one-way analysis of variance (ANOVA) performed in Python with scipy. All statistical analyses, including metric derivation, bootstrap resampling, and ANOVA were conducted using scikit-learn, numpy, and scipy libraries to ensure reproducibility and transparency.
Methods
This paper proposes a hybrid deep learning model for the classification of liver fibrosis stages based on ultrasound images. The proposed model leverages the power of two of the most popular convolutional neural networks, namely ResNet50 and VGG16, such that the network can learn both deep and fine texture patterns of the images, thus providing a more accurate diagnostic system. Before being fed into the networks, the ultrasound images are preprocessed to resize them to 224 × 224 pixels, convert them to grayscale, normalize them, and perform histogram equalization. These operations ensure that the images are of the same size and intensity, thus highlighting the minute textural information necessary for the identification of the stages of liver fibrosis. The images are then split into training and testing sets to enable the training of the model and its subsequent evaluation. The preprocessed images are then fed into both ResNet50 and VGG16, which have been fine-tuned for the analysis of liver ultrasound images. ResNet50 learns deep high-level features via its residual connections, while VGG16 learns fine and mid-level texture information through its simple stacked convolutional layers.
The training process utilizes the Adam optimizer along with a sparse categorical cross-entropy loss function, and early stopping is also employed based on validation loss. Preprocessing is done only for the training dataset, as it may cause data leakage if done for other datasets as well. By using a combination of macro-level feature extraction using ResNet50 and micro-level texture analysis using VGG16, a more informative feature space can be created, enabling more effective classification between different stages of METAVIR. Fig. 1 depicts the overall workflow from preprocessing to feature extraction, combination, and finally classification.
2.1
Dataset description
The current study utilizes the Liver Histopathology (Fibrosis) Ultrasound Images dataset, which was collected from the Kaggle platform (https://www.kaggle.com/code/houssameddinebhe/liver-histopathology-fibrosis-ultrasound-images/input). The dataset was originally collected from clinical research conducted at Seoul St. Mary’s Hospital and Eunpyeong St. Mary’s Hospital, where ultrasound imaging was carried out as a component of routine clinical evaluation of liver fibrosis. The rationale for the utilization of ultrasound imaging in the clinical research is founded on the safety, non-invasive, and accessibility features of the modality, coupled with the absence of radiation, which makes the modality particularly appropriate for the long-term monitoring of patients suffering from liver diseases. The data set includes high-resolution images obtained during the routine clinical examination of the patients using the B-mode ultrasound modality. The images are annotated in conformity with the METAVIR scoring system, which is a widely accepted system for the assessment of the degree of liver fibrosis. According to the scoring system, liver fibrosis can be classified into five grades: F0 (no fibrosis), F1 (portal fibrosis), F2 (periportal fibrosis), F3 (bridging fibrosis), and F4 (cirrhosis), as depicted in Fig. 2(A–E). According to the data set description, the labels for the images are derived from the clinical data in conformity with the METAVIR scoring system; however, the details of the annotation protocol are not provided. The publicly available data set includes only the class labels for the images; no information regarding the patients is provided.
The dataset has a hierarchical structure that includes image-level data. There is a sample imbalance in terms of the fibrosis stages. The extreme stages of fibrosis, i.e., F0 (healthy) and F4 (cirrhosis), are the best represented in the data set. Moreover, stage F4 alone represents 27% of the data set. Conversely, the intermediate stages of fibrosis, i.e., F1, F2, and F3, are not as well represented in the data set and contribute about 13% each. This sample imbalance in terms of fibrosis stages reflects real-world clinical practice since cirrhosis is easier to diagnose, whereas early and intermediate stages of fibrosis are hard to diagnose using ultrasound imaging. Although this sample imbalance complicates classification in terms of minor variations between consecutive fibrosis stages, it increases the clinical value of this dataset.
2.2
Preprocessing
The preprocessing steps are illustrated in Fig. 1. All the steps were carried out within the Anaconda environment (Version 2023.03, Anaconda Inc., USA) with the help of the Jupyter Notebook and Python (Version 3.10, Python Software Foundation, USA). Initially, the medical images were converted to a fixed resolution of 224 × 224 pixels. This was done to ensure the images were of a uniform size. Further, noise reduction was performed to the images. For this purpose, the images were filtered to reduce noise, which could adversely affect the performance of the CNN model. To increase the size of the training images, the concept of data augmentation was employed. In this regard, the images were augmented to increase their size. The augmentation was done through a series of operations such as rotation, flip, and zoom of the images. All these operations were performed online to increase the size of the images. The augmented images were then provided as input to the CNN model.
2.3
Handling class imbalance
The dataset exhibits class imbalance, with the extreme fibrosis stages (F0 and F4) being overrepresented, while intermediate stages (F1–F3) are less frequent. Specifically, it includes 2114 images of F0, 861 images of F1, 793 images of F2, 857 images of F3, and 1698 images of F4. To mitigate this during training, class weights were applied in the loss function, calculated inversely proportional to class frequencies.
The weight for each class was computed as shown in Equation (1) below:where.• is the weight assigned to class ,
• is the total number of training samples,
• is the total number of classes
• is the number of samples in class .
These class weights were incorporated into the loss function during model training, ensuring that misclassification of underrepresented classes was penalized more heavily.
2.4
Model architecture
The proposed framework utilizes two state-of-the-art convolutional neural networks (CNNs), namely ResNet50 and VGG16, to obtain discriminative features from liver ultrasound images. The selection of these networks is based on the fact that VGG16 is capable of learning fine textures through sequential convolutional layers, while ResNet50 utilizes residual connections to enable the learning of deeper abstract features by avoiding the gradient vanishing problem. Instead of proposing a novel feature fusion technique, this paper rigorously applies and evaluates the feature-level fusion technique to combine the complementary features learned from the two networks, with the aim of enhancing the classification robustness and performance metrics such as accuracy and AUC, which are essential in liver fibrosis diagnosis. Feature-level concatenation was adopted as a fusion strategy because it preserves the full representational capacity of each backbone network, enabling the model to jointly exploit complementary high-level semantic features from ResNet50 and fine-grained texture representations from VGG16, which is particularly important for ultrasound-based tissue characterization.
2.4.1
ResNet50
As a first step, transfer learning using a ResNet50 model was performed to obtain a baseline for the classification task. To use it, all ultrasound images were resized to 224 × 224 pixels. The weights from the ImageNet dataset were used to initialize the weights of the ResNet50 model, and the top layers were replaced by a custom network. It consists of a Global Average Pooling layer, fully connected layers that include dropout to prevent overfitting, and a Softmax layer for classification into different classes, from F0 to F4. During training, the base layers of the ResNet50 model were frozen, as they contain a lot of information from the ImageNet dataset, and only the top layers were allowed to learn. Normalization was performed, and training was done using the Adam optimizer and a sparse categorical cross-entropy loss function. To prevent overfitting, early stopping was performed based on the validation loss.
2.4.2
VGG16
In addition to ResNet50, the VGG16 architecture was also used for evaluation to further test the transfer learning approach for the classification of liver fibrosis. As was done in the previous experiment, all ultrasound images were resized to 224 × 224 pixels, and the pre-trained VGG16 model was employed, initialized with weights from the ImageNet dataset. The convolutional layers were set to be frozen, while new custom layers were added, including the flatten layer, fully connected layers with dropout regularization, and the Softmax classifier for classifying the different fibrosis stages F0, F1, F2, F3, and F4. The Adam optimizer was used for training, while sparse categorical cross-entropy was used as the loss function, together with early stopping to prevent overfitting during training by monitoring the validation loss.
2.4.3
Hybridization of ResNet50 and VGG16
In order to enhance the classification accuracy of liver fibrosis stages, a Hybrid CNN model was proposed by integrating two pre-trained networks, ResNet50 and VGG16, under a single umbrella. By leveraging the strength of both networks, ResNet50 was employed for the extraction of macro structural features, while VGG16 was employed for the extraction of micro textural features from the ultrasound images. The pre-processed images were provided as input to both networks concurrently. The output from ResNet50 was reduced in dimension using global average pooling, while the output from VGG16 was flattened to retain textural information. The output from both networks was combined and provided as input to the dense layers with dropout regularization. The final classification was carried out over the five stages of liver fibrosis, F0 to F4, using a Softmax classifier. The model was trained for a maximum of 10 epochs using the Adam optimizer with early stopping to prevent overtraining. This reliable system combines both macro and micro information from the input image. The hybrid architecture that merges ResNet50 and VGG16 is shown in Fig. 3.
2.4.4
Feature fusion and hybrid model details
A late feature fusion strategy is used to leverage the complementary strengths of both networks.(1)ResNet50 branch: Produces a feature map that is condensed via global average pooling into a 2048-dimensional vector.
(2)VGG16 branch: Outputs a flattened 4096-dimensional vector, preserving detailed texture information.
(3)Fusion: The two vectors are concatenated into a 6144-dimensional hybrid feature vector, creating a richer representation combining global and local features.
(4)Fully connected layers: The fused vector passes through two dense layers with 1024 and 512 neurons, each followed by ReLU activation and a dropout rate of 0.5 to prevent overfitting.
(5)Output layer: A Softmax layer classifies the image into one of five fibrosis stages (F0–F4).
Unless otherwise specified, ResNet50, VGG16, and the hybrid model were trained for up to 10 epochs with a batch size of 32, using the Adam optimizer with an initial learning rate of 1 × 10−4 and sparse categorical cross-entropy loss. Data augmentation (random rotations ± 15°, horizontal flips, zoom 0.9–1.1, and width/height shifts ± 10%) was applied to the training set only. Early stopping (patience = 10 epochs) based on validation loss was used, and class weights were applied to alleviate class imbalance between fibrosis stages. The architecture of the proposed model is illustrated in Fig. 4.
2.5
Evaluation metrics
In order to effectively assess the performance of the proposed model, some of the commonly used classification evaluation metrics have been utilized. These evaluation metrics include accuracy, class-wise sensitivity/recall as seen in Equation (2), specificity, precision, F1-score, and area under the receiver operating characteristic curve (AUC). Accuracy is defined as the ratio of correctly classified instances to the total number of instances, which effectively represents the model’s predictive capability. Class-wise sensitivity/recall is defined as the ratio of correctly predicted positive instances to the total number of actual positive instances, effectively indicating the model’s predictive capability for each stage of fibrosis:
Equation (3) was utilized for computing the specificity as the proportion of true negative predictions among all actual negative cases, quantifying the model’s capacity to correctly exclude non-target classes and reduce false positive assignments:
Precision represents the proportion of true positive predictions among all instances classified as positive, indicating the reliability of the model’s positive predictions and was calculated by using Equation (4) below:
The F1-score, defined as the harmonic mean of precision and recall, was used to provide a balanced evaluation of the model’s performance, particularly in the presence of class imbalance as shown in Equation (5):
The area under the curve of the receiver operating characteristic was used for evaluating discriminative capability of the model. For the multi-class classification of fibrosis stages ranging from F0 to F4, macro-averaged AUC as well as weighted AUC was reported for a robust evaluation of the model in imbalanced class problems.
2.6
Gradient-weighted Class Activation Mapping (Grad-CAM)
The Gradient-weighted Class Activation Mapping (Grad-CAM) technique was used for the explanation of the predictions obtained by the ResNet50, VGG16, and Hybrid models. The Grad-CAM heat maps were extracted from the last convolutional layer and superimposed on the original images of the histology dataset using a standard color map and transparency level. This technique helped in including the correctly classified and incorrectly classified images of the test dataset, thus providing information about the performance of the models on both types of images. The comparison between the actual class labels and the prediction outcomes helped in identifying the correctness of the predictions. The publication-ready class labels with information about the actual class, predicted class, and correctness of classification were automatically generated. The normalization of the Grad-CAM heat maps was done using a standard scale factor.
2.7
Statistical Methods
For each model and fibrosis stage, sensitivity and specificity were calculated from test set confusion matrices using scikit-learn. To estimate statistical robustness, 95% confidence intervalsFor each model and fibrosis stage, sensitivity and specificity were calculated from test set confusion matrices using scikit-learn. To estimate statistical robustness, 95% confidence intervals (CI) were derived for these metrics via bootstrap resampling: the test set was resampled with replacement 1,000 times, and sensitivity and specificity computed for each iteration. The results are reported as mean ± standard deviation (SD), approximating the empirical CI from the bootstrap distributions. Statistical significance of differences in sensitivity across fibrosis stages and models was assessed using one-way analysis of variance (ANOVA) performed in Python with scipy. All statistical analyses, including metric derivation, bootstrap resampling, and ANOVA were conducted using scikit-learn, numpy, and scipy libraries to ensure reproducibility and transparency.
Results
3
Results
In order to ensure uniformity in the images that are obtained during the process, preprocessing was done on the input images. As depicted in Fig. 5, the images that are obtained during the raw ultrasound scan are not uniform. Therefore, the input images are resized to ensure that the input images are uniform. The input images are then normalized to ensure that the pixels are within the range of [0, 1]. This ensures the convergence of the model. The input images are then subjected to histogram equalization to ensure that the features are visible in the input images.
The dataset contained images annotated according to fibrosis stages ranging from F0 to F4. Fig. 6 illustrates the distribution of samples across these classes. The balanced representation ensures reliable training and evaluation of the classification model.
3.1
ResNet50 baseline results
Fig. 7 illustrates the training and validation performance curves for the baseline ResNet50. During the initial epoch, the model achieved a training accuracy of 36.69% and a validation accuracy of 33.44%. With successive epochs, model performance improved steadily; by the tenth epoch, training accuracy reached 51.25%, and validation accuracy peaked at 55.26%. The validation loss also decreased consistently, dropping from 11.19 at the first epoch to 1.91 by the final epoch.
3.2
VGG16 results
Fig. 8, illustrates the training and validation performance curves for VGG16. In the first epoch, VGG16 exhibited a training accuracy of 46.78% and a validation accuracy of 58.58%, with a corresponding training loss of 1.24. Both accuracy and loss metrics improved as training progressed; by the tenth epoch, training accuracy increased to 60.32% and validation accuracy rose to 71.54%, along with a reduction in validation loss from 0.91 to 0.64. These improvements indicate that VGG16 demonstrated greater stability and discriminative capability on the dataset compared to ResNet50, warranting further investigation using more advanced frameworks.
3.3
Hybrid CNN results
Fig. 9, presents the training and validation curves. In the first epoch, the model achieved a training accuracy of 53.90% and a validation accuracy of 65.40%. Performance improved steadily across epochs, reaching a peak validation accuracy of 86.30% at epoch 12, while validation loss decreased from 0.76 in the first epoch to 0.37. As can be seen from the confusion matrix in Fig. 10, the classifier’s performance is illustrated over different stages of fibrosis, ranging from F0 to F4. The predictions are mostly true labels, with F0 having the highest number of correct predictions (420), followed by F4 (326). However, there are some incorrect predictions between consecutive stages, for example, F1 is predicted as F2 for 89 samples, while F3 is predicted as F2 for 32 samples.
3.4
Model performance comparison
The performance of ResNet50, VGG16, and Hybrid models is represented in Table 2, Table 3. Table 2 represents classification performance in terms of accuracy, macro- and weighted-area under the ROC curve (AUC), 95% CI for accuracy, and their associated p-values. The performance of the Hybrid model was superior in terms of accuracy (86.00%; 95% CI: 84.10–87.90) with better macro- and weighted-AUC values (96.79% and 97.79%, respectively). All models had statistically significant performance in comparison with chance level performance (p < 0.001).
Table 3 represents class-wise sensitivity and specificity of all models with their 95% CI and associated p-values for sensitivity. The table also includes the associated p-values for sensitivity comparisons. The performance of all models was statistically significant in comparison with chance level performance for all classes of fibrosis stages (F0 to F4). Although ResNet50 and VGG16 models had variable sensitivity for intermediate fibrosis stages, the Hybrid model had consistently high values of sensitivity and specificity with narrower CI for all classes of fibrosis stages.
3.5
Grad-CAM visualization of model predictions
Grad-CAM heatmaps were generated for three representative correct and misclassified images per model to visualize areas of attention during classification (Fig. 11). For ResNet50, well-classified F4 cases showed high intensity activation localized around the fibrotic areas, while misclassified F4 cases showed slightly diffused attention. For VGG16, a more diffused pattern of activation, which was also slightly off-target, was noted, particularly for misclassified F4 cases, suggesting a lack of precision in targeting the relevant liver areas. For the Hybrid model, highly focused and high-intensity activation was noted for both well-classified and misclassified F4 cases, targeting critical areas of fibrosis. These findings correlate well with the overall accuracy results, where the Hybrid model performed best, followed by VGG16 and ResNet50.
Results
In order to ensure uniformity in the images that are obtained during the process, preprocessing was done on the input images. As depicted in Fig. 5, the images that are obtained during the raw ultrasound scan are not uniform. Therefore, the input images are resized to ensure that the input images are uniform. The input images are then normalized to ensure that the pixels are within the range of [0, 1]. This ensures the convergence of the model. The input images are then subjected to histogram equalization to ensure that the features are visible in the input images.
The dataset contained images annotated according to fibrosis stages ranging from F0 to F4. Fig. 6 illustrates the distribution of samples across these classes. The balanced representation ensures reliable training and evaluation of the classification model.
3.1
ResNet50 baseline results
Fig. 7 illustrates the training and validation performance curves for the baseline ResNet50. During the initial epoch, the model achieved a training accuracy of 36.69% and a validation accuracy of 33.44%. With successive epochs, model performance improved steadily; by the tenth epoch, training accuracy reached 51.25%, and validation accuracy peaked at 55.26%. The validation loss also decreased consistently, dropping from 11.19 at the first epoch to 1.91 by the final epoch.
3.2
VGG16 results
Fig. 8, illustrates the training and validation performance curves for VGG16. In the first epoch, VGG16 exhibited a training accuracy of 46.78% and a validation accuracy of 58.58%, with a corresponding training loss of 1.24. Both accuracy and loss metrics improved as training progressed; by the tenth epoch, training accuracy increased to 60.32% and validation accuracy rose to 71.54%, along with a reduction in validation loss from 0.91 to 0.64. These improvements indicate that VGG16 demonstrated greater stability and discriminative capability on the dataset compared to ResNet50, warranting further investigation using more advanced frameworks.
3.3
Hybrid CNN results
Fig. 9, presents the training and validation curves. In the first epoch, the model achieved a training accuracy of 53.90% and a validation accuracy of 65.40%. Performance improved steadily across epochs, reaching a peak validation accuracy of 86.30% at epoch 12, while validation loss decreased from 0.76 in the first epoch to 0.37. As can be seen from the confusion matrix in Fig. 10, the classifier’s performance is illustrated over different stages of fibrosis, ranging from F0 to F4. The predictions are mostly true labels, with F0 having the highest number of correct predictions (420), followed by F4 (326). However, there are some incorrect predictions between consecutive stages, for example, F1 is predicted as F2 for 89 samples, while F3 is predicted as F2 for 32 samples.
3.4
Model performance comparison
The performance of ResNet50, VGG16, and Hybrid models is represented in Table 2, Table 3. Table 2 represents classification performance in terms of accuracy, macro- and weighted-area under the ROC curve (AUC), 95% CI for accuracy, and their associated p-values. The performance of the Hybrid model was superior in terms of accuracy (86.00%; 95% CI: 84.10–87.90) with better macro- and weighted-AUC values (96.79% and 97.79%, respectively). All models had statistically significant performance in comparison with chance level performance (p < 0.001).
Table 3 represents class-wise sensitivity and specificity of all models with their 95% CI and associated p-values for sensitivity. The table also includes the associated p-values for sensitivity comparisons. The performance of all models was statistically significant in comparison with chance level performance for all classes of fibrosis stages (F0 to F4). Although ResNet50 and VGG16 models had variable sensitivity for intermediate fibrosis stages, the Hybrid model had consistently high values of sensitivity and specificity with narrower CI for all classes of fibrosis stages.
3.5
Grad-CAM visualization of model predictions
Grad-CAM heatmaps were generated for three representative correct and misclassified images per model to visualize areas of attention during classification (Fig. 11). For ResNet50, well-classified F4 cases showed high intensity activation localized around the fibrotic areas, while misclassified F4 cases showed slightly diffused attention. For VGG16, a more diffused pattern of activation, which was also slightly off-target, was noted, particularly for misclassified F4 cases, suggesting a lack of precision in targeting the relevant liver areas. For the Hybrid model, highly focused and high-intensity activation was noted for both well-classified and misclassified F4 cases, targeting critical areas of fibrosis. These findings correlate well with the overall accuracy results, where the Hybrid model performed best, followed by VGG16 and ResNet50.
Discussion
4
Discussion
The results of the study emphasize the efficacy of the hybrid CNN model, which integrates ResNet50 and VGG16, in enhancing the accuracy of the classification of the liver fibrosis stage from ultrasound images. The strategy adopted in the study was intended to address the limitations of existing CNN models in the classification of the liver fibrosis stage from ultrasound images, which include the misclassification of the intermediate stage of liver fibrosis, as reported in some studies that adopted the CNN model in the classification of liver fibrosis stage from ultrasound images. The efficacy of the Hybrid CNN model in the classification of the liver fibrosis stage from ultrasound images was tested by comparing it with existing CNN models. In previous studies, researchers employed CNN architectures, including VGG16, ResNet50, etc., for classifying liver fibrosis stage from ultrasound images, but they reported moderate accuracy, coupled with difficulty in achieving balance between sensitivity/specificity of the classifier for different fibrosis stages. For example, CNN architectures, including VGG16, ResNet50, etc., were observed to misclassify intermediate fibrosis stage, including F2, F3, due to the subtle texture differences of ultrasound images. However, the proposed Hybrid CNN architecture was observed to show significant improvement in terms of AUC (Macro: 96.79%, Weighted: 97.79%).
Previous studies regarding the assessment of liver fibrosis using ultrasound imaging techniques were mainly focused on B-mode ultrasound images of the rat liver, where texture features were used with logistic regression analysis. The methods were highly successful, with an AUC value reaching up to 95.00%, sensitivity reaching 96.80%, and specificity reaching 93.70% for the classification task.40 However, the methods were restricted to animal images, with the aim of distinguishing early and late fibrosis. In comparison, the proposed hybrid CNN model was trained and tested directly on the ultrasound images of the human liver, with an AUC value reaching 97.79% for all METAVIR stages (F0–F4). This shows the better generalization ability of the proposed method in comparison with the earlier methods.
In another study, Destrempes et al.41 employed the technique of human ultrasound in combination with shear wave elastography and the Random Forest classifier. The results showed an AUC of 77.00% for the staging of fibrosis. However, the results of the model were constrained due to the small number of samples, i.e., only 82 patients. The performance of our hybrid CNN was significantly higher than the results reported in the study, with a macro AUC of 96.79%, thus demonstrating the robust performance of the model for all the stages of fibrosis.
In another study, Chen et al.42 employed the technique of real-time tissue elastography (RTE) in combination with different classifiers like SVM, RF, and KNN. The Random Forest classifier reported the maximum accuracy of about 85.00%, although the results reported in the study were constrained due to the dependency on the classifiers. Our hybrid CNN model has overcome the constraints of the study due to the automatic feature learning capability of the model, thus reporting a high accuracy of 86.00%.
Moreover, Li et al.43 combined ultrasonomics features, also known as radiomics, with various machine learning algorithms, such as AdaBoost, RF, and SVM. It was found that the mean AUC was 85.00% for 144 patients. Even though it was computationally costly, using a small dataset, it proved that machine learning algorithms can be powerful tools in predicting liver fibrosis. When comparing our results, the hybrid deep learning model performed significantly better, obtaining a high AUC of 97.79% using only ultrasound images, which is a larger dataset than theirs, comprising 6323 ultrasound images.
Moreover, when visualizing Grad CAM, it was noted that ResNet50 was focusing more on clinically important fibrotic areas, whereas VGG16’s attention was more scattered. These results are in line with previous studies that proved that Grad CAM can effectively highlight discriminative regions in medical imaging models.49 Moreover, the hybrid model’s results were more concentrated, as proven by other studies that showed that combining architectures can significantly improve classification results as well as attention quality,50 as shown in Fig. 5. These results prove that visualizations, such as Grad CAM, can significantly contribute to the confidence in liver fibrosis classification using AI models.
On the whole, the hybrid CNN model, which integrates the performance of the two architectures, namely, ResNet50 and VGG16, has shown promising results in terms of the accuracy rate (86.00%), macro AUC (96.79%), weighted AUC (97.79%), as well as sensitivity (95.88%). Although the misclassification between stages F1, F2 still needs to be addressed as an area of concern, the hybrid CNN model has shown promising performance in terms of sensitivity, specificity, as well as the overall generalization of the proposed framework, thereby indicating the potential of the proposed framework as an effective diagnostic tool for the evaluation of liver fibrosis.
It has been observed that the proposed Hybrid CNN has shown promising performance in terms of sensitivity for the early stages of fibrosis, namely, F1, F2, as well as specificity for the advanced stages, namely, F3, F4. This has highlighted the robustness of the proposed framework as an effective diagnostic tool.
Discussion
The results of the study emphasize the efficacy of the hybrid CNN model, which integrates ResNet50 and VGG16, in enhancing the accuracy of the classification of the liver fibrosis stage from ultrasound images. The strategy adopted in the study was intended to address the limitations of existing CNN models in the classification of the liver fibrosis stage from ultrasound images, which include the misclassification of the intermediate stage of liver fibrosis, as reported in some studies that adopted the CNN model in the classification of liver fibrosis stage from ultrasound images. The efficacy of the Hybrid CNN model in the classification of the liver fibrosis stage from ultrasound images was tested by comparing it with existing CNN models. In previous studies, researchers employed CNN architectures, including VGG16, ResNet50, etc., for classifying liver fibrosis stage from ultrasound images, but they reported moderate accuracy, coupled with difficulty in achieving balance between sensitivity/specificity of the classifier for different fibrosis stages. For example, CNN architectures, including VGG16, ResNet50, etc., were observed to misclassify intermediate fibrosis stage, including F2, F3, due to the subtle texture differences of ultrasound images. However, the proposed Hybrid CNN architecture was observed to show significant improvement in terms of AUC (Macro: 96.79%, Weighted: 97.79%).
Previous studies regarding the assessment of liver fibrosis using ultrasound imaging techniques were mainly focused on B-mode ultrasound images of the rat liver, where texture features were used with logistic regression analysis. The methods were highly successful, with an AUC value reaching up to 95.00%, sensitivity reaching 96.80%, and specificity reaching 93.70% for the classification task.40 However, the methods were restricted to animal images, with the aim of distinguishing early and late fibrosis. In comparison, the proposed hybrid CNN model was trained and tested directly on the ultrasound images of the human liver, with an AUC value reaching 97.79% for all METAVIR stages (F0–F4). This shows the better generalization ability of the proposed method in comparison with the earlier methods.
In another study, Destrempes et al.41 employed the technique of human ultrasound in combination with shear wave elastography and the Random Forest classifier. The results showed an AUC of 77.00% for the staging of fibrosis. However, the results of the model were constrained due to the small number of samples, i.e., only 82 patients. The performance of our hybrid CNN was significantly higher than the results reported in the study, with a macro AUC of 96.79%, thus demonstrating the robust performance of the model for all the stages of fibrosis.
In another study, Chen et al.42 employed the technique of real-time tissue elastography (RTE) in combination with different classifiers like SVM, RF, and KNN. The Random Forest classifier reported the maximum accuracy of about 85.00%, although the results reported in the study were constrained due to the dependency on the classifiers. Our hybrid CNN model has overcome the constraints of the study due to the automatic feature learning capability of the model, thus reporting a high accuracy of 86.00%.
Moreover, Li et al.43 combined ultrasonomics features, also known as radiomics, with various machine learning algorithms, such as AdaBoost, RF, and SVM. It was found that the mean AUC was 85.00% for 144 patients. Even though it was computationally costly, using a small dataset, it proved that machine learning algorithms can be powerful tools in predicting liver fibrosis. When comparing our results, the hybrid deep learning model performed significantly better, obtaining a high AUC of 97.79% using only ultrasound images, which is a larger dataset than theirs, comprising 6323 ultrasound images.
Moreover, when visualizing Grad CAM, it was noted that ResNet50 was focusing more on clinically important fibrotic areas, whereas VGG16’s attention was more scattered. These results are in line with previous studies that proved that Grad CAM can effectively highlight discriminative regions in medical imaging models.49 Moreover, the hybrid model’s results were more concentrated, as proven by other studies that showed that combining architectures can significantly improve classification results as well as attention quality,50 as shown in Fig. 5. These results prove that visualizations, such as Grad CAM, can significantly contribute to the confidence in liver fibrosis classification using AI models.
On the whole, the hybrid CNN model, which integrates the performance of the two architectures, namely, ResNet50 and VGG16, has shown promising results in terms of the accuracy rate (86.00%), macro AUC (96.79%), weighted AUC (97.79%), as well as sensitivity (95.88%). Although the misclassification between stages F1, F2 still needs to be addressed as an area of concern, the hybrid CNN model has shown promising performance in terms of sensitivity, specificity, as well as the overall generalization of the proposed framework, thereby indicating the potential of the proposed framework as an effective diagnostic tool for the evaluation of liver fibrosis.
It has been observed that the proposed Hybrid CNN has shown promising performance in terms of sensitivity for the early stages of fibrosis, namely, F1, F2, as well as specificity for the advanced stages, namely, F3, F4. This has highlighted the robustness of the proposed framework as an effective diagnostic tool.
Limitations and future directions
5
Limitations and future directions
While the findings are encouraging, it is important to note that this study has some limitations. First, the number of samples in the dataset was not extensive and may not represent the variability that would be observed in a larger clinical setting. This is particularly true since the distribution of samples was imbalanced for the different stages of fibrosis. Furthermore, the study only used ultrasound imaging, which, while feasible and widely available, has limitations in terms of resolution compared to other imaging modalities. Another limitation of this study is the interpretability of the model. While CNN-based models are highly effective in terms of classification accuracy, they remain essentially “black-box” models. This can be a limitation in a clinical setting where transparency and interpretability are essential for acceptance by healthcare professionals.
The limitation that can be noted for this research is that, based on the dataset that was made available to the public, only image-level class labels were available, and no patient-level information was present. Thus, no patient-level grouping or splitting was allowed, and training and evaluation were performed solely on the image level, which may include some hidden correlations. Moreover, no information was available regarding the label reliability, such as inter-observer agreement. These limitations need to be taken into consideration when interpreting the results.
Limitations and future directions
While the findings are encouraging, it is important to note that this study has some limitations. First, the number of samples in the dataset was not extensive and may not represent the variability that would be observed in a larger clinical setting. This is particularly true since the distribution of samples was imbalanced for the different stages of fibrosis. Furthermore, the study only used ultrasound imaging, which, while feasible and widely available, has limitations in terms of resolution compared to other imaging modalities. Another limitation of this study is the interpretability of the model. While CNN-based models are highly effective in terms of classification accuracy, they remain essentially “black-box” models. This can be a limitation in a clinical setting where transparency and interpretability are essential for acceptance by healthcare professionals.
The limitation that can be noted for this research is that, based on the dataset that was made available to the public, only image-level class labels were available, and no patient-level information was present. Thus, no patient-level grouping or splitting was allowed, and training and evaluation were performed solely on the image level, which may include some hidden correlations. Moreover, no information was available regarding the label reliability, such as inter-observer agreement. These limitations need to be taken into consideration when interpreting the results.
Conclusion
6
Conclusion
This paper proposed a feature fusion hybrid CNN model for the classification of liver fibrosis based on ultrasound images. The proposed model showed high sensitivity, specificity, and AUC, indicating its excellent performance, especially in the diagnosis of advanced fibrosis. The Grad-CAM analysis further verified that the hybrid model always concentrated on the meaningful area, which indicates the effectiveness of the hybrid model in improving the diagnosis of liver fibrosis.
Conclusion
This paper proposed a feature fusion hybrid CNN model for the classification of liver fibrosis based on ultrasound images. The proposed model showed high sensitivity, specificity, and AUC, indicating its excellent performance, especially in the diagnosis of advanced fibrosis. The Grad-CAM analysis further verified that the hybrid model always concentrated on the meaningful area, which indicates the effectiveness of the hybrid model in improving the diagnosis of liver fibrosis.
CRediT authorship contribution statement
CRediT authorship contribution statement
Adedotun F. Adesina: Writing – review & editing, Writing – original draft, Visualization, Validation, Software, Methodology, Investigation, Formal analysis, Data curation, Conceptualization. Blessing O. Olorunfemi: Writing – original draft, Visualization, Methodology, Investigation, Formal analysis, Data curation, Conceptualization. Adenike T. Adeniji-Sofoluwe: Writing – review & editing, Supervision, Methodology, Conceptualization. Chiagoziem A. Otuechere: Visualization, Supervision, Formal analysis, Conceptualization. Funmilayo Olopade: Supervision, Conceptualization. Benjamin Aribisala: Writing – original draft, Visualization, Validation, Supervision, Methodology, Investigation, Data curation, Conceptualization.
Adedotun F. Adesina: Writing – review & editing, Writing – original draft, Visualization, Validation, Software, Methodology, Investigation, Formal analysis, Data curation, Conceptualization. Blessing O. Olorunfemi: Writing – original draft, Visualization, Methodology, Investigation, Formal analysis, Data curation, Conceptualization. Adenike T. Adeniji-Sofoluwe: Writing – review & editing, Supervision, Methodology, Conceptualization. Chiagoziem A. Otuechere: Visualization, Supervision, Formal analysis, Conceptualization. Funmilayo Olopade: Supervision, Conceptualization. Benjamin Aribisala: Writing – original draft, Visualization, Validation, Supervision, Methodology, Investigation, Data curation, Conceptualization.
Informed consent
Informed consent
Not applicable. The study used a publicly available dataset from the Kaggle platform which does not contain identifiable human subjects or patient information. Therefore, informed consent was not required.
Not applicable. The study used a publicly available dataset from the Kaggle platform which does not contain identifiable human subjects or patient information. Therefore, informed consent was not required.
Ethics statement
Ethics statement
Not applicable.
Not applicable.
Data availability statement
Data availability statement
The current study utilizes the Liver Histopathology (Fibrosis) Ultrasound Images dataset, which was collected from the Kaggle platform: https://www.kaggle.com/code/houssameddinebhe/liver-histopathology-fibrosis-ultrasound-images/input.
The current study utilizes the Liver Histopathology (Fibrosis) Ultrasound Images dataset, which was collected from the Kaggle platform: https://www.kaggle.com/code/houssameddinebhe/liver-histopathology-fibrosis-ultrasound-images/input.
Declaration of generative AI and AI-assisted technologies in the writing process
Declaration of generative AI and AI-assisted technologies in the writing process
During the preparation of this work, the author(s) used Microsoft Copilot to assist with language refinement and organization in the related work section before reviewing and editing the content further. After using this tool, the author(s) reviewed and edited the content as needed and take full responsibility for the content of the publication.
During the preparation of this work, the author(s) used Microsoft Copilot to assist with language refinement and organization in the related work section before reviewing and editing the content further. After using this tool, the author(s) reviewed and edited the content as needed and take full responsibility for the content of the publication.
Funding
Funding
This research received no external funding.
This research received no external funding.
Declaration of competing interest
Declaration of competing interest
The authors declare no conflict of interest.
The authors declare no conflict of interest.
출처: PubMed Central (JATS). 라이선스는 원 publisher 정책을 따릅니다 — 인용 시 원문을 표기해 주세요.
🏷️ 같은 키워드 · 무료전문 — 이 논문 MeSH/keyword 기반
- LCMS-Net: Deep Learning for Raw High Resolution Mass Spectrometry Data Applied to Forensic Cause-of-Death Screening.
- CAR-T cells targeting activated hepatic stellate cells ameliorate liver fibrosis in mouse models.
- A novel real-world data methodology for lymphoma outcome classification: the real-world Lugano study.
- Artificial Intelligence-Enhanced Optimization of Wireless Breath Sensor Arrays for Detection of Lung Cancer Using Fuzzy Logic-Guided Genetic Algorithm and Multimodal Machine Learning.
- PIBAdb: a public cohort of multimodal colonoscopy videos and images including polyps with histological information.
- Exploring the Role of Extracellular Vesicles in Pancreatic and Hepatobiliary Cancers: Advances Through Artificial Intelligence.