본문으로 건너뛰기
← 뒤로

Enhancing lung cancer classification through a double attention hybrid CNN-HiFuse approach.

1/5 보강
Scientific reports 📖 저널 OA 96.3% 2021: 24/24 OA 2022: 32/32 OA 2023: 45/45 OA 2024: 140/140 OA 2025: 938/938 OA 2026: 698/767 OA 2021~2026 2026 Vol.16(1)
Retraction 확인
출처

M D A, B V

📝 환자 설명용 한 줄

[UNLABELLED] Lung cancer continues to be a predominant cause of cancer-related mortality globally.

이 논문을 인용하기

↓ .bib ↓ .ris
APA M D A, B V (2026). Enhancing lung cancer classification through a double attention hybrid CNN-HiFuse approach.. Scientific reports, 16(1). https://doi.org/10.1038/s41598-026-42290-9
MLA M D A, et al.. "Enhancing lung cancer classification through a double attention hybrid CNN-HiFuse approach.." Scientific reports, vol. 16, no. 1, 2026.
PMID 41813800 ↗

Abstract

[UNLABELLED] Lung cancer continues to be a predominant cause of cancer-related mortality globally. In 2022, lung cancer accounted for around 2.5 million new cases and around 1.8 million fatalities, highlighting the necessity for precise and effective computer-aided diagnostics. Timely detection is especially vital for non-small cell lung cancer (NSCLC), which constitutes roughly 80–85% of all lung cancer instances, but continues to pose difficulties in standard clinical practice. This paper presents a Double Attention Hybrid CNN-HiFuse architecture for categorizing lung cancer into three classes (normal, benign, malignant) using chest Computed Tomography (CT) images. The model is trained and evaluated on the publicly accessible IQ-OTH/NCCD dataset, which comprises 1,190 CT slices from 110 patients, using a defined preprocessing and augmentation protocol to address class imbalance. The proposed Hybrid CNN-HiFuse, which combines multi-scale feature fusion with channel and spatial attention methods, is evaluated against a bespoke CNN and transfer-learning benchmarks (VGG16, ResNet50). The model attains an overall classification accuracy of 98.12% on the stated test split, with precision, recall, and F1-score of 98.17%, 98.12%, and 98.13%, respectively, surpassing the baseline designs. The confusion matrix and ROC studies demonstrate minimal misclassification rates, especially for malignant nodules, while the attention maps emphasize clinically significant areas, hence improving interpretability. The present assessment, confined to a modest single-centre CT dataset, indicates that the Double Attention Hybrid CNN-HiFuse framework is a viable candidate for incorporation into clinical decision-support systems, potentially enhancing radiologists’ efficacy in early lung cancer detection.

[SUPPLEMENTARY INFORMATION] The online version contains supplementary material available at 10.1038/s41598-026-42290-9.

🏷️ 키워드 / MeSH 📖 같은 키워드 OA만

같은 제1저자의 인용 많은 논문 (1)

📖 전문 본문 읽기 PMC JATS · ~103 KB · 영문

Introduction

Introduction
Lung cancer continues to rank among the most frequently diagnosed cancers and remains a leading cause of cancer-related deaths globally. Recent global statistics for 2022 estimate approximately 2.5 million new lung cancer cases and about 1.8 million deaths, highlighting its substantial public health burden1. Early lung nodule detection, essential for improving patients’ survival rates, remains challenging despite advancements in medical imaging and diagnostic technology2. The precise nature of preliminary nodules and the time-consuming procedure required for radiologists to identify them contribute to this problem. About 80–85% of all instances of lung cancer are NSCLC, comprising subtypes like squamous cell carcinoma, big cell carcinoma, and adenocarcinoma3. Nevertheless, the current study did not seek to differentiate these particular categories. It emphasizes a triadic radiological classification of normal lung tissue, benign nodules, and malignant nodules, in accordance with the designations in the IQ-OTH/NCCD CT dataset. Timely and precise diagnosis of these three categories is essential to mitigate disease development and enhance patient outcomes4. The primary issue with lung cancer is that symptoms frequently manifest only in advanced stages, rendering early detection crucial for decreasing the prevalence of advanced lung cancer and enhancing patients’ life expectancy. There are seven different methods for detecting lung cancer: Positron Emission Tomography (PET), Chest X-rays (CXRs), Computed Tomography (CT), Chest Radiography, cytology, sputum, Magnetic Resonance Imaging (MRI), and Breath analysis scans5. The radiologist’s ability to identify aberrant lung nodules is crucial to the radiography screening performance, and manually detecting the small lung nodules becomes quite tricky. As a result, it will require more time and effort for a radiologist or doctor to examine each slice. Manual detection can result in missed or inaccurate tumor classifications, necessitating a nodule classifier to categorize the lumps according to their characteristics.
CAD (Computer-aided diagnosis) tools, which use automated image classification methods to identify lung tumor, could help radiologists improve diagnostic confidence, expedite the diagnostic process, and maximize classification accuracy6. Small nodules can be difficult to detect due to their small size and subtle appearance on CT images, making it challenging for standard computer-aided diagnosis methods to segment and classify them effectively. Research has shown that this segmentation constraint might result in incorrect classifications, making it simple to overlook small nodules7. When detecting nodules, the traditional CAD technology exhibits an elevated percentage of false positives, reducing its ability to distinguish between benign and malignant ones. A more complex approach requires additional image processing modules, including feature extraction, CT image enhancement, and lung nodule segmentation. Lung CT images are categorized as prominent cell carcinoma, adenocarcinoma, squamous cell carcinoma, or typical using CNN (Convolutional Neural Network) and SVM (Support Vector Machine) algorithms.
The Thorax CT scan image dataset, a widely utilized and easily obtainable collection of scans, was used to assess the method. Five thousand one hundred three pictures were used to test the proposed hybrid CNN-SVM model, demonstrating its advantages and potential uses. Although only a few test images and a readily available dataset were utilized, CNN and SVM were demonstrated. With an accuracy of over 91%, Da Silva et al.8 Presented a CNN-based CAD system for diagnosing lung tumor on CT scans, highlighting the removal of feature extraction procedures to increase efficiency. Zhang et al.9 Developed a CAD system that utilizes a 3D-DCNN (Deep CNN) for nodule recognition, underscoring the need for further improvement, particularly in detecting small nodules. Li et al.10 Presented a CAD system for lung nodule detection utilizing CT images, focused on improving static image detectors to extract pertinent data while preserving information regarding microscopic tumors.
A significant number of methods use binary classification to determine whether suspected lung nodules are benign or malignant. Many models cannot map malignant characteristics or categorize tumour into stages11. Particularly in clinical situations, a Deep Learning (DL) model that can quickly and effectively diagnose multiple-class nodules is required. Decision-making and insights into disease pathology are enhanced when clinical diagnostic data are incorporated into these models12. Researchers can better comprehend the complicated malignancy of nodules by using soft attention techniques, which increase interpretability and transparency. The model’s accuracy also depends on custom weight values, particularly in multi-class categorization. The effectiveness and usefulness of these developments are demonstrated by the performance of cutting-edge models, which promise simple integration into standard clinical settings13. A specially designed Neural Network (NN) model has been incorporated to classify lung nodules in CT scans. Unlike other approaches, this model incorporates unique convolutional layers, batch normalization, dropout, and a soft attention mechanism. By focusing on the most relevant areas of the image, this layout increases the accuracy of feature extraction and categorization. These architectural enhancements perform better than traditional methods because they more effectively handle issues like class imbalance and data scarcity that arise during lung nodule diagnosis.
In addition to lung cancer, artificial intelligence has significantly influenced several sectors of healthcare, showcasing its extensive applicability. The study14 utilizes gradient-boosted tree models alongside explainable AI methodologies to assess the risk of type 2 diabetes based on standard clinical variables, offering interpretable feature-attribution insights for healthcare professionals. Likewise15, utilizes a self-organizing map-based neural network with block processing to enhance cancer zone localization in noisy MRI scans. In hematology16, emphasizes the role of machine-learning techniques in diagnosing and treating complex blood diseases. Collectively, these studies illustrate the variety of AI applications in healthcare and encourage the advancement of resilient, interpretable Deep-Learning (DL) frameworks for CT-based lung cancer classification, such as the Hybrid CNN-HiFuse model introduced in this research.
Building upon these advancements, this study introduces a Hybrid CNN-HiFuse model with a double attention mechanism tailored to address the challenges of early lung cancer detection and classification. The model leverages the strengths of CNNs for robust feature extraction and incorporates HiFuse for efficient fusion of multi-scale features, enabling better representation of small nodules. The double attention mechanism further enhances the model’s ability to focus on critical image regions while minimizing irrelevant noise, improving interpretability and Accuracy. By addressing limitations in traditional CAD systems, such as high false-positive rates and inadequate handling of small nodules17. The proposed approach aims to provide a reliable tool for multi-class classification of lung nodules, including differentiating between various cancer stages. This work enhances diagnostic precision and offers a framework that can be seamlessly integrated into clinical workflows, supporting radiologists in making more informed decisions.

Related works

Related works

Deep learning for medical image analysis
DL algorithms have recently shown remarkable outcomes in medical imaging analysis18. Convolutional neural networks have frequently been reported to outperform many traditional machine-learning algorithms in terms of classification accuracy and robustness, making them one of the most widely used architectures for medical image processing19. Building on CNN’s contributions to medical image analysis20, researchers have investigated various deep neural network architectures, including CNN21, Recurrent Neural Network (RNN)22 and Long Short-Term Memory (LSTM)23,24 are utilized for specific tasks, such as lymph node classification25. Although LSTM and CNN classifiers have surpassed RNN models, issues with their clinical applicability persist due to their suboptimal efficiency. Additionally, technologies like Deep Lung26 and attention mechanisms27 have emerged for recognizing and classifying lung nodules, showcasing novel network architectures and attention techniques to enhance diagnostic accuracy. Designs like MED-GAN have also been developed. These methods, intended to produce superior 3D bone forms from 2D X-ray images, demonstrate significant advancement. Despite challenges posed by small dataset sizes and noisy images, they show promising results for enhancing diagnostic imaging quality28.

CNN-based CAD systems for lung nodule classification and
Several CNN-based CAD systems have been proposed for lung CT analysis. For instance, MVSA-CNN, built on the LIDC-IDRI dataset, was introduced for both binary and ternary nodule classification. While it showed good results for binary classification (error rate: 5.41%), its performance for ternary classification was comparatively lower, with an error rate of 13.91%, corresponding to an accuracy of 86.09%29 The model performed well across various evaluation criteria on two datasets. Before invasive operations, the Multi Crop-CNN (MC-CNN) model made a noninvasive suspiciousness assessment easier, as it included a comprehensive nodule and its information using the MC-CNN pooling method30.
Several CNN-based CAD systems have been proposed for lung CT analysis8. Proposed a CNN-based multi-group patch training system that outperformed a traditional CAD pipeline for lung nodule analysis; however, although the false-positive rate decreased, the overall diagnostic performance did not improve substantially.

Advanced architectures
A 3D-DCNN was developed31 for nodule recognition, underscoring the need for improved sensitivity to small nodules, while32 Focused on enhancing static image detectors to better capture subtle tumour information in CT scans. Given the growing number of medical radiological sequences, automatic nodule recognition is crucial in improving inspection productivity despite its potential. Using automated feature extraction and classification techniques, a 2D Deep CNN architecture and LdcNet to identify lung carcinoma cases in CT data. Further exploration of 3D-CNNs and algorithms for lung volume segmentation could enhance the model, highlighting methods to boost the accuracy of lung carcinoma categorization and inspection33.
The LungNet model, a 22-layer hybrid CNN architecture that fuses CT imaging and MIoT sensor data, reported impressive classification performance. On a dataset of 525,000 images, it achieved 96.81% accuracy for five-class lung cancer detection and 91.6% accuracy in sub-staging stage-1 and stage-2 cancers, with a notably low false positive rate of 3.35%34. LungNet subclassifies stage 1 and stage 2 lung cancers with 91.6% accuracy using a 22-layer CNN architecture. It differentiates between benign and malignant lung nodules without requiring much data annotation. The MV-KBC (Multi-view Knowledge-based Collaborative) framework also addressed the challenge of limited data by training submodels on fixed-view patches of 3D nodules. This model attained 91.60% accuracy with an AUC of 95.70%, highlighting its robustness in extracting voxel-level and shape-specific features across heterogeneous nodules using adaptive weighting35. Classified suspicious lesions with excellent accuracy using a semi-supervised adversarial classification technique36.
The Multiview-CNN21 uses a network design with three CNN branches to identify lung nodules to capture nodule-sensitive characteristics in CT images from three viewpoints37. Applied conventional machine learning with Local Binary Pattern (LBP) and Discrete Cosine Transform (DCT) for feature extraction, obtaining acceptable accuracy but having trouble with some cancer kinds because of feature overlap. However, it necessitates a large amount of data and computer power38. In a more recent advancement, Lung-EffNet, based on the EfficientNet architecture and tested on the IQ-OTH/NCCD dataset, achieved 99.10% accuracy, with ROC-AUC scores ranging between 0.97 and 0.99. This model demonstrates strong generalization across benign, malignant, and normal classes, aided by fine-tuning and augmentation to counter class imbalance39.

Research gap and aim of the study
The Prior studies indicate that DL-based CAD systems, encompassing 2D and 3D CNNs, multi-view architectures, and IoT-enabled models, significantly enhance the detection and classification of lung nodules in CT data. Nonetheless, numerous constraints persist. Numerous methodologies depend on single-scale or restricted multi-scale feature extraction and may encounter difficulties with small nodules and heterogeneous appearances. Some do not clearly integrate attention mechanisms to concentrate on the most informative areas of the lungs, resulting in restricted interpretability from a therapeutic standpoint. Moreover, only a limited number of models tackle three-class categorization of normal, benign, and malignant nodules using publicly accessible datasets like IQ-OTH/NCCD.
The current study seeks to establish and assess a Double Attention Hybrid CNN–HiFuse architecture that (i) conducts three-class classification of lung nodules (normal, benign, malignant) from CT slices, (ii) incorporates hierarchical multi-scale feature fusion with both channel and spatial attention to enhance the detection of small and heterogeneous nodules, and (iii) offers an interpretable, computationally efficient framework that can function as a practical CAD tool for early lung cancer detection.

Background

Background

HIFUSE: hierarchical multi-scale feature fusion network
A novel approach for classifying clinical images, the HiFuse model, is proposed to efficiently extract global semantic representations and local spatial information from medical images of varying sizes. By combining the characteristics of multiple topologies across the HFF block, down the sampling step, which is ultimately the categorization outcome, we employ a concurrent architecture for extracting worldwide and regional insights of clinical photographs from the local and global feature blocks.

The global branch (top) applies global average and max pooling followed by a shared MLP and sigmoid to generate a channel-attention map, producing the recalibrated features ​. The local branch ​ (bottom) uses a 7 × 7 convolution and sigmoid to compute a spatial-attention map, yielding . The fused feature from the previous hierarchy ​ is projected by a 1 × 1 convolution and average pooling to form , which is concatenated with the current global and local features and processed by a GELU-activated inverted residual MLP. Element-wise sums, multiplications, and concatenations integrate these paths to produce the final fused output .
As seen in Fig. 1, the flexible hierarchical feature fusion block can integrate global representations, regional features from various levels, and conceptual data derived from the prior hierarchy based on the input features. Among these feature matrices,  are obtained from the global feature block, from the local feature block, obtained from the preceding HFF stage, Resulting from HFF fusion at this point. The HFF block inputs the incoming global features into the Channel attention (CA) mechanism, utilizing the interdependence between the channel maps that enhances the feature illustration of particular terminology due to self-attention in the global feature block, which can partially gather global spatial and temporal data.
The spatial attention (SA) mechanism uses the local features as input to suppress irrelevant regions and highlight local details. Ultimately, a residual inverted MLP (IRMLP) will be coupled to the feature fusion outcome obtained at each attention and the fusion path. There are problems with gradient vanishing, explosion, and network degradation that are mitigated to some extent, allowing global and local feature information to be captured in each hierarchy. This is particularly evident in the structural comparison of ResNet, Swin Transformer, and ConvneXt, as well as HFF blocks.
Channel attention and spatial attention within the hierarchical feature fusion (HFF) block are defined as follows. Let denote an input feature map with height H, width W, and C channels.
The channel-attention map  is computed as
where and perform global average and global max pooling over the spatial dimensions, respectively, is a small multi-layer perceptron that projects pooled descriptors back to the channel dimension, and denotes the element-wise sigmoid function. Equation (1) therefore learns a channel-wise importance vector that highlights informative feature channels and suppresses less relevant ones.
The spatial-attention map SA(x) is defined as
where and now denote pooling operations applied along the channel dimension (producing two single-channel maps),  concatenates these maps, and is a convolution with a 7 × 7 kernel. Equation (2) produces a spatial attention mask that emphasizes salient locations in the feature map while down-weighting background regions.
where denotes layer normalisation, (⋅) is a 3 × 3 convolution, and  are 1 × 1 pointwise convolutions. This block combines local convolutional processing with residual connections to mitigate vanishing gradients and improve feature expressiveness.
Within the HFF block at hierarchy , we denote by the “global” feature branch and by the “local” feature branch. Channel and spatial attention are applied to these branches as
where denotes element-wise multiplication. Thus, , is the channel-recalibrated version of the global features, and is the spatially refocused version of the local features.
To incorporate information from the previous hierarchy, we use the fused feature ​ from level . A down-sampled summary of this previous fusion is computed as
Where  reduces the channel dimension and performs spatial average pooling, yielding a compact context vector.
The current-level global, local, and previous-hierarchy features are then combined as
where concatenates along the channel dimension and (⋅) applies a 3 × 3 convolution to integrate these inputs into a unified feature map.
where the concatenation of , , is combined with the residual context ​ and passed through the IRMLP block defined in Eqn. (3). The resulting is a hierarchically fused representation that jointly encodes global context, local detail, and information from preceding levels, and is passed to the next stage of the network.

Double attention mechanism
The impetus for integrating double attention in our model arises from recognising that the channel and the dependence on the effective representation of feature tensors. Channel attention allows the model to selectively concentrate on the most informative representations, enhancing comprehension of the structures inside medical images. Conversely, spatial attention prioritizes spatial links among features, enabling the model to acquire essential contextual information and dependencies across different image regions. Through the integration of dual attention, our model adeptly amalgamates the advantages of both channel and spatial attention, improving its efficacy in medical image segmentation tasks.
This methodology facilitates the creation of a more resilient and efficient network adept at effectively expressing feature tensors, resulting in enhanced segmentation results. Figure 2 illustrates a graphic illustration of the double attention mechanism. This figure demonstrates the collaborative function of the channel and spatial attention components in enhancing the model’s segmentation skills. Our system employs attention processes sequentially rather than in tandem, improving performance.
To leverage the benefits of the double attention mechanism while mitigating its inherent complexity, we employ an efficient attention module for channel attention and an advanced transformer block for spatial attention. The conventional self-attention technique is constrained by its quadratic computational complexity (O(N2)), as demonstrated in Eqn. 9. This limits the architecture’s usefulness to high-resolution medical images.
In Eqn. 9, Q, K and V denote the query, key, and value vectors, while d signifies the embedding dimension, respectively. By implementing the efficient attention mechanism, we diminish the computational burden while preserving the advantages provided by the channel attention method. This enables our model to handle feature maps more efficiently and achieve improved performance in medical picture segmentation tasks. Moreover, the efficient attention mechanism supports the model’s scalability, facilitating its application to various use cases and datasets. We employ the Efficient Attention approach introduced by Shen et al.40 as delineated in Eqn. 10.
where and Denote the normalization functions for both queries and keys, respectively. Shen et al.40 proved that utilizing SoftMax normalization functions like and Renders the module output similar to dot-product attention. Thus, Efficient Attention initially normalizes keys and queries, subsequently multiplies the keys with values, and ultimately multiplies the resultant global context vectors by the queries to provide a new representation.

The left block applies depthwise convolution, layer normalisation, fully connected (FC) layers with GELU activation, and residual connections to refine the input feature map. The central part illustrates the efficient self-attention operation, where query (Q), key (Kᵀ), value (V), and a global context vector (G) are computed and fused. The right block applies a second sequence of layer normalisation, depthwise convolution, and FC layers to produce the final attention-enhanced feature representation, which is added back to the residual path.

Methodology

Methodology

Dataset preparation
This study utilises the Iraq Oncology Teaching Hospital/National Centre for Cancer Diseases (IQ-OTH/NCCD) lung cancer dataset41, collected during a three-month duration in the autumn of 2019 and revised in May 2023. The dataset comprises 1,190 axial chest CT slices derived from 110 cases, categorized into three classes: normal, benign, and malignant. Among these, 55 cases are categorized as normal, 15 as benign, and 40 as malignant. We utilize the class definitions provided forth by the IQ-OTH/NCCD dataset. Normal cases denote CT scans from patients devoid of radiologically evident lung nodules or concerning parenchymal irregularities. Benign cases refer to people with lung nodules that radiologists and oncologists have identified as non-malignant lesions (e.g., granulomas, inflammatory conditions, or other benign entities) without signs of invasive tumor behavior. Malignant cases are to patients exhibiting CT results indicative of primary lung cancer, wherein the nodules or masses have been clinically or radiologically validated as malignant. All CT slices pertaining to a certain patient receive the corresponding patient-level classification (normal, benign, or malignant).
The CT scans were initially obtained in DICOM format utilizing a Siemens SOMATOM scanner, operating at a tube voltage of 120 kV and a slice thickness of 1 mm. Window width and center values of 350–1200 HU and 50–600 HU, respectively, were employed for visual interpretation, with breath-holding at full inspiration. Each scan comprises around 80–200 axial slices including the thorax from various elevations and angles. All photos were anonymized before processing. Oncologists and radiologists from the collaborating centers annotated the data, and the study received approval from the pertinent institutional review boards, with the supervision review board waiving the requirement for written consent.
In this study, we exclusively used the CT component of the IQ-OTH/NCCD dataset; no PET images are included in our experiments. Various preprocessing procedures were applied to prepare the dataset, ensuring consistent intensity normalisation, spatial resolution, and cropping so that the images are uniform for model training. The dataset is publicly available at:
https://data.mendeley.com/datasets/bhmdr45bh2/4.

Image loading and preprocessing
To ensure consistency and comparability across the dataset, a series of standardization procedures were applied. As illustrated in Fig. 3, the raw dataset contained images of varying formats and resolutions. The original 512 × 512 CT images were resized to 128 × 128, and the single grayscale channel was replicated to form pseudo-RGB inputs when required for ImageNet-pretrained architectures. First, all images were converted from RGB to grayscale, preserving essential structural details while reducing computational complexity. Subsequently, images were resized to a fixed resolution of 128 × 128 pixels to maintain uniform input dimensions across the dataset. Finally, the categorical class labels were encoded into integer and one-hot formats suitable for model training: benign cases were mapped to label 0 with one-hot vector [1, 0, 0], malignant cases to label 1 with [0, 1, 0], and normal cases to label 2 with [0, 0, 1].

Train-test split
To prevent data loss, we execute the train-test division at the patient level instead of the slice level. Each of the 110 instances yields several axial CT slices, hence arbitrarily partitioning individual slices may lead to slices from the same patient being present in both the training and test sets. Consequently, we divide the dataset so that 80% of the patients (88 cases) are allocated for training and validation, while the remaining 20% (22 cases) are reserved for testing. Maintaining the original case-level class distribution (55 normal, 15 benign, 40 malignant), this 80–20 division results in 44 normal, 12 benign, and 32 malignant cases in the training/validation subset, and 11 normal, 3 benign, and 8 malignant cases in the held-out test subset. Due to the varying contributions of slices from different patients (about 80–200 per scan), this patient-level division results in 977 slices for the training/validation set and 213 slices for the test set. All slices from a certain patient are allocated to the same subset, guaranteeing that no patient contributes slices to both the training/validation and test sets.

Summary of preprocessing
The images were preprocessed to ensure consistent quality, resolution, and format. The labels were encoded and split-stratified to maintain class balance. Table 1 provides a summary of the preprocessing pipeline.

Data augmentation
Data augmentation serves as an essential method for artificially increasing a dataset’s size and diversity by applying random transformations to the current data. In this study, data augmentation was used to address the class imbalances in the lung cancer dataset and increase the diversity of the training data. This is particularly important for DL frameworks because they are often prone to overfitting when trained on small or unbalanced datasets.

Class imbalance mitigation
The IQ-OTHNCCD lung cancer dataset includes images categorized into benign, malignant, and normal cases. However, the distribution of images across these classes is often imbalanced, which can lead the framework to favour the dominant group. Suppose the dataset contains more normal images than malignant or benign images. The model might be more likely to classify images as standard, resulting in poor performance for rare classes. To address this imbalance, random sampling and augmentation were applied to increase the diversity and number of instances for underrepresented classes. By generating additional variations of existing images, the architecture presents a broader range of features, aiding in the acquisition of strong features for every class.

Data augmentation techniques applied
Rotation is applied to the pictures, randomly rotated within a range (e.g., ± 30°) to mimic variations in orientation. This aids the model in being invariant to rotation and enhances generalization. Horizontal and vertical flipping were applied to various images. This technique simulates different perspectives and enables the model to learn features without being affected by the image’s orientation. Scaling/Zooming Random zooming was utilized on images. Examples of these transformations include rotating a sample CT slice by ± 30°, applying horizontal flips, and performing random zoom-in operations; these augmentations visually demonstrate how the model is exposed to variations in orientation and scale during training.
A random translation of images, both horizontally and vertically, was implemented. This technique mimics images captured from various angles or distances. The Brightness Adjustment involves applying random brightness variations to mimic multiple lighting conditions in imaging. This approach aids the model in becoming resilient to lighting fluctuations encountered in real-world situations. Random shearing, which slants the image along one axis, was employed to create variations in the appearance of lung regions from different perspectives. Noise injection is used in salt-and-pepper noise, or Gaussian noise, which was added to the images to make the model robust against noisy real-world images, which is common in medical image datasets. Cropping Random simulates zooming into different areas of an image, compelling the model to concentrate on various lung regions.

Benefits of data augmentation
Increasing Dataset Size is used in Augmenting the dataset and artificially expanding the number of training samples, which helps enhance the architecture’s robustness and allows it to generalize better. The augmentation benefits are listed in Table 2. Reducing overfitting with a more diverse and larger dataset makes the model less likely to memorise the training data and learn generalizable aspects that can be applied to novel and unseen data. Enhancing Class Balance by utilizing augmentation techniques for the less represented classes (malignant and benign cases), the model experiences a more balanced sample distribution, which reduces bias toward the majority class (normal cases). Simulating variations in orientation, lighting, and scale, the model is trained to improve generalisation across various real-world scenarios, enhancing its performance when deployed in clinical environments.

Model design
The model design for lung cancer detection in this study is based on advanced DL techniques to effectively extract features from the images and classify them into three categories: benign, malignant, and normal. The architecture incorporates several advanced mechanisms, such as feature extraction, attention mechanisms, and robust regularization, which help improve the model’s performance.

Convolutional neural network (CNN)
The backbone of the model is a CNN and DL framework popularly used for image classification tasks. CNNs are especially effective for image data as they automatically learn spatial hierarchies of features (edges, textures, shapes, etc.) through convolutional layers. These layers extract progressively more complex features from the input images. The CNN begins with several convolutional layers that apply multiple filters (kernels) to the input image. These filters learn features such as edges, corners, and textures, which are crucial for identifying different types of lung cancer. The model employs the ReLU (Rectified Linear Unit) as the activation function in each convolutional layer to introduce non-linearity, enabling the model to learn complex patterns. Batch Normalization (BN) is implemented following each convolutional layer to strengthen the learning process and accelerate convergence. BN standardizes the output of each layer by modifying and scaling activations, hence alleviating problems such as vanishing gradients and diminishing the likelihood of excessive fitting. Normalizing each mini-batch of inputs accelerates training by maintaining consistent activations.

HiFuse module
The HiFuse module facilitates multi-scale feature fusion and aligns with the hierarchical feature fusion (HFF) block detailed in our method, HiFuse functions on the intermediate feature maps generated by the initial convolutional stage and consolidates information across various receptive fields via two parallel convolutional branches. The initial branch employs a 1 × 1 convolution to diminish channel dimensionality and highlight the most informative channels, functioning as a trained channel projector. The second branch employs a 3 × 3 convolution to collect extensive spatial patterns and contextual information surrounding nodules.
Let represent the “global” feature branch and the “local” feature branch at hierarchy . Channel attention (CA) is utilized in the global branch, while spatial attention (SA) is employed in the local branch, as delineated in Eq. (1) to (3) in Sect.  3.1. The outputs of the attention modules, and , are subsequently integrated with the down-sampled features from the preceding hierarchy −1 to create the fused representation , as delineated in Eq. (4) to (8). This fusion is executed using a series of 1 × 1 and 3 × 3 convolutions, concatenation, and an inverted residual MLP (IRMLP), resulting in a singular augmented feature map that simultaneously captures local features and global context. The HiFuse feature map is subsequently transmitted to the deeper convolutional and attention layers of the network.

Overall double attention hybrid CNN–HiFuse architecture
Figure 4 presents an overview of the proposed Double Attention Hybrid CNN–HiFuse architecture for the categorization of lung nodules into three categories: normal, benign, and malignant. Each input is a resized 128 × 128 single-channel computed tomography slice. The network commences with a preliminary convolutional stage (Conv1) featuring a 3 × 3 convolution utilizing 32 filters, succeeded by batch normalization, ReLU activation, and 2 × 2 max pooling, as delineated in Table 6. This phase extracts fundamental features, including edges and basic textures.
The results from Conv1 and MaxPoo1 are subsequently input into the HiFuse module detailed in this module processes the feature map through parallel 1 × 1 and 3 × 3 convolution branches to generate local () and global () feature representations at the current hierarchy. Channel attention is utilized on Gi and spatial attention on Li, with these attended features being integrated with the down-sampled features from the preceding hierarchy −1 using the hierarchical feature fusion processes delineated in Eq. (4) to (8). The resultant fused representation Fi incorporates both intricate local details and extensive global context surrounding the nodules.
The HiFuse output is next processed by a second convolutional stage (Conv2), which includes a 3 × 3 convolution with 64 filters, batch normalization, ReLU activation, and 2 × 2 max pooling, thereby enhancing the representation and diminishing the spatial resolution. We implement the twofold attention mechanism on the enhanced feature map, consisting of a channel attention module succeeded by a spatial attention module, as detailed in Sect.  3.2. Channel attention adjusts the significance of each feature channel according to global statistics, whereas spatial attention emphasizes the most informative spatial areas within the lungs, such as potential nodule sites, and diminishes irrelevant background.
Following dual attention, global average pooling (GAP) condenses each channel into a singular scalar by averaging across the spatial dimensions, resulting in a concise one-dimensional feature vector. This vector is transmitted through a fully connected layer including 128 units with ReLU activation, succeeded by dropout for regularization, and ultimately through a softmax output layer with three units representing the benign, malignant, and normal classifications. The Hybrid CNN–HiFuse architecture integrates a lightweight CNN backbone, hierarchical multi-scale feature fusion, and dual attention to provide an end-to-end model that directly maps 128 × 128 CT slices to three-class probability outputs.

Double attention mechanism
The Double Attention Mechanism is crafted to direct the model’s attention towards the extreme areas of the input images. This mechanism includes two types of attention: Channel Attention and Spatial Attention. Channel attention works by assigning different importance to the channels of the feature map. It generates a weight for each channel based on the global average of the activations in that channel. This allows the model to highlight the most critical channels for detecting lung cancer features while suppressing less relevant channels. Spatial attention focuses on specific spatial locations (or regions) within the image. It creates a spatial attention map, indicating which parts of the image are more critical for detecting cancerous tissues. The attention map is generated using a convolutional operation, enabling the model to concentrate on the pertinent spatial regions of the image, such as nodules or tumors in the lung. The channel and spatial attention mechanisms enable the model to focus its processing power on the significant parts of the image, thereby enhancing its ability to differentiate between benign, malignant, and normal photos.

Global average pooling (GAP)
Instead of using standard fully connected (FC) layers, the model employs Global Average Pooling (GAP). GAP is a down-sampling technique that computes the average of each feature map, resulting in a single value for each channel. This helps aggregate spatial features more effectively, leading to a compact representation of the most important information in the image. Unlike fully connected layers, GAP eliminates the need for a large number of parameters, which helps reduce overfitting and improves generalisation to unseen data.

Fully connected layers
After applying the convolutional layers, the HiFuse module, and the Double Attention Mechanism, the feature maps are aggregated using GAP. The output is subsequently processed through FC layers for classification. A dense layer with ReLU activation is applied to map the extracted features to a high dimensional space. This layer allows the model to learn complex non-linear mappings between the features and the output classes (benign, malignant, or normal). Dropout regularization is implemented in FC layers to prevent overfitting and improve model generalization. It disables a portion of neurons during training, compelling the model to rely on different sets of neurons, thereby reducing the probability of overfitting the training data. The final output layer is a softmax layer, which provides a probability distribution across the three classes (benign, malignant, normal). This layer allows the model to classify an image into three categories based on the learned features.

Architecture summary
The Architecture Summary is listed in Table 3.

Hyperparameter optimization
To enhance the model’s efficacy and guarantee an equitable comparison of architectures, we utilized Optuna, a contemporary framework for automatic hyperparameter optimization. The aim was to determine the hyperparameter combination that optimized validation accuracy on the training/validation subset. Hyperparameters are essential in regulating training dynamics and model capacity, hence significantly impacting final generalization performance.

Hyperparameters tuned
This study involved the tuning of the following hyperparameters:

The quantity of convolutional filters in the primary CNN backbone, which regulates the depth and capability of feature extraction.

The quantity of units in the dense (completely linked) layer, which determines the complexity of the learned decision boundary.

The dropout rate serves as a regularization parameter to alleviate overfitting.

The learning rate dictates the magnitude of weight adjustments in the gradient descent algorithm.

Optimizer type influences convergence velocity and stability.

The search intervals and candidate values for each hyperparameter are delineated in Table 4.

Optuna setup and trials
Hyperparameter optimization was conducted via Optuna’s Tree-structured Parzen Estimator (TPE) sampler. For each trial, Optuna selected a set of hyperparameters from the specified search space (Table 4), instantiated the Hybrid CNN–HiFuse model with those parameters, and trained it on the training/validation subset. The validation accuracy functioned as the objective criterion. A total of 20 trials were executed, with Optuna systematically concentrating the search on attractive areas of the hyperparameter space informed by the outcomes of preceding trials.

Example of the hyperparameter space

Results of hyperparameter optimization
Optuna determined a configuration that produced the maximum validation accuracy and was then utilized for the final model training and comparative studies. The chosen parameters included 128 filters in the initial convolutional layer, 256 filters in the subsequent layer, a dense layer of 256 units, a dropout rate of 0.4, and a learning rate of 1e− 4 utilizing the Adam optimizer. This setup achieved an optimal equilibrium among model capacity, regularization, and stable convergence.

Impact of hyperparameter tuning
The Optuna-driven tuning procedure resulted in a significant enhancement in validation performance relative to manually selected default configurations. The chosen hyperparameters improved validation accuracy, mitigated overfitting via an effective dropout strategy, and produced a learning rate and optimizer combination that facilitated stable and efficient training. The ultimate configuration was thereafter employed for both the 5-fold cross-validation studies and the assessment of the reserved 213-slice test set.

Training and evaluation
Once the model architecture was defined and hyperparameters were optimized, the next step involved training the model and evaluating its performance. The training process was carefully designed to ensure efficient learning while minimizing overfitting. Key components of this stage included the use of advanced optimizers, loss functions, learning rate schedules, and regularization techniques.

Training setup
Employing the optimal hyperparameters identified by Optuna, all models were trained utilizing the Adam optimizer (learning rate 1e− 4) and categorical cross-entropy loss, suitable for the three-class classification (benign, malignant, normal). A batch size of 32 was utilized, with training permitted for a maximum of 50 epochs, contingent upon early halting as delineated below. To improve the reliability and robustness of the reported performance, the identical training setup was utilized inside a 5-fold cross-validation framework on the training/validation subset (88 patients, 977 slices). In each iteration, the model was trained on 80% of the training/validation data and assessed on the remaining 20%. Upon concluding all five folds, the mean and standard deviation of cross-validation accuracy were calculated and shown alongside the findings from the independent 213-slice test set.

Regularization techniques
Various regularization procedures were utilized throughout training to enhance generalization.

Early stopping: Validation accuracy was checked, and training was terminated if no enhancement was seen for 5 consecutive epochs, therefore averting overfitting and superfluous computation.

Reduce-on-plateau learning rate scheduling: A “Reduce LR on Plateau” callback was employed to decrease the learning rate by a factor of 0.5 if the validation loss failed to improve for three consecutive epochs, hence assisting the optimizer in refining the solution during subsequent training phases.

Dropout: According to Optuna’s findings, a dropout rate of 0.4 was used in the dense layer, promoting the network’s dependence on several complementing feature paths instead of memorising particular training instances.

The implementation of these metrics, along with the optimized hyperparameters, facilitated stable training dynamics and enhanced generalization performance throughout the cross-validation folds and the held-out test set.

System requirements
The lung cancer classification tests were conducted in Python 3.12 utilizing TensorFlow 2.4 on a workstation featuring an Intel Core i7 processor, 16 GB of RAM, and a 512 GB SSD. The resulting Double Attention Hybrid CNN–HiFuse model has a lightweight architecture, comprising around 1.1 million trainable parameters, equivalent to about 4.4 MB of weights in 32-bit floating-point notation. A 128 × 128 CT slice necessitates around 0.25 GFLOPs of computing for a single forward pass, rendering inference computationally economical. In this CPU-only system, a model of this magnitude produces an average per-slice inference time of roughly 7 ms, equating to approximately 140 CT scans processed per second in batch mode using TensorFlow. This indicates that a complete CT volume of approximately 300 slices may be analyzed in approximately 2 s using typical clinical gear, with significantly enhanced speed when a mid-range GPU is utilized. The suggested Hybrid CNN–HiFuse model has a computational footprint suitable for near real-time application in radiology operations and can be implemented on standard hospital workstations without the need for specialized high-end infrastructure.

Results and discussion

Results and discussion

Figure 5 and Table 5 summarises the classification performance of all comparison models on the held-out test set, as well as the mean accuracy derived from 5-fold cross-validation on the training/validation subset (88 patients, 977 slices) for all models. In the independent test set, the baseline CNN attains an accuracy of 90.61% (F1-score 90.75%), which is further enhanced by the deeper transfer-learning models: VGG16 achieves 92.96% accuracy (F1-score 93.18%), while ResNet50 further elevates performance to 94.84% accuracy (F1-score 94.85%). The customized CNN with attention achieves a 95.31% accuracy and a 95.35% F1-score, demonstrating that the integration of attention improves the representation of lung nodule patterns. The Hybrid CNN–HiFuse model demonstrates superior performance, attaining 98.12% accuracy, 98.17% precision, 98.12% recall, and a 98.13% F1-score on the test set. The results of the 5-fold cross-validation substantiate these findings: mean accuracy rises from 89.87 ± 0.42% (baseline CNN) and 94.98 ± 0.27% (custom CNN with attention) to 97.86 ± 0.19% for the Hybrid CNN–HiFuse, exhibiting a reduced standard deviation, indicating enhanced stability across various training/validation splits. The cross-validation findings, derived only from the training/validation subset, augment the performance of the held-out test and demonstrate that the performance improvements of the Hybrid CNN–HiFuse model are consistent rather than artifacts of a specific split. However, since all tests are performed on a single-centre dataset, these findings should be regarded as preliminary and require validation by external confirmation on separate cohorts.

Confusion matrix
Figure 6(a)–(e) displays the confusion matrices for the baseline CNN, VGG16, ResNet50, custom CNN with attention, and the proposed Hybrid CNN–HiFuse model, evaluated on the IQ-OTH/NCCD test set comprising 23 benign, 109 malignant, and 81 normal CT slices. The baseline CNN accurately categorized 20 out of 23 benign, 102 out of 109 malignant, and 71 out of 81 normal slices, resulting in a total of 20 misclassifications, with mistakes occurring across all three categories. VGG16 diminished this to 15 misclassifications, attaining flawless identification of benign cases (23/23) while continuing to misclassify certain malignant slices as benign and normal tissue. ResNet50 reduced the overall mistakes to 11, primarily misclassifying a portion of cancerous slices as normal. The customized CNN with attention enhanced robustness, accurately recognizing 22 out of 23 benign, 105 out of 109 malignant, and 76 out of 81 normal slices, resulting in a total of 10 misclassifications. The Hybrid CNN–HiFuse model demonstrated optimal performance, accurately categorizing 22 out of 23 benign, 108 out of 109 malignant, and 79 out of 81 normal slices, resulting in a total of just 4 misclassifications. The results demonstrate that the Hybrid CNN–HiFuse design enhances overall accuracy and achieves a more equitable performance across all three classes, notably decreasing false negatives in the malignant and normal categories.

ROC curve
Figure 7 depicts the one-vs-rest Receiver Operating Characteristic (ROC) curves for all evaluated models across the three categories: benign, malignant, and normal. Each Figure 7 (a)-(e) illustrates the trade-off between True Positive Rate (TPR) and False Positive Rate (FPR) for its corresponding model. The Baseline CNN (AUC = 0.91–0.93) shows sufficient class separability but displays slight overlap between benign and malignant areas. VGG16 and ResNet50 demonstrate enhanced discriminating, achieving mean AUC values between 0.94 and 1.00, signifying superior feature generalization. The Custom CNN with Attention improves the model’s capacity to discern nuanced patterns, with per-class AUCs of 0.95–0.97. The Hybrid CNN–HiFuse model (Fig. 7e) demonstrates the highest separability, achieving AUC values of 0.97 for benign, 0.99 for malignant, and 0.99 for normal cases, so affirming its exceptional diagnostic reliability and resilience against false positives. The nearly optimal ROC profiles of the Hybrid CNN–HiFuse confirm its robust classification abilities and emphasize its appropriateness for the early identification of lung anomalies in CT scans.

Table 6 presents a comparative analysis of the classification efficacy of the proposed Hybrid CNN–HiFuse model in relation to several deep learning models documented in the literature, all assessed using the IQ-OTH/NCCD dataset. Models such VGG16, AlexNet, Swin Base, and GoogLeNet have exhibited robust performance, with accuracy rates between 94.38% and 96.83%. The Hybrid CNN–HiFuse model surpasses all current methodologies, attaining the best accuracy (98.12%), precision (98.17%), recall (98.12%), and F1-score (98.13%). This underscores the efficacy of the dual-attention fusion method and its enhanced generalization capacity in multi-class lung cancer classification tasks.

Discussion

Discussion
This research assessed a Double Attention Hybrid CNN–HiFuse framework for the categorization of lung cancer into three categories (benign, malignant, normal) utilizing the IQ-OTH/NCCD CT dataset. In comparison to a baseline CNN, the Hybrid CNN–HiFuse model outperformed two transfer-learning models (VGG16, ResNet50) and a custom CNN with attention, achieving superior test performance and fewer misclassifications on the 213-slice patient-level test set, demonstrating notably high sensitivity for malignant cases. The confusion matrices demonstrate a more equitable performance across all three classes, suggesting that the hybrid architecture enhances both overall accuracy and class-specific reliability. The ROC analysis corroborates these findings: although the baseline and transfer-learning models offer adequate separability, the Hybrid CNN–HiFuse achieves the greatest per-class AUC values, nearing 0.97–0.99 for benign, malignant, and normal categories.
The findings indicate that the integration of multi-scale feature fusion and dual channel–spatial attention enhances the network’s ability to discern nuanced distinctions among benign nodules, malignant lesions, and normal parenchyma. Systematic preprocessing, class-specific augmentation, Optuna-driven hyperparameter optimization, and patient-level partitioning enhance stable training and equitable evaluation. Five-fold cross-validation on the training/validation subset, resulting in a mean accuracy of 97.86 ± 0.19%, demonstrates that the observed enhancements are consistent across multiple partitions and not confined to a singular split. Compared to other models previously applied to the IQ-OTH/NCCD dataset, the proposed Hybrid CNN–HiFuse achieves superior accuracy and overall performance across all key classification metrics.
From a pragmatic standpoint, the low malignant false-negative rate, the attention-based visual explanations, and the minimal processing requirements provide the architecture a viable choice for incorporation into computer-aided diagnostic workflows as a decision-support instrument. Nonetheless, the study possesses evident shortcomings. The dataset is unicentric and limited in size, with all scans obtained under comparable imaging settings, and the current implementation functions on 2D slices instead of complete 3D volumes. The class taxonomy is limited to three general categories, and exceptionally high-performance metrics, such as AUCs approaching 1.0, should be read with caution as they may indicate residual overfitting. Subsequent research should emphasize external validation across multi-centre cohorts, expansion to three-dimensional or volumetric methodologies, integration of supplementary modalities or clinical factors, and prospective studies to assess the influence on radiologist performance and patient care.

Conclusion

Conclusion
This study presented a Double Attention Hybrid CNN–HiFuse model for the categorization of three types of lung cancer using axial CT slices. The proposed framework, which integrates convolutional feature extraction with a hierarchical multi-scale fusion block and dual-channel spatial attention, outperformed a baseline CNN (VGG16) and ResNet50, as well as a custom CNN with attention, on a patient-level split of the IQ-OTH/NCCD dataset. On the 213-slice held-out test set, Hybrid CNN–HiFuse attained an accuracy of 98.12% and an F1-score of 98.13%. Additionally, 5-fold cross-validation on the training/validation subset produced a mean accuracy of 97.86 ± 0.19%, corroborating the robustness of the reported improvements.
The findings are still preliminary. The investigations depend on a single-centre dataset, utilize 2D slice-based inputs, and lack external validation; thus, any near-perfect ROC–AUC values should be interpreted cautiously. Prior to normal clinical use, the model must undergo testing on bigger and more diverse multi-centre cohorts, assessed under different acquisition techniques, and enhanced to utilize complete 3D volumes and more comprehensive clinical data. Notwithstanding these limitations, the Hybrid CNN–HiFuse architecture presents a promising framework for efficient, attention-driven multi-scale models in lung cancer computer-aided diagnosis, with the potential to assist radiologists in early detection when integrated with explainable AI methodologies and incorporated into practical imaging workflows. Ultimately, the continued development of this study has the potential to impact broader medical fields, contributing to the advancement of innovative healthcare solutions.

Supplementary Information

Supplementary Information
Below is the link to the electronic supplementary material.

출처: PubMed Central (JATS). 라이선스는 원 publisher 정책을 따릅니다 — 인용 시 원문을 표기해 주세요.

🏷️ 같은 키워드 · 무료전문 — 이 논문 MeSH/keyword 기반

🟢 PMC 전문 열기