Lightweight deep learning model with spatial attention for accurate and efficient breast cancer prediction.

Jaafari J; Ezzine H; Douzi K; Douzi S

doi:10.1038/s41598-025-34311-w

← 뒤로

Lightweight deep learning model with spatial attention for accurate and efficient breast cancer prediction.

1/5 보강

Scientific reports 📖 저널 OA 97.3% 2021~2026 2026 Vol.16(1) p. 4180

Jaafari J, Ezzine H, Douzi K, Douzi S

📖 무료 전문 🟢 PMC 전문 PMC12858957

PubMed ↗ DOI ↗ BibTeX ↓ RIS ↓

📝 환자 설명용 한 줄

Breast cancer is a worldwide health crisis that affects a large number of women.

이 논문을 인용하기

↓ .bib ↓ .ris

APA Jaafari J, Ezzine H, et al. (2026). Lightweight deep learning model with spatial attention for accurate and efficient breast cancer prediction.. Scientific reports, 16(1), 4180. https://doi.org/10.1038/s41598-025-34311-w

MLA Jaafari J, et al.. "Lightweight deep learning model with spatial attention for accurate and efficient breast cancer prediction.." Scientific reports, vol. 16, no. 1, 2026, pp. 4180.

PMID 41530223 ↗

DOI 10.1038/s41598-025-34311-w

Abstract

Breast cancer is a worldwide health crisis that affects a large number of women. Early detection of this disease is critical for determining effective treatment options and improving the chances of positive patient outcomes. In our study, we introduce a new method for detecting breast cancer in its earliest stages using thermographic images and a compact model that can be easily implemented on smartphones. This method is especially useful in areas where medical resources are scarce. We used multiple edge detection techniques like Canny, Roberts and Sobel, and evaluated their effectiveness to improve the accuracy of our model. Our model, which combines MobileNet V2 with a spatial attention mechanism, outperformed other deep learning networks like Inception ResNet and DenseNet121. Furthermore, with an accuracy rate of 98.88%, our proposed model outperformed current state-of-the-art algorithms. These findings point to the potential of our approach for early breast cancer detection and its practical application in resource-limited settings.

📖 전문 본문 읽기 PMC JATS · ~77 KB · 영문

Introduction

Introduction
According to the American Cancer Society, one in eight American women will be diagnosed with breast cancer at some point in their lives, resulting in over 40,000 deaths annually1. Regular breast screenings are essential for detecting breast cancer in its earliest stages, as this can significantly increase the 5-year survival rate to 99%2. Not only does early detection reduce mortality, but it also reduces morbidity and, in some cases, allows patients to avoid surgery3. When the tumor volume is less than 10 mm, there is an 85% chance of complete healing, according to research4. These statistics demonstrate conclusively the significance of early detection in enhancing the prognosis of breast cancer patients.
There are numerous subtypes of breast cancer, each with its own distinct characteristics, size, and behavior. By accurately identifying the type of cancer a patient has, physicians can devise the most effective treatment plan for each individual. Invasive and non-invasive breast cancers are the two primary classifications.
Invasive breast cancer: also known as infiltrating breast cancer, is caused by the spread of cancer cells from the milk ducts or lobules to the surrounding breast tissue. These cancer cells can also spread via the bloodstream or lymphatic system to other organs.
There are several subtypes of invasive breast cancer based on the location of their growth:

Invasive ductal carcinoma (IDC) is the most prevalent form of breast cancer, accounting for roughly 80% of all cases. It develops when abnormal cells in the lining of the milk ducts transform into cancerous cells and invade the surrounding breast tissue.

Invasive lobular carcinoma (ILC) is the second most prevalent form of invasive breast cancer, accounting for ten to fifteen% of all cases. It begins in the milk-producing glands of the lobules and spreads to other areas of the breast.

Noninvasive breast cancer: also known as in situ breast cancer, is characterized by the presence of cancer cells within the breast tissue without metastasis to lymph nodes or surrounding tissue. It should be noted, however, that this type of cancer may progress to an invasive form over time. Noninvasive breast cancer, like invasive breast cancer, has two principal subtypes: Ductal Carcinoma In Situ (DCIS) and Lobular Carcinoma In Situ (LCIS).

DCIS, also known as intraductal carcinoma, affects the cells of the breast’s milk ducts. These cells become cancerous but do not spread to the lymph nodes or bloodstream; they remain confined to the duct walls. This subtype accounts for approximately 16% of all cases of breast cancer and is commonly referred to as stage 0 of the disease.

LCIS, on the other hand, is a less common condition that occurs when abnormal cells grow within the breast milk-producing lobules. These cells do not, however, spread beyond their origin. Although LCIS is not considered to be a form of cancer, it may indicate a higher risk of developing invasive breast cancer in the future.

Various diagnostic procedures, including mammography, ultrasound, breast MRI, and thermography, are utilized by medical professionals to detect potential breast tissue anomalies.
Mammography is a common diagnostic technique. Mammography is a diagnostic imaging technique that uses low-dose X-rays to generate detailed images of breast tissue. It has long been considered the most common method of breast cancer detection4,5. Recent studies, however, have demonstrated that regular mammography increases the risk of breast cancer in healthy women6. In addition, research indicates that radiologists may interpret mammograms differently, resulting in false-positive results7,8.
Ultrasound is an imaging technique that uses high-frequency sound waves to produce images of a specific area, enabling the detection of tumors. Radiation-free, this method is considered safer for pregnant and younger women. However, ultrasound may not be able to detect microcalcifications or tumors in deep areas, and the technique’s effectiveness is dependent on the physician’s ability to interpret the images3.
Breast MRI (magnetic resonance imaging) is a non-invasive technique that uses a powerful magnetic field and radio waves to create detailed images of the breast. This method can detect the smallest lesions that are otherwise undetectable, but it is expensive and frequently produces false positive results. In addition, it is incapable of detecting microcalcifications and is rarely used on pregnant women due to the possibility of allergic reactions to the magnetic field’s strength9.
Thermography is predicated on the theory that cancer cells necessitate increased blood flow and metabolism, resulting in an increase in skin temperature in affected areas. Infrared thermography (IRT) is a non-invasive technique that uses a specialized camera to measure the temperature of the skin in a specific area by capturing long-infrared radiation from the electromagnetic spectrum (9000–14,000 nanometers or 9–14 m) and generating thermograms that display patterns of heat and blood flow on or near the surface of the body. Thermography is a less well-known method for detecting breast cancer than mammography, which is more widely accepted among medical professionals. Thermography has the potential to detect breast cancer at an earlier stage, thereby reducing patient risk. In addition, it is a more cost-effective method of detection than other methods.
While individual components of our methodology, such as MobileNetV2, edge detection algorithms, and attention mechanisms, have been used in other contexts, their combined application for thermographic breast cancer detection, optimized for smartphone deployment in low-resource environments, has not been previously explored to our knowledge. Our work proposes an original architecture integrating these components to deliver a cost-effective, lightweight, and accurate diagnostic tool tailored for real-world constraints.
This study aims to develop a deep learning model that can be implemented on smartphones using a thermal camera for early breast cancer detection, particularly in regions with limited access to advanced medical equipment. The objective is to develop a solution that is both accessible and affordable, thereby making it possible to increase breast cancer detection rates in underprivileged communities and ultimately save lives. This model has the potential to have a significant impact on breast cancer detection in these areas by employing a non-invasive and cost-effective thermal camera (Fig. 1).

Literature survey

Literature survey
The detection of breast cancer at an early stage is of the utmost importance, as it leads to improved treatment outcomes and higher survival rates. Several imaging modalities have been utilized to detect breast cancer, each with its own advantages and disadvantages. Ultrasound, Breast MRI, and Thermography are among the most widely used imaging modalities for detecting breast cancer. Ultrasound, also known as sonography, is a noninvasive imaging technique that creates images of the breast using high-frequency sound waves. It is widely used for breast cancer detection due to its low cost and lack of ionizing radiation. However, it is dependent on the operator and may not offer the same level of detail as other imaging modalities10. Breast MRI, also known as magnetic resonance imaging, creates detailed images of the breast using radio waves and a magnetic field. It is the gold standard for breast cancer detection, especially for women with dense breast tissue, because it can detect small lesions that may not be visible on mammography or ultrasound. However, it is an expensive imaging modality and not all patients may have access to it11. Thermography, also known as infrared imaging, detects changes in breast temperature using infrared cameras12. It has the potential to detect cancer at an early stage and is non-invasive and radiation-free. It has been extensively studied for the early detection of breast cancer because it is non-invasive, does not use ionizing radiation, and has the potential to do so. The thermographic images are analyzed using specific algorithms and software that extract pertinent features, classify the images, and detect any abnormal regions.
In recent years, classical methods for detecting breast cancer using thermographic images have been extensively studied in the scientific literature. These techniques include thresholding, edge detection, and segmentation, which are based on conventional image analysis methods and rely on human expertise to extract relevant image features. The effectiveness of these techniques in detecting breast cancer has been demonstrated, but they may have limitations in terms of precision and applicability. The authors of Ref.13 proposed an automated breast cancer detection system based on thermal images of the breast. The method entails recording images for various breast orientations, extracting healthy and DCIS image patches, processing the patches with image processing techniques, extracting features, optimizing features with the Marine-Predators-Algorithm, and classifying the images using a two-class classifier. The authors of another recent study14 examine the use of thermography and computer-assisted diagnosis in breast cancer screening and analysis. They investigate the use of machine learning techniques such as segmentation, feature extraction, dimensionality reduction, and various classification schemes in thermogram-based computer-assisted diagnostic systems developed over the past several decades. In order to inform researchers and clinicians about the current state of the field and to plan for future developments, they also discuss the limitations of current techniques and future needs for improvement.
In Ref.15, a self-adaptive gray-level histogram equalization method for enhancing the color of infrared (IR) images for thermography-based early detection of breast tumors is proposed. The classification is accomplished using a support vector machine (SVM). The authors o Ref.f16 provided a thermography-based method for the early detection of breast cancer. The method combines a Genetic Algorithm (GA) classifier for model selection with a Support Vector Machine (SVM) classifier for feature selection. Utilizing the temperature difference between cancerous and healthy tissues, the method aims to improve the diagnostic precision and efficacy of breast cancer. The authors evaluate the proposed method and demonstrate that, combining GA and SVM, makes a significant contribution to the field of breast cancer diagnosis. In Ref.17, the authors provided a summary of the various techniques and methods used in the field of Computer-Aided Diagnosis (CAD) for breast cancer. The authors discuss the early detection and diagnosis of breast cancer using image processing, machine learning, and deep learning. They provide an in-depth analysis of various techniques and methods, including their application and evaluation of performance. The article concludes that the improvement of these methods is crucial for the survival of breast cancer patients and that there is still a need for more advanced tools and techniques for precise diagnosis and classification. Researchers have identified the segmentation and classification phases as particularly challenging.
In recent years, there has been significant interest in the use of deep learning techniques for the detection of breast cancer in thermographic images. These techniques use diverse neural network architectures, such as convolutional neural networks (CNNs), recurrent neural networks (RNNs)18, and generative models (GANs)12, to automatically extract and classify image features. Deep learning has the potential to achieve greater precision and generalization than conventional methods, but it also necessitates larger data sets and more computational power. Another study19 proposed a Computer-assisted Diagnosis (CAD) technique for identifying and diagnosing breast cancer patients. A combination of machine learning algorithms, including Convolutional Neural Networks (CNNs), Support Vector Machines (SVMs), and Random Forest (RF) classifiers, are utilized in this method. The objective is to categorize patients into three groups: cancer, no cancer, and non-cancerous. In Ref.20, the authors present a summary of current research on the use of thermography and convolutional neural networks (CNNs) for the early detection of breast cancer. Observing breast temperature distribution, the authors discuss the potential of thermography as a noninvasive and noncontact method for detecting tumors and precancerous conditions. In addition, they examine the available datasets for breast thermal images, the features of breast thermograms, and the distinctions between healthy and cancerous patterns. The paper explains how CNNs can be used to classify breast thermograms, provides a simulation example, and summarizes most of the research on the application of deep neural networks for breast thermogram classification. In addition, the authors suggest future research directions for developing representative datasets, feeding segmented images, selecting appropriate kernels, and developing lightweight CNN models to enhance performance.
The authors of Ref.21 propose a fully automatic breast cancer detection system based on thermography imaging. The system employs a U-Net network to extract and isolate the breast region, followed by a two-class deep learning model trained from scratch to classify normal and abnormal breast tissue. The proposed system aims to increase the efficiency and precision of breast cancer detection and is anticipated to be a valuable clinical tool for physicians. In Ref.22, authors proposed a convolutional neural network (CNNs) with attention mechanisms (AMs) for early detection of breast cancer using infrared thermal images. Thermal imaging has shown promising results for early detection compared to mammograms and is non-invasive, safe, and inexpensive. The authors intend to examine the efficacy of CNNs with AMs for breast cancer detection using thermal images from the Database for Research Mastology with Infrared Image (DMR-IR). The proposed model is evaluated in terms of accuracy, sensitivity and specificity, and is compared to state-of-the-art breast cancer detection methods. Recent research published in Ref.23 proposed thermography as a non-invasive and inexpensive alternative to conventional breast cancer detection techniques such as ultrasound, mammography, and MRI. Multiple AI techniques, including transfer learning with pre-trained models and the use of Bayesian Networks for the interpretability of the diagnosis, are incorporated into the method. Utilizing a novel feature extraction method, temperature-related features are extracted from thermograms and fed into BNs. The results of this method are compared to those of other pre-trained models, and it is discovered that the approach demonstrates high performance with enhanced interpretability compared to previous works, making it a promising candidate for the Breast Self-Examination (BSE) recommended by WHO for mass screening.

Computer vision and CNNs

Computer vision and CNNs
Computer vision is a scientific field and subfield of artificial intelligence that examines how computers can acquire a high level of comprehension from digital images or videos. From a technological standpoint, it seeks to comprehend and automate the tasks that the human visual system can perform. Computer vision functions similarly to human vision, with the exception that humans are one step ahead. The advantage of human vision is that it can be trained to distinguish objects, determine their distance, determine if they are in motion, and determine if an image is flawed. In this context, understanding refers to the transformation of visual images (retinal input) into descriptions of the world that make sense to cognitive processes and can elicit the appropriate response. This image comprehension can be viewed as the acquisition of symbolic information from image data using models constructed with geometry, physics, statistics, and the theory of learning.
Various industries, including energy and utilities, manufacturing, and automotive, use computer vision, and the market continues to expand. By 2030, it is anticipated to reach $41.1 billion24. One of the most significant application areas is medical computer vision or medical image processing, which is characterized by the extraction of diagnostic information from image data. It can be used to detect tumors, arteriosclerosis, and other malignant changes; it can also be used to measure organ dimensions, blood flow, etc. Among the medical applications of computer vision is the enhancement of human-interpreted images, such as ultrasound or X-ray images, to reduce the influence of noise.
In machine learning, a convolutional neural network (CNN or ConvNets) is a type of acyclic artificial neural network (feed-forward) in which the pattern of connections between neurons is inspired by the visual cortex of animals. Their operation is inspired by biological processes, and they consist of a multilayer stack of perceptron designed to preprocess small amounts of data. In CNNs, the “neurons” are arranged similarly to those in the frontal lobe, which is where humans and other animals process visual stimuli. To avoid the fragmented image processing problem of conventional neural networks, the neuron layers are arranged to cover the entire visual field. A synthetic neuron consists of a series of mathematical operations. Initially, a weight and a bias are affinely applied to an input value, which in image analysis is the pixel value. The intermediate result is then subjected to an activation function to represent the data in this function’s data space. This activation function is typically nonlinear because it permits the representation of complex data in situations where linear combination fails.
The architecture of the Convolutional Neural Network includes an upstream convolutional component and thus has two very distinct components:

A convolutional portion whose goal is to extract image-specific features by compressing them to reduce their initial size. In conclusion, the input image passes through a series of filters, simultaneously generating new images known as convolution maps. The obtained convolution maps are then concatenated into a CNN code feature vector.

A classification section: The CNN code obtained at the output of the convolutional section is provided as input to a classification section comprised of fully connected layers called multilayer perceptron (MLP for Multi Layers Perceptron). This section is responsible for combining the CNN code’s characteristics to classify the image.

Utilizing a single weight for signals entering all neurons of the same convolution kernel is a major advantage of convolutional networks. This method reduces memory usage, improves performance25, and permits translation-invariant processing. This is the primary advantage of the convolutional neural network over the multilayer perception, which considers each neuron to be independent and, as a result, assigns a different weight to each incoming signal. In addition, convolutional neural networks necessitate minimal preprocessing. In contrast to conventional algorithms, the network is responsible for evolving its own filters (unsupervised learning). Therefore, the absence of initial configuration and human intervention is a major advantage of CNNs. Moreover, convolutional neural networks require minimal preprocessing. This means that, unlike more conventional algorithms, the network is responsible for evolving its own filters on its own (unsupervised learning). The absence of initial configuration and human intervention is therefore a significant advantage of CNNs.

Training CNNs
Training convolutional neural networks (CNNs) is a crucial step in the development of effective models for tasks like object classification. There are two prevalent methods for training CNNs:

Training a convolutional neural network (CNN) from scratch involves utilizing a randomly initialized network and a large, labeled image dataset. This involves employing techniques such as data augmentation and regularization to enhance the performance of the network. Training a CNN from scratch enables the network to learn task-specific features. This is especially advantageous when the dataset is large and diverse, as the network can learn to recognize a wide variety of objects and patterns. Using techniques such as data augmentation can also improve the network’s generalization and performance on unseen data. However, training CNN from scratch can be a computationally and time-intensive process. A large dataset of labeled images may also be required for optimal performance. In addition, the performance of the network may be constrained by the quality and diversity of the training data.

Transfer learning is a technique for machine learning in which a model trained on one task is used as a starting point for a model trained on a different task. In the case of convolutional neural networks (CNNs), this entails fine-tuning a previously trained network on a new dataset. Transfer learning can significantly reduce the time and computational resources required to train a CNN, which is one of its many benefits. Due to the fact that the pre-trained network has already learned to extract useful image features, it can be used as a starting point for the new task. The network can adapt to the specific characteristics of the new task by being fine-tuned using the new dataset. Transfer learning has the additional benefit of enhancing the model’s performance on the new task. Due to the fact that the pre-trained network has already learned to recognize a vast array of objects and patterns, it can serve as a solid foundation for the new task. The network’s performance is enhanced by fine-tuning it on the new dataset so that it can adapt to the specific characteristics of the new task.

The objective of this study was to develop a neural network model that could be easily executed on mobile devices. To accomplish this, we selected MobileNet architecture, which is renowned for its efficiency and speed. As mentioned previously, this model enables the possibility of early breast cancer detection in settings with limited resources and limited access to traditional medical equipment. In addition, we employed transfer learning to optimize the MobileNet architecture for our task of breast cancer prediction.
To ensure the model generalizes well and avoids overfitting, we applied k-fold cross-validation, early stopping, and dropout regularization.

Mobile net v1 and v2
Mobile Net v126 is a convolutional neural network (CNN) developed specifically for mobile devices. It was created by Google in 2017 and has since been utilized extensively in a variety of applications, including image classification, object detection, and segmentation. Mobile Net v1 is distinguished by its efficient architecture. It employs depthwise separable convolutions, which are computationally less expensive than standard convolutions, to reduce the number of network parameters and their computational complexity. This enables Mobile Net v1 to operate on mobile devices with constrained computational resources, such as smartphones and tablets. Mobile Net v1 is distinguished by its low latency. It has a small number of layers and parameters, allowing it to process an input image quickly. This permits the use of Mobile Net v1 in real-time applications such as video streaming and object detection in live video feeds. Mobile Net v1 has also demonstrated excellent performance across a range of tasks. It has achieved state-of-the-art results on a number of benchmarks, such as the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) and has been successfully implemented in a number of applications.
Mobile Net v1’s architecture is founded on depthwise separable convolutions, a type of convolution that is computationally cheaper than conventional convolutions. First, the input image is convolved with a set of filters that are independently applied to each channel in a depthwise separable convolution (depthwise convolution). The resultant feature maps are then concatenated and convoluted with a set of filters that combine information from all channels (pointwise convolution). Mobile Net v1 is composed of convolutional layers, batch normalization, and ReLU activation. The output of each convolutional layer is concatenated with the output of the previous layer after passing through a 1 × 1 convolution (pointwise convolution). This allows the network to combine information from different layers, thereby enhancing its performance. Mobile Net v1 also employs residual connections, which allow the network to bypass layers and combine the input and output of certain layers directly. This improves the network’s performance by permitting it to learn residual functions, which are the difference between the desired output and the network’s output.
MobileNet V227 is an enhancement to MobileNet V1, it has been developed to be more efficient, accurate, and quicker than its predecessor. MobileNet V2 encompasses several significant modifications, such as the introduction of new residual block modules known as “linear bottlenecks,” which reduce the number of model parameters while maintaining high accuracy. The linear bottleneck is comprised of a 1 × 1 convolution layer, a 3 × 3 convolution layer, and an additional 1 × 1 convolution layer. MobileNet V2 also employs a technique known as “shortcut connection,” which directly connects the input and output layers of the residual block. This shortcut connection shortens the distance between the two layers, which expedites convergence and facilitates gradient propagation during training. In addition, MobileNet V2 employs a Channel Splitting layer that divides the input data into two channels: one for channel information and the other for spatial information. This method enhances the model’s accuracy by facilitating the separation of spatial and channel information. In comparison to MobileNet V1, MobileNet V2 contains additional layers and inverted linear residual blocks. Furthermore, the learning stability is enhanced by the application of the bulk normalization function prior to each convolution layer. Consequently, MobileNet V2 is more precise. The inverted linear residual block formula is denoted as: where x is the input, f1 and f2 are convolutional layers, and + is the residual sum operation.
Finally, it is crucial to emphasize that MobileNet V2 has a more profound architecture than MobileNet V1, with 53 layers in contrast to 28. This enables the model to capture more intricate characteristics and achieve a higher level of precision while remaining lightweight enough to be executed on mobile devices.

Attention
The attention mechanism was introduced in the context of machine translation28, where it was used to enable the model to prioritize certain parts of the input sentence when producing the output translation. Since then, attention has been incorporated into a variety of natural language processing tasks, including language modeling, machine translation, summarization, and image captioning. The attention mechanism learns a function that maps the input sequence and context to a set of weights, which are then used to calculate the weighted sum of the input elements. The attention function is typically implemented as a neural network layer that accepts the input sequence and context as inputs and outputs the weights for each element of the input sequence. Typically, the attention layer is followed by a subsequent layer or module that uses the weighted sum as input, such as a decoder in a model for machine translation.
In the context of computer vision, attention mechanisms can be used to direct the model’s processing attention to particular regions of an input image. This is particularly useful when the model must simultaneously attend to multiple objects or regions of the image, or when the objects of interest are small or difficult to distinguish from the background.
There are multiple ways to incorporate an attention mechanism into a computer vision model. Here are some strategies we might consider:

Attention in the decoder: In this circumstance, the attention mechanism can be used to weigh the input features at each time step of the decoder, allowing the model to focus on different portions of the input image while generating the output. Attention in the decoder is unsuitable for breast cancer detection using thermal images because this type of attention mechanism is typically used when the model is generating some output based on the input image, such as a caption or a label, whereas in this study we are only processing the input image and generating a prediction (indicating whether or not breast cancer is present.

Spatial attention29: By assigning different weights to the pixels of the input image, this type of attention mechanism enables the model to focus on particular regions of the image. To implement this, we can use a convolutional neural network (CNN) layer that learns to weight the input pixels according to their significance for the current task. For instance, we could train a CNN layer with a small kernel size (e.g., 3 × 3) to learn the weights for each pixel in the input image.

Feature-based attention30: This type of attention mechanism enables the model to focus on particular features or channels within the input image. To implement this, we can use a neural network layer that learns to weight the input features according to their significance for the task. For instance, we could use a fully connected layer with a small number of units (such as 32 units) and train it to discover the weights for each feature in the input image.

The level of detail at which attention is directed is a critical difference between focusing on individual pixels (spatial attention) and focusing on specific features (feature-level attention). When dealing with models that need to consider the complex arrangement of input images, like identifying detailed patterns or abnormalities in medical imaging, it is more suitable to have a focus on individual pixels. On the other hand, it is more advantageous to concentrate on the feature level when examining the more conceptual characteristics of the input image, such as texture, color, or shape. For all these factors, we will utilize a spatial attention mechanism (Figure.2) to allow the model to concentrate on particular areas of the input image while analyzing thermal images for breast cancer diagnosis.

Combining MobileNet and the attention mechanism

Combining MobileNet and the attention mechanism
The utilization of MobileNet technology in conjunction with the Attention Mechanism aims to enhance the precision of early breast cancer detection. To be more precise, the inclusion of the attention mechanism is intended to improve classification accuracy by directing attention to the most relevant characteristics of thermal images.
There are many options for combining the MobileNet network with the attention mechanism. The available options include:

Parallel fusion refers to the parallel and independent operation of MobileNet and the attention mechanism on the same inputs.

Sequential fusion refers to the process of utilizing the outputs of the MobileNet as inputs for the attention mechanism.

Our proposed approach involves incorporating a spatial attention layer immediately following the MobileNet architecture. The spatial attention mechanism operates by computing activation maps for each output channel, employing the SoftMax activation function. Subsequently, the activation maps are utilized to assign weights to the network outputs, so giving greater importance to the most significant channels. The formula for computing the activation maps S is as follows: where Fi is the output of channel i of the previous layer and c is the total number of output channels.
The weighted outputs are subsequently combined using a pooling function, such as average or maximum, to create a global representation of the image. A formula for calculating the weighted output Ui is as follows: where Fi, j is the output of channel j of the previous layer, weighted by the channel weight Sj.

Experiments and results

Experiments and results

Dataset
To aid in the early detection of breast cancer, Silva et al. compiled and distributed a dataset of thermogram images to the research community31. This dataset was created by capturing 20 images at 15-s intervals while the breasts were at the same temperature as the surrounding environment using a FLIR (Forward Looking Infrared) camera. Before the images were captured, a cooling air stream was used to accomplish this. A reference temperature of 36.5 °C was utilized, derived from established literature on average human body temperature and clinical observations of thermal variations in breast cancer patients. The dataset, which can be downloaded via a link in32, has already been separated into train and test sets. The train set includes thermograms of 29 breast cancer patients and 15 healthy individuals, whereas the test set contains thermograms of 8 breast cancer patients and 4 healthy individuals. Each case is represented by 20 thermograms and corresponding ROI images. In Fig. 3, images (a) and (b) display thermographic images of healthy breasts. These images are distinct and reveal a uniform temperature across the entire breast. Figures c and d, in contrast, depict breasts with cancer. These images display regions of elevated temperature, which indicates the presence of cancerous cells. Cancer-related thermographic images are less consistent and display a wider temperature range, indicating the presence of abnormal cells.

Data preprocessing
Several factors may lead us to choose feature extraction over other preprocessing techniques:

Feature extraction can be more effective at preserving the image’s essential information. By extracting features such as edges or corners, we can preserve important details about the shape and structure of the objects in an image that may be lost when using other techniques such as smoothing or blurring.

Feature extraction can be more robust to noise and other image variations. By extracting features that are invariant to specific types of noise or variations, we can enhance the accuracy and robustness of our analysis.

Computationally, feature extraction can be made more efficient. Some feature extraction techniques, such as edge detection and corner detection, can be implemented with simple, fast, and computationally resource-light algorithms.

The interpretability of feature extraction can be increased. By extracting features such as edges and corners, we can gain a better understanding of the structure and composition of the image’s objects, which is useful for tasks such as object recognition and classification.

Before using feature extraction techniques, it is often recommended to convert the image to grayscale33. This can be a useful preprocessing step for many types of monochromatic images, including thermal images. An image is gray-scaled by converting it to a single channel in which each pixel is represented by a single intensity value (Fig. 4). Gray scaling an image before applying feature extraction techniques has several advantages. One advantage is that it reduces the image’s dimension, making the edge detection process more computationally efficient. Additionally, Gray scaling can reduce image noise, thereby enhancing the precision of edge detection. Edge detection algorithms typically rely on detecting changes in pixel intensity; therefore, converting the image to grayscale ensures that the algorithm only considers intensity changes and is unaffected by color differences. In addition, Gray scaling can simplify the interpretation of the edge map, as it is easier to see edges in a grayscale image than in a color image.

Edge detectors
Edge detection is a common image processing technique that identifies and emphasizes the boundaries between distinct regions or objects in an image. Over the years, numerous edge detection algorithms have been developed, each with its own advantages and disadvantages. The following are some of the most used edge detectors:

Canny edge detector
This algorithm34, developed by John Canny in 1986, is widely used for edge detection due to its high performance and reliability. It employs a multi-step procedure to detect edges, which includes smoothing the image, locating the intensity gradient, and applying hysteresis thresholding to eliminate weak and irrelevant edges. These steps are described in greater detail below.

Smoothing: The first step of the Canny edge detector is to use a Gaussian filter to smooth the image. This helps to reduce noise and enhance the image’s signal-to-noise ratio. The smoothed image is denoted by the letter G(x, y).

The following step is to calculate the magnitude and direction of the image’s gradient. One kernel represents the first derivative in the x-direction, while the other kernel represents the first derivative in the y-direction. Using the following formulas, one can determine the gradient magnitude and direction.Gradient magnitude:

Gradient direction:
where Gx(x, y) and Gy(x, y) represent the convolved images of the first derivative in the x and y directions, respectively.

Non-maximum suppression: During this step, the Canny edge detector eliminates pixels that are not considered local maxima along the gradient’s direction. This aids in thinning the edges and removing non-edge pixels.

The final step of the Canny edge detector is to apply hysteresis thresholding to suppress weak and irrelevant edges. This is achieved by setting low and high threshold values. Pixels with a gradient magnitude below low are not considered edges, whereas pixels with a gradient magnitude above high are edges. Pixels with a gradient magnitude between the low and the thigh are considered edges if they are connected to a pixel whose gradient magnitude is greater than high. This helps to eliminate noise and improve edge detection precision.

Sobel operator
The Sobel operator35 is a widely used algorithm for edge detection that is based on the convolution of the image with a small, separable kernel that approximates the image’s first derivative. It is named after the 1968 developer of the algorithm, Irwin Sobel. The Sobel operator approximates the first derivative in each direction using two 3 × 3 kernels, one for the horizontal direction and one for the vertical direction. These kernels are characterized by:
To apply the Sobel operator to an image, the image is convolved separately with each of these kernels to obtain the gradient in the x and y directions. The gradient magnitude is then calculated using the following formula:
Additionally, the gradient direction can be determined using the following formula:
The Sobel operator can be applied to both grayscale and color images and is easy to implement. It is frequently paired with other edge detection algorithms to enhance performance. The Sobel operator’s sensitivity to noise, which can result in the detection of false edges, is one of its primary limitations. Before applying the Sobel operator, it is common practice to apply additional smoothing to the image to mitigate this issue.
Several variants of the Sobel operator, such as the modified Sobel operator and the normalized Sobel operator, have been proposed to address some of its limitations. These modifications typically involve modifying the kernel weights or normalizing the gradient magnitude in order to reduce noise sensitivity and improve edge detection robustness.

Robert’s operator
The Roberts operator36 is a simple algorithm for edge detection that is based on the cross-correlation of the image with a small, fixed kernel. It was created by Lawrence Roberts in 1963 and is one of the earliest algorithms for edge detection. To approximate the first derivative of the image, the Roberts operator employs a 3 × 3 kernel. For the Roberts operator to be applied to an image, the image must be convolved with the kernel to obtain the gradient magnitude. The magnitude of the gradient is then computed using the following formula: where Ix, y represents the pixel intensity at position (x, y) in the image. It is impossible to determine the gradient’s direction using Roberts’ method, as it fails to estimate the gradient in the x and y dimensions.
The Roberts operator can be applied to both grayscale and color images and is relatively easy to implement. Due to its sensitivity to noise, which can lead to the detection of false edges, it is not as widely used as other algorithms for edge detection. Before applying the Roberts operator, it is common practice to apply additional smoothing to the image to mitigate this issue. Figure 5 depicts the thermographic image in greyscale and the edge detection algorithms Sobel, Roberts, and Canny applied to it.

Resizing
To guarantee compatibility with Mobile Net V2, the DMR-IR dataset was preprocessed by resizing each image to 224 × 224 pixels. The Python Pillow library was used for this, with the Image.Open() function reading the images, the Image.resize() function resizing the images, and the Image.save() function saving the resized images. A script was implemented to ensure that this method was applied uniformly across the entire dataset. This allowed for the efficient preparation of the DMR-IR dataset for Mobile Net V2 use.

Architecture of classification models
After the DMR-IR dataset has been prepared and the necessary preprocessing steps have been implemented, we can proceed to the model’s proposed architecture. This section describes the architecture and implementation of the neural network used to analyze the dataset.
Before presenting the proposed architecture of our model (Fig. 6), we wished to evaluate the effect of various edge detection algorithms on the model’s performance. To accomplish this, we applied three distinct algorithms—Sobel, Roberts, and Canny—to the DMR-IR dataset and analyzed the resulting images using MobileNet V2 as a classifier. Then, we compared the performance of the model using these preprocessed datasets to the performance of the model using the original dataset to determine which approach yielded the best performance. This allowed us to determine the most efficient method for extracting relevant features from the images in the DMR-IR dataset and ensured that our proposed architecture was suitable for the task at hand. In the following section, the results of these experiments are presented along with a comparison to the results obtained when using the original DMR-IR dataset without any edge detection applied.
The proposed architecture for DMR-IR dataset analysis combines attention layer and MobileNet V2. As stated previously, MobileNet V2 is a deep convolutional neural network that was trained on the ImageNet37 dataset and has demonstrated excellent performance on a variety of image classification tasks. By adding an attention layer, we aimed to improve the model’s ability to focus on relevant image features and to represent the spatial relationships more accurately between various scene objects.
To enhance the model’s ability to focus on relevant image features, we incorporated a self-attention mechanism after the final convolutional layer of MobileNet V2. This mechanism computes attention weights for each feature map position, allowing the model to selectively emphasize the most discriminative regions of the segmented images. The weighted feature map is then passed through subsequent convolutional layers and a fully connected layer to produce the final classification output.
In our analysis of the DMR-IR dataset, we decided to use spatial attention after analyzing the characteristics of the dataset and the objectives of our study. This decision was made since spatial attention enables the model to independently focus on distinct regions of the input image, which can be especially advantageous for images with complex or cluttered backgrounds. By attending to different parts of the image independently, the model can distinguish between different objects and ignore distracting features more effectively. Compared to channel attention, which is more focused on learning relationships between different channels in an image, this particular task may not be as effective with channel attention.

We used the self-attention mechanism proposed in38 to implement spatial attention in MobileNet V2 using the self-attention mechanism. The method employs a self-attention module that receives the MobileNet V2 network’s features and computes the attention weights for each feature map position. The attention weights are then utilized to weigh the features at each position and generate a weighted feature map, which is then passed through a series of convolutional layers to generate the final output. To implement this approach, we added the self-attention module to the MobileNet V2 network after the final convolutional layer. The network was then trained using the DMR-IR dataset and supervised learning.
Our model was developed and trained using Google Colab, a cloud-based platform that provided the necessary computational resources. We employed the MobileNetV2 architecture, pre-trained on ImageNet, and resized input images to 224 × 224 pixels. The training objective was binary classification, and we used the Adam optimizer with a learning rate of 0.0002.
To accomplish this, the MobileNetV2 model was loaded, and the top layer was omitted. Next, we implemented a custom spatial attention layer that takes the output of the base model as input and generates an attention map that enables the model to separately attend to distinct regions of the input image. This enables the model to concentrate on specific image features that are relevant to the classification task, which can be especially useful for images with complex or cluttered backgrounds. To further improve the model’s performance, we decided to use the MobileNetV2 model as a pre-trained model and fine-tune it using the target dataset. We used a binary cross-entropy loss function and the Adam optimizer to train the model. We then added a dense layer to produce the final prediction.
During model training, we employed a combination of batch normalization, dropout, and L2 regularization to prevent overfitting and enhance the model’s generalization capabilities. In addition, we utilized early stopping to monitor the model’s performance on a validation set and to ensure that the training process was not terminated too early or continued for too long.
To assess the performance of our model, we considered several metrics. The initial metric is accuracy, which measures the model’s overall performance. The accuracy formula is:
Next, we used precision as a metric to determine the accuracy of the model’s positive forecasts. The precision equation is:
In addition, we utilized recall as a metric to evaluate the model’s ability to correctly identify all positive instances in the dataset. It is determined by:
Finally, we employed the F1 score as a metric for balancing the model’s precision and recall. The definition of this metric is the harmonic mean of precision and recall. It can be mathematically expressed as: where True positives (TP) are the number of instances in which the model predicted the positive class correctly. True negatives (TN) are the number of instances in which the model predicted the negative class correctly. False positives (FP) refer to the instances in which the model incorrectly predicted the positive class. False negatives (FN) represent the instances in which the model incorrectly predicted the negative class. Consideration of these metrics enabled us to gain a comprehensive understanding of the model’s performance and helped us identify improvement opportunities.

Results and discussion
In this section, we compare different edge detection algorithms using the DMR-IR dataset. We then present the outcomes of our proposed model and compare it to other well-known and cutting-edge models.
First, we present the outcomes of our experiments using various edge detection algorithms and MobileNet V2 on the DMR-IR dataset (Fig. 7). Experiments revealed that the Sobel algorithm had the highest accuracy of 97.77%, outperforming all other methods and the original dataset without any preprocessing, which had an accuracy of 87.77%. The Roberts and Canny algorithms, on the other hand, yielded accuracy rates of 75.55% and 61.11%, respectively. These results demonstrate the efficacy of the Sobel algorithm in extracting relevant features from the images in the DMR-IR dataset and enhancing MobileNet V2’s performance.

Second, we present the outcomes of our proposed architecture, which combines MobileNet V2 with an attention mechanism and employs the Sobel edge detection algorithm (Fig. 8), chosen for its superior performance in our earlier experiments. In addition, the performance of our model is compared to that of several well-known pre-trained models, such as MobileNet, Inception Resnet, and DenseNet121. All models were trained on the preprocessed DMR-IR dataset using transfer learning.
The proposed architecture, which combines MobileNet V2 with an attention mechanism and the Sobel edge detection algorithm, outperforms all other evaluated models, according to our findings. It achieves 98.88% accuracy, which is significantly higher than the accuracy of other models. The accuracy of the second-best performing model, MobileNet V2 without an attention mechanism, was merely 97.77%. The accuracy of the well-known, cutting-edge Inception Resnet and DenseNet121 models was 87.77% and 95.55%, respectively. These results illustrate the effectiveness of combining the pre-trained MobileNet V2 model with the attention mechanism, particularly when coupled with the Sobel edge detection algorithm. The attention mechanism enables the model to focus on the most relevant image features, resulting in improved performance, and using the Sobel algorithm to extract these features further improves the model’s performance.

Figure 8 illustrates the extremely promising performance of our model in predicting breast cancers, with a high true positive rate of 97.8% and a low false negative rate of only 3.0%. This means that the model can accurately identify affected breasts, which is crucial for early breast cancer diagnosis and treatment. In addition, with a true negative rate of 100%, the model performed exceptionally well in identifying healthy breasts. This is significant because correctly identifying healthy breasts reduces the need for unnecessary testing and treatment, resulting in significant cost savings and a reduced burden on the healthcare system.
The performance differences among the compared models can be attributed to their respective architectural and computational characteristics. For instance, Inception ResNet achieves 100% precision for the “Affected” class but suffers from a lower recall of 81.67%, indicating a higher rate of false negatives for the “Healthy” class. This can be attributed to its complex architecture, which may overfit to specific features in the training data, reducing its ability to generalize.
In contrast, DenseNet121 demonstrates a more balanced performance but achieves slightly lower accuracy (95.55%) due to its dense connections, which increase computational overhead while providing moderate improvements. MobileNet V2, as a lightweight architecture, offers competitive accuracy (97.77%) and precision but lacks the fine-grained feature extraction capabilities needed for higher class-specific performance.
Our proposed approach incorporates a lightweight design with a spatial attention mechanism that enhances the model’s ability to focus on critical regions of thermographic images. This leads to a precision of 100% for the “Affected” class and a recall of 100% for the “Healthy” class, achieving the highest overall accuracy (98.88%). The balanced performance across both classes demonstrates the robustness and efficiency of our model, making it particularly suitable for real-time diagnostic applications in resource-constrained environments.

Figure 9 shows the ROC curves for the four evaluated models, illustrating how each balances the true positive rate (TPR) against the false positive rate (FPR) over a range of classification thresholds. As evidenced by its curve, the proposed approach (MobileNet V2 + Attention + Sobel) achieves the highest overall discriminative power, consistently ranking positive instances above negative ones. MobileNet V2 (Sobel) and DenseNet121 follow closely, both exhibiting robust yet slightly lower performance. In contrast, Inception ResNet demonstrates a more moderate trajectory, suggesting it is more prone to misclassifying positives or negatives at certain thresholds compared to the other models.
In addition, the confusion matrix depicted in Fig. 10 illustrates the performance of our proposed model, which incorporates MobileNet v2, an attention mechanism, and the Sobel edge detector. This combination was selected based on its superior performance compared to other evaluated methods. This figure outlines a potential limitation is the model’s relatively high false positive rate of 3.2% for healthy breasts. This means that 3.2% of healthy breasts were incorrectly identified as cancerous by the model. The 3.2% false positive rate reported in our study is comparable to those observed in other non-invasive breast cancer screening methods, such as ultrasound, where false positive rates range from 2% to 10% depending on the population and study conditions. While this rate is relatively low, it still presents a risk of unnecessary anxiety and additional testing for healthy individuals. To address this, we propose refining the model’s decision threshold to optimize the trade-off between sensitivity and specificity. Additionally, combining thermography with other diagnostic parameters, such as patient age, hormonal status, and medical history, could further reduce false positives and enhance the overall reliability of the screening process.

To ensure reproducibility and clarify the impact of our choices, we have detailed all hyperparameters, data splits, and preprocessing steps in the Methodology section. Furthermore, the confusion matrix, precision, recall, and F1-scores are discussed to offer a balanced evaluation beyond overall accuracy. For example, the high recall of 100% for the healthy class ensures no false negatives, while the F1-score of 99.16% reflects both sensitivity and precision for affected cases. These metrics, combined with ROC analysis (Fig. 9), support the model’s discriminative capacity and practical relevance.
In conclusion, we compare the precision of our proposed approach (98.88%) to the most advanced models in the field of breast cancer diagnosis. We evaluate the advanced models proposed by Nasser39, Schaefer40, Silva41, Fernandez42, Tello43, and Ekici44. (Fig. 11).

Our proposed solution outperforms these cutting-edge models when compared to them. The accuracy of Nasser’s model, for instance, was 95.8%, while our approach was 98.88%. Similarly, Schaefer’s model achieved an accuracy of only 80%, which was significantly lower than that of our method. The accuracy of Silva’s model was 90.9%, which was lower than that of our proposed method. The accuracy of Fernandez’s model was 71.4%, which was again lower than our method. The accuracy of Tello’s model was 95.4%, while Ekici’s was 97.5%. In each instance, our proposed method outperformed the dominant models.
While thermography has inherent limitations, such as reduced sensitivity to small tumors and microcalcifications compared to mammography, it remains a valuable tool in early breast cancer detection due to its non-invasive and radiation-free nature. This study addresses these limitations by leveraging advanced deep learning techniques, including a spatial attention mechanism, to enhance the detection of subtle temperature variations that may correspond to early pathological changes. By focusing on thermal pattern analysis, our approach complements traditional imaging methods like mammography and ultrasound, offering a cost-effective and accessible screening option, particularly in resource-limited settings where advanced imaging technologies may not be readily available. Furthermore, the integration of thermography with other modalities has the potential to create a robust diagnostic framework, improving overall detection rates and reducing the burden on healthcare systems.
Overall, these results demonstrate that our proposed method accurately predicts breast cancer. The high true positive and true negative rates of our model make it a valuable tool for early breast cancer diagnosis and treatment, as well as for reducing unwarranted testing and treatment of healthy breasts. The superior performance of our proposed method relative to state-of-the-art models highlights its potential to significantly impact breast cancer diagnosis.
In addition, the model achieves an average processing time of 25 milliseconds per image, which is suitable for real-time diagnostic applications. This performance ensures that the model can provide timely results without compromising accuracy, making it effective for early breast cancer detection in practical scenarios.

Conclusion and future work

Conclusion and future work
Our proposed breast cancer prediction model demonstrates outstanding performance, achieving an accuracy rate of 98.88%, significantly outperforming prior methods by Nasser (95.8%), Schaefer (80%), Silva (90.9%), Fernandez (71.4%), Tello (95.4%), and Ekici (97%). This performance is attributed to the effective use of transfer learning with MobileNetV2, enhanced by the integration of a spatial attention mechanism and the Sobel edge detection algorithm, which together enable precise extraction of relevant features from thermal images.
The model holds considerable promise for improving early diagnosis and treatment, particularly in resource-limited settings. Accurately identifying abnormal thermal patterns it enables timely intervention, potentially improving patient outcomes. In many low-income regions where mammography is either cost-prohibitive or unavailable due to a shortage of radiologists, this lightweight, non-invasive, radiation-free approach offers a viable screening alternative—even in remote areas. Moreover, it is particularly beneficial for women with dense breast tissue, for whom traditional mammography is often less effective.
Despite its strengths, the model exhibits a 3.2% false positive rate, likely due to the absence of clinical metadata such as patient age and family history. To address this, future work will focus on threshold tuning and the integration of clinical parameters to improve specificity without sacrificing sensitivity. It is important to note that this tool is intended for preliminary screening; final diagnostic decisions should remain based on gold-standard clinical methods.
Thermography itself presents inherent challenges, as it is sensitive to environmental and physiological factors—including ambient temperature, physical activity, and hormonal variations—which can affect temperature readings. Although data collection was standardized, individual physiological variability may still introduce noise. Future work should investigate advanced normalization techniques or personalized baselines to improve consistency.
Another limitation of this study lies in the restricted size and demographic diversity of the dataset, due to the limited availability of public thermographic breast cancer data. While k-fold cross-validation was used to enhance result reliability, broader validation on larger and more diverse datasets is necessary to confirm generalizability.
The current research was conducted under controlled environmental and imaging conditions. However, real-world deployment will require validation under variable conditions, such as fluctuating ambient temperatures and lighting. Future work will involve testing model robustness using clinical datasets in diverse real-world scenarios to ensure consistent performance.
A preliminary cost analysis suggests that smartphone-based thermography can significantly reduce infrastructure and equipment costs. Although a comprehensive economic evaluation is beyond the scope of this paper, future research will assess total implementation costs, including thermal camera pricing, training time, and the potential healthcare benefits of earlier detection.
To further advance this approach, future directions include integrating clinical parameters such as patient age and family history, exploring transformer-based models tailored for thermal imaging, and developing a fully integrated end-to-end pipeline for real-time deployment in both clinical and field conditions. Enhancing the model with more sophisticated imaging hardware and expanding the dataset with broader demographic representation could further improve its ability to detect smaller and deeper tumors. Additionally, comparative studies with mammography and the establishment of user protocols will be essential to validate and standardize this technology for widespread clinical adoption.

출처: PubMed Central (JATS). 라이선스는 원 publisher 정책을 따릅니다 — 인용 시 원문을 표기해 주세요.