Weakly supervised colorectal gland segmentation through self-supervised learning and attention-based pseudo-labeling.
1/5 보강
Accurate gland segmentation in colorectal cancer histopathology is crucial, but the scarcity of pixel-level annotations limits robust model development.
APA
Wen H, Wu Y, et al. (2026). Weakly supervised colorectal gland segmentation through self-supervised learning and attention-based pseudo-labeling.. Scientific reports, 16(1), 5771. https://doi.org/10.1038/s41598-026-36256-0
MLA
Wen H, et al.. "Weakly supervised colorectal gland segmentation through self-supervised learning and attention-based pseudo-labeling.." Scientific reports, vol. 16, no. 1, 2026, pp. 5771.
PMID
41554951 ↗
Abstract 한글 요약
Accurate gland segmentation in colorectal cancer histopathology is crucial, but the scarcity of pixel-level annotations limits robust model development. This study aims to develop a highly accurate gland segmentation method that leverages weakly labeled data, specifically image-level labels, to overcome the need for extensive pixel-level annotations. We propose a novel three-stage framework that uniquely combines self-supervised fine-tuning of the DINOv2 vision transformer, attention-based pseudo-label generation, and a boundary-aware loss function. Initially, an off-the-shelf DINOv2 encoder is fine-tuned on a large unlabeled dataset of histopathology images. This fine-tuned encoder is then integrated into a classification network equipped with an attention mechanism, which is trained using image-level labels to generate initial pseudo-labels via attention maps. These maps are refined through blending, thresholding, and Conditional Random Field (CRF) post-processing. Finally, a segmentation network, employing the same fine-tuned encoder and a lightweight decoder, is trained using these refined pseudo-labels and a boundary-aware loss. Ablation studies demonstrated the significant benefit of the fine-tuned encoder and the comprehensive post-processing steps for pseudo-label generation. Further studies confirmed the effectiveness of the boundary-aware loss in improving segmentation accuracy. Our method achieved superior performance on the GlaS dataset compared to several state-of-the-art methods, including both fully supervised and weakly supervised approaches, demonstrating higher F1-score, Object Dice, and lower Object Hausdorff distance. This approach effectively addresses the challenge of limited pixel-level annotations by utilizing more readily available image-level data, offering a promising solution for improved colorectal cancer diagnosis. The proposed framework shows potential for generalization to other histopathology image analysis tasks.
🏷️ 키워드 / MeSH 📖 같은 키워드 OA만
같은 제1저자의 인용 많은 논문 (5)
- MLND-IU: A multi-stage detection model of subcentimeter lung nodule with improved U-Net+.
- Locus-specific HERV expression associated with hepatocellular carcinoma.
- TC check: a web app for thyroid cancer recurrence prediction using explainable machine learning.
- Histone methyltransferase KMT2D targets the SPOP-G3BP1 axis to enhance AR stability and drive castration-resistant prostate cancer progression.
- Identifying the signature of NAD+ metabolism-related genes for immunotherapy of gastric cancer.
📖 전문 본문 읽기 PMC JATS · ~102 KB · 영문
Introduction
Introduction
Accurate and efficient segmentation of glandular structures in histopathology images is crucial for the diagnosis and grading of colorectal cancer (CRC), the third most common and second deadliest cancer worldwide1. Precise gland boundary delineation allows pathologists to assess morphological features, distinguish benign from malignant tissue, and determine cancer progression stage2. As illustrated in Fig. 1, glandular morphology varies dramatically across CRC grades, from well-formed tubular structures in healthy and low-grade tissue to highly irregular, poorly differentiated forms in aggressive malignancies. Automating gland segmentation can significantly improve diagnostic reproducibility and efficiency while enabling quantitative morphological analysis.
Deep learning has revolutionized medical image segmentation3–6. Fully supervised approaches have dominated colorectal gland segmentation, with early encoder-decoder networks incorporating contour-aware objectives (DCAN7), multi-channel feature fusion8, and minimal-information-loss dilation (MILD-Net9). More recent methods leverage transformers (VENet10) and diffusion processes11 to achieve state-of-the-art performance. However, these approaches universally require large volumes of pixel-level annotations, which are extremely time-consuming and expensive to acquire in histopathology due to the need for expert pathologist input. As shown in Figure 2, coarse-grained labels (such as image-level labels indicating the presence or absence of glands) are far more readily available in clinical settings than detailed pixel-level annotations.
Weakly supervised semantic segmentation (WSSS) using only image-level labels offers a promising solution to the annotation bottleneck. In the general computer vision domain, classic class activation mapping (CAM)12 and its extensions-such as affinity learning13, inter-pixel relation networks (IRNet14), puzzle-based consistency (Puzzle-CAM15), and anti-adversarial manipulation (AdvCAM16)-have progressively expanded activation coverage. Transformer-based affinity learning17 and containment modeling18 have further improved contextual reasoning.
In histopathology, domain-specific WSSS methods have emerged to handle staining variation and touching glandular instances. CAMEL19 pioneered ensemble CAMs for pseudo-label generation in pathology. Subsequent works introduced sub-category exploration (SC-CAM20), online easy-example mining (OEEM21), and enhanced prompting of the Segment Anything Model (EP-SAM22). Despite these advances, most rely on backbones pretrained on natural images and produce coarse pseudo-labels that particularly fail at complex, adherent, or poorly differentiated gland boundaries.
Self-supervised learning (SSL) has recently transformed representation learning in histopathology by exploiting vast unlabeled whole-slide image collections. Recent advancements have largely been driven by two primary paradigms: Masked Image Modeling (MIM), such as MAE23, which reconstructs missing pixel information, and contrastive learning frameworks like DINO24 and iBOT25, which learn invariance to views. These methodologies have enabled the training of pathology models, as seen in recent works like UNI26, Virchow27, and others28–30. Among these architectures, DINOv231 has demonstrated exceptional feature robustness and transferability by combining patch-level and image-level objectives. However, few studies have explored the specific benefits of targeted fine-tuning of such SSL backbones on large, domain-specific unlabeled datasets (e.g., exclusively colorectal tissue) as a precursor to weakly supervised gland segmentation.
Attention-based multiple instance learning (MIL)32, dense CRF refinement33, and boundary-aware losses34 are well-established tools for improving pseudo-label quality and final segmentation fidelity, yet they are rarely combined with histopathology-specialized SSL pretraining in a unified pipeline.
To the best of our knowledge, no prior work has systematically integrated (1) large-scale self-supervised fine-tuning of DINOv2 on unlabeled colorectal histopathology WSIs, (2) attention-driven MIL localization followed by comprehensive post-processing (blending, thresholding, and dense CRF), and (3) boundary-aware segmentation training into a single weakly supervised framework for gland segmentation.
We propose a novel three-stage approach that addresses these gaps: (1) domain-specific fine-tuning of DINOv2 on the large unlabeled IMP-CRS-2024 dataset35–37; (2) attention-based pseudo-label generation with extensive refinement; and (3) lightweight decoder training using a boundary-aware composite loss. Ablation studies confirm the critical contribution of each component. On the benchmark GlaS dataset38, our method achieves state-of-the-art weakly supervised performance and surpasses or matches several fully supervised baselines. Furthermore, evaluation on the challenging independent Colorectal Adenocarcinoma Gland (CRAG) dataset9, which contains more malignant and morphologically irregular glands, demonstrates generalization capability.
Accurate and efficient segmentation of glandular structures in histopathology images is crucial for the diagnosis and grading of colorectal cancer (CRC), the third most common and second deadliest cancer worldwide1. Precise gland boundary delineation allows pathologists to assess morphological features, distinguish benign from malignant tissue, and determine cancer progression stage2. As illustrated in Fig. 1, glandular morphology varies dramatically across CRC grades, from well-formed tubular structures in healthy and low-grade tissue to highly irregular, poorly differentiated forms in aggressive malignancies. Automating gland segmentation can significantly improve diagnostic reproducibility and efficiency while enabling quantitative morphological analysis.
Deep learning has revolutionized medical image segmentation3–6. Fully supervised approaches have dominated colorectal gland segmentation, with early encoder-decoder networks incorporating contour-aware objectives (DCAN7), multi-channel feature fusion8, and minimal-information-loss dilation (MILD-Net9). More recent methods leverage transformers (VENet10) and diffusion processes11 to achieve state-of-the-art performance. However, these approaches universally require large volumes of pixel-level annotations, which are extremely time-consuming and expensive to acquire in histopathology due to the need for expert pathologist input. As shown in Figure 2, coarse-grained labels (such as image-level labels indicating the presence or absence of glands) are far more readily available in clinical settings than detailed pixel-level annotations.
Weakly supervised semantic segmentation (WSSS) using only image-level labels offers a promising solution to the annotation bottleneck. In the general computer vision domain, classic class activation mapping (CAM)12 and its extensions-such as affinity learning13, inter-pixel relation networks (IRNet14), puzzle-based consistency (Puzzle-CAM15), and anti-adversarial manipulation (AdvCAM16)-have progressively expanded activation coverage. Transformer-based affinity learning17 and containment modeling18 have further improved contextual reasoning.
In histopathology, domain-specific WSSS methods have emerged to handle staining variation and touching glandular instances. CAMEL19 pioneered ensemble CAMs for pseudo-label generation in pathology. Subsequent works introduced sub-category exploration (SC-CAM20), online easy-example mining (OEEM21), and enhanced prompting of the Segment Anything Model (EP-SAM22). Despite these advances, most rely on backbones pretrained on natural images and produce coarse pseudo-labels that particularly fail at complex, adherent, or poorly differentiated gland boundaries.
Self-supervised learning (SSL) has recently transformed representation learning in histopathology by exploiting vast unlabeled whole-slide image collections. Recent advancements have largely been driven by two primary paradigms: Masked Image Modeling (MIM), such as MAE23, which reconstructs missing pixel information, and contrastive learning frameworks like DINO24 and iBOT25, which learn invariance to views. These methodologies have enabled the training of pathology models, as seen in recent works like UNI26, Virchow27, and others28–30. Among these architectures, DINOv231 has demonstrated exceptional feature robustness and transferability by combining patch-level and image-level objectives. However, few studies have explored the specific benefits of targeted fine-tuning of such SSL backbones on large, domain-specific unlabeled datasets (e.g., exclusively colorectal tissue) as a precursor to weakly supervised gland segmentation.
Attention-based multiple instance learning (MIL)32, dense CRF refinement33, and boundary-aware losses34 are well-established tools for improving pseudo-label quality and final segmentation fidelity, yet they are rarely combined with histopathology-specialized SSL pretraining in a unified pipeline.
To the best of our knowledge, no prior work has systematically integrated (1) large-scale self-supervised fine-tuning of DINOv2 on unlabeled colorectal histopathology WSIs, (2) attention-driven MIL localization followed by comprehensive post-processing (blending, thresholding, and dense CRF), and (3) boundary-aware segmentation training into a single weakly supervised framework for gland segmentation.
We propose a novel three-stage approach that addresses these gaps: (1) domain-specific fine-tuning of DINOv2 on the large unlabeled IMP-CRS-2024 dataset35–37; (2) attention-based pseudo-label generation with extensive refinement; and (3) lightweight decoder training using a boundary-aware composite loss. Ablation studies confirm the critical contribution of each component. On the benchmark GlaS dataset38, our method achieves state-of-the-art weakly supervised performance and surpasses or matches several fully supervised baselines. Furthermore, evaluation on the challenging independent Colorectal Adenocarcinoma Gland (CRAG) dataset9, which contains more malignant and morphologically irregular glands, demonstrates generalization capability.
Methods
Methods
This section details our proposed three-stage framework for weakly supervised colorectal gland segmentation. We aim to develop a segmentation network that accurately predicts gland masks from histopathology images, using only image-level labels. We begin by formally defining the problem, followed by a detailed explanation of each stage: self-supervised feature encoding, pseudo-label generation, and segmentation network training.
Problem formulation
Our objective is to develop a segmentation network, denoted as S, that accurately predicts gland masks M from Hematoxylin and Eosin (H&E) stained histopathology image tiles , where H and W represent the height and width of the image, respectively. The challenge lies in the fact that we only have access to image-level labels , indicating the presence (1) or absence (0) of glands in an image, and lack pixel-level annotation data for supervised training. We can express this as learning a mapping function that outputs a segmentation mask M, given an input image x.
Overview of the proposed framework
To overcome the challenge of lacking pixel-level labels, our proposed framework, depicted in Fig. 3, operates in three stages. First, in Self-Supervised Feature Encoding, we begin with an off-the-shelf, pre-trained DINOv2 model and fine-tune it on a large, unlabeled dataset of H&E images to create a feature encoder E. This encoder learns a mapping from the input image space to a feature space, , where h, w are the height and width of the feature map and c is the number of channels. This fine-tuning stage is crucial for adapting the general features learned by DINOv2 to the specific domain of histopathology data, as detailed later. Second, for Pseudo-Label Generation using Weak Supervision, we leverage the fine-tuned encoder E and weakly labeled data to generate pseudo-segmentation masks. We train a classification network C, which also uses E, to predict the image-level label y based on the encoded features C(E(x)). Importantly, the classification network has an attention mechanism, implemented using an MLP+softmax block, that allows us to weight the contribution of each patch token and extract attention maps, which highlight regions in the image that are most relevant for the classification task. These maps are then post-processed to create the pseudo-labels , as elaborated later. Finally, in Segmentation Network Training, we train a segmentation network S using the generated pseudo-labels . This network also uses the fine-tuned encoder E and a decoder to produce the final segmentation mask S(x), as described later.
Stage 1: self-supervised feature encoding
Fine-tuning DINOv2 on unlabeled histopathology data
We begin with an off-the-shelf, pre-trained DINOv2 model, released by Meta, and fine-tune it using self-supervision on the large, unlabeled IMP-CRS-2024 dataset. This fine-tuning adapts the DINOv2 model, which is pre-trained on natural images, to the specific characteristics of H&E-stained histopathology images. We use the ViT-G/14 architecture as the backbone within the DINOv2 framework. The ViT-G/14 architecture refers to Vision Transformer with a “Giant” model size and a patch size of 14x14. Let be an image from this dataset. The fine-tuning process involves using the DINOv2 training framework of student network and a teacher network . The student receives different views of the input image, including global crops and local crops . The teacher network is a momentum-updated copy of the student network and receives global crops only. The student network is trained to match the output of the teacher network for corresponding views, learning robust, domain-specific feature representations. The fine-tuning loss is computed as follows:where is the cross-entropy loss, which encourages local-to-global feature correspondence, is the iBOT loss, which promotes invariance to masking, and is a masked version of and is a hyperparameter balancing the two losses. This process results in a feature encoder E, derived from the fine-tuned student network, which maps the input images to domain-specific feature representations, .
Stage 2: pseudo-label generation using weak supervision
Classification network with attention mechanism
After fine-tuning the encoder E, we integrate it into a classification network C. The encoder outputs patch tokens , where is the number of patch tokens and d is the dimension of each token. Since we only have image-level labels, and some patch tokens may correspond to background or non-gland tissue, we weight the contribution of each patch token using an MLP+softmax attention block. This effectively implements a form of Multiple Instance Learning (MIL).
The attention mechanism works as follows: Each patch token from F(x) is passed through a Multi-Layer Perceptron (MLP) with one hidden layer, resulting in an attention score :where , , , are learnable parameters, and is the dimension of the hidden layer. The attention scores are then normalized using a softmax function to generate attention weights :The weighted patch tokens are then summed up to generate an image feature vector :Finally, the image feature vector is fed into a linear classification layer to predict the image level label y using a linear classifier, and the predicted probability is denoted as C(E(x)). The classification network is trained to minimize the cross entropy loss between the predicted label C(E(x)) and the ground-truth image level label y, which can be written as:where N is the number of training images and is the ground-truth label for image . The attention map A(x) is then derived from the attention weights by upsampling the weights to the original image size.
Attention-based pseudo-label derivation
The attention maps A(x), which are in the same resolution as the feature maps (i.e. ), are upsampled to the original input image resolution , denoted as . Since typically does not cover the entire gland region effectively (often resulting in a shrunken version of the actual gland mask), we first overlay it with the original image x to create a blended image . The blending is performed as follows:where is a blending factor, controlling the contribution of the attention map to the blended image. We then threshold this blended image to generate a binary mask :where is a threshold value. The threshold operation is element-wise:The resulting binary mask typically has noisy boundaries.
Attention map post-processing (thresholding and CRF)
To refine the pseudo-labels , we apply a Conditional Random Field (CRF). The CRF is a probabilistic graphical model that aims to find the most likely labeling of the image given the observed data. We use a fully connected CRF, which considers all pixel pairs in the image. The CRF energy function is defined as:where represents the pseudo-label map, a binary segmentation, and is the label (0 or 1) for pixel i. The unary potential, , represents the cost of assigning label to pixel i based on our initial pseudo-label prediction; this is based on the value of before CRF. The pairwise potential, , represents the cost of assigning labels and to pixels i and j, respectively, considering their spatial relationship. It is defined as:Here, if and 0 otherwise, penalizing label differences. K is the number of Gaussian kernels used, is the weight for kernel m, and is a Gaussian kernel, defined as:where and are feature vectors of pixel i and j (pixel coordinates and pixel intensities in our case), and is the covariance matrix. The goal of the CRF is to find the labeling that minimizes the total energy , which is done through an iterative process that updates each label based on the current state of all the other labels, until the process converges. Thus, the final refined pseudo-label is , where
Stage 3: segmentation network training
Segmentation network architecture
The segmentation network S also uses the fine-tuned encoder E to extract features. The decoder D then upsamples these encoded features to predict the final segmentation mask . The decoder is designed to be lightweight to minimize the risk of overfitting given the pseudo labels.
Boundary-aware loss function
To enhance the accuracy of gland boundary delineation, the segmentation network is trained using a boundary-aware loss function. This loss function combines the Dice loss and a distance transform-weighted boundary loss. The boundary loss is designed to emphasize correct boundary prediction which is essential for precise segmentation:where controls the relative importance of the boundary loss.
The Dice Loss () measures the overlap between the predicted segmentation S(x) and the refined pseudo ground-truth segmentation :The Boundary Loss () penalizes discrepancies between the predicted and refined pseudo ground-truth gland boundaries. It utilizes the distance transform of the refined pseudo ground-truth boundaries to emphasize the loss at gland boundaries:where is the distance transform of the refined pseudo-label .
This section details our proposed three-stage framework for weakly supervised colorectal gland segmentation. We aim to develop a segmentation network that accurately predicts gland masks from histopathology images, using only image-level labels. We begin by formally defining the problem, followed by a detailed explanation of each stage: self-supervised feature encoding, pseudo-label generation, and segmentation network training.
Problem formulation
Our objective is to develop a segmentation network, denoted as S, that accurately predicts gland masks M from Hematoxylin and Eosin (H&E) stained histopathology image tiles , where H and W represent the height and width of the image, respectively. The challenge lies in the fact that we only have access to image-level labels , indicating the presence (1) or absence (0) of glands in an image, and lack pixel-level annotation data for supervised training. We can express this as learning a mapping function that outputs a segmentation mask M, given an input image x.
Overview of the proposed framework
To overcome the challenge of lacking pixel-level labels, our proposed framework, depicted in Fig. 3, operates in three stages. First, in Self-Supervised Feature Encoding, we begin with an off-the-shelf, pre-trained DINOv2 model and fine-tune it on a large, unlabeled dataset of H&E images to create a feature encoder E. This encoder learns a mapping from the input image space to a feature space, , where h, w are the height and width of the feature map and c is the number of channels. This fine-tuning stage is crucial for adapting the general features learned by DINOv2 to the specific domain of histopathology data, as detailed later. Second, for Pseudo-Label Generation using Weak Supervision, we leverage the fine-tuned encoder E and weakly labeled data to generate pseudo-segmentation masks. We train a classification network C, which also uses E, to predict the image-level label y based on the encoded features C(E(x)). Importantly, the classification network has an attention mechanism, implemented using an MLP+softmax block, that allows us to weight the contribution of each patch token and extract attention maps, which highlight regions in the image that are most relevant for the classification task. These maps are then post-processed to create the pseudo-labels , as elaborated later. Finally, in Segmentation Network Training, we train a segmentation network S using the generated pseudo-labels . This network also uses the fine-tuned encoder E and a decoder to produce the final segmentation mask S(x), as described later.
Stage 1: self-supervised feature encoding
Fine-tuning DINOv2 on unlabeled histopathology data
We begin with an off-the-shelf, pre-trained DINOv2 model, released by Meta, and fine-tune it using self-supervision on the large, unlabeled IMP-CRS-2024 dataset. This fine-tuning adapts the DINOv2 model, which is pre-trained on natural images, to the specific characteristics of H&E-stained histopathology images. We use the ViT-G/14 architecture as the backbone within the DINOv2 framework. The ViT-G/14 architecture refers to Vision Transformer with a “Giant” model size and a patch size of 14x14. Let be an image from this dataset. The fine-tuning process involves using the DINOv2 training framework of student network and a teacher network . The student receives different views of the input image, including global crops and local crops . The teacher network is a momentum-updated copy of the student network and receives global crops only. The student network is trained to match the output of the teacher network for corresponding views, learning robust, domain-specific feature representations. The fine-tuning loss is computed as follows:where is the cross-entropy loss, which encourages local-to-global feature correspondence, is the iBOT loss, which promotes invariance to masking, and is a masked version of and is a hyperparameter balancing the two losses. This process results in a feature encoder E, derived from the fine-tuned student network, which maps the input images to domain-specific feature representations, .
Stage 2: pseudo-label generation using weak supervision
Classification network with attention mechanism
After fine-tuning the encoder E, we integrate it into a classification network C. The encoder outputs patch tokens , where is the number of patch tokens and d is the dimension of each token. Since we only have image-level labels, and some patch tokens may correspond to background or non-gland tissue, we weight the contribution of each patch token using an MLP+softmax attention block. This effectively implements a form of Multiple Instance Learning (MIL).
The attention mechanism works as follows: Each patch token from F(x) is passed through a Multi-Layer Perceptron (MLP) with one hidden layer, resulting in an attention score :where , , , are learnable parameters, and is the dimension of the hidden layer. The attention scores are then normalized using a softmax function to generate attention weights :The weighted patch tokens are then summed up to generate an image feature vector :Finally, the image feature vector is fed into a linear classification layer to predict the image level label y using a linear classifier, and the predicted probability is denoted as C(E(x)). The classification network is trained to minimize the cross entropy loss between the predicted label C(E(x)) and the ground-truth image level label y, which can be written as:where N is the number of training images and is the ground-truth label for image . The attention map A(x) is then derived from the attention weights by upsampling the weights to the original image size.
Attention-based pseudo-label derivation
The attention maps A(x), which are in the same resolution as the feature maps (i.e. ), are upsampled to the original input image resolution , denoted as . Since typically does not cover the entire gland region effectively (often resulting in a shrunken version of the actual gland mask), we first overlay it with the original image x to create a blended image . The blending is performed as follows:where is a blending factor, controlling the contribution of the attention map to the blended image. We then threshold this blended image to generate a binary mask :where is a threshold value. The threshold operation is element-wise:The resulting binary mask typically has noisy boundaries.
Attention map post-processing (thresholding and CRF)
To refine the pseudo-labels , we apply a Conditional Random Field (CRF). The CRF is a probabilistic graphical model that aims to find the most likely labeling of the image given the observed data. We use a fully connected CRF, which considers all pixel pairs in the image. The CRF energy function is defined as:where represents the pseudo-label map, a binary segmentation, and is the label (0 or 1) for pixel i. The unary potential, , represents the cost of assigning label to pixel i based on our initial pseudo-label prediction; this is based on the value of before CRF. The pairwise potential, , represents the cost of assigning labels and to pixels i and j, respectively, considering their spatial relationship. It is defined as:Here, if and 0 otherwise, penalizing label differences. K is the number of Gaussian kernels used, is the weight for kernel m, and is a Gaussian kernel, defined as:where and are feature vectors of pixel i and j (pixel coordinates and pixel intensities in our case), and is the covariance matrix. The goal of the CRF is to find the labeling that minimizes the total energy , which is done through an iterative process that updates each label based on the current state of all the other labels, until the process converges. Thus, the final refined pseudo-label is , where
Stage 3: segmentation network training
Segmentation network architecture
The segmentation network S also uses the fine-tuned encoder E to extract features. The decoder D then upsamples these encoded features to predict the final segmentation mask . The decoder is designed to be lightweight to minimize the risk of overfitting given the pseudo labels.
Boundary-aware loss function
To enhance the accuracy of gland boundary delineation, the segmentation network is trained using a boundary-aware loss function. This loss function combines the Dice loss and a distance transform-weighted boundary loss. The boundary loss is designed to emphasize correct boundary prediction which is essential for precise segmentation:where controls the relative importance of the boundary loss.
The Dice Loss () measures the overlap between the predicted segmentation S(x) and the refined pseudo ground-truth segmentation :The Boundary Loss () penalizes discrepancies between the predicted and refined pseudo ground-truth gland boundaries. It utilizes the distance transform of the refined pseudo ground-truth boundaries to emphasize the loss at gland boundaries:where is the distance transform of the refined pseudo-label .
Results
Results
Datasets and preprocessing
We evaluate the proposed method on two publicly available colorectal gland segmentation benchmarks with different characteristics and difficulty levels.
GlaS
The Gland Segmentation (GlaS) challenge dataset38 from MICCAI 2015 consists of 165 H&E-stained image tiles extracted from 16 whole-slide images of colorectal adenocarcinoma (stages T3/T4). It contains 85 training images (37 benign, 48 malignant) and 80 test images split into Test Part A (60 images: 37 benign, 23 malignant) and Test Part B (20 images: 4 benign, 16 malignant). Images are provided at 20 magnification with an average size of approximately 775522 pixels. The dataset supplies both pixel-level gland instance annotations and image-level benign/malignant labels for the training set, enabling weakly supervised training using only the latter.
CRAG
The Colorectal Adenocarcinoma Gland (CRAG) dataset9 is a more challenging benchmark comprising 213 H&E-stained images cropped from 38 whole-slide images of colorectal cancer patients, acquired at 20 magnification using an Omnyx VL120 scanner. The images (average size 15121516 pixels) are deliberately selected to include complex malignant cases with poorly differentiated, irregularly shaped, and densely packed or overlapping glands. The official split provides 173 images for training and 40 for testing, with only pixel-level instance annotations released for evaluation (no official image-level labels). The CRAG dataset is widely regarded as significantly harder than GlaS due to greater morphological variability and gland adhesion.
IMP-CRS-2024
The IMP-CRS-2024 dataset35–37 comprises 5,333 digitized whole slide images (WSI) of colorectal biopsies and polypectomies, acquired from the IMP Diagnostics laboratory, Portugal. The WSIs were digitized at a magnification of 40x using two Leica GT450 scanners. Each case is classified as nonneoplastic, low-grade, or high-grade lesion. This large unlabeled dataset serves as the basis for self-supervised fine-tuning.
Data preprocessing
To prepare the IMP-CRS-2024 dataset for self-supervised pretraining, we first performed tissue extraction on the thumbnail images to exclude background regions. We then extracted 256m 256m tiles and resized them to a uniform resolution of 512x512 pixels. Tiles with less than 80% tissue coverage or exhibiting blurriness were discarded to ensure the quality of training data.
For both GlaS and CRAG in Stages 2 and 3, we crop the original images into non-overlapping 512512 tiles. Tiles with gland area ratio below 20% are discarded to reduce near-background samples that could destabilize weakly supervised training. Color normalization using Macenko’s method39 is applied across datasets to mitigate staining variation. Standard augmentations (random horizontal/vertical flips, rotation, color jitter) are used during all training stages.
Implementation details
Training parameters
We conducted a three-stage training process, each with its specific parameters.
For DINOv2 Fine-tuning, the off-the-shelf DINOv2 model with a ViT-G/14 backbone was fine-tuned on the IMP-CRS-2024 dataset for 100 epochs using a batch size of 16. We employed the AdamW optimizer with an initial learning rate of 5e-5 and a weight decay of 1e-5. The learning rate was scheduled using a cosine annealing strategy. We used DINOv2’s data augmentations, including random resized cropping, horizontal flipping, color jittering, Gaussian blurring, and solarization. The loss was a combination of the cross-entropy and iBOT losses as detailed previously, with set to 1.0.
For Classification Network Training, the classification network was trained for 50 epochs on the GlaS dataset, using a batch size of 16. We used the AdamW optimizer with an initial learning rate of 1e-4 and a weight decay of 1e-5. The learning rate was scheduled using a cosine annealing strategy. The MLP in the attention mechanism had a hidden layer dimension of 2048. Standard data augmentations were applied, including random resized cropping, horizontal flipping, and color jittering. The attention map threshold was set to 0.5 and the blending factor was set to 0.75 based on a grid search.
For Segmentation Network Training, we trained the segmentation model for 50 epochs on the GlaS dataset, using a batch size of 8. We used the AdamW optimizer with an initial learning rate of 1e-4 and a weight decay of 1e-5. The learning rate was scheduled using a cosine annealing strategy. For the combined loss function, we empirically determined a value of for the boundary loss weight via a grid search with step size 0.1 over the range [0.0, 0.5], balancing the contributions of the Dice Loss and Boundary Loss. Data augmentations included random resized cropping, horizontal flipping, and color jittering.
Hardware and software setup
We implemented our model using PyTorch version 1.12.040 with CUDA version 11.7. All experiments were conducted on a single NVIDIA RTX 4090 GPU with 24GB of memory. The code was written in Python 3.8. The Conditional Random Field (CRF) implementation used was the ‘pydensecrf‘ library version 1.0.0.
Evaluation metrics
We evaluated segmentation performance using the following metrics: F1 score, object Dice (obj-Dice), and object Hausdorff distance (obj-HD), calculated using the official GlaS challenge evaluation scripts.
The F1 score is used to measure the detection precision of individual glandular objects. A segmented glandular object that intersects with at least 50% of its ground truth object is counted as true positive (TP), otherwise it is counted as false positive (FP). The number of false negatives (FN) is calculated as the difference between the number of ground truth objects and the number of true positives. Given these definitions, the F1 score is defined by:where Precision = TP/(TP + FP) and Recall = TP/(TP + FN), and TP, FP, and FN denote respectively the number of true positives, false positives, and false negatives from all images in the dataset.
For gland segmentation accuracy, we employ obj-Dice and obj-HD between ground truth Y and segmented object X. The obj-Dice index is defined as follows:where Dice = , and , . denotes the ith segmented object, denotes a ground truth object that maximally overlaps , denotes the jth ground object, denotes a segmented object that maximally overlaps . and denote the total number of X and Y, respectively. Similarly, obj-HD is defined as follows:where HD. The weighted coefficients , are defined as above.
Ablation study
Impact of pseudo-label generation
To assess the impact of our proposed pseudo-label generation method, we conducted an ablation study by evaluating the quality of the generated pseudo-masks against the ground truth (GT) masks on the GlaS dataset. We computed the Intersection over Union (IoU), Dice score, Precision, Recall, and Hausdorff Distance (HD) between the generated pseudo-masks and the corresponding ground-truth masks for several configurations. These included using patch features from the off-the-shelf DINOv2 ViT-G/14 model without fine-tuning to derive attention maps for pseudo-labels (with blending and thresholding); the same off-the-shelf DINOv2 approach but with full post-processing including CRF; using patch features from our fine-tuned DINOv2 model with only blending and thresholding for pseudo-labels; and finally, our full proposed method using the fine-tuned DINOv2 with comprehensive post-processing (blending, thresholding, and CRF).
The quantitative results of the pseudo-label generation ablation study are presented in Table 1, which clearly demonstrate the effectiveness of our proposed method. The “Off-the-shelf DINOv2” method, without fine-tuning on histopathology data, shows the worst performance, highlighting the need for domain-specific feature adaptation. By fine-tuning the DINOv2 model on the unlabeled IMP-CRS-2024 dataset, the performance substantially increases by a significant margin (“Fine-tuned DINOv2”), in terms of IoU, Dice, Precision and Recall. The fine-tuned DINOv2 model benefits from post-processing with CRF (“Fine-tuned DINOv2 + Post-processing”), which further improves the performance and reduces the Hausdorff Distance. This confirms the effectiveness of post-processing steps including blending, thresholding, and CRF. The best result is achieved by the approach of combining fine-tuned DINOv2 features with post-processing showing superior performance in all metrics. These results demonstrate the benefits of both self-supervised fine-tuning and attention map post-processing in generating high-quality pseudo-labels.
To further illustrate the impact of fine-tuning and post-processing on the quality of the pseudo-labels, we present a qualitative analysis in Figs. 4 and 5.
Figure 4 visualizes the feature maps from the off-the-shelf and fine-tuned DINOv2 models. The first column shows the original H&E image tiles. The second column shows the feature maps obtained using the off-the-shelf DINOv2 model. These maps show a more diffused activation pattern, with less clear boundaries between glands and background. In contrast, the third column shows the feature maps from the fine-tuned DINOv2 model. The fine-tuned feature maps show stronger and more concentrated activation within the gland regions, demonstrating that the fine-tuning process helps the model learn more domain-specific features.
Figure 5 provides a visual comparison of the pseudo-masks generated by different methods. The first column shows the original H&E images. The second and the third columns visualize the results of the off-the-shelf DINOv2 with and without post processing respectively. The forth and fifth columns displays the pseudo-masks generated by the fine-tuned DINOv2 with and without post-processing, respectively. The last column shows the corresponding ground truth (GT) masks. It can be observed that the pseudo-masks generated by the off-the-shelf DINOv2 model (the second and third columns) shows more noisy and diffused regions and miss some gland regions. The fine-tuned DINOv2 model, with blending and thresholding (the forth column) produces significantly better segmentation results. The addition of CRF post-processing (the fifth column) further refines the mask boundaries and generates high-quality pseudo-masks which visually aligns well with the ground truth masks. These qualitative results visually confirm the quantitative findings and highlight the effectiveness of our proposed approach for generating high-quality pseudo labels.
Impact of boundary-aware loss
To evaluate the impact of the boundary-aware loss function on segmentation performance, we conducted an ablation study by training two weakly-supervised segmentation models on the GlaS dataset. The first model was trained without the boundary loss component (i.e., only using the Dice loss), and the second model was trained with the boundary-aware loss function, as described previously. We evaluated both models on two GlaS test sets (testA and testB). The performance was measured using the F1-score, Dice score, and Hausdorff Distance.
Table 2 shows the results of this ablation study. The model trained with the boundary-aware loss function (Ours) achieves a significantly higher F1-score and Dice score on both test sets compared to the model trained without it. Furthermore, the Hausdorff Distance, which measures the distance between the boundaries of the predicted segmentation and the ground truth, is significantly lower for our model. This demonstrates that the boundary-aware loss helps the network learn more accurate segmentations, especially at the boundaries of the glands. These results confirm the effectiveness of the boundary-aware loss in improving the overall segmentation quality, both quantitatively and in terms of boundary delineation.
Comparison with State-of-the-Art (SOTA)
We compare our method with both fully supervised and weakly supervised state-of-the-art approaches on the GlaS dataset (Test A and Test B) and the more challenging CRAG dataset. All weakly supervised baselines (CAMEL19, SC-CAM20, OEEM21, EP-SAM22) are evaluated under the identical image-level supervision setting. Results are reported as mean ± standard deviation over five independent runs with different random seeds.
As shown in Table 3, the fully supervised methods generally exhibit strong performance on the GlaS dataset, benefiting from direct pixel-level guidance during training. For instance, the Diffusion Model11 achieves the highest Object Dice on Test A due to its ability to model complex boundary distributions through iterative denoising, while VENet10 leverages variational energy minimization for robust boundary handling, resulting in low Object HD. Classic baselines like U-Net6 and U-Node41 provide solid but less optimized results.
Among weakly supervised methods, earlier approaches such as CAMEL19 and SC-CAM20 suffer from coarse pseudo-labels derived from ensemble or sub-category CAMs, leading to lower F1-scores and poorer boundary delineation. OEEM21 improves upon this through iterative mining of easy examples, yielding better overlap metrics but still elevated HD due to limited boundary refinement. EP-SAM22 further advances performance by incorporating SAM prompting, achieving competitive F1-scores, though it remains suboptimal on boundaries in malignant subsets. Our method sets a new benchmark for weakly supervised segmentation, with superior F1-scores and the lowest Object HD among all weakly supervised approaches, approaching fully supervised levels. This is attributable to the domain-adapted DINOv2 features, attention-based localization, and boundary-aware training, which collectively enhance pseudo-label quality and model robustness.
On the considerably more difficult CRAG dataset (Table 4), which features a higher proportion of poorly differentiated, irregularly shaped, and adherent glands, performance across all methods declines, reflecting the greater morphological complexity and domain shift from GlaS. Fully supervised methods maintain relative advantages: VENet excels with high Object Dice and low HD, leveraging its energy-based boundary optimization, while the Diffusion Model shows moderate degradation. U-Net and U-Node exhibit larger drops in F1-score and higher HD, highlighting limitations in handling extreme irregularity without specialized architectures.
Weakly supervised baselines struggle more pronouncedly on CRAG: CAMEL and SC-CAM yield the lowest scores, as their CAM-derived pseudo-labels fail to capture fragmented or fused glands, leading to poor instance separation and elevated HD. OEEM and EP-SAM perform better, with EP-SAM’s SAM integration aiding in boundary refinement, but both remain inferior to our approach. Our method demonstrates the strongest generalization among weakly supervised approaches, achieving the highest F1-score and Object Dice with the lowest HD.
Overall, these comparisons underscore the practical advantages of our framework: it not only outperforms prior weakly supervised methods but also narrows the annotation-efficiency gap to fully supervised ones, making it suitable for clinical deployment where labeled data is scarce.
Figure 6 displays the segmentation results of our method on a few examples, the first column shows the original H&E image. The second column shows the corresponding ground truth masks.Columns (c) and (d)shows the masks obtained by VENET and EP-SAM. The segmentation masks obtained by our proposed method are shown in the last column.
These results show that our method accurately delineates gland boundaries, closely aligning with the ground truth masks and further demonstrating the effectiveness of our approach for high quality gland segmentation.
Datasets and preprocessing
We evaluate the proposed method on two publicly available colorectal gland segmentation benchmarks with different characteristics and difficulty levels.
GlaS
The Gland Segmentation (GlaS) challenge dataset38 from MICCAI 2015 consists of 165 H&E-stained image tiles extracted from 16 whole-slide images of colorectal adenocarcinoma (stages T3/T4). It contains 85 training images (37 benign, 48 malignant) and 80 test images split into Test Part A (60 images: 37 benign, 23 malignant) and Test Part B (20 images: 4 benign, 16 malignant). Images are provided at 20 magnification with an average size of approximately 775522 pixels. The dataset supplies both pixel-level gland instance annotations and image-level benign/malignant labels for the training set, enabling weakly supervised training using only the latter.
CRAG
The Colorectal Adenocarcinoma Gland (CRAG) dataset9 is a more challenging benchmark comprising 213 H&E-stained images cropped from 38 whole-slide images of colorectal cancer patients, acquired at 20 magnification using an Omnyx VL120 scanner. The images (average size 15121516 pixels) are deliberately selected to include complex malignant cases with poorly differentiated, irregularly shaped, and densely packed or overlapping glands. The official split provides 173 images for training and 40 for testing, with only pixel-level instance annotations released for evaluation (no official image-level labels). The CRAG dataset is widely regarded as significantly harder than GlaS due to greater morphological variability and gland adhesion.
IMP-CRS-2024
The IMP-CRS-2024 dataset35–37 comprises 5,333 digitized whole slide images (WSI) of colorectal biopsies and polypectomies, acquired from the IMP Diagnostics laboratory, Portugal. The WSIs were digitized at a magnification of 40x using two Leica GT450 scanners. Each case is classified as nonneoplastic, low-grade, or high-grade lesion. This large unlabeled dataset serves as the basis for self-supervised fine-tuning.
Data preprocessing
To prepare the IMP-CRS-2024 dataset for self-supervised pretraining, we first performed tissue extraction on the thumbnail images to exclude background regions. We then extracted 256m 256m tiles and resized them to a uniform resolution of 512x512 pixels. Tiles with less than 80% tissue coverage or exhibiting blurriness were discarded to ensure the quality of training data.
For both GlaS and CRAG in Stages 2 and 3, we crop the original images into non-overlapping 512512 tiles. Tiles with gland area ratio below 20% are discarded to reduce near-background samples that could destabilize weakly supervised training. Color normalization using Macenko’s method39 is applied across datasets to mitigate staining variation. Standard augmentations (random horizontal/vertical flips, rotation, color jitter) are used during all training stages.
Implementation details
Training parameters
We conducted a three-stage training process, each with its specific parameters.
For DINOv2 Fine-tuning, the off-the-shelf DINOv2 model with a ViT-G/14 backbone was fine-tuned on the IMP-CRS-2024 dataset for 100 epochs using a batch size of 16. We employed the AdamW optimizer with an initial learning rate of 5e-5 and a weight decay of 1e-5. The learning rate was scheduled using a cosine annealing strategy. We used DINOv2’s data augmentations, including random resized cropping, horizontal flipping, color jittering, Gaussian blurring, and solarization. The loss was a combination of the cross-entropy and iBOT losses as detailed previously, with set to 1.0.
For Classification Network Training, the classification network was trained for 50 epochs on the GlaS dataset, using a batch size of 16. We used the AdamW optimizer with an initial learning rate of 1e-4 and a weight decay of 1e-5. The learning rate was scheduled using a cosine annealing strategy. The MLP in the attention mechanism had a hidden layer dimension of 2048. Standard data augmentations were applied, including random resized cropping, horizontal flipping, and color jittering. The attention map threshold was set to 0.5 and the blending factor was set to 0.75 based on a grid search.
For Segmentation Network Training, we trained the segmentation model for 50 epochs on the GlaS dataset, using a batch size of 8. We used the AdamW optimizer with an initial learning rate of 1e-4 and a weight decay of 1e-5. The learning rate was scheduled using a cosine annealing strategy. For the combined loss function, we empirically determined a value of for the boundary loss weight via a grid search with step size 0.1 over the range [0.0, 0.5], balancing the contributions of the Dice Loss and Boundary Loss. Data augmentations included random resized cropping, horizontal flipping, and color jittering.
Hardware and software setup
We implemented our model using PyTorch version 1.12.040 with CUDA version 11.7. All experiments were conducted on a single NVIDIA RTX 4090 GPU with 24GB of memory. The code was written in Python 3.8. The Conditional Random Field (CRF) implementation used was the ‘pydensecrf‘ library version 1.0.0.
Evaluation metrics
We evaluated segmentation performance using the following metrics: F1 score, object Dice (obj-Dice), and object Hausdorff distance (obj-HD), calculated using the official GlaS challenge evaluation scripts.
The F1 score is used to measure the detection precision of individual glandular objects. A segmented glandular object that intersects with at least 50% of its ground truth object is counted as true positive (TP), otherwise it is counted as false positive (FP). The number of false negatives (FN) is calculated as the difference between the number of ground truth objects and the number of true positives. Given these definitions, the F1 score is defined by:where Precision = TP/(TP + FP) and Recall = TP/(TP + FN), and TP, FP, and FN denote respectively the number of true positives, false positives, and false negatives from all images in the dataset.
For gland segmentation accuracy, we employ obj-Dice and obj-HD between ground truth Y and segmented object X. The obj-Dice index is defined as follows:where Dice = , and , . denotes the ith segmented object, denotes a ground truth object that maximally overlaps , denotes the jth ground object, denotes a segmented object that maximally overlaps . and denote the total number of X and Y, respectively. Similarly, obj-HD is defined as follows:where HD. The weighted coefficients , are defined as above.
Ablation study
Impact of pseudo-label generation
To assess the impact of our proposed pseudo-label generation method, we conducted an ablation study by evaluating the quality of the generated pseudo-masks against the ground truth (GT) masks on the GlaS dataset. We computed the Intersection over Union (IoU), Dice score, Precision, Recall, and Hausdorff Distance (HD) between the generated pseudo-masks and the corresponding ground-truth masks for several configurations. These included using patch features from the off-the-shelf DINOv2 ViT-G/14 model without fine-tuning to derive attention maps for pseudo-labels (with blending and thresholding); the same off-the-shelf DINOv2 approach but with full post-processing including CRF; using patch features from our fine-tuned DINOv2 model with only blending and thresholding for pseudo-labels; and finally, our full proposed method using the fine-tuned DINOv2 with comprehensive post-processing (blending, thresholding, and CRF).
The quantitative results of the pseudo-label generation ablation study are presented in Table 1, which clearly demonstrate the effectiveness of our proposed method. The “Off-the-shelf DINOv2” method, without fine-tuning on histopathology data, shows the worst performance, highlighting the need for domain-specific feature adaptation. By fine-tuning the DINOv2 model on the unlabeled IMP-CRS-2024 dataset, the performance substantially increases by a significant margin (“Fine-tuned DINOv2”), in terms of IoU, Dice, Precision and Recall. The fine-tuned DINOv2 model benefits from post-processing with CRF (“Fine-tuned DINOv2 + Post-processing”), which further improves the performance and reduces the Hausdorff Distance. This confirms the effectiveness of post-processing steps including blending, thresholding, and CRF. The best result is achieved by the approach of combining fine-tuned DINOv2 features with post-processing showing superior performance in all metrics. These results demonstrate the benefits of both self-supervised fine-tuning and attention map post-processing in generating high-quality pseudo-labels.
To further illustrate the impact of fine-tuning and post-processing on the quality of the pseudo-labels, we present a qualitative analysis in Figs. 4 and 5.
Figure 4 visualizes the feature maps from the off-the-shelf and fine-tuned DINOv2 models. The first column shows the original H&E image tiles. The second column shows the feature maps obtained using the off-the-shelf DINOv2 model. These maps show a more diffused activation pattern, with less clear boundaries between glands and background. In contrast, the third column shows the feature maps from the fine-tuned DINOv2 model. The fine-tuned feature maps show stronger and more concentrated activation within the gland regions, demonstrating that the fine-tuning process helps the model learn more domain-specific features.
Figure 5 provides a visual comparison of the pseudo-masks generated by different methods. The first column shows the original H&E images. The second and the third columns visualize the results of the off-the-shelf DINOv2 with and without post processing respectively. The forth and fifth columns displays the pseudo-masks generated by the fine-tuned DINOv2 with and without post-processing, respectively. The last column shows the corresponding ground truth (GT) masks. It can be observed that the pseudo-masks generated by the off-the-shelf DINOv2 model (the second and third columns) shows more noisy and diffused regions and miss some gland regions. The fine-tuned DINOv2 model, with blending and thresholding (the forth column) produces significantly better segmentation results. The addition of CRF post-processing (the fifth column) further refines the mask boundaries and generates high-quality pseudo-masks which visually aligns well with the ground truth masks. These qualitative results visually confirm the quantitative findings and highlight the effectiveness of our proposed approach for generating high-quality pseudo labels.
Impact of boundary-aware loss
To evaluate the impact of the boundary-aware loss function on segmentation performance, we conducted an ablation study by training two weakly-supervised segmentation models on the GlaS dataset. The first model was trained without the boundary loss component (i.e., only using the Dice loss), and the second model was trained with the boundary-aware loss function, as described previously. We evaluated both models on two GlaS test sets (testA and testB). The performance was measured using the F1-score, Dice score, and Hausdorff Distance.
Table 2 shows the results of this ablation study. The model trained with the boundary-aware loss function (Ours) achieves a significantly higher F1-score and Dice score on both test sets compared to the model trained without it. Furthermore, the Hausdorff Distance, which measures the distance between the boundaries of the predicted segmentation and the ground truth, is significantly lower for our model. This demonstrates that the boundary-aware loss helps the network learn more accurate segmentations, especially at the boundaries of the glands. These results confirm the effectiveness of the boundary-aware loss in improving the overall segmentation quality, both quantitatively and in terms of boundary delineation.
Comparison with State-of-the-Art (SOTA)
We compare our method with both fully supervised and weakly supervised state-of-the-art approaches on the GlaS dataset (Test A and Test B) and the more challenging CRAG dataset. All weakly supervised baselines (CAMEL19, SC-CAM20, OEEM21, EP-SAM22) are evaluated under the identical image-level supervision setting. Results are reported as mean ± standard deviation over five independent runs with different random seeds.
As shown in Table 3, the fully supervised methods generally exhibit strong performance on the GlaS dataset, benefiting from direct pixel-level guidance during training. For instance, the Diffusion Model11 achieves the highest Object Dice on Test A due to its ability to model complex boundary distributions through iterative denoising, while VENet10 leverages variational energy minimization for robust boundary handling, resulting in low Object HD. Classic baselines like U-Net6 and U-Node41 provide solid but less optimized results.
Among weakly supervised methods, earlier approaches such as CAMEL19 and SC-CAM20 suffer from coarse pseudo-labels derived from ensemble or sub-category CAMs, leading to lower F1-scores and poorer boundary delineation. OEEM21 improves upon this through iterative mining of easy examples, yielding better overlap metrics but still elevated HD due to limited boundary refinement. EP-SAM22 further advances performance by incorporating SAM prompting, achieving competitive F1-scores, though it remains suboptimal on boundaries in malignant subsets. Our method sets a new benchmark for weakly supervised segmentation, with superior F1-scores and the lowest Object HD among all weakly supervised approaches, approaching fully supervised levels. This is attributable to the domain-adapted DINOv2 features, attention-based localization, and boundary-aware training, which collectively enhance pseudo-label quality and model robustness.
On the considerably more difficult CRAG dataset (Table 4), which features a higher proportion of poorly differentiated, irregularly shaped, and adherent glands, performance across all methods declines, reflecting the greater morphological complexity and domain shift from GlaS. Fully supervised methods maintain relative advantages: VENet excels with high Object Dice and low HD, leveraging its energy-based boundary optimization, while the Diffusion Model shows moderate degradation. U-Net and U-Node exhibit larger drops in F1-score and higher HD, highlighting limitations in handling extreme irregularity without specialized architectures.
Weakly supervised baselines struggle more pronouncedly on CRAG: CAMEL and SC-CAM yield the lowest scores, as their CAM-derived pseudo-labels fail to capture fragmented or fused glands, leading to poor instance separation and elevated HD. OEEM and EP-SAM perform better, with EP-SAM’s SAM integration aiding in boundary refinement, but both remain inferior to our approach. Our method demonstrates the strongest generalization among weakly supervised approaches, achieving the highest F1-score and Object Dice with the lowest HD.
Overall, these comparisons underscore the practical advantages of our framework: it not only outperforms prior weakly supervised methods but also narrows the annotation-efficiency gap to fully supervised ones, making it suitable for clinical deployment where labeled data is scarce.
Figure 6 displays the segmentation results of our method on a few examples, the first column shows the original H&E image. The second column shows the corresponding ground truth masks.Columns (c) and (d)shows the masks obtained by VENET and EP-SAM. The segmentation masks obtained by our proposed method are shown in the last column.
These results show that our method accurately delineates gland boundaries, closely aligning with the ground truth masks and further demonstrating the effectiveness of our approach for high quality gland segmentation.
Discussion
Discussion
This study introduced a novel three-stage framework designed to address the significant challenge of limited pixel-level annotated data for gland segmentation in colorectal cancer histopathology images. Our approach successfully integrates self-supervised fine-tuning of a DINOv2 vision transformer, an innovative attention-based pseudo-label generation pipeline, and a boundary-aware loss function for training the final segmentation network. The synergistic effect of these components allows our method to effectively leverage readily available image-level labels, thereby circumventing the need for laborious and expensive pixel-wise annotations.
The core strength of our framework lies in its multi-faceted strategy. The initial self-supervised fine-tuning of DINOv2 on a large, unlabeled histopathology dataset (IMP-CRS-2024) empowers the encoder to learn domain-specific visual features that are highly relevant to glandular structures, a crucial step as demonstrated by our ablation studies where fine-tuned DINOv2 significantly outperformed its off-the-shelf counterpart. We chose DINOv2 specifically due to its proven effectiveness in pathology foundation models, such as UNI and Virchow, which are built on similar distillation frameworks, enabling superior patch-level representations for histopathology tasks. Subsequently, the attention mechanism within the classification network plays a pivotal role in localizing relevant gland regions from image-level cues, forming the basis for pseudo-label generation. The comprehensive post-processing of these attention maps, including blending, thresholding, and CRF refinement, proved essential in converting coarse attention signals into high-quality pseudo-segmentation masks, closely approximating ground truth annotations. Finally, training the segmentation network with these refined pseudo-labels, particularly with the inclusion of a boundary-aware loss, ensured precise delineation of gland boundaries, a critical aspect for accurate morphological assessment in pathology.
Our experimental results on the GlaS dataset are highly encouraging. The proposed method not only demonstrates superior performance compared to existing weakly supervised techniques but also achieves results comparable to, and in some metrics exceeding, several fully supervised methods. Furthermore, evaluation on the independent CRAG dataset, which includes more malignant and morphologically irregular glands, confirms the method’s strong generalization, with our framework outperforming all weakly supervised baselines and narrowing the gap to fully supervised approaches despite the increased difficulty. This is a significant finding, as it suggests that high-accuracy segmentation models can be developed with substantially reduced annotation effort, making advanced computational pathology tools more accessible. The improved F1-scores and particularly the lower Object Hausdorff distances highlight our method’s capability in accurately identifying individual glands and precisely capturing their complex boundaries. This precision is vital for downstream clinical applications, such as quantitative analysis of gland morphology for cancer grading and prognosis.
Despite the promising results, this study has certain limitations. Firstly, while the GlaS and CRAG datasets serve as standard benchmarks, further validation on larger and more diverse multi-center datasets, encompassing a wider range of staining variations and tissue appearances, is necessary to fully establish the generalizability of our approach. Secondly, the quality of pseudo-labels, although significantly improved by our pipeline, inherently remains an approximation of true pixel-level annotations. Any residual inaccuracies in these pseudo-labels could potentially propagate to the final segmentation model, although the boundary-aware loss aims to mitigate some of this. Thirdly, the three-stage nature of the framework, while effective, involves sequential training steps which might be computationally more intensive than end-to-end solutions. The initial self-supervised fine-tuning of DINOv2, in particular, requires significant computational resources, which might be a barrier for laboratories with limited infrastructure. Lastly, the performance of the self-supervised fine-tuning stage is also dependent on the size and diversity of the unlabeled dataset used; while IMP-CRS-2024 is substantial, performance might vary with different unlabeled sources. Additionally, although DINOv2 was selected for its established success in pathology, newer variants like DINOv3 could potentially offer further improvements but were not available during development.
Future work will aim to address these limitations and build upon the current findings. We plan to extend the evaluation of our framework to other challenging histopathology image analysis tasks, such as cell nucleus segmentation or tumor microenvironment characterization, and across different cancer types to assess its broader applicability. Investigating more advanced pseudo-label refinement techniques, potentially incorporating uncertainty estimation or iterative self-training strategies, could further enhance the quality of the supervision signal. There is also scope for exploring methods to streamline the multi-stage pipeline, possibly moving towards an end-to-end weakly supervised framework that retains the benefits of each component while reducing complexity. Furthermore, integrating other forms of weak supervision, such as bounding boxes or scribbles, if available, could offer a flexible trade-off between annotation effort and performance. We also intend to incorporate newer self-supervised models like DINOv3 or MAE variants to potentially boost feature robustness. Finally, prospective studies evaluating the clinical utility of segmentations generated by our method in real-world diagnostic workflows would be invaluable.
In conclusion, our proposed three-stage weakly supervised learning framework presents a significant step towards alleviating the annotation bottleneck in computational pathology. By effectively combining self-supervised learning, attention-based pseudo-labeling, and boundary-sensitive training, we have demonstrated that high-accuracy gland segmentation can be achieved using only image-level labels. This work paves the way for the development of more robust, accurate, and accessible AI-powered tools for colorectal cancer diagnosis and research.
This study introduced a novel three-stage framework designed to address the significant challenge of limited pixel-level annotated data for gland segmentation in colorectal cancer histopathology images. Our approach successfully integrates self-supervised fine-tuning of a DINOv2 vision transformer, an innovative attention-based pseudo-label generation pipeline, and a boundary-aware loss function for training the final segmentation network. The synergistic effect of these components allows our method to effectively leverage readily available image-level labels, thereby circumventing the need for laborious and expensive pixel-wise annotations.
The core strength of our framework lies in its multi-faceted strategy. The initial self-supervised fine-tuning of DINOv2 on a large, unlabeled histopathology dataset (IMP-CRS-2024) empowers the encoder to learn domain-specific visual features that are highly relevant to glandular structures, a crucial step as demonstrated by our ablation studies where fine-tuned DINOv2 significantly outperformed its off-the-shelf counterpart. We chose DINOv2 specifically due to its proven effectiveness in pathology foundation models, such as UNI and Virchow, which are built on similar distillation frameworks, enabling superior patch-level representations for histopathology tasks. Subsequently, the attention mechanism within the classification network plays a pivotal role in localizing relevant gland regions from image-level cues, forming the basis for pseudo-label generation. The comprehensive post-processing of these attention maps, including blending, thresholding, and CRF refinement, proved essential in converting coarse attention signals into high-quality pseudo-segmentation masks, closely approximating ground truth annotations. Finally, training the segmentation network with these refined pseudo-labels, particularly with the inclusion of a boundary-aware loss, ensured precise delineation of gland boundaries, a critical aspect for accurate morphological assessment in pathology.
Our experimental results on the GlaS dataset are highly encouraging. The proposed method not only demonstrates superior performance compared to existing weakly supervised techniques but also achieves results comparable to, and in some metrics exceeding, several fully supervised methods. Furthermore, evaluation on the independent CRAG dataset, which includes more malignant and morphologically irregular glands, confirms the method’s strong generalization, with our framework outperforming all weakly supervised baselines and narrowing the gap to fully supervised approaches despite the increased difficulty. This is a significant finding, as it suggests that high-accuracy segmentation models can be developed with substantially reduced annotation effort, making advanced computational pathology tools more accessible. The improved F1-scores and particularly the lower Object Hausdorff distances highlight our method’s capability in accurately identifying individual glands and precisely capturing their complex boundaries. This precision is vital for downstream clinical applications, such as quantitative analysis of gland morphology for cancer grading and prognosis.
Despite the promising results, this study has certain limitations. Firstly, while the GlaS and CRAG datasets serve as standard benchmarks, further validation on larger and more diverse multi-center datasets, encompassing a wider range of staining variations and tissue appearances, is necessary to fully establish the generalizability of our approach. Secondly, the quality of pseudo-labels, although significantly improved by our pipeline, inherently remains an approximation of true pixel-level annotations. Any residual inaccuracies in these pseudo-labels could potentially propagate to the final segmentation model, although the boundary-aware loss aims to mitigate some of this. Thirdly, the three-stage nature of the framework, while effective, involves sequential training steps which might be computationally more intensive than end-to-end solutions. The initial self-supervised fine-tuning of DINOv2, in particular, requires significant computational resources, which might be a barrier for laboratories with limited infrastructure. Lastly, the performance of the self-supervised fine-tuning stage is also dependent on the size and diversity of the unlabeled dataset used; while IMP-CRS-2024 is substantial, performance might vary with different unlabeled sources. Additionally, although DINOv2 was selected for its established success in pathology, newer variants like DINOv3 could potentially offer further improvements but were not available during development.
Future work will aim to address these limitations and build upon the current findings. We plan to extend the evaluation of our framework to other challenging histopathology image analysis tasks, such as cell nucleus segmentation or tumor microenvironment characterization, and across different cancer types to assess its broader applicability. Investigating more advanced pseudo-label refinement techniques, potentially incorporating uncertainty estimation or iterative self-training strategies, could further enhance the quality of the supervision signal. There is also scope for exploring methods to streamline the multi-stage pipeline, possibly moving towards an end-to-end weakly supervised framework that retains the benefits of each component while reducing complexity. Furthermore, integrating other forms of weak supervision, such as bounding boxes or scribbles, if available, could offer a flexible trade-off between annotation effort and performance. We also intend to incorporate newer self-supervised models like DINOv3 or MAE variants to potentially boost feature robustness. Finally, prospective studies evaluating the clinical utility of segmentations generated by our method in real-world diagnostic workflows would be invaluable.
In conclusion, our proposed three-stage weakly supervised learning framework presents a significant step towards alleviating the annotation bottleneck in computational pathology. By effectively combining self-supervised learning, attention-based pseudo-labeling, and boundary-sensitive training, we have demonstrated that high-accuracy gland segmentation can be achieved using only image-level labels. This work paves the way for the development of more robust, accurate, and accessible AI-powered tools for colorectal cancer diagnosis and research.
출처: PubMed Central (JATS). 라이선스는 원 publisher 정책을 따릅니다 — 인용 시 원문을 표기해 주세요.
🏷️ 같은 키워드 · 무료전문 — 이 논문 MeSH/keyword 기반
- A Phase I Study of Hydroxychloroquine and Suba-Itraconazole in Men with Biochemical Relapse of Prostate Cancer (HITMAN-PC): Dose Escalation Results.
- Self-management of male urinary symptoms: qualitative findings from a primary care trial.
- Clinical and Liquid Biomarkers of 20-Year Prostate Cancer Risk in Men Aged 45 to 70 Years.
- Diagnostic accuracy of Ga-PSMA PET/CT versus multiparametric MRI for preoperative pelvic invasion in the patients with prostate cancer.
- Comprehensive analysis of androgen receptor splice variant target gene expression in prostate cancer.
- Clinical Presentation and Outcomes of Patients Undergoing Surgery for Thyroid Cancer.