ADPv2: A hierarchical histological tissue type-annotated dataset for potential biomarker discovery of colorectal disease.
1/5 보강
Computational pathology (CPath) leverages histopathology images to enhance diagnostic precision and reproducibility in clinical pathology.
APA
Yang Z, Li K, et al. (2026). ADPv2: A hierarchical histological tissue type-annotated dataset for potential biomarker discovery of colorectal disease.. Journal of pathology informatics, 20, 100537. https://doi.org/10.1016/j.jpi.2025.100537
MLA
Yang Z, et al.. "ADPv2: A hierarchical histological tissue type-annotated dataset for potential biomarker discovery of colorectal disease.." Journal of pathology informatics, vol. 20, 2026, pp. 100537.
PMID
41658283 ↗
Abstract 한글 요약
Computational pathology (CPath) leverages histopathology images to enhance diagnostic precision and reproducibility in clinical pathology. However, publicly available datasets for CPath that are annotated with extensive histological tissue type (HTT) taxonomies at a granular level remain scarce due to the significant expertise and high annotation costs required. Existing datasets, such as the Atlas of Digital Pathology (ADP), address this by offering diverse HTT annotations generalized to multiple organs, but limit the capability for in-depth studies on specific organ diseases. Building upon this foundation, we introduce ADPv2, a novel dataset focused on gastrointestinal histopathology. Our dataset comprises 20,004 image patches derived from healthy colon biopsy slides, annotated according to a hierarchical taxonomy of 32 distinct HTTs of 3 levels. Furthermore, we train a multilabel representation learning model following a two-stage training procedure on our ADPv2 dataset. By leveraging the VMamba model architecture, we achieve a mean average precision of 0.88 in multilabel colon HTT classification.. Finally, we show that our dataset is capable of an organ-specific in-depth study for potential biomarker discovery by analyzing the model's prediction behavior on tissues affected by different colon diseases, which reveals statistical patterns that confirm the two pathological pathways of colon cancer development. Our dataset is publicly available here: Part 1, Part 2, and Part 3.
🏷️ 키워드 / MeSH 📖 같은 키워드 OA만
같은 제1저자의 인용 많은 논문 (5)
- Analysis of the effect of neuroendoscopy-assisted microscopy in the treatment of Large (Koos grade IV) vestibular schwannoma.
- Novel Resorcinol Dibenzyl Ether-Based PD-L1 Inhibitors Modulate Lipid Metabolism for Enhanced Tumor Immunotherapy.
- Early diagnosis of colorectal cancer using Cerenkov luminescence endoscopy: a pilot trial involving humans for the first time.
- The diagnostic and therapeutic challenges of pulmonary epithelioid trophoblastic tumor: A case report and literature review.
- Yuxingcao formula suppresses acute erythroleukemia through inhibition of AKT1 and FLI1.
📖 전문 본문 읽기 PMC JATS · ~70 KB · 영문
Introduction
Introduction
In recent years, colorectal cancer has become the third most common cancer worldwide and a leading cause of cancer death in individuals under the age of 50.1,2 Colorectal carcinomas typically originate from local precursor polyps or lesions through well-described pathways, including the classical adenoma–carcinoma sequence and the alternate serrated pathway.3 Lower gastrointestinal endoscopy is generally employed for screening in the general population and allows direct visualization and instrumented retrieval of polyps and other suspicious lesions for further pathological evaluation.4 Histopathological image analysis by pathologists is considered the gold-standard for diagnosis of both colorectal cancer and its precursors, and for assessment of tumor staging and grading, surgical margin status, and other relevant pathological features.5,6
Computational pathology (CPath) leverages deep learning (DL) to analyze hematoxylin and eosin (H&E) whole-slide images (WSIs) at scale.7, 8, 9, 10, 11 However, WSIs are gigapixel images with substantial staining variability, preparation artifacts, and heterogeneous tissue composition.9, 10, 11 Moreover, precise annotations are expensive to obtain and are often available only at the slide level or for a small subset of regions of interest (RoIs).7,12 Consequently, existing public datasets frequently lack organ-specific, fine-grained histological tissue type (HTT) labels that would support multilabel learning and downstream biological interrogation within a single organ system.
To address the annotation bottleneck, self-supervised learning (SSL)13, 14, 15 has emerged as a compelling strategy because it reduces reliance on exhaustive labels while retaining strong representational capacity for pathology tasks.16 Building on this direction, we introduce the Atlas of Digital Pathology V2 (ADPv2), a colon-focused dataset with a 32-label hierarchical HTT taxonomy curated from healthy tissue. In addition, we develop a two-stage pipeline that first pretrains on unlabeled tiles using SSL and then fine-tunes a multilabel classifier for HTT recognition. Finally, to illustrate the scientific utility of such organ-specific representations beyond conventional accuracy metrics, we analyze confidence score distributions over RoIs from common colorectal diagnostic contexts (e.g., hyperplastic (HP)/serrated and adenomatous lesions) and examine their alignment with the two canonical CRC pathways.
In short, our contributions are as follows:•Development of ADPv2 Dataset, a hierarchical multilabel dataset focused on gastrointestinal histopathology. The latest version of our dataset is available for download at the following links.1.ADPv2 Dataset – Part 1
2.ADPv2 Dataset – Part 2
3.ADPv2 Dataset – Part 3
Patch collection for our dataset is ongoing, and the dataset will be continually updated.•Development of multilabel tissue type classification baseline model following a two-stage SSL pretraining and supervised fine-tuning procedure on our ADPv2 dataset.
•Demonstration of the potential of future biomarker discovery for colon cancers using our baseline model through statistical comparison of model predictions, and confirm the two pathological pathways of colon cancer development.
In recent years, colorectal cancer has become the third most common cancer worldwide and a leading cause of cancer death in individuals under the age of 50.1,2 Colorectal carcinomas typically originate from local precursor polyps or lesions through well-described pathways, including the classical adenoma–carcinoma sequence and the alternate serrated pathway.3 Lower gastrointestinal endoscopy is generally employed for screening in the general population and allows direct visualization and instrumented retrieval of polyps and other suspicious lesions for further pathological evaluation.4 Histopathological image analysis by pathologists is considered the gold-standard for diagnosis of both colorectal cancer and its precursors, and for assessment of tumor staging and grading, surgical margin status, and other relevant pathological features.5,6
Computational pathology (CPath) leverages deep learning (DL) to analyze hematoxylin and eosin (H&E) whole-slide images (WSIs) at scale.7, 8, 9, 10, 11 However, WSIs are gigapixel images with substantial staining variability, preparation artifacts, and heterogeneous tissue composition.9, 10, 11 Moreover, precise annotations are expensive to obtain and are often available only at the slide level or for a small subset of regions of interest (RoIs).7,12 Consequently, existing public datasets frequently lack organ-specific, fine-grained histological tissue type (HTT) labels that would support multilabel learning and downstream biological interrogation within a single organ system.
To address the annotation bottleneck, self-supervised learning (SSL)13, 14, 15 has emerged as a compelling strategy because it reduces reliance on exhaustive labels while retaining strong representational capacity for pathology tasks.16 Building on this direction, we introduce the Atlas of Digital Pathology V2 (ADPv2), a colon-focused dataset with a 32-label hierarchical HTT taxonomy curated from healthy tissue. In addition, we develop a two-stage pipeline that first pretrains on unlabeled tiles using SSL and then fine-tunes a multilabel classifier for HTT recognition. Finally, to illustrate the scientific utility of such organ-specific representations beyond conventional accuracy metrics, we analyze confidence score distributions over RoIs from common colorectal diagnostic contexts (e.g., hyperplastic (HP)/serrated and adenomatous lesions) and examine their alignment with the two canonical CRC pathways.
In short, our contributions are as follows:•Development of ADPv2 Dataset, a hierarchical multilabel dataset focused on gastrointestinal histopathology. The latest version of our dataset is available for download at the following links.1.ADPv2 Dataset – Part 1
2.ADPv2 Dataset – Part 2
3.ADPv2 Dataset – Part 3
Patch collection for our dataset is ongoing, and the dataset will be continually updated.•Development of multilabel tissue type classification baseline model following a two-stage SSL pretraining and supervised fine-tuning procedure on our ADPv2 dataset.
•Demonstration of the potential of future biomarker discovery for colon cancers using our baseline model through statistical comparison of model predictions, and confirm the two pathological pathways of colon cancer development.
Related works
Related works
Deep learning in computational pathology
Ongoing developments in advanced machine learning techniques have introduced novel DL frameworks for pathomics, the analysis of histopathological digital images with the aim of extracting quantitative features for clinical decision-making.17 These techniques can uncover previously unknown relationships and features that provide biological insights and improve disease characterization.18 In one study, Romo-Bucheli et al.19 employ DL methods to learn granular features within the nuclei, discovering that nuclei tubule prominence, intensity, multicentricity, shape and texture, may predict breast cancer recurrence. Bejnordi and colleagues20 highlighted rich insights which can be extracted from the stroma in breast tissue slides. Three CNN networks were trained in hierarchical fashion to segment stroma, identify tumor–stromal content, then derive a score for the likelihood of tumor malignancy. Colorectal cancer biomarker discovery has focused on analyzing genetic events in key genes to model tumor progression and observing differences in protein expression to differentiate between healthy and tumor tissues.21, 22, 23 Some recent studies leveraged histology slides for DL-based detection of microsatellite instability (MSI), a key biomarker for tumor mutation and immunogenicity. Gustav et al.24 use a transformer-based DL model on colorectal tissue images, identifying both MSI and polymerase epsilon mutations in tumors. Despite extensive research in colorectal biomarker discovery, to the best of our knowledge, there are only a few studies that utilize DL approaches on histopathological slides from colorectal samples to derive biomarkers for differentiating between various types of colorectal polyps.25, 26, 27, 28
Datasets for CPath
Although publicly available pathology datasets leveraging WSIs and patch-level annotations have grown significantly in recent years, crucial gaps still exist. Recent datasets primarily address multiclass or multilabel classification, incorporating region-level annotations, yet few feature extensive taxonomies (10+ labels). For instance, the BACH dataset29 provides multiclass breast cancer annotations but only includes four classes. PANDA30 features detailed prostate cancer annotations across several grades, but does not meet the multilabel criterion. The Lizard,31 PanNuke,32 and NCT-CRC-HE-100K33 datasets provide multiclass region-level annotations, but their label taxonomies remain limited to fewer than 10 classes each, despite their diversity. ADPv134 notably offers an extensive taxonomy (57 labels), but covers various organs and integrates both normal and diseased tissues, not exclusively focusing on healthy samples of one organ type at a granular level.
Self-supervised learning in CPath
SSL has become indispensable in CPath because exhaustive annotation of WSIs is prohibitively expensive. By pretraining on millions of unlabeled tissue tiles, SSL models acquire generic morphological features that can be fine-tuned with the limited expert labels available, consistently narrowing the performance gap between pathology specific networks and those initialized from natural image datasets such as ImageNet.
Early SSL success in CPath came from contrastive objectives that pull two augmented views of the same tile together while pushing different tiles apart. SimCLR13 delivers strong features but demands very large batches to expose sufficient negatives; MoCo15 sidesteps that requirement with a momentum-updated memory bank, trading batch size for extra bookkeeping. A second family removes explicit negatives and learns by feature prediction or distillation. BYOL35 trains a “student” network to regress the embeddings of a teacher that updates via an exponential moving average, whereas DINO36 aligns the output distributions of two networks under diverse augmentations. Both methods cut GPU memory nearly in half relative to SimCLR and have shown stable convergence on imbalanced tissue datasets, though they can become sensitive to tile sampling bias if the teacher lags too far behind the student.
More recently, redundancy reduction objectives such as Barlow Twins14 and VICReg37 decorrelate features across views instead of contrasting samples. These objectives work with moderate batch sizes, require no memory bank, and encourage diverse texture cues that are particularly valuable for rare glandular patterns in gastrointestinal (GI) slides.
Multilabel classification
Multilabel classification underpins many pathology tasks because a single tissue tile can express several histological attributes that sit at different depths of a diagnostic taxonomy. Two obstacles dominate. (i) Extreme label imbalance: rare lesions or gland patterns may appear in fewer than 1% of patches, so naive training is swamped by negatives. (ii) Strong label dependence and hierarchy: the presence of a high-level “inflammatory cells” tag, for example, constrains the probabilities of its child classes (lymphocytes, eosinophils, etc.). Standard accuracy metrics overlook these subtleties, forcing researchers to rely on set-based scores like micro- or macro-F1.
The usual starting point is independent Binary-Cross-Entropy (BCE) with a sigmoid per label, but BCE amplifies the positive–negative skew: every patch contributes dozens of easy negative terms, drowning the rare positives in gradient noise. Re-weighting schemes seek to restore balance. Focal loss38 down weights well-classified negatives; the asymmetric loss (ASL) of Ridnik et al.39 goes further by giving separate, tunable exponents to positives and negatives and by introducing a hard negative threshold, yielding consistent gains on long-tailed medical datasets. Class-balanced BCE and re-sampling strategies offer similar relief, but at the cost of extra passes through the data or fragile-weighting heuristics. However, none of these methods capture label correlations.
Dependencies can be introduced explicitly. Classifier chains40 pass predicted labels down a sequence of binary heads so that each head conditions on the previous ones; performance rises, but inference grows linear in the label count and results depend on chain order. More scalable are graph-based or hierarchical networks that mirror the label tree in their architecture. HMCN41 attaches shared low-level convolutional features to branch-specific classifiers, propagating information from coarse to fine labels. HiMulConE42 adds a supervised-contrastive objective43 that pulls embeddings of tiles sharing any ancestor closer together while enforcing a monotonicity constraint; child confidences can never exceed those of their ancestors. Similar hierarchy-aware ideas now appear in graph-neural-network heads and conditional probability models for WSIs.
Deep learning in computational pathology
Ongoing developments in advanced machine learning techniques have introduced novel DL frameworks for pathomics, the analysis of histopathological digital images with the aim of extracting quantitative features for clinical decision-making.17 These techniques can uncover previously unknown relationships and features that provide biological insights and improve disease characterization.18 In one study, Romo-Bucheli et al.19 employ DL methods to learn granular features within the nuclei, discovering that nuclei tubule prominence, intensity, multicentricity, shape and texture, may predict breast cancer recurrence. Bejnordi and colleagues20 highlighted rich insights which can be extracted from the stroma in breast tissue slides. Three CNN networks were trained in hierarchical fashion to segment stroma, identify tumor–stromal content, then derive a score for the likelihood of tumor malignancy. Colorectal cancer biomarker discovery has focused on analyzing genetic events in key genes to model tumor progression and observing differences in protein expression to differentiate between healthy and tumor tissues.21, 22, 23 Some recent studies leveraged histology slides for DL-based detection of microsatellite instability (MSI), a key biomarker for tumor mutation and immunogenicity. Gustav et al.24 use a transformer-based DL model on colorectal tissue images, identifying both MSI and polymerase epsilon mutations in tumors. Despite extensive research in colorectal biomarker discovery, to the best of our knowledge, there are only a few studies that utilize DL approaches on histopathological slides from colorectal samples to derive biomarkers for differentiating between various types of colorectal polyps.25, 26, 27, 28
Datasets for CPath
Although publicly available pathology datasets leveraging WSIs and patch-level annotations have grown significantly in recent years, crucial gaps still exist. Recent datasets primarily address multiclass or multilabel classification, incorporating region-level annotations, yet few feature extensive taxonomies (10+ labels). For instance, the BACH dataset29 provides multiclass breast cancer annotations but only includes four classes. PANDA30 features detailed prostate cancer annotations across several grades, but does not meet the multilabel criterion. The Lizard,31 PanNuke,32 and NCT-CRC-HE-100K33 datasets provide multiclass region-level annotations, but their label taxonomies remain limited to fewer than 10 classes each, despite their diversity. ADPv134 notably offers an extensive taxonomy (57 labels), but covers various organs and integrates both normal and diseased tissues, not exclusively focusing on healthy samples of one organ type at a granular level.
Self-supervised learning in CPath
SSL has become indispensable in CPath because exhaustive annotation of WSIs is prohibitively expensive. By pretraining on millions of unlabeled tissue tiles, SSL models acquire generic morphological features that can be fine-tuned with the limited expert labels available, consistently narrowing the performance gap between pathology specific networks and those initialized from natural image datasets such as ImageNet.
Early SSL success in CPath came from contrastive objectives that pull two augmented views of the same tile together while pushing different tiles apart. SimCLR13 delivers strong features but demands very large batches to expose sufficient negatives; MoCo15 sidesteps that requirement with a momentum-updated memory bank, trading batch size for extra bookkeeping. A second family removes explicit negatives and learns by feature prediction or distillation. BYOL35 trains a “student” network to regress the embeddings of a teacher that updates via an exponential moving average, whereas DINO36 aligns the output distributions of two networks under diverse augmentations. Both methods cut GPU memory nearly in half relative to SimCLR and have shown stable convergence on imbalanced tissue datasets, though they can become sensitive to tile sampling bias if the teacher lags too far behind the student.
More recently, redundancy reduction objectives such as Barlow Twins14 and VICReg37 decorrelate features across views instead of contrasting samples. These objectives work with moderate batch sizes, require no memory bank, and encourage diverse texture cues that are particularly valuable for rare glandular patterns in gastrointestinal (GI) slides.
Multilabel classification
Multilabel classification underpins many pathology tasks because a single tissue tile can express several histological attributes that sit at different depths of a diagnostic taxonomy. Two obstacles dominate. (i) Extreme label imbalance: rare lesions or gland patterns may appear in fewer than 1% of patches, so naive training is swamped by negatives. (ii) Strong label dependence and hierarchy: the presence of a high-level “inflammatory cells” tag, for example, constrains the probabilities of its child classes (lymphocytes, eosinophils, etc.). Standard accuracy metrics overlook these subtleties, forcing researchers to rely on set-based scores like micro- or macro-F1.
The usual starting point is independent Binary-Cross-Entropy (BCE) with a sigmoid per label, but BCE amplifies the positive–negative skew: every patch contributes dozens of easy negative terms, drowning the rare positives in gradient noise. Re-weighting schemes seek to restore balance. Focal loss38 down weights well-classified negatives; the asymmetric loss (ASL) of Ridnik et al.39 goes further by giving separate, tunable exponents to positives and negatives and by introducing a hard negative threshold, yielding consistent gains on long-tailed medical datasets. Class-balanced BCE and re-sampling strategies offer similar relief, but at the cost of extra passes through the data or fragile-weighting heuristics. However, none of these methods capture label correlations.
Dependencies can be introduced explicitly. Classifier chains40 pass predicted labels down a sequence of binary heads so that each head conditions on the previous ones; performance rises, but inference grows linear in the label count and results depend on chain order. More scalable are graph-based or hierarchical networks that mirror the label tree in their architecture. HMCN41 attaches shared low-level convolutional features to branch-specific classifiers, propagating information from coarse to fine labels. HiMulConE42 adds a supervised-contrastive objective43 that pulls embeddings of tiles sharing any ancestor closer together while enforcing a monotonicity constraint; child confidences can never exceed those of their ancestors. Similar hierarchy-aware ideas now appear in graph-neural-network heads and conditional probability models for WSIs.
Dataset
Dataset
In this section, we detail the process of developing our digital pathology database, ADPv2. Fig. 1 illustrates our data collection procedure. We first collect WSIs from multiple hospitals, preprocess the WSIs using our online ADP annotation platform to extract patches for annotations. Pathologists inspect and select all HTTs associated with each patch, creating a multilabel annotation for each patch. The final complete dataset is split into an annotated split and an unannotated split, both of which will be presented to the public for research and development use.
WSI collection
The 461 WSIs used in this project originate from multiple sources, including “The Cancer Genome Atlas (TCGA)” provided by the National Cancer Institute (NCI); Kingston General Hospital (KGH; Canada), St. Michael's Hospital (SMH; Canada), and Sunnybrook Research Institute (SRI; Canada). Each slide used in our project is derived from healthy colon biopsies. KGH slides are acquired using a TissueScope LE brightfield Scanner, whereas SMH and SRI slides are acquired from TissueScope IQ brighfield Scanner. The staining techniques, image magnifications, and resolutions vary across institutions, as shown in Table 1.
Patch extraction and preprocessing
We evenly extract non-overlapping image patches of fixed size 544 × 544 μm from each of our slides. To account for variations in image resolutions of WSIs, we adjust the digital resolution of the patches according to their microns per pixel (mpp) such that each image patch represents the same physical patch of 544 × 544 μm. Table 1 shows the converted pixel resolutions of each type of slide. A filtering algorithm based on image color and contrast is deployed to filter out background patches. As a result, we extracted a total of 125,808 non-background image patches. Among these patches, 20,004 image patches are selected by our annotator for labeling. Only patches deemed the most informative and containing sufficient numbers of tissue types from our taxonomy were selected.
Annotation process
The annotation of this dataset was conducted on colon polyp WSIs at patch level using the ADP annotation tool. The annotator was trained under close supervision of an expert pathologist to specifically recognize GI tract tissues under different staining conditions on WSIs. The annotation criteria were designed to be highly precise, ensuring maximum specificity. The detection of even a single cell within a patch warranted its labeling, regardless of the number or quantity of cells present. All annotated patches are subsequently evaluated by the expert pathologist for accuracy.
All annotations follow a predefined hierarchical taxonomy of HTTs, as shown in Table 2. Although our dataset only contains patches extracted from colon polyp WSIs, the taxonomy includes the entire GI tract tissues for possible future dataset expansion. Developed by expert pathologists, this taxonomy carefully selected 32 HTTs belonging to 3 levels of hierarchy based on their critical role in both normal physiology and pathological transformations. These HTTs generally represent the key sites where precancerous and cancerous changes occur. For instance, the HTTs under the Surface Epithelium branch are often examined for the shift from stratified squamous to columnar epithelium, which is a well-documented precancerous change. Similarly, glandular structures (under the Glands branch), including crypts, are common sites of dysplasia and neoplastic transformation. Inflammatory cells were included due to their role in the tumor microenvironment and chronic inflammation-driven carcinogenesis. Additionally, connective tissue, stromal, neural, and vascular components were incorporated due to their role in tumor progression, as remodeling, perineural invasion, and lymphovascular invasion are key markers of cancer aggressiveness.
Dataset statistics
As previously described in the Patch Extraction and Preprocessing section, our dataset comprises a total of 20,004 multilabel annotated image patches. Each image patch is associated with at least one HTT label up to a maximum of 22 labels, with an average of 10 HTTs associated with each annotated patch in our dataset, as shown in Fig. 2. It can be observed that 90% of the image patches in the dataset have more than 9 labels, and 75% of them contain between 5 and 15 labels. This pattern reflects the inherent heterogeneity of GI tissues, where multiple structures coexist within a given area. Furthermore, patch selection was intentionally designed to prioritize histologically rich and informative regions. Instead of choosing entirely homogeneous areas, patches were selected based on the presence of multiple HTTs to maximize information capture.
In this section, we detail the process of developing our digital pathology database, ADPv2. Fig. 1 illustrates our data collection procedure. We first collect WSIs from multiple hospitals, preprocess the WSIs using our online ADP annotation platform to extract patches for annotations. Pathologists inspect and select all HTTs associated with each patch, creating a multilabel annotation for each patch. The final complete dataset is split into an annotated split and an unannotated split, both of which will be presented to the public for research and development use.
WSI collection
The 461 WSIs used in this project originate from multiple sources, including “The Cancer Genome Atlas (TCGA)” provided by the National Cancer Institute (NCI); Kingston General Hospital (KGH; Canada), St. Michael's Hospital (SMH; Canada), and Sunnybrook Research Institute (SRI; Canada). Each slide used in our project is derived from healthy colon biopsies. KGH slides are acquired using a TissueScope LE brightfield Scanner, whereas SMH and SRI slides are acquired from TissueScope IQ brighfield Scanner. The staining techniques, image magnifications, and resolutions vary across institutions, as shown in Table 1.
Patch extraction and preprocessing
We evenly extract non-overlapping image patches of fixed size 544 × 544 μm from each of our slides. To account for variations in image resolutions of WSIs, we adjust the digital resolution of the patches according to their microns per pixel (mpp) such that each image patch represents the same physical patch of 544 × 544 μm. Table 1 shows the converted pixel resolutions of each type of slide. A filtering algorithm based on image color and contrast is deployed to filter out background patches. As a result, we extracted a total of 125,808 non-background image patches. Among these patches, 20,004 image patches are selected by our annotator for labeling. Only patches deemed the most informative and containing sufficient numbers of tissue types from our taxonomy were selected.
Annotation process
The annotation of this dataset was conducted on colon polyp WSIs at patch level using the ADP annotation tool. The annotator was trained under close supervision of an expert pathologist to specifically recognize GI tract tissues under different staining conditions on WSIs. The annotation criteria were designed to be highly precise, ensuring maximum specificity. The detection of even a single cell within a patch warranted its labeling, regardless of the number or quantity of cells present. All annotated patches are subsequently evaluated by the expert pathologist for accuracy.
All annotations follow a predefined hierarchical taxonomy of HTTs, as shown in Table 2. Although our dataset only contains patches extracted from colon polyp WSIs, the taxonomy includes the entire GI tract tissues for possible future dataset expansion. Developed by expert pathologists, this taxonomy carefully selected 32 HTTs belonging to 3 levels of hierarchy based on their critical role in both normal physiology and pathological transformations. These HTTs generally represent the key sites where precancerous and cancerous changes occur. For instance, the HTTs under the Surface Epithelium branch are often examined for the shift from stratified squamous to columnar epithelium, which is a well-documented precancerous change. Similarly, glandular structures (under the Glands branch), including crypts, are common sites of dysplasia and neoplastic transformation. Inflammatory cells were included due to their role in the tumor microenvironment and chronic inflammation-driven carcinogenesis. Additionally, connective tissue, stromal, neural, and vascular components were incorporated due to their role in tumor progression, as remodeling, perineural invasion, and lymphovascular invasion are key markers of cancer aggressiveness.
Dataset statistics
As previously described in the Patch Extraction and Preprocessing section, our dataset comprises a total of 20,004 multilabel annotated image patches. Each image patch is associated with at least one HTT label up to a maximum of 22 labels, with an average of 10 HTTs associated with each annotated patch in our dataset, as shown in Fig. 2. It can be observed that 90% of the image patches in the dataset have more than 9 labels, and 75% of them contain between 5 and 15 labels. This pattern reflects the inherent heterogeneity of GI tissues, where multiple structures coexist within a given area. Furthermore, patch selection was intentionally designed to prioritize histologically rich and informative regions. Instead of choosing entirely homogeneous areas, patches were selected based on the presence of multiple HTTs to maximize information capture.
Method
Method
In this section, we provide an overview of our multilabel HTT representation learning pipeline (Fig. 3) to efficiently train a model and how we use the model's predicted confidence scores to uncover statistical patterns that confirm the two pathological development pathways of colon cancers (Fig. 4).
Multilabel representation learning
Our multilabel representation learning procedure follows three steps: (1) choice of the representation learning encoder, (2) self-supervised pretraining on unlabeled image patches, and (3) model fine-tuning on the multilabel HTT classification task using labeled ADPv2 patches. The learned model is further used for potential biomarker discovery, which will be detailed in Section 4.2.
Encoder for representation learning
We adopt VMamba44 as the feature encoder for HTT classification. VMamba is a Selective State-Space Model that scans image tokens sequentially, storing their latent “state” in a lightweight gating mechanism. This design delivers transformer-level accuracy while keeping time and memory strictly linear in the number of tokens. This is an important property when a single pathology patch may contain many thousands of 16 × 16 pixel tokens.
Linear scaling lets us present the network with a large field of view (multiple millimeters of tissue) without down-sampling, so VMamba simultaneously sees architectural patterns (e.g., crypt orientation, glandular shape, and spacing) and cellular details (e.g., nuclear atypia) that often co-occur in colon polyps. The recurrent state captures short-range context naturally, whereas the long convolutional kernel implicit in the SSM propagates information across the entire patch, yielding robust representations for both coarse and fine labels in our hierarchy.
Clinically, this token-by-token sweep echoes how pathologists move a microscope stage: first surveying broad regions, then zooming in on suspicious microstructures. That behavioral analogy, together with VMamba's favorable compute profile, makes it a more apt backbone for ADPv2 than quadratic Vision Transformers45 or pure CNN alternatives.
Self-supervised learning of tissues using Barlow Twins
In the pretraining stage, we select Barlow Twins14 as our SSL method, as it efficiently reduces feature redundancy: the presence of repetitive or duplicate information in the learned representation of the data. By reducing redundancy, it extracts diverse and meaningful features crucial for accurately capturing the subtle variations in pathology images, which is essential for tasks such as cancer diagnosis. Barlow Twins utilizes an objective function that measures the cross-correlation between the output features computed from two distorted versions of each image sample and tries to make this matrix as close to the identity as possible, reducing redundancy between components of these vectors. The Barlow Twins objective function is given as:
Where the is a positive constant trading off the importance of the first and second terms of the loss, and is the cross-correlation matrix computed between the two outputs of the network along the batch dimension:
During training, we follow the standard Barlow Twins data augmentation settings, with an extra RandStainNA46 before color normalization. RandStainNA is a hybrid framework combining stain normalization and stain augmentation while incorporating variations in color space to produce realistic stains. Using RandStainNA, we can exclude the default color jitter used in the standard Barlow Twins implementation while gaining increased performance. Training setup details can be found in Table B.1 and Table B.2 in the appendix.
Finetuning model on ADPv2 for multilabel HTT classification
After pretraining, we fine-tune the pretrained encoder for multilabel HTT classification. During finetuning, we remove the projection head of the pretrained VMamba encoder and add a classification head of a single linear layer. Before training, we discard all images originating from the 110 TCGA slides due to their heterogeneity in staining and inconsistent quality. We split the dataset into training, validation, and test using an 80–10–10 ratio. To prevent data leakage, we create dataset splits so that no two patches from different splits come from the same slide. The training split covers 364 slides and the testing split covers 86 slides. For data augmentation, we applied RandomResizedCrop, RandomResizedHorizontalFlip, RandomRotation, RandomGrayscale, Solarization, RandStainNA, and color normalization.
Before training, we perform label pruning to address the dataset imbalance issue and reduce label noise. First, we exclude all HTTs that are non-colon or occur very sparsely in our dataset, which are denoted in a red font in Table 2. Then, we remove connective tissues due to their extreme abundance (∼95%) in the image patches, making the model extremely biased in predicting this class. Furthermore, we discard HTTs with noisy annotations due to the challenge for pathologist annotation. We remove inflammatory cells, polymorphonuclear cells, and mononuclear cells from training as these are deemed “low-confidence” labels, which are selected in cases where the presence of the children HTTs is ambiguous, thereby possibly introducing inaccuracies in our dataset. Finally, we merge surface epithelium and glands as a single class GD, because they are the same biological structure cut at different angles. All pruned HTTs are denoted in standard black font color. As a result, we reduced the total number of HTTs for training from 32 labels to 14 labels. These 14 labels are denoted in green in Table 2.
We use ASL47 as our objective function, which is described in the following equation:where K is the number of labels, yk is the ground-truth label for the k-th class, pk is the predicted probability for the k-th label, is the shifted probability with the margin m applied, and and are tunable focusing parameters for the positive and negative samples, respectively. This objective function enhances multilabel classification by applying different penalties to false-positives and false-negatives, allowing the model to focus more on correctly identifying rare but critical labels. This approach addresses label imbalance and asymmetry in error importance, leading to improved performance in scenarios where certain misclassifications carry higher consequences, especially in digital pathology. All hyperparameter settings can be found in Table B.1 in the appendix.
Confidence score distribution analysis
Tissues affected by disease change in morphology, structure, and even color. These changes may be subtle, but will affect the model's output confidence scores if the model has only seen healthy tissue patches during training.48 Therefore, the distribution of the model's predicted confidence scores on patches from the RoIs of diseased slides will shift from the distribution of normal slides on certain HTTs. The shift pattern on different sets of HTTs can be potentially used as a marker for different diseases.
Fig. 4 shows an overall illustration of the confidence score distribution analysis workflow. To find the distribution shift, we first additionally collected colon polyp WSIs belonging to four different diseases: HP, sessile serrated lesion (SSL), tubular adenomas (TA), and tubulovillous adenoma (TVA). All slides are annotated by pathologists at RoI level to have areas of disease identified. Traditional serrated adenomas (TSAs) and pure villous adenomas (VA) were not included due to insufficient numbers of RoI-annotated cases in our cohort at the time of analysis; as additional cases accrue, we plan to extend the analysis to these morphologies. For our healthy slides in hand, pathologists simply annotate areas where these diseases are most likely to develop as RoIs. We extracted non-overlapping patches corresponding to a FOV of 544 × 544 μm from the RoI annotation areas on all slides (both healthy and diseased). The statistics of the extracted patches are shown in Table B.4. Then, we applied our VMamba model to all RoI patches and collected the multilabel confidence scores for all HTTs for each patch, giving us a 2D tensor of shape [N, C], where N is the total number of image patches extracted from all WSI RoIs and C is the number of HTTs. We divide this collection of confidence scores into M non-overlapping subsets [N1, C] … [NM, C], depending on the group of the diseases to which the patch belongs. After that, we visualize the distribution shifts using normalized histograms. We generate a histogram for each HTT using the M subsets of predicted confidence scores, giving us a total of C histogram plots having M distribution histograms on each plot (with M - 1 diseased distributions and one healthy distribution). Furthermore, we quantify the distribution shifts using Welch's two-sample t-test (two-tailed). was set at 0.05 for significance for the tests described. Because the confidence scores are heavily right-skewed, we apply logit transformation to all confidence values to improve normality. The logit transform is defined as , where p is the confidence score and is added to avoid infinity when p is exactly 0 or 1. We compute p-values with Holm-Bonferroni's correction method to control the family-wise error rate (adjustment for multiple comparisons).
When grouping diseases for confidence score analysis, we treat SSL and HP as the first group, TA and TVA as the second group, and healthy slides as the last group. Finally, pathologists interpret the distribution shifts from the medical perspective to confirm the two cancer development pathways.
In this section, we provide an overview of our multilabel HTT representation learning pipeline (Fig. 3) to efficiently train a model and how we use the model's predicted confidence scores to uncover statistical patterns that confirm the two pathological development pathways of colon cancers (Fig. 4).
Multilabel representation learning
Our multilabel representation learning procedure follows three steps: (1) choice of the representation learning encoder, (2) self-supervised pretraining on unlabeled image patches, and (3) model fine-tuning on the multilabel HTT classification task using labeled ADPv2 patches. The learned model is further used for potential biomarker discovery, which will be detailed in Section 4.2.
Encoder for representation learning
We adopt VMamba44 as the feature encoder for HTT classification. VMamba is a Selective State-Space Model that scans image tokens sequentially, storing their latent “state” in a lightweight gating mechanism. This design delivers transformer-level accuracy while keeping time and memory strictly linear in the number of tokens. This is an important property when a single pathology patch may contain many thousands of 16 × 16 pixel tokens.
Linear scaling lets us present the network with a large field of view (multiple millimeters of tissue) without down-sampling, so VMamba simultaneously sees architectural patterns (e.g., crypt orientation, glandular shape, and spacing) and cellular details (e.g., nuclear atypia) that often co-occur in colon polyps. The recurrent state captures short-range context naturally, whereas the long convolutional kernel implicit in the SSM propagates information across the entire patch, yielding robust representations for both coarse and fine labels in our hierarchy.
Clinically, this token-by-token sweep echoes how pathologists move a microscope stage: first surveying broad regions, then zooming in on suspicious microstructures. That behavioral analogy, together with VMamba's favorable compute profile, makes it a more apt backbone for ADPv2 than quadratic Vision Transformers45 or pure CNN alternatives.
Self-supervised learning of tissues using Barlow Twins
In the pretraining stage, we select Barlow Twins14 as our SSL method, as it efficiently reduces feature redundancy: the presence of repetitive or duplicate information in the learned representation of the data. By reducing redundancy, it extracts diverse and meaningful features crucial for accurately capturing the subtle variations in pathology images, which is essential for tasks such as cancer diagnosis. Barlow Twins utilizes an objective function that measures the cross-correlation between the output features computed from two distorted versions of each image sample and tries to make this matrix as close to the identity as possible, reducing redundancy between components of these vectors. The Barlow Twins objective function is given as:
Where the is a positive constant trading off the importance of the first and second terms of the loss, and is the cross-correlation matrix computed between the two outputs of the network along the batch dimension:
During training, we follow the standard Barlow Twins data augmentation settings, with an extra RandStainNA46 before color normalization. RandStainNA is a hybrid framework combining stain normalization and stain augmentation while incorporating variations in color space to produce realistic stains. Using RandStainNA, we can exclude the default color jitter used in the standard Barlow Twins implementation while gaining increased performance. Training setup details can be found in Table B.1 and Table B.2 in the appendix.
Finetuning model on ADPv2 for multilabel HTT classification
After pretraining, we fine-tune the pretrained encoder for multilabel HTT classification. During finetuning, we remove the projection head of the pretrained VMamba encoder and add a classification head of a single linear layer. Before training, we discard all images originating from the 110 TCGA slides due to their heterogeneity in staining and inconsistent quality. We split the dataset into training, validation, and test using an 80–10–10 ratio. To prevent data leakage, we create dataset splits so that no two patches from different splits come from the same slide. The training split covers 364 slides and the testing split covers 86 slides. For data augmentation, we applied RandomResizedCrop, RandomResizedHorizontalFlip, RandomRotation, RandomGrayscale, Solarization, RandStainNA, and color normalization.
Before training, we perform label pruning to address the dataset imbalance issue and reduce label noise. First, we exclude all HTTs that are non-colon or occur very sparsely in our dataset, which are denoted in a red font in Table 2. Then, we remove connective tissues due to their extreme abundance (∼95%) in the image patches, making the model extremely biased in predicting this class. Furthermore, we discard HTTs with noisy annotations due to the challenge for pathologist annotation. We remove inflammatory cells, polymorphonuclear cells, and mononuclear cells from training as these are deemed “low-confidence” labels, which are selected in cases where the presence of the children HTTs is ambiguous, thereby possibly introducing inaccuracies in our dataset. Finally, we merge surface epithelium and glands as a single class GD, because they are the same biological structure cut at different angles. All pruned HTTs are denoted in standard black font color. As a result, we reduced the total number of HTTs for training from 32 labels to 14 labels. These 14 labels are denoted in green in Table 2.
We use ASL47 as our objective function, which is described in the following equation:where K is the number of labels, yk is the ground-truth label for the k-th class, pk is the predicted probability for the k-th label, is the shifted probability with the margin m applied, and and are tunable focusing parameters for the positive and negative samples, respectively. This objective function enhances multilabel classification by applying different penalties to false-positives and false-negatives, allowing the model to focus more on correctly identifying rare but critical labels. This approach addresses label imbalance and asymmetry in error importance, leading to improved performance in scenarios where certain misclassifications carry higher consequences, especially in digital pathology. All hyperparameter settings can be found in Table B.1 in the appendix.
Confidence score distribution analysis
Tissues affected by disease change in morphology, structure, and even color. These changes may be subtle, but will affect the model's output confidence scores if the model has only seen healthy tissue patches during training.48 Therefore, the distribution of the model's predicted confidence scores on patches from the RoIs of diseased slides will shift from the distribution of normal slides on certain HTTs. The shift pattern on different sets of HTTs can be potentially used as a marker for different diseases.
Fig. 4 shows an overall illustration of the confidence score distribution analysis workflow. To find the distribution shift, we first additionally collected colon polyp WSIs belonging to four different diseases: HP, sessile serrated lesion (SSL), tubular adenomas (TA), and tubulovillous adenoma (TVA). All slides are annotated by pathologists at RoI level to have areas of disease identified. Traditional serrated adenomas (TSAs) and pure villous adenomas (VA) were not included due to insufficient numbers of RoI-annotated cases in our cohort at the time of analysis; as additional cases accrue, we plan to extend the analysis to these morphologies. For our healthy slides in hand, pathologists simply annotate areas where these diseases are most likely to develop as RoIs. We extracted non-overlapping patches corresponding to a FOV of 544 × 544 μm from the RoI annotation areas on all slides (both healthy and diseased). The statistics of the extracted patches are shown in Table B.4. Then, we applied our VMamba model to all RoI patches and collected the multilabel confidence scores for all HTTs for each patch, giving us a 2D tensor of shape [N, C], where N is the total number of image patches extracted from all WSI RoIs and C is the number of HTTs. We divide this collection of confidence scores into M non-overlapping subsets [N1, C] … [NM, C], depending on the group of the diseases to which the patch belongs. After that, we visualize the distribution shifts using normalized histograms. We generate a histogram for each HTT using the M subsets of predicted confidence scores, giving us a total of C histogram plots having M distribution histograms on each plot (with M - 1 diseased distributions and one healthy distribution). Furthermore, we quantify the distribution shifts using Welch's two-sample t-test (two-tailed). was set at 0.05 for significance for the tests described. Because the confidence scores are heavily right-skewed, we apply logit transformation to all confidence values to improve normality. The logit transform is defined as , where p is the confidence score and is added to avoid infinity when p is exactly 0 or 1. We compute p-values with Holm-Bonferroni's correction method to control the family-wise error rate (adjustment for multiple comparisons).
When grouping diseases for confidence score analysis, we treat SSL and HP as the first group, TA and TVA as the second group, and healthy slides as the last group. Finally, pathologists interpret the distribution shifts from the medical perspective to confirm the two cancer development pathways.
Experiments and results
Experiments and results
Dataset qualitative analysis
In this section, we show the quality, diversity and richness of our dataset. First, we randomly sample a few patches and label HTTs within them for visualization. Second, we analyze the interactions between HTTs through co-occurrence networks.
HTT visualizations
We visualize four example patches in Fig. 5. The regions corresponding to HTTs in each patch are highlighted with black arrows. As observed, the HTTs on these patches exhibit specific and identifiable morphological characteristics. Notably, the patches tend to be multiplex, often containing multiple HTTs with diverse structural features. Specifically, the surface epithelium or brush border with numerous goblet cells appears as a monolayer of goblet cells, whereas large bowel crypts present as round or oval structures arranged in clusters in cross-sectional cuts. Embedded between these crypts is the lamina propria, which mainly consists of loose connective tissue and various inflammatory cells. Among these inflammatory cells, eosinophils are easily distinguishable due to their red granules and multilobed nuclei. Additionally, plasma cells, recognized by their round, dense, and eccentrically positioned nuclei, are also present within the lamina propria. Within the lamina propria, small lymphovascular channels—thinner than large blood vessels and lacking elastic fibers—can be detected, often containing numerous red blood cells. The smooth muscle cells of the muscularis mucosae, stained pink, can be abundant near the bases of the crypts, at the junction with the submucosa. Moreover, this patch also contains hollow-appearing white lipid vacuoles of adipose tissue, further emphasizing the complexity and heterogeneity of the visualized regions. To provide a more detailed reference for these HTTs, Fig. 6 presents representative examples of Level 3 HTTs. Each panel in the figure displays a cropped region centered on a specific HTT, with the relevant structure outlined or highlighted. These visualizations serve as examples of the diverse morphological patterns captured at this classification level. Further details can be found in the appendix, where Table C.1 provides a detailed description of all HTTs.
HTT co-occurrence and interactions
HTTs exhibit a network of coexistence by nature, with certain types frequently co-occurring. As shown in Fig. 7, we visualize the coexistence network on the HTTs in the dataset. In the figure, the varying thickness of the edges in the network graph indicates differences in co-occurrence strength, suggesting that some HTTs are more closely associated than others. Level 1 (Fig. 7a) shows four broad compartments—surface epithelium, glands, neural tissue, and inflammatory cells. The thickest edge links glands to inflammatory cells, reflecting the well-known propensity of colonic glands to attract inflammatory infiltrates when epithelial integrity is perturbed (e.g., glandular distortion, crypt abscesses). A similarly prominent SE–IC edge mirrors the fact that the luminal epithelium is the first barrier breached by pathogens or mechanical injury, again driving immune cell recruitment. In comparison, NT is more isolated, consistent with the relative scarcity of enteric neurons in routine mucosal biopsies and their limited exposure to inflammatory exudates. Level 2 (Fig. 7b) differentiates epithelium into simple columnar versus glandular phenotypes and splits the immune compartment into polymorphonuclear (POC, acute) and mononuclear (MOC, chronic) cells. The GL–MOC and POC–MOC edges dominate the graph: gland-associated chronic inflammation (e.g., lymphoplasmacytic infiltrates in long-standing colitis) explains the first, whereas the second captures the typical succession of neutrophil-led acute responses that transition into mononuclear-cell predominance during tissue repair. A thinner SCL–POC connection suggests that the surface epithelium is more often attacked by neutrophils during acute flares, whereas deeper glands harbor mixed inflammatory infiltrates. Level 3 (Fig. 7c) explodes the network into 16 fine-grained HTTs, revealing a dense mesh of anatomical and pathological relationships. Notable thick edges include LC–SMC (lymphovascular channels enwrapped by muscularis mucosa), CT–LY (lymphoid aggregates embedded in connective stroma), and RBC-V/RBC-NTR (erythrocytes pooling within damaged vessels and neutrophil-rich exudates). Strong SMC–EF and V–AT links echo the structural coupling of smooth muscle, elastic fibers, vessels, and perivascular adipose in the submucosa, whereas the cluster of NTR–ES–MA–PC interactions typifies mixed acute-on-chronic inflammation where neutrophils, eosinophils, macrophages, and plasma cells co-localize around sites of persistent injury. Collectively, the graphs re-capitulate the expected microarchitectural hierarchy of the colonic wall and align with classical pathophysiological sequences—from epithelial damage to gland-centred chronicity and, finally, to the complex multicellular milieu observed in advanced lesions or reparative fibrosis.
Multilabel representation learning results
In this section, we detail both quantitative and qualitative results achieved by our VMamba model. For quantitative results, we show the model's general multilabel classification performance and per-class performance using multiple metrics on the test split of our ADPv2 dataset. For qualitative results, we first visualize the spreads in the multilabel representations learned by our model in the latent space using t-SNE on 500 randomly sampled image patches from our ADPv2 dataset. Then, we provide heatmaps on in-house test slides to check the quality of HTT identification on WSIs. Finally, we select the most representative image patch for each HTT in our dataset and apply GradCAM on the model's second VSSM block to visualize the model's pixel-level attentions to make predictions.
Quantitative results
Our finetuned VMamba model achieves a test mean average precision (mAP) of 0.879. As shown in Table 3, for each HTT used in training, we compute the true-positive rate (TPR), true-negative rate (TNR), false-positive rate (FPR), false-negative rate (FNR), accuracy (ACC), F1-score, precision, and recall. Notably, the model performs strongly on most HTTs, importantly on HTTs pathologists consider diagnostically relevant. We also find the worst performances primarily stem from four HTTs: LC, ES, MA, and RBC. For LC and RBC, both HTTs are present in a large majority of our dataset, with LC occurring in 16,418 image patches and RBC occurring in 16,543 image patches, meaning these HTTs are found in about 82% of all image patches in our dataset. Furthermore, LC and RBC exhibit similarities in their performances: exceptionally high TPR and FPR, and abnormally low TNR and FNR. One interpretation for this is the lack of negative samples makes it difficult for the model to discern these HTTs properly. In addition, our objective function incentivizes predicting positive labels correctly via asymmetric weighting of positive losses, thereby neglecting to “learn” negative samples when there exist so few. Conversely, for macrophages, we observe low TPR, high FNR, low precision, and low F1, indicative that the model cannot recognize positive samples. This can be attributed to macrophages' low positive occurrence of positives in our dataset—at just 1020 samples, or 6% of the dataset. Lastly, EF has a high FPR, low TNR, and low accuracy, indicating the model is quite poor at correctly identifying negatives. This can be attributed to the general difficulty of identifying this HTT in our images, as is the case with neutrophils (refer to Section 4.1.2), which is the sibling of EF under POC. Thus, EF is susceptible to mislabeling by the pathologists themselves and as a result, the “real” FPR is likely to be slightly lower than reported.
Qualitative results
In this section, we first visualize the representation of our VMamba model learned on ADPv2 using t-SNE, then we show the heatmaps on unseen slides using the model's predictions. Because our dataset is multilabeled, we use the color blending technique to color-label the t-SNE plot, by assigning each HTT a color and blending the colors into one if a sample has multiple labels associated with it. For visualization clarity, we only consider the parent HTTs for color blending, reducing the total number of colors to blend from 16 to 11. As shown in Fig. 8, we observe clear clustering effect on the blended colors. It suggests that the VMamba model effectively captures high-level morphological differences, e.g., separating glandular crypt structures from surface epithelium or adipose tissue. Where colors blend, the patches exhibit multiple co-occurring tissue/cell types, reflecting the biological reality that colon subregions consist of overlapping structures. For instance, smooth muscle cells and elastic fibers coexist as key components of blood vessel walls in the submucosa. From a histopathological point of view, this clustering and partial overlap align with the preserved colon wall architecture, e.g., epithelial lining is found at the mucosal surface, and lymphatic and vascular channels can be seen within the submucosa.
We also visualize the model's prediction results for each HTT as heatmaps on slides in the test split. Each slide is tiled into patches corresponding to a FOV of 544 × 544 microns at 20× level. The patches are resized to 512 × 512 pixels and fed into our VMamba model to get multilabel confidence scores. For each HTT, we visualize the patch confidence scores as an overlay on the original slides. Fig. 9 shows the heatmaps for a healthy slide and a TVA slide on five diagnostically relevant HTTs. On TVA slides, the detection activates mostly on glands and lamina propria, matching key diagnostic architectural features of TVAs, with increased crowding of dysplastic glands and presence of both tubular and villous components (the latter representing >25% but <75% of the TVA). The observed detection of immune cells, such as lymphocytes and plasma cells, within the adenoma's mucosa, could be related to local responses to dysplastic changes, especially if increased, but such cells also seen within the lamina propria of the normal colorectal mucosa.
We also leverage GradCAM to identify the salient features used by our model during prediction. GradCAM is performed on the patch level to show how granular tissue morphologies contribute to model's prediction of particular HTTs. For demonstration, we visualize the class activation maps of all 14 HTTs used in training on test image patches in our ADPv2 dataset, as in Fig. 10.
Whereas for the majority of HTTs, activation maps as seen with GradCAM have mostly focused directly on the HTT of interest (e.g., LA, lymphocytes; GD, glands; SMC, smooth muscle cells), in other instances, salient features most used by the model as visualized by the activation maps include surrounding non-HTT tissues (e.g., for AT, adipose tissue) or interfaces/limits between HTT-positive and non-HTT regions (e.g., for V, large vessels).
Confidence score distribution analysis
In Fig. 11, we present the distribution of the predicted confidence score on patches extracted from gland areas on slides of five categories. From Fig. 11a, we observe a clear separation between the diseased distribution and the healthy distribution, where the diseased confidence scores are shifted to the left, which could be related to the subtle glandular distortions as interpreted by the model. Further, in Fig. 11b, there is a separation between TT (TA + TVA) and SH (SSL + HP), suggesting that the model's confidence is affected by underlying morphological glandular differences between categories of colorectal precursor lesions as they diverge from normal mucosa (e.g., SSL + HP: luminal epithelial serrations, and, in most cases, crypt elongation, small basally located nuclei; TA + TVA: dysplasia, by definition, with prominent pseudostratification in most cases).49, 50, 51 Of note, a decrease in inter-rater agreement regarding colorectal polyp classification can be often similarly seen among both general and expert pathologists, highlighting the complexity of “diseased” glandular interpretation and potential decrease in confidence scores.51,52
We quantify the distribution shifts using t-tests. As summarized in Table 4, SSLs+HP polyps elicit a pronounced reduction in model confidence relative to normal tissue (Welch , ), whereas the effect is more modest for traditional TA + TVA (, ). TT and SH distributions also differ significantly from each other (, ), with a shift in classifier confidence scores within the same tissue compartment. These pathway-specific confidence signatures indicate that the network is recognizing subtle, biologically meaningful glandular alterations, highlighting their potential as computational biomarkers for precursor lesion characterization in colorectal cancer screening.
To further understand distribution shifts in terms of the model's prediction behavior, we visualize the model's attention on diseased patches using GradCAM heatmaps, as shown in Fig. 12. Through the heatmaps, we observe a predominant mucosal epithelial contribution with a minor component from the lamina propria surrounding the ‘diseased’ glands in cases of HP, SSL, TA, and TVA. In healthy tissues, the “gland” (GD) activation map is relatively uniform and diffuse across the glandular epithelium component, with slight local variations from contributions from epithelial cell nuclei and cytoplasm. In the HP and SSL cases, we note strong activations along the glandular epithelium, but most specifically at the sawtooth luminal borders, and branching crypt outlines, with increased contribution from cell cytoplasm, whereas TA and TVA cases elicit model activation hotspots mostly on atypical nuclei (instead of cytoplasm), which can be elongated and pseudostratified, within the dysplastic epithelium, and, to a lesser extent, the lamina propria surrounding the basal aspect of the crypts.
The heatmaps demonstrate that our model classifier focuses diagnostically relevant features, reflecting biologically meaningful deviations from normal glandular architecture. Our model appears to attend to disease-specific glandular alterations rather than background artifacts to visually anchor its grouping of serrated versus adenomatous lesions. These saliency maps could therefore provide an interpretable bridge between DL predictions and diagnostic pathology, pending further validation.
Dataset qualitative analysis
In this section, we show the quality, diversity and richness of our dataset. First, we randomly sample a few patches and label HTTs within them for visualization. Second, we analyze the interactions between HTTs through co-occurrence networks.
HTT visualizations
We visualize four example patches in Fig. 5. The regions corresponding to HTTs in each patch are highlighted with black arrows. As observed, the HTTs on these patches exhibit specific and identifiable morphological characteristics. Notably, the patches tend to be multiplex, often containing multiple HTTs with diverse structural features. Specifically, the surface epithelium or brush border with numerous goblet cells appears as a monolayer of goblet cells, whereas large bowel crypts present as round or oval structures arranged in clusters in cross-sectional cuts. Embedded between these crypts is the lamina propria, which mainly consists of loose connective tissue and various inflammatory cells. Among these inflammatory cells, eosinophils are easily distinguishable due to their red granules and multilobed nuclei. Additionally, plasma cells, recognized by their round, dense, and eccentrically positioned nuclei, are also present within the lamina propria. Within the lamina propria, small lymphovascular channels—thinner than large blood vessels and lacking elastic fibers—can be detected, often containing numerous red blood cells. The smooth muscle cells of the muscularis mucosae, stained pink, can be abundant near the bases of the crypts, at the junction with the submucosa. Moreover, this patch also contains hollow-appearing white lipid vacuoles of adipose tissue, further emphasizing the complexity and heterogeneity of the visualized regions. To provide a more detailed reference for these HTTs, Fig. 6 presents representative examples of Level 3 HTTs. Each panel in the figure displays a cropped region centered on a specific HTT, with the relevant structure outlined or highlighted. These visualizations serve as examples of the diverse morphological patterns captured at this classification level. Further details can be found in the appendix, where Table C.1 provides a detailed description of all HTTs.
HTT co-occurrence and interactions
HTTs exhibit a network of coexistence by nature, with certain types frequently co-occurring. As shown in Fig. 7, we visualize the coexistence network on the HTTs in the dataset. In the figure, the varying thickness of the edges in the network graph indicates differences in co-occurrence strength, suggesting that some HTTs are more closely associated than others. Level 1 (Fig. 7a) shows four broad compartments—surface epithelium, glands, neural tissue, and inflammatory cells. The thickest edge links glands to inflammatory cells, reflecting the well-known propensity of colonic glands to attract inflammatory infiltrates when epithelial integrity is perturbed (e.g., glandular distortion, crypt abscesses). A similarly prominent SE–IC edge mirrors the fact that the luminal epithelium is the first barrier breached by pathogens or mechanical injury, again driving immune cell recruitment. In comparison, NT is more isolated, consistent with the relative scarcity of enteric neurons in routine mucosal biopsies and their limited exposure to inflammatory exudates. Level 2 (Fig. 7b) differentiates epithelium into simple columnar versus glandular phenotypes and splits the immune compartment into polymorphonuclear (POC, acute) and mononuclear (MOC, chronic) cells. The GL–MOC and POC–MOC edges dominate the graph: gland-associated chronic inflammation (e.g., lymphoplasmacytic infiltrates in long-standing colitis) explains the first, whereas the second captures the typical succession of neutrophil-led acute responses that transition into mononuclear-cell predominance during tissue repair. A thinner SCL–POC connection suggests that the surface epithelium is more often attacked by neutrophils during acute flares, whereas deeper glands harbor mixed inflammatory infiltrates. Level 3 (Fig. 7c) explodes the network into 16 fine-grained HTTs, revealing a dense mesh of anatomical and pathological relationships. Notable thick edges include LC–SMC (lymphovascular channels enwrapped by muscularis mucosa), CT–LY (lymphoid aggregates embedded in connective stroma), and RBC-V/RBC-NTR (erythrocytes pooling within damaged vessels and neutrophil-rich exudates). Strong SMC–EF and V–AT links echo the structural coupling of smooth muscle, elastic fibers, vessels, and perivascular adipose in the submucosa, whereas the cluster of NTR–ES–MA–PC interactions typifies mixed acute-on-chronic inflammation where neutrophils, eosinophils, macrophages, and plasma cells co-localize around sites of persistent injury. Collectively, the graphs re-capitulate the expected microarchitectural hierarchy of the colonic wall and align with classical pathophysiological sequences—from epithelial damage to gland-centred chronicity and, finally, to the complex multicellular milieu observed in advanced lesions or reparative fibrosis.
Multilabel representation learning results
In this section, we detail both quantitative and qualitative results achieved by our VMamba model. For quantitative results, we show the model's general multilabel classification performance and per-class performance using multiple metrics on the test split of our ADPv2 dataset. For qualitative results, we first visualize the spreads in the multilabel representations learned by our model in the latent space using t-SNE on 500 randomly sampled image patches from our ADPv2 dataset. Then, we provide heatmaps on in-house test slides to check the quality of HTT identification on WSIs. Finally, we select the most representative image patch for each HTT in our dataset and apply GradCAM on the model's second VSSM block to visualize the model's pixel-level attentions to make predictions.
Quantitative results
Our finetuned VMamba model achieves a test mean average precision (mAP) of 0.879. As shown in Table 3, for each HTT used in training, we compute the true-positive rate (TPR), true-negative rate (TNR), false-positive rate (FPR), false-negative rate (FNR), accuracy (ACC), F1-score, precision, and recall. Notably, the model performs strongly on most HTTs, importantly on HTTs pathologists consider diagnostically relevant. We also find the worst performances primarily stem from four HTTs: LC, ES, MA, and RBC. For LC and RBC, both HTTs are present in a large majority of our dataset, with LC occurring in 16,418 image patches and RBC occurring in 16,543 image patches, meaning these HTTs are found in about 82% of all image patches in our dataset. Furthermore, LC and RBC exhibit similarities in their performances: exceptionally high TPR and FPR, and abnormally low TNR and FNR. One interpretation for this is the lack of negative samples makes it difficult for the model to discern these HTTs properly. In addition, our objective function incentivizes predicting positive labels correctly via asymmetric weighting of positive losses, thereby neglecting to “learn” negative samples when there exist so few. Conversely, for macrophages, we observe low TPR, high FNR, low precision, and low F1, indicative that the model cannot recognize positive samples. This can be attributed to macrophages' low positive occurrence of positives in our dataset—at just 1020 samples, or 6% of the dataset. Lastly, EF has a high FPR, low TNR, and low accuracy, indicating the model is quite poor at correctly identifying negatives. This can be attributed to the general difficulty of identifying this HTT in our images, as is the case with neutrophils (refer to Section 4.1.2), which is the sibling of EF under POC. Thus, EF is susceptible to mislabeling by the pathologists themselves and as a result, the “real” FPR is likely to be slightly lower than reported.
Qualitative results
In this section, we first visualize the representation of our VMamba model learned on ADPv2 using t-SNE, then we show the heatmaps on unseen slides using the model's predictions. Because our dataset is multilabeled, we use the color blending technique to color-label the t-SNE plot, by assigning each HTT a color and blending the colors into one if a sample has multiple labels associated with it. For visualization clarity, we only consider the parent HTTs for color blending, reducing the total number of colors to blend from 16 to 11. As shown in Fig. 8, we observe clear clustering effect on the blended colors. It suggests that the VMamba model effectively captures high-level morphological differences, e.g., separating glandular crypt structures from surface epithelium or adipose tissue. Where colors blend, the patches exhibit multiple co-occurring tissue/cell types, reflecting the biological reality that colon subregions consist of overlapping structures. For instance, smooth muscle cells and elastic fibers coexist as key components of blood vessel walls in the submucosa. From a histopathological point of view, this clustering and partial overlap align with the preserved colon wall architecture, e.g., epithelial lining is found at the mucosal surface, and lymphatic and vascular channels can be seen within the submucosa.
We also visualize the model's prediction results for each HTT as heatmaps on slides in the test split. Each slide is tiled into patches corresponding to a FOV of 544 × 544 microns at 20× level. The patches are resized to 512 × 512 pixels and fed into our VMamba model to get multilabel confidence scores. For each HTT, we visualize the patch confidence scores as an overlay on the original slides. Fig. 9 shows the heatmaps for a healthy slide and a TVA slide on five diagnostically relevant HTTs. On TVA slides, the detection activates mostly on glands and lamina propria, matching key diagnostic architectural features of TVAs, with increased crowding of dysplastic glands and presence of both tubular and villous components (the latter representing >25% but <75% of the TVA). The observed detection of immune cells, such as lymphocytes and plasma cells, within the adenoma's mucosa, could be related to local responses to dysplastic changes, especially if increased, but such cells also seen within the lamina propria of the normal colorectal mucosa.
We also leverage GradCAM to identify the salient features used by our model during prediction. GradCAM is performed on the patch level to show how granular tissue morphologies contribute to model's prediction of particular HTTs. For demonstration, we visualize the class activation maps of all 14 HTTs used in training on test image patches in our ADPv2 dataset, as in Fig. 10.
Whereas for the majority of HTTs, activation maps as seen with GradCAM have mostly focused directly on the HTT of interest (e.g., LA, lymphocytes; GD, glands; SMC, smooth muscle cells), in other instances, salient features most used by the model as visualized by the activation maps include surrounding non-HTT tissues (e.g., for AT, adipose tissue) or interfaces/limits between HTT-positive and non-HTT regions (e.g., for V, large vessels).
Confidence score distribution analysis
In Fig. 11, we present the distribution of the predicted confidence score on patches extracted from gland areas on slides of five categories. From Fig. 11a, we observe a clear separation between the diseased distribution and the healthy distribution, where the diseased confidence scores are shifted to the left, which could be related to the subtle glandular distortions as interpreted by the model. Further, in Fig. 11b, there is a separation between TT (TA + TVA) and SH (SSL + HP), suggesting that the model's confidence is affected by underlying morphological glandular differences between categories of colorectal precursor lesions as they diverge from normal mucosa (e.g., SSL + HP: luminal epithelial serrations, and, in most cases, crypt elongation, small basally located nuclei; TA + TVA: dysplasia, by definition, with prominent pseudostratification in most cases).49, 50, 51 Of note, a decrease in inter-rater agreement regarding colorectal polyp classification can be often similarly seen among both general and expert pathologists, highlighting the complexity of “diseased” glandular interpretation and potential decrease in confidence scores.51,52
We quantify the distribution shifts using t-tests. As summarized in Table 4, SSLs+HP polyps elicit a pronounced reduction in model confidence relative to normal tissue (Welch , ), whereas the effect is more modest for traditional TA + TVA (, ). TT and SH distributions also differ significantly from each other (, ), with a shift in classifier confidence scores within the same tissue compartment. These pathway-specific confidence signatures indicate that the network is recognizing subtle, biologically meaningful glandular alterations, highlighting their potential as computational biomarkers for precursor lesion characterization in colorectal cancer screening.
To further understand distribution shifts in terms of the model's prediction behavior, we visualize the model's attention on diseased patches using GradCAM heatmaps, as shown in Fig. 12. Through the heatmaps, we observe a predominant mucosal epithelial contribution with a minor component from the lamina propria surrounding the ‘diseased’ glands in cases of HP, SSL, TA, and TVA. In healthy tissues, the “gland” (GD) activation map is relatively uniform and diffuse across the glandular epithelium component, with slight local variations from contributions from epithelial cell nuclei and cytoplasm. In the HP and SSL cases, we note strong activations along the glandular epithelium, but most specifically at the sawtooth luminal borders, and branching crypt outlines, with increased contribution from cell cytoplasm, whereas TA and TVA cases elicit model activation hotspots mostly on atypical nuclei (instead of cytoplasm), which can be elongated and pseudostratified, within the dysplastic epithelium, and, to a lesser extent, the lamina propria surrounding the basal aspect of the crypts.
The heatmaps demonstrate that our model classifier focuses diagnostically relevant features, reflecting biologically meaningful deviations from normal glandular architecture. Our model appears to attend to disease-specific glandular alterations rather than background artifacts to visually anchor its grouping of serrated versus adenomatous lesions. These saliency maps could therefore provide an interpretable bridge between DL predictions and diagnostic pathology, pending further validation.
Discussion
Discussion
In this work, we introduced the ADPv2 dataset, a carefully curated repository of 20,004 annotated image patches from healthy colon biopsies, enriched with a 32-label hierarchical taxonomy. Beyond, we developed and fine-tuned a DL pipeline that combines Barlow Twins self-supervised pretraining with the VMamba architecture, demonstrating both the robustness of our data and the efficacy of this novel model design.
Furthermore, our extensive 32-label taxonomy, annotated exclusively on healthy colon slides, is unique from existing datasets; The BACH dataset provides multiclass breast cancer annotations but is limited to four datasets. PANDA features diverse prostate cancer annotations, but does not meet the multilabel criterion. The Lizard, PanNuke, and NCT-CRC-HE-100K datasets provide multiclass region-level annotations, but are limited to fewer than 10 labels. While ADPv1 offers an extensive taxonomy, it covers various organs and integrates both normal and diseased tissues, not exclusively focusing on healthy samples of one organ type.
In addition, we presented an in-depth confidence distribution analysis, revealing characteristic shifts in the model's predicted confidence scores when encountering various disease groups. These shifts—lower peak sharpness, leftward displacement, and wider spread—could signal underlying structural abnormalities interpreted by the model and, specifically in the case of glandular analysis, might reflect known pathological changes in colorectal precursor lesions.
Regarding limitations, our RoI analysis focused on HP, SSL, TA, and TVA. Rarer morphologies such as TSA and VA were excluded due to limited RoI-annotated cases. Another limitation is class imbalance, which necessitates the pruning of certain labels (i.e., connective tissue). In the future, we intend to incorporate the aforementioned categories and achieve a more balanced representation across all class labels. For the RoI analysis, we also hope to incorporate further external validation and deeper investigation to translate these insights into robust biomarkers for cancer detection.
In this work, we introduced the ADPv2 dataset, a carefully curated repository of 20,004 annotated image patches from healthy colon biopsies, enriched with a 32-label hierarchical taxonomy. Beyond, we developed and fine-tuned a DL pipeline that combines Barlow Twins self-supervised pretraining with the VMamba architecture, demonstrating both the robustness of our data and the efficacy of this novel model design.
Furthermore, our extensive 32-label taxonomy, annotated exclusively on healthy colon slides, is unique from existing datasets; The BACH dataset provides multiclass breast cancer annotations but is limited to four datasets. PANDA features diverse prostate cancer annotations, but does not meet the multilabel criterion. The Lizard, PanNuke, and NCT-CRC-HE-100K datasets provide multiclass region-level annotations, but are limited to fewer than 10 labels. While ADPv1 offers an extensive taxonomy, it covers various organs and integrates both normal and diseased tissues, not exclusively focusing on healthy samples of one organ type.
In addition, we presented an in-depth confidence distribution analysis, revealing characteristic shifts in the model's predicted confidence scores when encountering various disease groups. These shifts—lower peak sharpness, leftward displacement, and wider spread—could signal underlying structural abnormalities interpreted by the model and, specifically in the case of glandular analysis, might reflect known pathological changes in colorectal precursor lesions.
Regarding limitations, our RoI analysis focused on HP, SSL, TA, and TVA. Rarer morphologies such as TSA and VA were excluded due to limited RoI-annotated cases. Another limitation is class imbalance, which necessitates the pruning of certain labels (i.e., connective tissue). In the future, we intend to incorporate the aforementioned categories and achieve a more balanced representation across all class labels. For the RoI analysis, we also hope to incorporate further external validation and deeper investigation to translate these insights into robust biomarkers for cancer detection.
Conclusion
Conclusion
In this work, we addressed the need for organ-specific, fine-grained histological labeling in colon pathology, a gap that limits multilabel learning and downstream biological interrogation. We proposed a multilabel dataset, namely ADPv2, a VMamba baseline model for multilabel HTT classification, and confidence score distribution analysis demonstrating the potential of biomarker discovery using ADPv2.
ADPv2 comprises 20,004 healthy colon patches annotated with a 32-label hierarchical taxonomy and enables training a two-stage VMamba pipeline that achieves mAP = 0.879 on multilabel HTT classification. Confidence score analyses on RoIs show pathway-specific shifts (e.g., SH vs Normal: t = −6.98, p < 0.001; TT vs Normal: t = −1.99, p = 0.047; TT vs SH: t = 3.88, p < 0.001), supporting the use of learned representations of healthy tissues as sensitive probes for disease-related morphology. Together, these findings significantly advance the field of CPath by providing a high-quality dataset, a powerful modeling strategy, and actionable biological insights.
Limitations include the focus on healthy-tissue training, label pruning for rare or ambiguous HTTs, and the absence of explicit TSA/VA RoIs in the current analysis. Future work will expand RoI annotations (including TSA/VA), perform external multisite validation, and explore hierarchy-aware heads and slide-level integration for prospective, clinically oriented evaluation.
Overall, this work demonstrates how data-centric curation and modern DL can be combined to enable organ-specific, explainable analysis of colon histology. We expect these advances to facilitate decision-support tools that improve workflow efficiency and reduce repetitive review, ultimately benefiting patients and care providers.
In this work, we addressed the need for organ-specific, fine-grained histological labeling in colon pathology, a gap that limits multilabel learning and downstream biological interrogation. We proposed a multilabel dataset, namely ADPv2, a VMamba baseline model for multilabel HTT classification, and confidence score distribution analysis demonstrating the potential of biomarker discovery using ADPv2.
ADPv2 comprises 20,004 healthy colon patches annotated with a 32-label hierarchical taxonomy and enables training a two-stage VMamba pipeline that achieves mAP = 0.879 on multilabel HTT classification. Confidence score analyses on RoIs show pathway-specific shifts (e.g., SH vs Normal: t = −6.98, p < 0.001; TT vs Normal: t = −1.99, p = 0.047; TT vs SH: t = 3.88, p < 0.001), supporting the use of learned representations of healthy tissues as sensitive probes for disease-related morphology. Together, these findings significantly advance the field of CPath by providing a high-quality dataset, a powerful modeling strategy, and actionable biological insights.
Limitations include the focus on healthy-tissue training, label pruning for rare or ambiguous HTTs, and the absence of explicit TSA/VA RoIs in the current analysis. Future work will expand RoI annotations (including TSA/VA), perform external multisite validation, and explore hierarchy-aware heads and slide-level integration for prospective, clinically oriented evaluation.
Overall, this work demonstrates how data-centric curation and modern DL can be combined to enable organ-specific, explainable analysis of colon histology. We expect these advances to facilitate decision-support tools that improve workflow efficiency and reduce repetitive review, ultimately benefiting patients and care providers.
Declaration of generative AI and AI-assisted technologies in the writing process
Declaration of generative AI and AI-assisted technologies in the writing process
During the preparation of this work, the authors used ChatGPT for English writing consistency and fluency. After using this tool, the authors reviewed and edited the content as needed and take full responsibility for the content of the published article.
During the preparation of this work, the authors used ChatGPT for English writing consistency and fluency. After using this tool, the authors reviewed and edited the content as needed and take full responsibility for the content of the published article.
Funding
Funding
This study was funded by the 10.13039/100013873Government of Ontario under the ORF-RE grant (ORF-Re10_026), and by the NSERC-Discovery Grant Program (RGPIN-2022-05378).
This study was funded by the 10.13039/100013873Government of Ontario under the ORF-RE grant (ORF-Re10_026), and by the NSERC-Discovery Grant Program (RGPIN-2022-05378).
Declaration of competing interest
Declaration of competing interest
The authors declare the following financial interests/personal relationships which may be considered as potential competing interests:
Mahdi S Hosseini reports financial support was provided by Government of Ontario. Mahdi S Hosseini reports was provided by NSERC-Discovery Grant Program. If there are other authors, they declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this article.
The authors declare the following financial interests/personal relationships which may be considered as potential competing interests:
Mahdi S Hosseini reports financial support was provided by Government of Ontario. Mahdi S Hosseini reports was provided by NSERC-Discovery Grant Program. If there are other authors, they declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this article.
출처: PubMed Central (JATS). 라이선스는 원 publisher 정책을 따릅니다 — 인용 시 원문을 표기해 주세요.
🏷️ 같은 키워드 · 무료전문 — 이 논문 MeSH/keyword 기반
- Spatial omics at the forefront: emerging technologies, analytical innovations, and clinical applications.
- Single-Cell and Spatial Transcriptomics Reveal That and Contribute to Human Prostate Tumor Progression.
- Oxaliplatin-induced peripheral neuropathy: from pathogenesis to treatment.
- Cis-regulatory and long noncoding RNA alterations in breast cancer - current insights, biomarker utility, and the critical need for functional validation.
- Multiomic profiling of ER-positive HER2-negative breast cancer reveals markers associated with metastatic spread.
- Predicting MammaPrint Recurrence Risk from Breast Cancer Pathological Images Using a Weakly Supervised Transformer.