Self-Supervised Transformer-Based Pipeline for Liver Tumor Segmentation and Type Classification.

Mojtahedi R; Hamghalam M; Peoples JJ; Jarnagin WR; Do RKG; Simpson AL

doi:10.1200/CCI-25-00135

← 뒤로

Self-Supervised Transformer-Based Pipeline for Liver Tumor Segmentation and Type Classification.

1/5 보강

JCO clinical cancer informatics 📖 저널 OA 43.9% 2024~2026 2026 Vol.10() p. e2500135

Mojtahedi R, Hamghalam M, Peoples JJ, Jarnagin WR, Do RKG, Simpson AL

📖 무료 전문 🟢 PMC 전문 PMC12866948

PubMed ↗ DOI ↗ BibTeX ↓ RIS ↓

📝 환자 설명용 한 줄

[PURPOSE] It is essential to detect and segment liver tumors to guide treatment and track disease progression.

🔬 핵심 임상 통계 (초록에서 자동 추출 — 원문 검증 권장)

표본수 (n) 40
95% CI 0.96 to 1.00

이 논문을 인용하기

↓ .bib ↓ .ris

APA Mojtahedi R, Hamghalam M, et al. (2026). Self-Supervised Transformer-Based Pipeline for Liver Tumor Segmentation and Type Classification.. JCO clinical cancer informatics, 10, e2500135. https://doi.org/10.1200/CCI-25-00135

MLA Mojtahedi R, et al.. "Self-Supervised Transformer-Based Pipeline for Liver Tumor Segmentation and Type Classification.." JCO clinical cancer informatics, vol. 10, 2026, pp. e2500135.

PMID 41616240 ↗

DOI 10.1200/CCI-25-00135

Abstract

[PURPOSE] It is essential to detect and segment liver tumors to guide treatment and track disease progression. To reduce the need for large annotated data sets, we present an end-to-end pipeline that uses self-supervised pretraining to improve segmentation and then classifies tumor types with a separate pretrained classifier applied to the segmented tumor regions.

[METHODS] First, we pretrained the encoder of a transformer-based network using a self-supervised approach on unlabeled abdominal computed tomography images. Subsequently, we fine-tuned the segmentation network to segment the liver and tumors, and the tumor regions were classified using a pretrained convolutional neural network (Inception-v3 architecture) as intrahepatic cholangiocarcinoma (ICC), hepatocellular carcinoma (HCC), or colorectal liver metastases (CRLMs). We evaluated 459 images (155 HCC, 107 ICC, 197 CRLM). For external testing, we used an independent public data set (n = 40).

[RESULTS] Averaged across HCC, ICC, and CRLM, in comparison with a supervised baseline (no pretraining), self-supervised pretraining improved the liver Dice similarity coefficient (DSC) by 6.4 percentage points and reduced the 95th-percentile Hausdorff distance (HD) by 32.97 mm. For tumors, the DSC increased by 6.0 percentage points and the HD decreased by 3.2 mm. Tumor type classification achieved AUC 0.98 (95% CI, 0.96 to 1.00) and accuracy 96% (95% CI, 92% to 99%). Segmentation performance on the external data was close to the internal cohort with tumor DSC 0.73, intersection over union (IoU) 0.60, and HD 30.98 mm and liver DSC 0.91, IoU 0.83, and HD 29.67 mm.

[CONCLUSION] The proposed self-supervised, end-to-end pipeline improves liver tumor segmentation and provides accurate tumor type classification, supporting reliable radiologic assessment, treatment planning, and improved prognostication for patients with liver cancer.

🏷️ 키워드 / MeSH 📖 같은 키워드 OA만

📖 전문 본문 읽기 PMC JATS · ~39 KB · 영문

Introduction

1
Introduction

Background:
Liver cancer is the sixth most common cancer diagnosed globally, and the third leading cause of cancer-related deaths.1 Hepatocellular carcinoma (HCC) accounts for more than 80% of primary liver cancers and is closely linked to chronic hepatitis infection and cirrhosis.2 Intrahepatic cholangiocarcinoma (ICC) is the second most common primary liver cancer, accounting for approximately 10–15% of all primary liver cancers.3 Colorectal cancer is the second most common cause of cancer-related deaths and the third most common type of cancer globally,1 with liver metastases occurring in approximately 25–50% of patients.4 Accurate liver and tumor assessment is critical for treatment decision-making including radiotherapy, surgical resection, arterial embolization, and systemic chemotherapy.5 Abdominal computed tomography (CT) is essential for this evaluation, providing detection, diagnosis, and staging.6 Consequently, the Response Evaluation Criteria in Solid Tumors mandate measurement of the dominant tumor,7 emphasizing CT’s key importance in standardized evaluations. Manual segmentation of liver tumors is time consuming and lacks reproducibility.5

Technical Challenges:
Due to the clinical significance of precise liver tumor segmentation, deep-learning-based methods continue to struggle with limited annotated data, class imbalance, and the intrinsic three-dimensional complexity of CT scans.8,9 Specifically, U-shaped convolutional neural networks (CNNs)architectures inspired by U-Net, have been traditionally adopted in medical image segmentation to leverage encoder-decoder structures with skip connections for multi-scale feature extraction.10–12 Following enhancements, such as three-dimensional (3D) U-Net and V-Net, employ volumetric convolutions, residual connections, and Dice similarity coefficient (DSC)-based losses to better capture spatial context in 3D data.13,14 In recent years, transformer-based architectures have been introduced, mainly based on Vision Transformers, and later enhanced by the Swin Transformer, which employs a shifted window mechanism for feature extraction and leverages global self-attention to capture long-range dependencies.15,16 However, supervised learning based on these models typically requires large amounts of annotated data, which is a major limitation of medical image analysis.17 To tackle this limitation, self-supervised learning (SSL) has been introduced to utilize large volumes of unlabeled data for learning effective representations, primarily by leveraging pseudo-masks and contrastive features to enhance segmentation performance.18 Additionally to segmentation, accurate classification of liver tumors is essential for guiding clinical decision making and treatment planning.19 CNN architectures have been widely leveraged for multi-scale feature extraction from imaging data for such problems.20–22 Particularly, fusion modules of Inception-V3 can mitigate parameter burden without compromising accuracy, making it an appealing option in data-scarce scenarios.23

Contributions:
Recently developed technologies have not fully addressed the issues of data scarcity, tumor heterogeneity, or the absence of a unified pipeline for simultaneous segmentation and classification of liver cancer tumors. Our study aims to address these gaps by proposing the following:
To the best of our knowledge, we present the first self-supervised, transformer-based, end-to-end pipeline that segments liver tumors and then classify primary and secondary liver cancers in a single workflow.

The pipeline tackles heterogeneous tumor shapes through a hybrid masked-autoencoding plus geometric-contrastive SSL schedule, boosting segmentation DSC by 7.4% and reducing 95th-percentile Hausdorff distance (HD95) by 32.9 mm compared with a supervised baseline.

We validate the pipeline on primary and secondary liver cancer data, including ICC, HCC, and colorectal liver metastases (CRLM).

Methods

2
Methods
The pipeline proceeds in two stages with a transformer-based 3D model segmenting liver and tumor on contrast-enhanced CT and an Inception-v3 model classifying auto-segmented tumor slices as HCC, ICC or CRLM. Figure 1 shows the overall workflow, and the details are discussed below.
2.1
Data Description
Our dataset consists of abdominal CT images of patients under hepatic resection with pathologically confirmed diagnoses of HCC (155 patients), CRLM (197 patients), and ICC (107 patients). The images were collected in accordance with the standard clinical imaging protocol for the contrast-enhanced portal venous phase CT. The voxel spacing ranged from 0.57 to 0.98 mm in the x- and y-axes and from 0.8 to 8 mm in the z-axis. All segmentations were reviewed by a board-certified radiologist and confirmed for data integrity and quality. These datasets have been previously utilized in other studies.24–28 The details for the patient demographics are provided in the supplementary Table S1.
Figure 2 provides a hierarchical structure of the data analysis, while supplementary Figure S1 shows representative samples. In addition, 40 contrast-enhanced abdominal CT scans from the public Liver Tumor Segmentation (LiTS) dataset were used exclusively for external hold-out validation (n = 40); model weights, thresholds, and post-processing were frozen, and identical preprocessing and per-case 3D metrics (DSC, IoU, HD95) were applied.5

2.2
Segmentation
2.2.1
Image Preprocessing
CT scans were preprocessed using the Medical Open Network for AI framework for transformations, dataset preparation, and training.29 Intensities were adjusted to the range of [−175, 250], and background regions were removed to highlight the tumors and liver. To maintain a consistent scale, the voxel spacing was standardized across all CT images. The training set was augmented by flipping across three axes, rotating by 90 degrees, zooming, and applying intensity normalization. Finally, the dataset was divided into 70% for training and 30% for validation.

2.2.2
Segmentation Model Development
For our segmentation model, we employ the Swin UNETR architecture, which has been one of the state-of-the-art models for abdominal segmentation, as the basis of our experiments. We selected Swin UNETR for abdominal 3D CT segmentation because, on similar datasets (e.g., BTCV), it achieved higher overall DSC than TransUNet and UNETR, as shown in the supplementary Table S2.30–33 This model combines the Swin Transformer with the UNETR framework for volumetric medical image segmentation. It treats segmentation as a sequence-to-sequence task and transforms the multi-modal images through a patch embedding layer into a sequence of embeddings. These embedded vectors pass through a hierarchical Swin Transformer encoder, which uses shifted windows and split attention to reduce computational load while preserving a global field of view.34,35 The encoder gathers features at multiple scales, capturing both detailed elements and broader context. A decoder based on a fully convolutional neural network36 integrates skip connections from various encoder stages.

Encoder:
Encoder generates patch embeddings of dimension from volumetric CT data using a single-channel input of size . Let , and and the input image is defined as , where is the single channel of the input CT image. It processes the 3D tokens using a shifting window technique. The architecture consists of four stages, each containing two transformer blocks, resulting in a total of eight layers.35 At the end of each stage, a patch-merging layer halves the resolution of the feature-map. In the first stage, there is a linear embedding layer to generate tokens.
Let be the output at layer , where is the number of tokens and is the token dimension (). The attention is given by:

where SWMSA is the shifted window-based multi-head self-attention module, WMSA is the window-based multi-head self-attention module, LN refers to layer normalization, and MLP is a multi-layer perceptron. Furthermore, the attention function is defined as:

where are the query, key, and value matrices for one head, is the number of tokens, and is the embedding dimensionality of each key (and query) vector.37

Decoder:
In each decoder stage doubles the resolution of the feature maps using a deconvolution layer, and the upsampled features are concatenated with those from the previous stage. Eventually, a convolutional layer of assigns each voxel to one of three classes (background, liver, or tumor). A sigmoid activation function then provides a probability for each voxel’s class.35

2.2.3
Self-Supervised Pretraining Protocol
SSL was performed transductively once across HCC, ICC, and CRLM on the full unlabeled target domain CT corpus (N = 459), with no access to labels, patient identifiers, or dataset statistics. On the volumetric images, we applied three proxy objectives of inpainting, flipping, and contrastive learning. Encoder weights were initialized by the SSL pretrained models and then fine-tuned on labelled data. This in-domain SSL setup has been widely adopted and is supported by prior works demonstrating improved label efficiency and downstream performance in 3D medical imaging.38–43

2.2.4
Self-supervised Learning Approach for Liver Tumor Segmentation
We used the following three self-supervised proxy tasks to handle limited annotated data. Using Inpainting helps the encoder fill in missing data, the Flipping helps the encoder learn the spatial orientations, and contrastive learning separates dissimilar sub-volumes and groups similar ones.

Inpainting Task:
We masked certain regions and extracted random sub-volumes. The encoder trained to recover these areas hidden under a mask. Equation (4) defines the inpainting loss:

where xi is the original pixel value and xˆi is the inpainted pixel value.44

Flipping Task:
Each 3D sub-volume was mirror-flipped about the axial (z) plane (z →−z), the coronal (y) plane (y →−y), or both planes simultaneously, yielding four possible orientations (no flip, z-flip, y-flip, yz-flip). The encoder predicts the flip label with an MLP head trained via the categorical cross-entropy loss:

where yi denotes the one-hot ground-truth vector and yˆi the predicted probabilities.45

Contrastive Learning Task:
We projected latent spaces from sub-volumes. Dissimilar sub-volumes were pushed apart, and similar sub-volumes were pulled closer. Equation (6) shows contrastive loss:

where sim(zi, zj) is the cosine similarity, τ is a temperature parameter, and K is the number of sub-volumes.46

Total Self-supervised Loss:
We combine these losses as shown in Equation (7), where we set :

Ablation Protocol:
The SSL process was ablated by selectively enabling the three proxy tasks above, which are done individually, in all pairwise combinations, and in the full combined integration. The Swin UNETR encoding was fine-tuned after each pretraining variant using the same training/validation splits, hyperparameters, loss functions, and post-processing. The average segmentation results of the ablation study across all cancer types are reported in the Results section.

2.3
Classification
We used our segmentation outputs (auto-segmented tumors) for the training and evaluation of a liver tumor classifier. The auto-segmented images were processed, and two-dimensional (2D) slices containing tumors were extracted. These slices were then fed into an Inception-v3 model to classify HCC, ICC, and CRLM.
2.3.1
Image Preprocessing
Segmented abdominal CT volumes were exported as 16-bit, lossless PNG images. For each patient, the 15 axial slices intersecting the largest tumor were retained, given its prognostic relevance.47 Through intensity-histogram analyses window level and width were optimized per study. Each slice was cropped into the tumor bounding box, resized to 299 × 299 pixels, and pixel intensities were linearly rescaled to the range of [0, 1]. Class imbalance was mitigated through slice-level random oversampling.

2.3.2
Classification Model Development
To classify the tumor types, an ImageNet-pretrained Inception-v3 encoder served as the feature extractor.23 The model was finetuned for 80 epochs with AdamW (weight decay = 1 × 10−5)48 and cosine learning rate schedule with a 5% warm-up to provide smooth and accelerated convergence.
The final classification head was replaced by a three-layer multilayer perceptron (1024→256→3) with ReLU activations and a dropout layer. Training used cross-entropy loss with label smoothing, and stochastic weight averaging was activated to improve generalization. At inference, the slice-level class probabilities were aggregated to generate the patient-level tumor type. During this process, we performed patient-level stratified cross-validation five-fold (stratified by tumor type). For training, in each fold (80%) of patients were used and (20%) were held out for testing. No patient or slice appeared in both sets, and tumor ROIs were generated once.

2.4
Evaluation Metrics
For segmentation, we used the DSC, the HD95, and the intersection over union (IoU). For classification, we measured accuracy, precision, recall, F1-score, and area under the receiver operating characteristic (ROC) curve (AUC).49–52

Code availability
All custom scripts used in this study will be released in a public GitHub repository upon final acceptance of this manuscript.

Results

3
Results
3.1
Segmentation Performance
The SSL-pretrained model outperformed the baseline model across all evaluation metrics. Averaged over all cancer types, liver DSC increased by 7.6 percentage points (from 83.0% to 90.6%), and tumor DSC by 8.9 points (from 68.7% to 77.6%), while liver HD95 decreased by 32.97 mm and tumor HD95 by 3.2 mm. The greatest absolute gain was observed in the CRLM cohort, where liver DSC reached 96% and HD95 fell by 46 mm. Detailed results are provided in Table 1. To assess generalization, we evaluated the trained model on an independent public dataset using identical preprocessing and metrics. Performance was consistent with our internal cohort, with liver DSC 91.0%, IoU 83.0%, HD95 29.67 mm, and tumor DSC 73.0%, IoU 60.0%, HD95 30.98 mm.
3.1.1
Ablation of Self-Supervised Proxy Objectives
Table 2 shows the effect of each proxy objective on the segmentation performance. The combined SSL yielded the best overall performance averaged across liver cancer types relative to a supervised baseline without pretraining (liver: DSC 89.7%, IoU 83.3%, HD95 19.92 mm and tumor: DSC 74.7%, IoU 62.7%, HD95 24.92). The pairing of contrastive and inpainting provides most of the overlap gains (liver: DSC 88.6%, IoU 82.0% HD95 36.84 mm and tumor: DSC 74.1% IoU 62.0% HD95 25.73 mm). The flipping objective strongly reduced tumor boundary error when used alone (tumor HD95 19.31 mm, which is 8.80 mm improvement compared to the baseline), consistent with improved orientation and edge sensitivity it also reduced tumor HD95 when combined with inpainting. Contrary to baseline pretraining, contrastive pretraining underperformed on tumor overlap (DSC 63.1%, IoU 50.4%) and increased tumor HD95 (30.97 mm).

3.2
Classification Performance
Five-fold cross-validation (stratified 80% training, 20% validation per fold) on our internal cohort produced an overall patient-level accuracy of 0.96 (95% CI, 0.92–0.99) and an AUC of 0.98 (95% CI, 0.96–1.00). Performance remained uniformly high across tumor types, where ICC achieved the best aggregate metrics, closely followed by HCC, while CRLM demonstrated slightly lower recall despite comparable precision, which is consistent with its more heterogeneous imaging appearance. Full results are shown in Table 1.
Figure 3 presents the row-normalized confusion matrix. The pronounced diagonal and sparse off-diagonal elements indicate that misclassifications are rare and that class overlap is minimal. Figure 4 overlays the one-versus-rest ROC curves, where all curves closely track the upper-left boundary and remain well above the chance line.

Discussion and Conclusion

4
Discussion and Conclusion
This work proposes an end-to-end pipeline for liver tumor segmentation and classification in CT images. The pipeline adapts learnt representations to the segmentation task by pretraining on unlabeled data, thereby decreasing dependence on large annotated datasets. Compared with the standard supervised baseline, the proposed pipeline consistently improves DSC and IoU while generally reducing HD95. However, the occasional rise in HD95 for ICC highlights that tissue heterogeneity, partial-volume effects, and limited annotations can still impede accurate boundary detection in lesions with irregular shapes. Consistent with this pattern, external hold-out testing achieved close performance for DSC and IoU, while HD95 increased by 9.75 mm for liver and 6.06 mm for tumor. It was also shown through the experimental loss ablation of pretraining objectives that contrastive and inpainting recover most of the DSC or IoU gains of the full model while flipping alone yields the largest reduction in tumor-boundary error (average Tumor HD95 = 8.80 mm).
When it comes to classification, five-fold cross-validation yields an overall patient-level accuracy of 0.96 (95% CI 0.92–0.99) and an AUC of 0.98 (95% CI 0.96–1.00), as shown in Table 1. When comparing results of HCC and ICC, they achieve high accuracies, potentially due to their inherent imaging signatures create distinctive attenuation and texture patterns on the largest tumor slice, which the classifier can reliably distinguish. On the other hand, CRLM shows the highest precision (0.94; 95% CI, 0.84–1.00) but the lowest recall (0.86; 95% CI, 0.74–0.97), hinting that the model applies the metastatic label conservatively and therefore misses atypical presentations, such as very small micronodular or necrotic deposits.
Fold-to-fold variability is greatest for CRLM recall and ICC precision as both correlate with changes in patient-wise HD95. This confirms that segmentation quality directly conditions downstream classification confidence. Boundary-aware losses, such as surface DSC or Hausdorff penalties, combined with focal loss in the classifier could therefore increase sensitivity to metastases without sacrificing high specificity for primary tumors.
Figure 4 demonstrates that the ROC traces for HCC and ICC hug the upper-left boundary across almost the entire threshold spectrum (AUC = 0.986 and 0.996, respectively). The CRLM curve (AUC = 0.963) matches this pattern until the false-positive rate drops below 0.02, where a steeper descent and visibly broader 95% bootstrap band emerges. They point to a small subset of borderline metastatic cases in which the classifier sacrifices sensitivity to preserve very high specificity.
Supplementary Figure S2 shows that two CRLMs were incorrectly predicted as HCC. Class-activation map (CAM)53 overlays concentrate saliency on the bright peripheral rim visible in the portal-venous phase, while largely ignoring the heterogeneous core and irregular contour that typify CRLM. At strict decision thresholds, this single-cue bias leads to HCC false positives. This error may be mitigated by providing multiphase inputs, adding shape-aware losses or regularizing attention to include intra-tumoral texture.
Compared to similar studies, the averaged AUC exceeds the 0.93–0.95 range reported by Midya et al.21 Their Inception-v3 model used 814 CT scans with radiologist-drawn regions of interest, while our study used 459 scans whose tumors were automatically segmented by an SSL-enhanced transformer. These findings show that self-supervised pretraining can offset a smaller dataset and eliminate the need for manual contours, supporting scalable, operator-independent decision support.
The use of automated segmentation and classification can expedite treatment workflows by reducing the reliance on time-intensive manual annotation. High segmentation fidelity allows accurate tumor-volume estimation, while reliable classification guides personalized therapy and follow-ups. If a misclassification does occur it can be flagged for expert review or refined through post-processing. Using large unlabeled datasets, the approach captures generalizable liver and tumor representations, improving cross-site consistency.
There are limitations of this study to consider. The evaluation of tumor-type classification used internal five-fold cross-validation; segmentation accuracy was evaluated on an external LiTS cohort but the classification was not due to a lack of classification label in the LiTS data. Therefore, the generalizability of the classifier will need to be evaluated with external data. Comparisons to other segmentation architectures, such as nnU-Net and TransUNet, were not conducted in our experiments, as we chose our backbone based on the reported superiority of Swin UNETR.30,31,35 In addition, the self-supervised encoder was pretrained transductively, meaning that although no labels were exposed during SSL and only image intensities were used, prior in-domain exposure to all cases could lead to slightly optimistic internal performance estimates compared with an externally SSL-pretrained model. Future work should therefore explore external SSL pretraining. Collectively, by bridging the gap between fully supervised approaches and real-world clinical needs, this pipeline facilitates faster, more reliable liver tumor characterization and could be adapted for diverse oncologic applications.

Supplementary Material

Supplementary Material
PV Data Supplement

출처: PubMed Central (JATS). 라이선스는 원 publisher 정책을 따릅니다 — 인용 시 원문을 표기해 주세요.

🏷️ 같은 키워드 · 무료전문 — 이 논문 MeSH/keyword 기반

A Phase I Study of Hydroxychloroquine and Suba-Itraconazole in Men with Biochemical Relapse of Prostate Cancer (HITMAN-PC): Dose Escalation Results.
Cancer research communications 2026 Talmor B 외 📖 unpaywall
Self-management of male urinary symptoms: qualitative findings from a primary care trial.
The British journal of general practice : the journal of the Royal College of General Practitioners 2026 Wheeler JR 외 📖 unpaywall
Clinical and Liquid Biomarkers of 20-Year Prostate Cancer Risk in Men Aged 45 to 70 Years.
JAMA network open 2026 Lindholz M 외 📖 unpaywall
Diagnostic accuracy of Ga-PSMA PET/CT versus multiparametric MRI for preoperative pelvic invasion in the patients with prostate cancer.
Science progress 2026 Qin Z 외 📖 unpaywall
Comprehensive analysis of androgen receptor splice variant target gene expression in prostate cancer.
Biochimica et biophysica acta. Molecular cell research 2026 Wüstmann N 외 📖 unpaywall
Clinical Presentation and Outcomes of Patients Undergoing Surgery for Thyroid Cancer.
Journal of the College of Physicians and Surgeons--Pakistan : JCPSP 2026 Khan MMU 외 📖 unpaywall