Deep multimodal state-space fusion of endoscopic-radiomic and clinical data for survival prediction in colorectal cancer.

Wang N; Lin J; Li W; Lyu Y; Jiang Y; Ni Z; Huang Q; Chen H; Yan Q; Huang C

doi:10.1038/s41746-025-02236-3

← 뒤로

Deep multimodal state-space fusion of endoscopic-radiomic and clinical data for survival prediction in colorectal cancer.

1/5 보강

NPJ digital medicine 📖 저널 OA 98.6% 2024~2026 2025 Vol.8(1) p. 801

Wang N, Lin J, Li W, Lyu Y, Jiang Y, Ni Z

📖 무료 전문 🟢 PMC 전문 PMC12756232

PubMed ↗ DOI ↗ BibTeX ↓ RIS ↓

📝 환자 설명용 한 줄

🔬 핵심 임상 통계 (초록에서 자동 추출 — 원문 검증 권장)

연구 설계 cross sectional

이 논문을 인용하기

↓ .bib ↓ .ris

APA Wang N, Lin J, et al. (2025). Deep multimodal state-space fusion of endoscopic-radiomic and clinical data for survival prediction in colorectal cancer.. NPJ digital medicine, 8(1), 801. https://doi.org/10.1038/s41746-025-02236-3

MLA Wang N, et al.. "Deep multimodal state-space fusion of endoscopic-radiomic and clinical data for survival prediction in colorectal cancer.." NPJ digital medicine, vol. 8, no. 1, 2025, pp. 801.

PMID 41476131 ↗

DOI 10.1038/s41746-025-02236-3

Abstract

Integrating complementary surface and cross sectional cues is central to preoperative assessment of colorectal cancer, but technically challenging because endoscopic images and pelvic CT encode anatomy at different scales. Here we present HydraMamba, a multimodal selective state space framework that fuses endoscopy and CT for joint lesion segmentation, lesion detection, and survival prediction. The model couples a shared state space backbone with two lightweight modules. Across the endoscopic dataset and the CT dataset, HydraMamba achieved state-of-the-art lesion analysis (endoscopy: Dice 0.856, F1 0.918; CT: Dice 0.812, F1 0.888) and delivered calibrated survival modeling on the CT dataset (Harrell's C index 0.832, Uno's C@1y 0.853, integrated Brier score 0.161, calibration slope ≈1.01). By unifying endoscopic and CT information in a single coherent architecture, HydraMamba provides an accurate and well-calibrated foundation for lesion analysis and prognostication in colorectal cancer.

같은 제1저자의 인용 많은 논문 (5)

Head-to-head comparison between F-FDG and F-PSMA-1007 PET/CT in diagnosing recurrent and metastatic prostate cancer.
Journal of translational medicine 2026
Risk factor for colorectal neoplasms: obstructive sleep apnea or nocturnal hypoxemia?
Sleep medicine 2026
Combination of indoline with alepterolic acid produces potent anti-breast cancer agents.
Bioorganic chemistry 2026
Enhancing lipid nanoparticles-mediated RNA delivery to glioblastoma via targeted strategies.
Journal of controlled release : official journal of the Controlled Release Society 2026
Denosumab Plus Immune Checkpoint Inhibitors in Bone Metastases From Solid Tumors.
Cancer control : journal of the Moffitt Cancer Center 2026

📖 전문 본문 읽기 PMC JATS · ~148 KB · 영문

Introduction

Introduction
Colorectal cancer remains a significant health burden, with rectal cancer as a major contributor to morbidity and mortality1. Effective management of colorectal tumors relies on accurate lesion characterization and staging, which increasingly benefit from advanced medical imaging and artificial intelligence. In clinical practice, endoscopy (colonoscopy) provides high resolution visualization of mucosal lesions, whereas cross-sectional imaging such as computed tomography (CT) delineates deeper anatomical context and distant spread. These modalities offer complementary information; however, traditionally they have been analyzed in isolation. There is a growing consensus that multimodal learning can unlock synergistic value from heterogeneous data in oncology2. For instance, integrating endoscopic and radiologic data has been shown to improve diagnostic accuracy and prognostic modeling. In gastric cancer, a deep learning model that fuses CT scans with endoscopy images achieved an area under the ROC curve (AUC) of 0.93 in predicting pathological stage, significantly outperforming models using either modality alone3. Similarly, combining radiologic features with digitized pathology has yielded robust predictors of patient outcomes; a recent multimodal model integrating CT-based and histopathology-based deep features (along with clinical factors) attained high concordance indices (0.74) in survival prediction for head and neck cancer4–6.
At the same time, deep learning has revolutionized image-based detection and segmentation of lesions in gastroenterology and oncology. Convolutional neural networks and their variants now approach expert level performance in localizing tumors or precancerous polyps on medical images. In colonoscopy, numerous studies have demonstrated automated polyp detection and delineation systems powered by deep learning, which can assist endoscopists by reducing miss rates and providing real-time decision support7. On radiologic scans, AI algorithms excel at highlighting tumors and metastases that might be subtle to human observers8. These advances are not merely academic—they translate into practical tools that improve diagnostic precision and workflow efficiency. Nevertheless, challenges remain in generalizing these models across diverse patient populations and imaging devices7. Robust performance in lesion detection/segmentation requires large, diverse datasets and architectures capable of capturing both fine grained local details and global context.
Beyond diagnosis, there is an urgent need for methods that can leverage imaging data to predict clinical outcomes such as treatment response and survival. Traditional risk stratification in colorectal cancer hinges on clinicopathological factors, but imaging derived biomarkers offer a noninvasive window into tumor biology and patient prognosis. Recent work demonstrates that deep learning models can extract prognostically relevant features from medical images that sometimes rival or even surpass classical risk factors in predictive power. In gastric cancer, for example, a CT-based deep learning survival score was shown to stratify patients into high vs. low risk groups with predictive performance comparable to postoperative pathological staging (5-year disease free survival AUC ≈ 0.71)9. In bladder cancer, an interpretable deep neural network trained on preoperative CT scans significantly outperformed standard clinical models in forecasting overall survival, and its risk predictions remained independently prognostic in multivariate analysis10.
Multimodal Learning in Medical Imaging: Harnessing multiple data sources for more informed decision making has been a prominent theme in recent medical AI research. Multimodal learning approaches seek to combine modalities such as medical images, clinical data, and omics in a unified model. In the context of gastrointestinal cancers, researchers have explored fusing endoscopic images with radiological scans to compensate for the limitations of each modality. The study by Zhang et al.3 on gastric cancer is a representative example: the authors developed an AI model that inputs both CT images and gastroscopy photographs to predict pathological staging. Their integrated model significantly outperformed single modality models, confirming that cross modal synergies can boost diagnostic accuracy. Beyond imaging, integration of pathology and radiology has shown promise. Qian et al. combined deep features from CT scans and whole slide histology images using a Cox proportional hazards model, improving prognostic predictions in head and neck cancer4. In colorectal cancer, there is increasing interest in merging colonoscopy findings with imaging and even genomics to refine risk assessment11. These works collectively illustrate that multimodal fusion can capture aspects of disease biology that might be missed by unimodal analysis alone. However, designing architectures that effectively align and fuse heterogeneous data remains challenging. Simple late fusion may not fully exploit inter-modal correlations, whereas early fusion (concatenating raw inputs) can be intractable due to differing data dimensions. This has motivated research into attention mechanisms and cross modal transformers that learn shared representations. Our proposed method contributes to this line of work by using a state space model backbone to naturally accommodate sequential endoscopic data alongside volumetric imaging, with an explicit token matching mechanism to align anatomical regions across modalities.
Deep Learning for Lesion Detection and Segmentation: Automated detection and segmentation of lesions in medical images is a well established application of deep learning, with numerous successes reported in recent years. In gastrointestinal endoscopy, AI-driven polyp detection systems have attracted particular attention as they can alert endoscopists to subtle lesions in real time and reduce miss rates. Ali et al. (2024) organized a large-scale colonoscopy computer vision challenge to assess the generalizability of polyp detection/segmentation models7. Top-performing algorithms–often based on convolutional neural networks or transformer hybrids achieved high accuracy on the challenge dataset, though the study underscored the need for better robustness across different centers and imaging conditions. The general trend is that deep learning models can now outperform traditional computer aided detection methods by a substantial margin, given sufficient training data. For CT and MRI, organ and tumor segmentation has been revolutionized by 3D CNNs and variants of the U-Net architecture. For example, Chen et al. demonstrated that incorporating masked image modeling pre-training significantly improves 3D tumor segmentation in volumetric scans8.
Foundation models like the Segment Anything Model (SAM)12 have also been adapted to medical images, showing encouraging results in zero-shot or few-shot segmentation of various lesion types (as evidenced by recent Nature Communications reports on MedSAM13). Nevertheless, challenges such as class imbalance, small lesion size, and boundary ambiguity persist. Our work addresses some of these issues through token anatomy-aware interpolation that emphasizes consistent localization of lesions across modalities, potentially improving both detection and segmentation. By leveraging endoscopy, which directly visualizes mucosal lesions, alongside CT, which provides context for external invasion and lymph nodes, our model aims for more comprehensive lesion delineation in colorectal cancer. This could aid surgeons and oncologists in treatment planning (e.g., identifying all tumor foci preoperatively).
Survival Modeling and Outcome Prediction: Predicting patient outcomes using imaging data has become an important research area, especially with the rise of radiomics and deep learning prognostic models. Conventional radiomics involves handcrafting quantitative features from images and correlating them with outcomes, whereas deep learning can automatically learn complex feature representations optimized for outcome prediction. Several high-profile studies have validated the power of imaging-based predictors. In a study by Nature Communications 2023, Jiang et al. developed a multitask deep learning model for gastric cancer that simultaneously classified the tumor microenvironment (TME) subtype and predicted survival from CT images9. In bladder cancer, a recent multicenter study in NPJ Precision Oncology trained a deep neural network on contrast enhanced CT to predict overall survival; the model’s performance (validated across four cohorts) demonstrated that deep learning prognostic models can generalize and provide interpretability via techniques like SHAP (highlighting image regions predictive of poor outcome)10. These advances suggest that noninvasive imaging, when coupled with artificial intelligence, can uncover subtle prognostic indicators that human observers might overlook. For colorectal cancer, outcome prediction is particularly relevant in the context of neoadjuvant therapy: identifying which patients will achieve a complete response to chemoradiation could spare them unnecessary surgery. Prior works have explored MRI based radiomics for predicting pathological complete response in colorectal cancer, and more recent efforts use deep learning on MRI or CT to improve upon those models. Yet, one limitation in many studies is that they rely on single modalities. Our approach, by integrating endoscopic imagery with CT, has the potential to enhance outcome modeling–since endoscopy might capture tumor phenotype associated with aggressiveness, while CT captures anatomical stage. We incorporate a survival analysis component in our framework (through a Cox proportional hazards layer on top of learned features), and we will demonstrate that the fused feature representations from our multimodal state space model yield superior risk stratification compared to using either modality alone.
State Space and Sequence Models in Vision: The backbone of our proposed architecture aligns with a growing movement in computer vision towards sequence modeling architectures, including transformers and state space models. While convolutional neural networks have long dominated imaging tasks, their local receptive fields and inductive biases can be suboptimal for modeling long-range dependencies or globally contextual information. Transformers14, with their self-attention mechanism, were introduced to vision (Vision Transformers, ViT15) to directly model relationships between distant patches, and quickly proved effective for image recognition and segmentation. However, as noted earlier, transformers incur high computational cost for large images or long sequences, and can be data hungry. Structured state space models (SSMs)16 emerged as an appealing alternative for certain vision tasks, drawing from sequence modeling successes in NLP. SSMs use linear recurrent layers parameterized by state matrices that can be efficiently computed via convolution, enabling them to handle very long sequences with linear scaling17. Our design of a global context injection module is partly inspired by this idea of hybridizing attention with state space kernels. We apply a similar principle not for super resolution, but for the multimodal fusion problem, ensuring our model can grasp global correspondences.
Masked Image Modeling and Self-Supervised Pretraining: Self-supervised learning via masked image modeling (MIM) has become a cornerstone of modern computer vision, following the success of BERT style masked language modeling in NLP. The idea is to mask out parts of the input image and train a model to reconstruct the missing content, thereby forcing the network to learn rich internal representations of the visual data. Methods like MAE (Masked Autoencoder) demonstrated that MIM pretraining can substantially improve downstream task performance by capturing high-level semantic features. Until recently, however, MIM had been relatively underexplored in the medical imaging domain8. This is now rapidly changing. Chen et al. (WACV 2023) showed that MIM pretraining on 3D medical images boosts performance on tasks like tumor segmentation, especially when annotated data are limited8. Moreover, researchers have begun tailoring MIM strategies to domain-specific architectures. The work MambaMIM (Tang et al., MedIA 2025) is particularly relevant to us: it was the first attempt to pretrain a Mamba state space model using masked image modeling18. MambaMIM introduced a novel token interpolation strategy (TOKI) to maintain consistency across the state dimensions during pretraining, and employed a hierarchical decoder to learn multi-scale features. This approach proved effective, yielding notable improvements on multiple medical segmentation benchmarks18.
Collectively, the progress in multimodal data integration, image-based lesion analysis, and prognostic modeling sets the stage for next-generation AI systems in colorectal cancer care. However, realizing this vision demands new architectures that can seamlessly handle multimodal, high-dimensional medical data and capture both the spatial context of images and the sequential nature of, say, endoscopic video frames or longitudinal scans. Convolutional networks have dominated medical image analysis to date, but they may struggle to model long-range dependencies or cross-modal relationships explicitly. Transformer-based architectures offer an alternative with global attention, yet vanilla self-attention has limitations: quadratic complexity with respect to sequence length and a restricted input size, making it less ideal for extremely high resolution images or long image sequences17. In response, researchers have begun exploring state space models (SSMs) and other sequential modeling approaches in vision. SSMs, inspired by classical state space systems, combine aspects of recurrent neural networks and convolutional filters to achieve linear or near-linear scaling in sequence length while maintaining the ability to learn long-range interactions17. A notable example is the Mamba architecture, a selective state space model that has achieved state-of-the-art results across modalities, including text, audio, and images17.
We introduce HydraMamba, a unified multimodal state space framework for comprehensive colorectal cancer analysis. HydraMamba aggregates endoscopic frames and pelvic CT volumes within a shared selective Mamba backbone and adds two targeted modules that supply complementary inductive biases: Anatomy-Aware Token Interpolation reconstructs masked anatomical tokens via geometry- and uncertainty-aware interpolation, while Anatomical Prototype State Injection derives patient-specific prototypes from the shared anatomical space and injects global context through a structured low-rank update. A disentangled dual-stream design separates modality-invariant anatomy from modality-specific style, followed by late fusion to drive task heads for segmentation, detection, and survival risk modeling.

Results

Results

Dataset
We conducted our study using five publicly available datasets: PolypGen, CVC-ColonDB, StageII-Colorectal-CT, CT COLONOGRAPHY (ACRIN 6664), TCGA-READ, and TCGA-COAD. For each dataset, we report only details provided by the dataset custodians.
PolypGen: PolypGen19 is a multi-centre colonoscopy dataset released on 16 November 2022. Data were collected from six medical centres and include both still image frames and short video sequence data from more than 300 unique patients. The released set comprises 1537 single frames and 2225 sequence frames. Pixel-level polyp segmentation annotations are provided and were verified by six senior gastroenterologists. The dataset also includes negative sequences (normal mucosa) and describes an annotation protocol designed to minimise heterogeneity across centres. The current public release does not define train/test splits. Access is provided via Synapse under a Creative Commons Attribution 4.0 International (CC BY 4.0) licence.
CVC-ColonDB: CVC-ColonDB20 is a white-light colonoscopy still-image dataset curated by the Computer Vision Center (CVC) as part of its CVC-Colon resources. It provides 300 images at 574 × 500 resolution with pixel-level polyp masks, extracted from 13 polyp video sequences acquired from 13 patients; background (mucosa/lumen) masks are also included.
StageII-Colorectal-CT: This TCIA collection contains abdominal or pelvic enhanced CT images acquired within 10 days before surgery from 230 patients with stage II colorectal cancer21. Inclusion criteria require radical surgery for colorectal cancer with stage II confirmed by histology and pathology, and completion of abdominal or pelvic contrast-enhanced CT within 10 days pre-operatively; exclusion criteria include preoperative therapy, synchronous malignancy, or death within one month from surgical complications. CT acquisitions were performed on Sensation 64 (Siemens Healthcare) or Brilliance (Philips Healthcare) scanners with 120 kV tube voltage, 200 mA tube current, 5 mm slice thickness, pitch 1.4 or 0.9, and a 512 × 512 matrix. Iodinated contrast (80–100 mL ioprolamine) was injected at 2–3 mL/s; enhanced images were collected 65–75 s after injection. Images are distributed as CT DICOM and de-identified. The collection comprises 230 subjects, 230 studies, 230 series, and 13850 images. The current version (v2) repaired a byteswap error in pixels; the dataset is released under CC BY 4.0. Download requires the NBIA Data Retriever.
CT COLONOGRAPHY (ACRIN 6664): The National CT Colonography Trial (ACRIN 6664)22 collection contains 825 screening CT colonography cases with accompanying spreadsheets that provide polyp descriptions and their locations within colon segments. The primary objective of the trial was to clinically validate the use of CT colonography in a screening population for detecting colorectal neoplasia. The TCIA distribution includes prone and supine DICOM images; supporting spreadsheets list 35 cases with at least one polyp ≥10 mm and 69 cases with 6–9 mm polyps, and identify 243 same-day validated negative cases. Data are provided as CT DICOM and include 825 subjects, 836 studies, 3451 series, and 941,771 images. The collection is released under a CC BY 3.0 licence. Download requires the NBIA Data Retriever.
TCGA-READ: The TCGA Rectum Adenocarcinoma (READ)23 collection on TCIA provides radiological images for TCGA subjects, enabling linkage of imaging phenotypes with clinical, genetic, and pathological data available through the Genomic Data Commons. Images were generally acquired as part of routine care. The TCIA distribution contains MR and CT DICOM images for 3 subjects, spanning 4 studies, 34 series, and 1796 images. The current version notes updated links to clinical and biomedical spreadsheets in the GDC. The collection is released under CC BY 3.0; download requires the NBIA Data Retriever.
TCGA-COAD: The TCGA Colon Adenocarcinoma (COAD)24 collection on TCIA similarly links radiological imaging with TCGA identifiers. The TCIA distribution provides CT and OT DICOM images for 25 subjects (32 studies, 93 series, 8387 images). The current version notes updated links to clinical and biomedical spreadsheets in the GDC. The collection is released under CC BY 3.0; download requires the NBIA Data Retriever.

Generalization across modalities and devices
To make domain shift assumptions explicit and to quantify robustness beyond pooled results, we stratified evaluation by endoscopy acquisition centre (PolypGen), scanner/device factors in CT (StageII-Colorectal-CT), and acquisition protocol in CT colonography (ACRIN 6664). PolypGen aggregates frames from six medical centres with heterogeneous image characteristics (illumination, colour calibration, still vs. short sequences, presence of negative frames), while CVC-ColonDB contributes still images at a distinct native resolution (574 × 500), both reflecting typical multi-centre endoscopic variation. The StageII-Colorectal-CT cohort spans two scanner families (Sensation 64, Siemens; Brilliance, Philips) and different table pitches (1.4 and 0.9). ACRIN 6664 contains prone and supine acquisitions with case-level polyp size labels (6–9 mm vs. ≥10 mm) and validated negatives, enabling protocol, and size-aware analysis of CT colonography detection.

Cohort design and data alignment
All experiments used only the public datasets enumerated in section 2.1, PolypGen and CVC-ColonDB for endoscopy, and StageII-Colorectal-CT, ACRIN 6664 CT colonography, and the TCGA-COAD/READ CT subsets for cross-sectional imaging and clinical linkage. Table 1 summarises modality, subject counts, and available annotations for each source. Endoscopic images and pelvic CT volumes were treated as unpaired. Subjects were never linked across modalities, and no attempt was made to match an endoscopic case to a CT case. Cross-modal fusion in HydraMamba therefore relies on representation alignment rather than subject pairing, and survival supervision is drawn only from CT cohorts with linked outcomes.
For endoscopic polyp segmentation and detection, the data sources were PolypGen and CVC-ColonDB, which provide still frames and sequence frames with pixel-level polyp masks; negative frames from PolypGen were included. The unit of analysis was individual frames. Ground truth consisted of the pixel masks released by the dataset custodians; detection boxes were computed as the tight axis-aligned bounding box enclosing each mask. For split construction, all frames from a given patient and, when applicable, from the same video sequence were grouped and assigned to a single split to prevent leakage across partitions. The same endoscopy splits were used for both segmentation and detection heads.
For CT tumor segmentation and lesion detection, the data sources were Stage II-Colorectal-CT as the primary CT cohort, TCGA-COAD/READ for additional CT series, and ACRIN 6664, used only for CT colonography detection experiments. The unit of analysis was 3D CT studies for training and evaluation, with tokenisation performed per axial slice; metrics are reported at the case level after aggregating slice predictions. Ground truth for segmentation used the subset of CT studies with available tumor annotations; detection boxes were either derived from tumor masks (minimal enclosing boxes) or, for ACRIN 6664, constructed from the official spreadsheets that list polyp presence and colon segment location, which were mapped to imaging coordinates. For split construction, CT studies were split at the patient level, and all series from the same subject reside in a single split. The identical CT subject partitions were reused across CT segmentation and CT detection so that a subject cannot appear in training for one CT task and testing for another.
For survival prediction, the data sources were CT subjects with linked clinical follow-up drawn from Stage II-Colorectal-CT and from TCGA-COAD/READ through the GDC clinical spreadsheets referenced by TCIA; ACRIN 6664 was not used because the trial distribution does not include longitudinal outcomes. The unit of analysis was patient-level time-to-event with right censoring, and the time origin was the date of the index CT. Survival supervision was applied only to patients with imaging and outcome metadata. Split construction and alignment followed patient-level k-fold cross-validation on the CT cohort, as described in “Training protocol and hyperparameters,” with the same CT folds shared with CT segmentation and detection; within each training fold, a validation subset was carved out at the subject level for hyperparameter selection. Endoscopic images did not carry survival labels and were never used to define survival targets.
Subject overlap and leakage safeguards were enforced across all tasks, with partitions subject-disjoint. A subject never contributed data to more than one of train, validation, and test for any task. Because modalities are unpaired and originate from independent repositories with distinct anonymised identifiers, no individual contributed to both the endoscopy and CT cohorts. Within the CT cohort, a subject could contribute to multiple task heads only when the required labels existed, and then always within the same fold and split to avoid cross-task leakage. The shared backbone was trained with multi-task losses on the union of available labels, while the survival head consumed only the CT patient embedding and its own time-to-event label.
Practical implications for reproducibility are as follows. The policy ensures that endoscopy and CT are fused through aligned representations rather than subject pairing; reported segmentation and detection metrics are free of frame, slice, or patient leakage; and survival estimates are conditioned solely on CT information from patients with outcome metadata, with no direct or indirect transfer of subject identity across splits or modalities. The splitting protocol used for survival, patient-level k-fold on CT, is stated in “Training protocol and hyperparameters” and was held fixed for all survival analyses.

Comprehensive SOTA comparisons
We compare HydraMamba to recent state-of-the-art models on all three task dimensions: lesion segmentation, lesion detection, and survival outcome prediction. Despite the diversity of modalities and tasks, HydraMamba achieves top-tier performance across the board, often with modest but consistent improvements over previous methods.
For both CT and endoscopy, we constructed a pooled dataset by combining all samples from the datasets enumerated in Table 1; no additional sources were used.
Segmentation: The quantitative evaluation of segmentation performance was conducted on test sets derived from both endoscopic and computed tomography (CT) data, with comprehensive results presented in Tables 2 and 3, respectively. On the endoscopic dataset, HydraMamba achieved a Dice Similarity Coefficient (DSC) of 0.856 and an Intersection-over-Union (IoU) of 0.748. This level of accuracy surpasses that of numerous contemporary baselines, including specialized segmentation architectures such as Viewpoint-Aware (2024) and PraNet-V2 (2025), which recorded DSC scores of 0.835 and 0.831, respectively. The proposed model also demonstrated superior performance compared to joint segmentation-detection frameworks, outperforming MedSAM-2 (2024), which obtained a DSC of 0.841. For the task of tumor segmentation on the CT dataset, HydraMamba yielded a DSC of 0.812 and an IoU of 0.683. This result indicates a clear performance advantage over highly competitive benchmarks, including the widely adopted nnUNet framework (DSC 0.805) and another recent state space model, SegMamba (2024), which achieved a DSC of 0.785. The model’s segmentation accuracy on CT data was also higher than that of joint models like MedSAM-2 (DSC 0.801). Collectively, these findings affirm the robust and superior segmentation capabilities of the proposed architecture across fundamentally different imaging modalities.
Detection: For the lesion detection task, the model’s performance was evaluated on both endoscopic and CT test sets, with quantitative results presented in Table 2 and Table 3, respectively. On the endoscopic data, the proposed model achieved a precision of 0.905, a recall of 0.931, and an F1 score of 0.918. This result exceeds the performance of several specialized detection baselines, including the top entry from the PolypGen Challenge (F1 score 0.873) and general object detectors like YOLOv8 (F1 score 0.868). The model also demonstrated superior performance compared to joint segmentation-detection frameworks such as MedSAM-2, which obtained an F1 score of 0.875.
On the CT test set, the model attained a precision of 0.885, a recall of 0.891, and an F1score of 0.888. This level of accuracy is higher than that of other modern detectors, including the YOLOv13 baseline (F1 score 0.870) and the specialized medical imaging model CCDNet (F1 score 0.850). Furthermore, the model’s detection capability on CT data surpassed that of joint models like MedSAM-2, which recorded an F1 score of 0.865.

Survival prediction
On the CT internal test set, HydraMamba achieved the highest discriminative performance with Harrell’s concordance of C = 0.832 and Uno’s time–dependent concordance at 1 year of C(t) = 0.853 (Table 4). Absolute risk estimates were accurate, with the lowest integrated Brier score IBS = 0.161 and a calibration slope closest to unity 1.01, indicating near-perfect agreement between predicted and observed risks. Compared with classical Cox models built on handcrafted CT radiomics (C = 0.707, IBS = 0.197), HydraMamba improved concordance by + 0.125 and reduced prediction error by ≈ 18% (IBS), underscoring the value of end-to-end feature learning over hand-engineered descriptors. Relative to a strong deep learning baseline, DeepSurv (C = 0.726, IBS = 0.188)25, HydraMamba delivered a substantial gain in discrimination (+0.106) and an ≈14% reduction in IBS, with markedly better calibration (slope 1.01 vs. 1.14).
A naïve transformer baseline that pools CT and endoscopy features (ViT pooling) reached C = 0.780 and IBS = 0.174, whereas a Vision–Mamba backbone without our modules attained C = 0.777 and IBS = 0.175. HydraMamba surpassed both while yielding the best calibration, supporting two design hypotheses: cross–modal fusion that is conditioned on patient–specific prototypes improves survival modeling, and state–space encoders with anatomy–aware tokenization capture long–range, disease–extent cues that are not fully exploited by late pooling alone. In the context of colorectal and colorectal cancer, these findings align with prior reports that imaging–driven deep models can match or exceed clinicopathologic baselines for outcome prediction26.
Across recent transformer/Mamba survival models spanning diverse modalities (WSI+omics, MRI+clinical, pathology+genomics), reported concordance values generally fall in the 0.75–0.81 range with IBS = 0.167–0.182 (Table 4). HydraMamba’s C = 0.832 and IBS = 0.161 are therefore competitive with, and in several cases exceed, contemporary imaging and multimodal architectures, while maintaining superior calibration. Results are also consistent with large–cohort studies demonstrating the prognostic value of CT–based deep models10.
The combination of higher concordance, lower IBS, and a calibration slope near ~1 suggests that HydraMamba can both rank patients by risk and provide reliable absolute risk estimates at clinically actionable horizons. In practice, these features facilitate stratifying patients for adjuvant therapy and tailoring surveillance intensity. Kaplan–Meier curves stratified by model–predicted risk (quartiles) show clear separation (see Supplement), consistent with the improvements observed in C and IBS relative to all baselines.

Segmentation boundary quality
Figure 1 illustrates qualitative results on colonoscopic frames. The upper row shows pixelwise segmentations overlaid in blue for ten representative cases. Across varied lumen orientations, illumination levels, and mucosal textures, the predicted masks adhere closely to visible lesion margins, following the subtle transition between polyp head and surrounding mucosa rather than expanding into colonic folds or specular highlights. The model preserves narrow necks in pedunculated polyps and maintains smooth, anatomically plausible edges on sessile and flat lesions, with minimal bleeding into adjacent shadows. The lower row presents the corresponding detections. Bounding boxes are centered on the lesion with tight extent, and the detector remains selective in challenging scenes containing valves, bubbles, and instrument tips. Qualitatively, missed detections are rare in frames with clear visual access; remaining imperfections tend to occur when severe glare partially obscures boundaries or when extensive debris compresses the apparent lesion footprint.
Figure 2 provides multi-slice overviews on four CT cases, showing axial stacks from cranial to caudal. In each case, predicted masks taper gradually at the superior and inferior poles of the tumor and reach maximal cross-section near the center of the stack, matching typical morphology. Boundaries follow expected tissue interfaces: the segmentation respects fat planes and muscular walls without spilling into lumen air or adjacent organs. When the lesion abuts heterogeneous regions, such as regions with streak artifacts or partial volume near the bowel wall, the delineation remains smooth rather than fragmented, and small satellite components are consistently represented across neighboring slices rather than appearing as isolated, flickering islands. The slice-to-slice continuity reveals stable three-dimensional behavior, avoiding abrupt jumps in area or topology and preserving the expected shrink-to-vanish pattern at the ends of the stack. Taken together, these examples highlight the model’s ability to provide clean, contiguous masks on both endoscopic and CT data, with boundaries that align with visually interpretable anatomical cues.

Robustness and sensitivity analysis
Sensitivity to APSI hyperparameters is visualised in Fig. 3. HydraMamba peaks at K = 64 and r = 16, where the C-index reaches 0.760, and mAP0.5:0.95 reaches 0.433. These scores represent gains of roughly + 0.02 C-index and +0.02 mAP over the lightest settings, while neighbouring configurations remain within 0.006 of the optimum, confirming that the default delivers the strongest accuracy without sacrificing robustness.
Robustness to scan order is summarised in Fig. 4. Geometry-preserving Hilbert scanning, the default in HydraMamba retains the best metrics (C-index 0.743, mAP0.5:0.95 0.422), with Z-order and raster mappings trailing by less than 0.005. The tight band across orders shows that APSI maintains consistent performance even when the token traversal pattern changes.

Ablation studies
The ablation results in Tables 5 and 6 show that each proposed module makes a measurable and complementary contribution per modality.
Removing AnatoTI primarily affected boundary fidelity, with the Dice Similarity Coefficient (DSC) decreasing from 0.856 to 0.826 and the Intersection-over-Union (IoU) from 0.748 to 0.708. Detection performance also declined, as the F1 score fell from 0.918 to 0.897, driven by reductions in precision (0.905–0.894) and recall (0.931–0.900). Excluding APSI, most strongly influenced sensitivity, as recall decreased from 0.931 to 0.882, accompanied by a reduction in F1 to 0.894. Segmentation quality also declined, with DSC and IoU dropping to 0.842 and 0.726, respectively. Removing the dual-stream anatomy-style pathway resulted in moderate, consistent performance losses (DSC 0.834; F1 0.892). The variant without both modules exhibited the weakest results overall, with DSC 0.814, IoU 0.690, and F1 0.880.
A similar pattern was observed in the CT experiments. Removing AnatoTI caused the most pronounced degradation in boundary accuracy, with DSC decreasing from 0.812 to 0.780 and IoU from 0.683 to 0.645, and the F1 score falling to 0.872. The absence of APSI predominantly impaired detection sensitivity, reducing recall from 0.891 to 0.842 and F1 from 0.888 to 0.867. Precision showed a slight increase (0.885 to 0.892), suggesting that APSI enhances global context at the expense of marginally higher false positives. Eliminating the dual-stream pathway again reduced overall performance (DSC 0.788; F1 0.870). The configuration without both AnatoTI and APSI produced the lowest metrics across all criteria, with DSC 0.766, IoU 0.629, and F1 0.857.
On the combined modality survival task (Table 7), the full HydraMamba achieves C = 0.832 and C@1y = 0.853 with IBS = 0.161 and a calibration slope of 1.01. Removing AnatoTI lowers discrimination to C = 0.816 and C@1y = 0.838 and increases error to IBS = 0.168. Removing APSI produces the largest degradation (C = 0.794, C@1y = 0.815, IBS = 0.178, slope 1.07). Disabling both modules yields the weakest performance (C = 0.776, C@1y = 0.798, IBS = 0.186, slope 1.16). These results indicate complementary roles: AnatoTI improves local representation with modest calibration gains, whereas APSI provides global context that chiefly drives risk discrimination and calibration; used together, they deliver the most accurate and best-calibrated survival predictions.

Generalization across modalities and devices
To make domain shift assumptions explicit and to quantify robustness beyond pooled results, we stratified evaluation by endoscopy acquisition centre (PolypGen), scanner/device factors in CT (StageII-Colorectal-CT), and acquisition protocol in CT colonography (ACRIN 6664). PolypGen aggregates frames from six medical centres with heterogeneous image characteristics (illumination, colour calibration, still vs. short sequences, presence of negative frames), while CVC-ColonDB contributes still images at a distinct native resolution (574 × 500), both reflecting typical multi-centre endoscopic variation. The Stage II-Colorectal-CT cohort spans two scanner families: Sensation 64, Siemens (SS64); Brilliance, Philips (PB), and different table pitches (1.4 and 0.9). ACRIN 6664 contains prone and supine acquisitions with case-level polyp size labels (6–9 mm vs. ≥10 mm) and validated negatives, enabling protocol- and size-aware analysis of CT colonography detection. Unless noted, the backbone and hyperparameters were unchanged from the main experiments; metrics were recomputed on held-out test folds within each stratum. Confidence intervals (95%) were estimated by case-level bootstrapping (1000 resamples).
Across the six PolypGen centres, HydraMamba maintained stable performance for both segmentation and detection results shown in Table 8, with between-centre coefficients of variation < 1% for Dice and F1. Macro-averaged Dice/IoU and detection F1 (0.856/0.747 and 0.917) closely matched the pooled endoscopy results (Dice 0.856, IoU 0.748; F1 0.918), indicating limited sensitivity to centre-specific appearance changes.
On Stage II-Colorectal-CT, device and protocol stratified results shown in Table 9 remained tightly clustered around the pooled CT metrics (Dice 0.812, IoU 0.683; detection F1 0.888; survival C 0.832, IBS 0.161; calibration slope ≈1.01). Siemens vs. Philips differences were small and not statistically significant given overlapping CIs; similarly, pitch 1.4 and 0.9 showed only marginal changes. Calibration by stratum stayed near unity, indicating stable absolute risk predictions.
On ACRIN 6664, detection performance shown in Table 10 was similar for prone vs. supine acquisitions, polyps ≥10 mm were detected more reliably than 6–9 mm lesions.

Model interpretability and visualization
To address the need for clinical interpretability, we generated attention map visualizations to understand which image regions HydraMamba utilizes for its predictions. We compared our model’s attention focus against several state-of-the-art baseline methods on representative endoscopic cases from the test set.
As illustrated in Fig. 5, the results provide clear qualitative evidence of our model’s learned focus. For both cases shown, HydraMamba’s attention (visualized as a heatmap) is consistently and tightly concentrated on the true lesion area, aligning closely with the ground truth bounding boxes. This localized attention indicates that our model successfully learns to identify and prioritize salient pathological features. In contrast, other baseline methods exhibit more diffuse or misplaced attention patterns. For instance, some models show attention spreading to non-lesion areas of the mucosa or highlighting only a fraction of the polyp. This analysis supports that our model’s decision-making process is grounded in the correct anatomical structures, enhancing its trustworthiness and potential for clinical interpretation.

Discussion

Discussion
HydraMamba unifies endoscopic and CT information within a selective state-space framework and delivers consistent gains across tasks. On endoscopy, the model attains Dice 0.856 and detection F1 0.918; on CT, Dice is 0.812 with F1 0.888. For survival prediction, HydraMamba reaches C = 0.832 with IBS = 0.161 and a calibration slope of 1.01, outperforming Cox+radiomics and DeepSurv, and exceeding transformer/Mamba baselines while maintaining superior calibration. Ablations clarify complementary roles. Removing AnatoTI reduces boundary accuracy on both modalities; removing APSI primarily lowers detection recall and degrades survival discrimination and calibration. The full configuration yields the strongest survival performance and the lowest prediction error, indicating that anatomy-aware token interpolation and prototype-driven context injection together are necessary to realize the model’s performance envelope. The results show that the proposed modules, coupled with a state-space backbone, provide an effective and calibrated multimodal solution for colorectal cancer segmentation, detection, and survival prediction.
The current work has a number of shortcomings. The retrospective public datasets on which all analyses are based have variable annotation quality, small sample sizes for certain tasks (particularly survival prediction), and no paired endoscopy-CT data at the subject level, which limits the evaluation of true cross-modal complementarity. External validation is limited to a small number of centres and acquisition protocols, and comparisons with prior survival models rely on results reported in separate cohorts rather than direct head-to-head evaluation. Prospective, multi-center validation, the addition of other modalities like MRI, pathology, and molecular data, the creation of paired multimodal datasets to better analyze cross-modal interactions, and research into workflow integration and practical clinical impact are all important areas for future work.

Methods

Methods

Problem setup and overview
Given a corpus of unpaired endoscopic RGB images and pelvic CT volumes , we introduce HydraMamba shown in Fig. 6 a unified architecture for unpaired multi-modal learning. The model is designed to produce (i) a tumor segmentation mask M on a reference imaging space (CT slice space by default), (ii) lesion bounding boxes , and (iii) a survival risk representation r used to estimate patient hazard h(t∣r). The framework is built upon the principle of representation disentanglement, decomposing inputs into a modality-invariant anatomical representation and a modality-specific style representation. These are processed by a shared selective state space backbone (Mamba)17 augmented with two modules: Anatomy Aware Token Interpolation (AnatoTI), which performs masked token reconstruction within the shared anatomical space, and Anatomical Prototype State Injection (APSI), which injects global anatomical context derived from the shared representation. A late-fusion mechanism combines the anatomical and style representations to feed task-specific “hydra" heads for segmentation, detection, and survival.

Preprocessing and tokenization
Offline Feature Extraction: All input images are first processed by a frozen, pre-trained MedSigLIP image encoder. For each 2D CT slice or endoscopic frame, the encoder produces a spatial feature map. These feature maps, which represent high-level visual information, are saved and used as the direct input for all subsequent model training and evaluation. This step is performed only once.
Tokenization: The trainable part of our model begins by taking these pre-computed feature maps. Tokens for the Mamba backbone are formed by flattening patches from these feature maps. For CT, we use 2D per-slice patching with slice index encodings. For endoscopy, we use non-overlapping patches from the feature map.

State space encoder and representation disentanglement
We adopt the discretized state space recurrence used in modern Mamba-style vision encoders17,27. For input tokens with learned, data-dependent time scale Δi, the hidden state and output evolve aswhereFor learned A, B, C, D. This causal formulation provides linear time long range modeling but suffers from long range decay when interactions must traverse many steps; alleviating this limitation requires non-causal augmentation of the output mapping C or multi-directional scanning.
Shared Backbone and Disentangled Encoders: To enable learning from unpaired data, we employ a shared backbone architecture that processes inputs from either modality using the same set of Mamba block weights. To handle the significant statistical shifts between endoscopy and CT data, we incorporate modality-specific LayerNorm parameters within the shared backbone. The architecture uses two parallel encoders:
A deep, shared Mamba-based backbone that encodes the input image into a modality-invariant anatomical representation s, which captures the underlying anatomical content. A shallow, modality-specific convolutional encoder that produces a compact modality representationz, which encodes the appearance and style characteristics of the source modality (e.g., CT or endoscopy).
This explicit separation of content and style is fundamental to preventing information leakage and enabling robust unpaired learning.
Tokenization and Scanning: Endoscopy is embedded by a 3 × 3 stride- 2 convolutional stem followed by non-overlapping P × P patch flattening and positional encodings. CT is embedded with either (i) a 3 × 3 × 3 stem and P × P × S cuboid tokens and (ii) 2D per slice patching with slice index encodings. For a given input, a scan π (Hilbert or Morton) defines a 1D token order that preserves local adjacency.

Anatomy aware token interpolation (AnatoTI)
Motivation: Replacing missing or low-confidence tokens with a generic learnable vector disrupts the causal, input dependent selective scan property of SSMs and degrades representation quality. Mamba specific masked modeling succeeds when masked tokens are generated via state-space consistent interpolation rather than free parameters. Our AnatoTI module reinforces the learning of a robust, modality-invariant anatomical representation. It operates on the anatomical representations to perform masked token reconstruction. This acts as a self-supervised objective that encourages the shared backbone to learn universal anatomical features, consistent with the causal dynamics of SSMs, without being influenced by modality-specific shown in Fig. 7.
Geometry and uncertainty aware interpolation: Let π denote the scan over the anatomical tokens S, and suppose indices i < j are valid tokens with a gap G(i, j) = i + 1, …, j − 1. For α = 1, …, j − i − 1, define u = α/(j − i) and construct:where ϕθ maps local image descriptors in neighborhoods (e.g., color gradients for endoscopy; HU histograms for CT) and the geodesic progress u to a token in . The blend gate is uncertainty-aware, where ςi+α estimates aleatoric uncertainty.
Causality at deployment via dual pass training: During training we use a symmetric, teacher forcing blend in (5). At inference, to preserve strict causality, we freeze γi+α = 1 and disable the backward term, yielding a purely forward interpolation that depends only on past tokens.
Where AnatoTI applies: We invoke AnatoTI during pretraining-style consistency regularization by replacing a fraction of anatomical tokens si with masked placeholders and training the model to reconstruct them.

Anatomical prototype state injection (APSI)
Motivation: The causal SSM in (2) constrains each token to “see” only its predecessors along π, which limits global perception. Non-causal augmentation that modulates the output pathway enables one-pass global context without multi-directional scans. The core mechanism derives patient-specific anatomical prototypes from the shared content space S and injects this context via a structured, token-conditioned low-rank update. Modality-specific style information is fused at a late stage to inform the final task-specific predictions shown in Fig. 8.
Prototype construction: Given the anatomical representation tokens (optionally whitened), we learn K prototypes via temperature-controlled soft clustering:with trainable queries and projection Wa. We encourage diversity via an entropy regularizer . Low rank operator injection and Late Fusion: For token i, we form an anatomical context vector gi = ∑kcikpk and modulate the SSM output mapping by a rank−r update on the anatomical weights:where is the context-aware anatomical representation. The final representation for the task heads, , is produced by a late-fusion module (a gated MLP) that combines the enhanced anatomical features with the modality-style vector . This ensures that the deep backbone learns a pure anatomical representation, while allowing modality-specific characteristics to inform the final prediction.
Stability refinements: We apply (i) a nuclear norm penalty λ*∑i∣Udiag(Γgi)V⊤∣* to prevent rank inflation; (ii) context drop with small probability to avoid over-reliance on APSI; and (iii) prototype EMA updates for smooth dynamics.

Backbone stack and multi-scale features
We interleave shared SSM blocks with depthwise separable convolutions for local mixing. Each stage uses modality-specific LayerNorm → SSM (with AnatoTI at inputs) → modality-specific LayerNorm → pointwise MLP, with residual connections. The APSI is applied in every other block to the anatomical representations. We build a feature pyramid by down sampling token density across stages, exposing high resolution features to segmentation and mid/low resolution features to detection and survival18.

Task-specific decoders
Segmentation: A U-shaped decoder upsamples multi-scale SSM features to the CT slice space. Each up block performs 2× upsampling, depthwise convolution, and gated MLP fusion with lateral skips from early stems. The final 1 × 1 projection yields logits ℓseg; loss combines soft Dice and focal cross entropy. To sharpen boundaries, we add a surface loss on the signed distance transform and a topology-preserving penalty on Euler characteristic in a narrow band.
Detection (bounding boxes): an anchor-free, center-based head operates on a 1/4–1/8 scale feature map and predicts objectness o(u, v) and box parameters (l, t, r, b) with centerness weighting. Training uses focal loss for classification and a mixture of IoU/GIoU and smooth-ℓ1 for regression.
Survival: We form a patient embedding r via attention pooling concentrated on predicted tumor regions: tokens receive attention weights where are upsampled mask logits. We fit either (i) a Cox head that outputs log risk optimized by the negative partial log likelihood, or (ii) a discrete time head that outputs hazardsover T bins with likelihood
To encourage clinically meaningful cues, a weak radiomics prior u (mask area, elongation, detector counts) is fused via r← r + Wuu.

Training objective and optimization
The total loss is
Disentanglement Loss (): To enforce a clean separation of anatomical (s) and modality (z) representations, we use a margin-based similarity loss. It ensures that anatomical representations of similar structures are close regardless of modality, while modality representations are close for the same modality regardless of anatomical content. For subjects p, q, and modalities i, j:Semantic Alignment Loss (): To align the anatomical spaces of the two modalities, we use a knowledge-guided contrastive loss inspired by MedCLIP.[5] For unpaired batches from each modality, we construct a semantic similarity matrix S based on weak labels (presence of tumor, organ labels). The loss then minimizes the divergence between the computed cross-modal representation similarity and S, encouraging anatomically similar (but unpaired) images to have similar representations. Task-Specific Loss (): The standard supervised losses for segmentation, detection, and survival, applied to samples with available ground-truth labels.

Training protocol and hyperparameters
We use patient-level k-fold cross-validation on the CT cohort. In each fold, the model is trained on the CT training set features combined with all available endoscopy features. Evaluation is performed on the held-out CT test set features.
The entire HydraMamba model, including the shared Mamba backbone, AnatoTI, APSI, and task-specific heads, is trained from scratch on the pre-extracted MedSigLIP features. We use the AdamW optimizer with a cosine learning rate decay and mixed precision. The total loss combines task-specific losses with the cross-modal disentanglement and alignment losses that enforce a coherent shared representation space.
To eliminate any risk of information leakage, all preprocessing and any statistic-fitting operations were confined to the training data of each outer fold. Feature extraction with the MedSigLIP encoder is a fixed, frozen mapping from pixels to features and does not learn from our data; nevertheless, any downstream transformations that require data-driven parameters were estimated on the training partition only and then applied unchanged to validation/test partitions. Splits were enforced at the patient level across all tasks so that a subject never appears in more than one of train/validation/test, and the survival head consumed only CT patient embeddings with linked outcomes.
Survival modeling used nested, patient-level resampling. In the outer loop we performed k = 5-fold cross-validation with folds stratified by event indicator and approximate follow-up quantiles. Within each outer training split, 20% of patients were held out as an inner validation set for model selection and early stopping; the inner set was never used to compute the reported test metrics. Hyperparameters were tuned strictly inside this inner loop (learning rate/weight decay/dropout and APSI design), using a grid that included K ∈ 16, 32, 64 prototypes and ranks r ∈ 4, 8, 16. The configuration that maximized Uno’s C(t) at 1 year while minimizing IBS on the inner validation set was then refit on the full outer-training fold and evaluated once on the untouched outer-test fold. Uncertainty was quantified with patient-level nonparametric bootstrapping. After aggregating out-of-fold predictions from the five outer test folds, we computed 95% confidence intervals for Harrell’s C, Uno’s C(t) at 1 year, the integrated Brier score (IBS), and the calibration slope using B = 1000 bootstrap replicates, resampling patients (keeping all of a patient’s slices/frames together) to preserve within-subject dependence.

Ethics and consent to participate
This study was conducted using publicly available, fully de-identified datasets obtained from open-access repositories (TCGA-COAD, TCGA-READ, StageII-Colorectal-CT, PolypGen, CVC-ColonDB, and ACRIN 6664). As no human participants or animals were directly involved and all data were previously anonymized, ethics approval and consent to participate were not required.

출처: PubMed Central (JATS). 라이선스는 원 publisher 정책을 따릅니다 — 인용 시 원문을 표기해 주세요.