본문으로 건너뛰기
← 뒤로

StructSAM: structure-aware prompt adaptation for robust lung cancer lesion segmentation in CT.

1/5 보강
NPJ digital medicine 📖 저널 OA 98.6% 2024: 1/1 OA 2025: 41/41 OA 2026: 26/27 OA 2024~2026 2026 Vol.9(1) p. 127
Retraction 확인
출처

Liu M, Yao Y, Jia J, Yao J, Huang Z, Zeng Z

📝 환자 설명용 한 줄

Accurate delineation of lung lesions in computed tomography (CT) scans is critical for diagnosis, staging, and treatment planning, yet remains a challenging task.

이 논문을 인용하기

↓ .bib ↓ .ris
APA Liu M, Yao Y, et al. (2026). StructSAM: structure-aware prompt adaptation for robust lung cancer lesion segmentation in CT.. NPJ digital medicine, 9(1), 127. https://doi.org/10.1038/s41746-025-02306-6
MLA Liu M, et al.. "StructSAM: structure-aware prompt adaptation for robust lung cancer lesion segmentation in CT.." NPJ digital medicine, vol. 9, no. 1, 2026, pp. 127.
PMID 41634130 ↗

Abstract

Accurate delineation of lung lesions in computed tomography (CT) scans is critical for diagnosis, staging, and treatment planning, yet remains a challenging task. While foundation models like the Segment Anything Model (SAM) excel in natural images, they often falter in medical imaging due to low contrast, ambiguous boundaries, and a lack of 3D context. To address these limitations, we propose StructSAM, a structure-aware prompt adaptation framework designed for robust volumetric segmentation, with a primary focus on lung cancer. StructSAM injects anatomical priors into the prompt pathway, employs a 3D inter-slice aggregator for volumetric consistency, and leverages PEFT for scalability. Experiments on the LIDC-IDRI dataset demonstrate that StructSAM achieves state-of-the-art accuracy on lung nodule segmentation, outperforming both classical architectures and SAM-based adaptations. Crucially, extended cross-organ evaluations on KiTS19 and MSD Pancreas datasets reveal that StructSAM effectively generalizes to other anatomical structures, highlighting its robustness to domain shifts. These findings suggest that embedding structural priors into foundation models is a promising strategy toward generic, clinically reliable, and efficient medical image segmentation.

같은 제1저자의 인용 많은 논문 (5)

📖 전문 본문 읽기 PMC JATS · ~115 KB · 영문

Introduction

Introduction
Lung cancer remains the leading cause of cancer-related mortality worldwide, accounting for approximately one in five cancer deaths1. Accurate delineation of lung lesions from computed tomography (CT) scans plays a pivotal role in the clinical workflow, underpinning early detection, precise staging, treatment planning, and longitudinal response assessment. In practice, however, automated segmentation of lung nodules and tumors remains highly challenging. Lesions often present with subtle intensity differences compared to the surrounding parenchyma, frequently abut vessels or pleural boundaries, and exhibit substantial variability in size, shape, and growth patterns across patients. Moreover, the inherently volumetric nature of CT scans requires consistent lesion representation across slices, which conventional 2D models frequently fail to achieve. These challenges hinder the robust deployment of computer-aided systems in real-world oncology practice2.
Deep learning methods have driven significant progress in medical image segmentation over the past decade. Classical architectures such as U-Net and its variants3,4 have established strong baselines in thoracic imaging. Nevertheless, their generalization capabilities remain limited when confronted with domain shifts caused by varying scanner vendors, acquisition protocols, and heterogeneous clinical cohorts. Consequently, models trained on a single dataset often degrade substantially when applied to external cohorts, limiting their clinical utility. This disparity between benchmark performance and deployment robustness is particularly pronounced in lung cancer, where reliable lesion delineation is indispensable for precision oncology.
The recent advent of large-scale foundation models has introduced a new paradigm for vision tasks. The Segment Anything Model (SAM)5, trained on over one billion masks, achieves remarkable zero-shot segmentation in natural image domains by leveraging prompt-driven interaction. Its success has sparked considerable interest in medical imaging applications6,7. However, applying SAM directly to volumetric medical data encounters fundamental hurdles. First, low-contrast and ambiguous lesion boundaries make prompt attention unstable, often leading to under- or over-segmentation. Second, SAM lacks an intrinsic understanding of anatomical context, producing masks that may be structurally inconsistent with biological plausibility. Third, SAM is natively 2D, making it unsuitable for volumetric CT where inter-slice continuity is crucial. In the context of lung cancer, these limitations translate into clinically unacceptable errors in nodule margins or tumor extent estimation.
Several recent works have attempted to adapt SAM for the medical domain. MedSAM6 introduces large-scale medical pretraining, while other strategies explore specialized fine-tuning8 or domain-specific prompt engineering. Despite these advances, existing adaptations remain limited. Most efforts focus primarily on data-driven transfer without explicitly encoding structural priors necessary for anatomical validity, and they often remain constrained to 2D slice-level adaptation, neglecting volumetric coherence. As a result, their performance remains suboptimal, particularly in cross-center or cross-organ settings.
To overcome these challenges, we propose StructSAM, a novel structure-aware prompt adaptation framework designed for robust volumetric segmentation, utilizing lung cancer as a challenging primary case study. StructSAM introduces anatomical priors into the prompt generation process, guiding SAM with shape- and topology-based cues (e.g., organ masks, vesselness) to stabilize predictions at low-contrast boundaries. We further design a lightweight 3D-aware adapter that aggregates inter-slice contextual information, ensuring volumetric continuity and reducing slice-wise inconsistencies. Finally, we adopt a domain-aware PEFT strategy to enable generalization across datasets and institutions, yielding a practical framework for robust deployment.
In summary, our contributions are: We systematically analyze the limitations of applying SAM to complex volumetric lesions, highlighting issues of low-contrast ambiguity, lack of anatomical priors, and absence of volumetric modeling. We propose StructSAM, a structure-aware prompt adaptation framework that integrates domain-specific priors into SAM, enabling more biologically consistent lesion delineation. We introduce a 3D-aware adapter that effectively incorporates inter-slice contextual information, improving volumetric coherence in CT-based lesion segmentation. We conduct extensive experiments on the LIDC-IDRI lung cancer dataset as well as cross-organ benchmarks (KiTS19 and MSD Pancreas), demonstrating that StructSAM outperforms SAM, MedSAM, and state-of-the-art medical segmentation models in both accuracy and generalization capabilities.
Large-scale foundation models have fundamentally reshaped visual segmentation by coupling promptable interfaces with broad pretraining. The SAM demonstrated unprecedented versatility on natural images but encountered significant domain gaps in medical scenarios. Recent releases of SAM 2 extend this paradigm to video processing via streaming memory mechanisms9, further widening the scope of promptable segmentation. In medical imaging, efforts to bridge the gap between SAM-style pretraining and clinical distributions generally fall into three categories: (i) scaling medical pretraining with millions of image-mask pairs (e.g., MedSAM, SAM-Med2D)6,10, (ii) transitioning from 2D to volumetric settings via specialized architectures (e.g., SAM-Med3D)11, and (iii) empirical benchmarking of SAM 2 on medical modalities7. While these efforts validate the utility of foundation models, they often struggle to enforce anatomical consistency under low contrast and ambiguous boundary conditions.
A rapidly expanding body of literature focuses on adapting SAM to clinical images via parameter-efficient updates. Modality-agnostic strategies (e.g., MA-SAM) preserve 2D backbones while injecting 3D or temporal cues during fine-tuning, yielding consistent gains across CT, MRI, and Ultrasound12. Beyond general adapters, concurrent works have proposed lightweight encoders and automatic prompting mechanisms (e.g., EMedSAM, SAM-AutoMed)10,13 to achieve fully automated segmentation. For volumetric data, approaches like mixture-of-experts aggregation have been explored to mitigate catastrophic forgetting when specializing foundation models14. Collectively, these studies suggest that structured adaptation—rather than end-to-end retraining- represents a more pragmatic pathway toward robust medical performance.
Prior to the advent of foundation models, lung lesion segmentation was a mature field dominated by task-specific deep learning architectures. Early approaches relied on multi-view CNNs or 3D U-Net variants to capture volumetric context. More recent state-of-the-art methods, such as NoduleNet15 and nnDetection16, integrate detection and segmentation into unified frameworks to reduce false positives and improve boundary delineation. While these specialized models achieve high performance on specific benchmarks like LIDC-IDRI, they are typically trained from scratch on limited datasets and lack the semantic richness of large-scale vision models. Moreover, their distinct architectural designs often limit their transferability to other organs or tasks without extensive retraining. In contrast, StructSAM aims to retain the generalization power of foundation models while matching the precision of these specialized tools through structure-aware adaptation.
Parameter-efficient strategies, such as prompt tuning and Low-Rank Adaptation (LoRA), have become standard for adapting large vision models to healthcare17,18. Visual prompt learning at test-time can stabilize adaptation without altering backbone weights19. However, clinical usability hinges on anatomical plausibility, not just pixel-level accuracy. Recent work revisits shape priors and category-specific constraints20,21, highlighting the limitations of purely data-driven transfer. These studies motivate our structure-aware prompting approach: by explicitly exposing the model to geometric and regional priors (e.g., vesselness, organ boundaries), we can regularize masks toward biologically plausible configurations, addressing a critical gap in current SAM adaptations.
Distribution shift across scanners and protocols remains a major barrier in clinical deployment22,23. Methodologically, single-image or on-the-fly Test-Time Adaptation (TTA) has emerged as a realistic setting for segmentation24. Prompt-based TTA relaxes source dependence by adapting style or shape at inference (e.g., PASS)25, while cascaded pipelines enable zero-shot transfer26. For SAM specifically, recent works target the semantic gap between natural and medical domains27. Our work complements these efforts by demonstrating that structural priors serve as robust, domain-invariant anchors, facilitating generalization across heterogeneous datasets.
Compared with prior SAM adaptations that primarily scale medical pretraining6 or inject volumetric cues11, our framework uniquely targets the synergy between structure-aware prompting and 3D-aware aggregation. Unlike specialized lung models (e.g., NoduleNet) that are confined to specific tasks, StructSAM leverages the semantic power of foundation models for broader generalization. Simultaneously, unlike general-purpose adapters (e.g., MedSAM) that neglect anatomical constraints, we explicitly incorporate topology-informed prompts. This design allows StructSAM to stabilize predictions in low-contrast, volumetric settings while preserving the parameter efficiency required for clinical deployment.

Results

Results
This section provides a comprehensive evaluation of the proposed StructSAM framework. We begin by outlining the experimental setup, including datasets, preprocessing protocols, implementation details, and evaluation metrics. We then benchmark StructSAM against a broad spectrum of state-of-the-art segmentation methods to highlight its advantages on lung lesion analysis as well as cross-organ tasks. Beyond quantitative comparisons, we conduct extensive ablation studies to disentangle the contributions of structure-aware prompting, 3D inter-slice aggregation, and parameter-efficient adaptation. To assess robustness and transferability, we further examine StructSAM under cross-dataset generalization and TTA scenarios, simulating domain shifts commonly encountered in clinical practice. We complement these results with qualitative case studies that illustrate how structural priors improve boundary delineation and topological consistency. Finally, we analyze model complexity, parameter efficiency, and inference speed, demonstrating that StructSAM achieves strong accuracy-efficiency trade-offs suitable for real-world deployment.

Experimental setup
We evaluate StructSAM on two widely used, publicly available medical image segmentation benchmarks. (1) The Lung Image Database Consortium and Image Database Resource Initiative (LIDC-IDRI)2 dataset, consisting of 1018 thoracic CT scans with annotated lung nodules, serves as our primary benchmark for lung lesion segmentation. (2) To assess cross-organ generalization, we include the Kidney Tumor Segmentation Challenge (KiTS19) dataset28, which contains 210 abdominal CT scans with kidney and tumor masks. These datasets represent diverse anatomical sites, imaging protocols, and lesion morphologies, allowing us to test both in-domain accuracy and cross-domain robustness.
For fair comparison, all models are trained and evaluated under the same preprocessing pipeline: CT volumes are resampled to 1 × 1 × 1 mm3 voxel spacing, intensity is clipped to [− 1000, 400] Hounsfield units, and normalized to zero mean and unit variance. Training and evaluation follow official dataset splits when provided; otherwise, we adopt a 70/10/20 split by patient. Models are implemented in PyTorch and trained using AdamW optimizer with cosine learning rate scheduling. All experiments are conducted on NVIDIA A100 GPUs with 40 GB memory. Evaluation metrics include Dice similarity coefficient (DSC), Intersection-over-Union (IoU), 95th percentile Hausdorff distance (HD95), and average surface distance (ASD). Statistical significance of improvements is assessed via paired two-tailed t-tests with p < 0.05. To ensure fair comparison, all baseline models (e.g., MedSAM, SwinUNETR) were fine-tuned using their official open-source implementations. We unified the training protocols with the same batch size (2), optimization steps (100 epochs), and data augmentation pipeline (intensity scaling, rotation, and elastic deformation) to rule out hyperparameter-driven gains.

Comparison with state-of-the-art
Table 1 presents a comprehensive benchmarking on lung nodule segmentation. Standard medical baselines, such as the self-configuring nnU-Net3, establish a high-performance bar with a Dice score of 84.7%, reflecting the maturity of CNN-based approaches. Crucially, we also compare against task-specific models optimized for lung nodules, including NoduleNet15 and nnDetection16. While these specialized systems achieve impressive accuracy (Dice>86%) by tailoring architectures to nodule characteristics, they require training from scratch and lack semantic versatility. StructSAM surpasses both categories, achieving a state-of-the-art Dice score of 88.6% and significantly reducing the Hausdorff Distance (HD95) to 7.8 mm. This superiority suggests that StructSAM effectively combines the best of both worlds: it leverages the rich, robust semantic features from the SAM foundation model while injecting the precise, boundary-aware guidance typically found in specialized detectors.

Ablation studies
Table 2 presents a systematic ablation study on the LIDC-IDRI dataset. Starting from the vanilla SAM baseline, which achieves a Dice score of 74.8%, we observe limited performance due to its lack of domain adaptation and structural priors. Incorporating the Structure-Aware Prompt Generator (SAPG) yields a substantial improvement of nearly 7 percentage points in Dice, confirming that anatomical priors provide crucial guidance in low-contrast and ambiguous regions. Adding the 3D-Aware Inter-slice Aggregator (3D-AIA) further boosts performance to 85.7% Dice and reduces HD95 by almost 3 mm, indicating that cross-slice contextual reasoning significantly enhances volumetric consistency. Finally, equipping the framework with Domain-Aware Parameter-Efficient Fine-Tuning (PEFT) produces the best results, reaching 88.6% Dice and 7.8 mm HD95. These results validate that each component contributes complementary benefits: SAPG addresses boundary ambiguity, 3D-AIA enforces volumetric coherence, and PEFT adapts the decoder to clinical data distributions, together forming a robust and generalizable segmentation framework.

Robustness analysis and choice of priors
A critical concern for clinical deployment is whether StructSAM relies too heavily on perfect anatomical priors, which may not be available in cases of severe pathology (e.g., consolidation, large tumors) or imaging artifacts. To rigorously assess this, we simulated prior degradation by applying random morphological deformations to the input lung masks (erosion/dilation of k pixels) and adding Gaussian noise to the vesselness maps. Table 3 summarizes the results. Remarkably, even under “High” degradation levels (10-pixel deformation, mimicking severe segmentation failure), the performance drop is minimal (< 2.6%). This resilience suggests that the Structure-Aware Prompt Generator (SAPG) does not treat priors as ground truth but rather as soft guidance tokens. The attention mechanism in the decoder learns to effectively filter out noise and cross-reference the priors with the rich semantic features from the image encoder, thereby maintaining robust performance even when external inputs are imperfect.
We further investigated the optimal representation for boundary cues. While binary edge detectors like the Canny filter are commonly used in computer vision, medical lesions often exhibit fuzzy, gradual transitions rather than sharp step-edges. Table 4 compares our soft Gradient Magnitude prior against a standard Canny edge prior. The results confirm that the soft gradient representation outperforms the binary approach by 1.8% in Dice and reduces HD95 by 1.3 mm. We attribute this to the fact that soft gradient maps are fully differentiable and preserve the magnitude of intensity changes, providing the network with continuous confidence scores regarding boundary existence. In contrast, binary edges discard subtle gradient information, leading to information loss in low-contrast regions typical of ground-glass opacities.

Cross-organ generalization
We evaluated the transferability of StructSAM by training on lung nodules (LIDC-IDRI) and directly testing on unseen anatomical structures (Kidney and Pancreas) without fine-tuning. This rigorous setting assesses whether the learned structural priors are generic.
Table 5 reports the performance of models trained on LIDC-IDRI lung nodules when applied to kidney tumor segmentation. This represents a severe domain shift involving different organ shapes and tissue textures. While standard baselines struggle (e.g., SAM drops to 55.4% Dice), StructSAM maintains robust performance with a Dice score of 70.5%. This indicates that the learned structural priors—such as the correlation between gradient boundaries and mask edges—are transferable logic that applies beyond the thoracic cavity.
We further validated universality on the MSD Pancreas dataset (Table 5). The pancreas is notoriously difficult to segment due to its irregular shape and low contrast. By simply replacing the lung-specific prior with a coarse organ mask generated via TotalSegmentor, StructSAM adapted seamlessly, achieving a Dice score of 83.1%. This significantly outperforms the foundation model baseline MedSAM (79.8%) and surpasses the strong medical baseline nnU-Net. These results confirm that StructSAM is a generalized structure-aware framework capable of handling diverse soft-tissue organs.
To further validate the universality of our framework beyond lung and kidney, we extended our evaluation to the MSD Pancreas dataset. The pancreas is notoriously difficult to segment due to its irregular shape, small size, and low contrast against retroperitoneal fat. To adapt StructSAM, we simply replaced the lung-specific prior with a coarse organ mask generated via TotalSegmentor, leaving the rest of the architecture unchanged. As shown in Table 5, StructSAM adapted seamlessly to this new domain, achieving a Dice score of 83.1%. This significantly outperforms the foundation model baseline MedSAM (79.8%) and surpasses the strong medical baseline nnU-Net. These results provide compelling evidence that StructSAM is not merely a lung-specific tool, but a generalized structure-aware framework capable of handling diverse soft-tissue organs by leveraging appropriate anatomical priors.

Test-time adaptation and robustness
Table 6 evaluates the robustness of StructSAM when directly transferred to unseen domains without retraining. Enabling test-time prompt refinement (TPR) improves Dice to 76.8% and reduces HD95 by over 2 mm on the LIDC-IDRI transfer task, demonstrating that lightweight normalization updates can effectively align feature statistics across domains.
A similar trend is observed in the more challenging KiTS19 setting, where cross-organ generalization is required. As illustrated in Fig. 1, StructSAM with TPR consistently outperforms baselines and the non-adapted version. Crucially, the violin plots in Fig. 2 reveal that TPR significantly reduces the variance in HD95. While baseline distributions exhibit heavy tails—indicating frequent catastrophic failures—the TPR-adapted model yields a tighter, more compact distribution, ensuring more reliable boundaries in clinical practice.
We further analyzed the stability of this adaptation. Figure 3 presents the sensitivity of the model to the entropy weight λ. The results show that performance remains stable across a broad range of weights (e.g., λ ∈ [0.1, 0.4]), confirming that StructSAM is not hypersensitive to hyperparameter selection and can be robustly deployed without extensive tuning. These quantitative and qualitative results confirm that TTA complements the design of StructSAM by mitigating domain shift at deployment.

Qualitative results and case studies
Figure 4 illustrates representative lung nodule segmentation results. We present two challenging scenarios involving small nodules located near the chest wall/pleura. As observed in the second row, MedSAM completely fails to detect the lesion (false negative), producing an empty mask despite the presence of the nodule. Standard SAM (first row), while detecting the target, generates a distorted, hook-like shape that lacks anatomical plausibility. TransUNet successfully localizes the lesions but tends to significantly under-segment the nodule area. In contrast, StructSAM achieves precise alignment with the nodule boundaries in both cases. By leveraging structural priors, our method avoids the shape distortion seen in SAM and the detection failures observed in MedSAM, demonstrating superior robustness in low-contrast settings.
Across all datasets, the qualitative evidence corroborates the quantitative findings. StructSAM consistently provides masks that are closer to the ground truth, particularly in challenging scenarios where other methods struggle with boundary ambiguity or domain mismatch. These visual comparisons highlight the practical value of the proposed design: the framework not only improves segmentation accuracy but also enhances robustness to clinical variability, a crucial property for real-world deployment in multi-center and multi-organ settings.

Complexity and efficiency analysis
An important consideration for clinical deployment is whether accuracy gains are achieved at the expense of efficiency. To this end, we systematically analyze the computational complexity and runtime behavior of StructSAM, comparing it with representative baselines. We report results in terms of trainable parameter counts, FLOPs per forward pass, GPU memory usage, inference latency, and throughput (volumes per second). We further evaluate how key hyperparameters (e.g., token pooling size, axial window length, and LoRA rank) influence the trade-off between accuracy and efficiency.
We visualize these efficiency metrics in Figs. 5–11. First, as shown in Fig. 5, the number of trainable parameters and FLOPs scale linearly with LoRA rank, yet StructSAM remains significantly lighter than full fine-tuning. Second, regarding runtime latency, Fig. 6 illustrates that inference time grows only marginally with larger pooling sizes, demonstrating that the 3D-Aware Inter-slice Aggregator is efficient in practice. Third, in terms of memory consumption (Fig. 7), GPU usage increases moderately with axial window length, but the model still fits comfortably within a standard 40GB GPU.
In terms of runtime performance, Fig. 8 confirms that throughput remains above 20 volumes per second across all tested configurations, verifying the model’s real-time capability. Crucially, StructSAM achieves a superior Pareto frontier compared to baseline methods. As visualized in Fig. 8 (and FLOPs analysis in Fig. 9), our method consistently offers higher Dice scores for similar or lower computational budgets. Furthermore, as noted in Section 2.6 (see Fig. 10), the optional test-time prompt refinement adds negligible overhead while preserving this efficiency advantage. Finally, the sensitivity analysis in Fig. 11 indicates that StructSAM maintains stable performance on cross-dataset evaluations under varying parameter settings, underscoring its robustness for real-world usage.
Overall, the results demonstrate that StructSAM is not only more accurate but also parameter- and runtime-efficient, making it a practical candidate for deployment in clinical pipelines where both precision and speed are critical.

Embedding analysis and score distributions
To complement quantitative metrics, we further investigate the distributional properties of case-wise performance and the geometry of feature embeddings. Figure 12 provide complementary perspectives: a violin–box overlay of Dice scores across methods. This analyses serve to answer two questions: how consistent are different methods across individual cases, and how well do their learned features align with ground-truth structures?
The violin plots in Fig. 12 highlight not only the average performance. SAM and MedSAM display wide distributions with heavy tails, reflecting inconsistent predictions that vary strongly depending on lesion size, location, and contrast. SwinUNETR narrows the variance but still exhibits outliers where performance deteriorates sharply. StructSAM without test-time refinement achieves higher median Dice with reduced spread, while the full StructSAM with TPR yields the tightest distribution and highest median. This indicates that our design not only improves accuracy but also enhances reliability, which is crucial in clinical practice where case-to-case variability is inevitable.

Can structure-aware prompts solve domain shift in medical segmentation?
One of the central motivations behind StructSAM is the hypothesis that explicitly injecting anatomical priors can mitigate the performance degradation observed under domain shift. Our cross-dataset experiments suggest that structural prompts indeed provide a more stable anchor compared to purely data-driven adaptation. For example, when transferring from LIDC-IDRI to StructSAM maintains a Dice score above 74%, whereas SAM and MedSAM drop below 67%. This indicates that prompts encoding shape and boundary information are less sensitive to changes in intensity distribution or scanner-specific variations.
However, the question remains whether structural prompting alone is sufficient to handle all forms of domain shift. Distribution shifts in texture, noise characteristics, or even annotation styles may still confound the model. Moreover, the choice of priors matters: while lung lobes or vesselness maps provide useful guidance for thoracic CT, analogous priors may be less obvious for other organs or modalities such as MRI. Thus, while our findings highlight the promise of structure-aware prompting, they also underscore the need for complementary strategies, including self-supervised domain adaptation and continual test-time refinement, to achieve reliable out-of-distribution performance.

Is parameter efficiency enough for clinical deployment?
A recurring theme in foundation model adaptation is the pursuit of parameter-efficient fine-tuning. StructSAM follows this trend by updating less than 5% of SAM’s parameters while still achieving state-of-the-art results. This design yields tangible benefits: lower memory consumption, faster training, and easier deployment across institutions with limited compute. Our efficiency analysis confirms that StructSAM offers a favorable accuracy-efficiency trade-off compared to both traditional CNNs and full fine-tuning approaches.
Yet, parameter efficiency alone may not be sufficient for real-world clinical integration. Hospitals often face constraints not only in computational resources but also in reliability, interpretability, and integration into existing workflows. For example, a model that is efficient but occasionally fails catastrophically on rare cases may still be impractical in high-stakes environments. Furthermore, regulatory approval processes require extensive validation that goes beyond computational efficiency. Thus, the question should not be whether a method is parameter-efficient, but whether it balances efficiency with robustness, interpretability, and clinical usability. Our findings indicate that StructSAM makes progress toward this balance, but a complete solution will require broader system-level considerations.

Discussion

Discussion
The findings of this study shed light on several critical aspects of adapting large-scale foundation models to medical imaging, with a particular emphasis on robust volumetric segmentation. Our results demonstrate that the SAM, while remarkably versatile in natural image domains, encounters substantial difficulties when applied directly to thoracic CT. These shortcomings are not merely numerical but clinically significant: failure to capture subtle nodule margins or tumor extensions can directly affect staging accuracy and treatment planning in lung cancer management. By integrating structure-aware prompting and 3D aggregation, StructSAM consistently mitigates these limitations, achieving more reliable delineation of both small pulmonary nodules and complex soft-tissue structures.
A crucial contribution of this work is the translation of technical metric improvements into tangible clinical utility. Reviewers often question whether a few percentage points in Dice truly matter. Our results show that StructSAM significantly reduces the 95% Hausdorff Distance (HD95) to 7.8 mm and the Average Surface Distance (ASD) to 2.41 mm. In the context of Stereotactic Body Radiation Therapy (SBRT) for lung cancer, reducing HD95 is directly linked to the precision of the Clinical Target Volume. A lower HD95 allows clinicians to define tighter planning margins, thereby maximizing the dose to the tumor while sparing healthy lung parenchyma and reducing toxicity. Furthermore, the improved volumetric consistency evidenced by lower ASD ensures more reliable longitudinal tracking of tumor burden, which is essential for accurate response assessment under RECIST criteria.
A key observation is that StructSAM improves not only in-domain performance but also cross-dataset and cross-organ generalization. In clinical practice, models are frequently confronted with domain shifts caused by differences in scanners, acquisition protocols, and patient demographics. Baseline methods, including both conventional CNNs and transformer-based architectures, often degrade sharply in such settings. In contrast, StructSAM maintains robust performance across LIDC-IDRI, KiTS19, and the newly added MSD Pancreas dataset. The successful adaptation to the pancreas—an organ with distinct morphology and lower contrast than the lung—validates that explicit structural cues (e.g., gradient boundaries and organ masks) act as stable anchors even when pixel-level distributions shift. This finding highlights an important conceptual point: foundation model adaptation should not be regarded solely as data alignment, but as a process of structural alignment, where anatomical plausibility provides the necessary constraints for clinical reliability.
Efficiency is another crucial factor. Traditional fine-tuning approaches demand large-scale retraining and substantial computational budgets, which can be prohibitive in healthcare settings with limited resources. StructSAM, by contrast, leverages PEFT and lightweight architectural modules. Our complexity analysis demonstrates that these modifications incur only modest overhead while delivering significant accuracy gains. This balance between efficiency and accuracy is particularly relevant in large-scale screening programs, where rapid throughput and low latency are necessary to process high volumes of CT scans in a cost-effective manner.
Despite these advances, limitations remain. StructSAM may still underperform in extremely low-contrast regions or in cases where lesions abut major vessels or pleural boundaries, leading to ambiguous delineations. While structure-aware prompting alleviates part of this issue, residual errors persist. Furthermore, although the method shows promising cross-organ generalization, transferability is not yet fully equivalent to models explicitly trained from scratch on the target anatomy. Another limitation is the reliance on handcrafted priors such as edge maps or coarse masks, which may depend on the availability of robust pre-processing tools (e.g., TotalSegmentor) across different institutions.
Looking ahead, several directions merit further investigation. Incorporating multimodal priors, such as radiology reports or pathological data, could enrich the prompt space and reduce reliance on handcrafted features. Self-supervised learning of structural priors represents another promising avenue to achieve greater autonomy from external pre-processing. Moreover, integrating StructSAM with continual TTA could enable models to evolve dynamically in longitudinal monitoring scenarios, such as follow-up of lung cancer patients undergoing therapy. Finally, extending the framework beyond CT to modalities such as MRI or PET, and applying it to broader oncology tasks including treatment response prediction, could further validate the general utility of structure-aware prompt adaptation.
In this work, we introduced StructSAM, a framework tailored to the challenges of volumetric medical segmentation. By embedding anatomical priors into the prompt pathway and incorporating a lightweight 3D inter-slice aggregator, StructSAM addresses the core limitations of foundation models in low-contrast CT imaging. Extensive evaluations on LIDC-IDRI, KiTS19, and MSD Pancreas datasets demonstrate that our approach not only achieves state-of-the-art accuracy but also exhibits robust cross-organ generalization. These findings highlight that incorporating structural plausibility—rather than relying solely on pixel-level data alignment—is a critical step toward deployable, trustworthy medical AI systems.

Methods

Methods
We introduce StructSAM, a structure-aware prompt adaptation framework that injects anatomical priors into the prompt pathway of SAM while enabling volumetric reasoning for CT. The method is lightweight (parameter-efficient), modular, and generalizable to diverse lesion types. Figure 13 provides an overview.

Problem definition and notation
Given a CT volume with slice index t ∈ {1, …, T}, our goal is to predict a volumetric lesion mask Y ∈ {0, 1}H×W×T. SAM consists of an image encoder Eθ, a prompt encoder Pθ, and a mask decoder Dθ. StructSAM augments SAM with three modules: (i) a Structure-Aware Prompt GeneratorGϕ producing dense/sparse prompts from anatomical priors; (ii) a 3D-Aware Inter-slice AggregatorAψ that fuses contextual features across neighboring slices; and (iii) a Domain-Aware Prompt Optimizer for parameter-efficient adaptation.
For a target slice and its neighborhood , SAM yields 2D embeddings Xk = Eθ(Ik) for , while Gϕ produces prompt tokens (sparse) and (dense). The aggregator outputs . The final prediction is

Structure-Aware Prompt Generator (SAPG)
Medical CT presents low contrast, fuzzy boundaries, and strong anatomical regularities. To capture these structural cues, we construct multi-channel prior maps using generic, reproducible operators:Lung/Organ Region (Lt): For lung experiments, we employ a pre-trained U-Net (R231) from the ‘lungmask` library to generate binary lung masks. For pancreas generalization, we utilize TotalSegmentor to obtain coarse organ masks.

Vesselness (St): We compute the multi-scale Frangi vesselness filter with scales σ∈ [1, 2, 3] to highlight vascular structures, which often serve as critical exclusion zones for lesion delineation.

Edge Map (Et): We apply the gradient magnitude operator on Gaussian-smoothed images (σ = 1.0) to capture soft boundary evidence, which we found superior to binary Canny edges (see Ablation).

These maps are channel-wise standardized to form the input tensor Qt.
SAM’s prompt pathway supports a dense mask embedding. We convert Qt to a soft prior mask Mt ∈ [0, 1]H×W via a shallow CNN hϕ (three 3 × 3 conv layers with GroupNorm and GELU):where σ(⋅) denotes the logistic function. The dense prompt token is then obtained by SAM’s built-in prompt encoder on Mt and concatenated to the prompt tokens.
To stabilize localization, we synthesize K plausible positive points and one enclosing box directly from Mt. Let be the set of connected components in Mt > τ. For the largest component, we place K points by k-means over high-confidence pixels (centroids as positive prompts) and set a tight bounding box around it. This yields , fully automatic and differentiable through Mt (the selection is treated with straight-through estimation during backprop).
All dense and sparse prompts are embedded by Pθ into token space. We further append a small learnable structure token, where stats(⋅) collects global descriptors (area ratio, compactness, eccentricity). The final prompt sequence is

3D-Aware Inter-slice Aggregator (3D-AIA)
To limit cost, we mean-pool Xk into a compact grid of n tokens per slice, forming a sequence . We add learnable axial positions depending on slice offset k−t.
We employ a 2-layer lightweight Transformer with cross-slice multi-head self-attention (8 heads, embedding dimension d = 256) and a 1D relative bias along the axial index. Let MSAψ denote attention and FFNψ the feed-forward network. One block iswhere R collects repeated rk for each token. We slice back the tokens for the center slice and upsample by a light deconvolution Up to match the encoder resolution. A gating map gt = σ(Up(CLS(T″))) modulates the original feature Xt:where ⊙ is element-wise product and α ∈[0, 1] controls the fusion strength.

Domain-aware prompt PEFT
We keep the Image Encoder Eθ and Prompt Encoder Pθ frozen to preserve foundation knowledge. To efficiently adapt the Mask Decoder Dθ to medical distributions, we introduce Low-Rank Adaptation (LoRA) modules. Specifically, we freeze the original weights of the decoder’s self-attention layers but inject trainable rank decomposition matrices into the Query (Wq) and Value (Wv) projections. The forward pass becomes:where and are low-rank matrices with r = 8, initialized to zero to ensure stability at the start of training. The Key projections and FFN layers remain frozen. This strategy retains high parameter efficiency (< 5% trainable parameters) while allowing the decoder to realign its attention mechanisms based on the structural prompts provided by SAPG.

Training objectives
We use the standard Dice loss and binary cross-entropy. For brevity, sums are over all voxels of a mini-batch volume.
To discourage fragmented or implausible shapes, we adopt a soft skeleton overlap term (clDice-style) and a boundary length penalty computed on the signed-distance transform (SDT). Let be a differentiable soft-skeleton operator and TP( ⋅ , ⋅ ) a soft true-positive overlap.Let be the SDT of the prediction with zero level-set at the boundary. A curvature-aware perimeter penalty is
Adjacent predictions should vary smoothly along the axial dimension after accounting for small deformations. We estimate a simple 2D displacement field Δt→t+1 via a frozen, light-weight registration head on (no ground truth required) and minimize the warped difference:
To mitigate cross-center shift, we impose a maximum mean discrepancy (MMD) term between the prompt-conditioned decoder queries extracted from different domains (mini-batch level):
The overall objective isWe set λ1 = 1.0, λ2 = 0.5, λ3 = 0.5, λ4 = 0.1, λ5 = 0.1, λ6 = 0.1 by default and tune them in ablation.

Inference and test-time prompt refinement
At inference, SAPG generates prompts automatically from priors, hence the pipeline is fully automatic. For robustness, we optionally enable a single-epoch test-time refinement that updates only the affine scale/shift of LayerNorm in Aψ and the bias of gϕ using entropy minimization per case:No backbone weight is changed; the procedure is fast and memory-friendly.

Complexity and parameter efficiency
Let n be pooled tokens per slice, d the channel dimension, and w = 2ℓ + 1 the axial window. The self-attention cost of 3D-AIA scales as O(w2n2d) but with small n (e.g., after pooling) and w≤7, the overhead remains modest. LoRA rank r contributes O(rd) parameters per adapted projection. In our default configuration, the total trainable parameters are <5% of SAM, while throughput remains close to the frozen baseline thanks to pooled tokens and shallow adapters.

Implementation details
We use the SAM ViT-B image encoder frozen. SAPG uses three 3 × 3 conv layers (channels C→32→32→1) to produce Mt. For sparse prompts, K = 4 points are placed by k-means over the top-5% pixels of Mt; the box is the tight bounding rectangle of that component. 3D-AIA adopts two Transformer blocks (d = 256, n = 14 × 14 pooled tokens per slice, w = 5).
LoRA is applied with rank r = 8 exclusively on the Query and Value projections of the first two attention layers in the Mask Decoder. Training uses the AdamW optimizer with learning rate 2 × 10−4 for the new modules {ϕ, ψ} and 5 × 10−5 for LoRA parameters, with a weight decay of 1 × 10−4 and a cosine annealing schedule. We use a batch size of 2 volumes with random axial crops of T = 48 slices. Data augmentation includes intensity jitter, small rotations (± 7∘), and elastic deformations respecting lung boundaries. All experiments were conducted on NVIDIA A100 GPUs using PyTorch 2.0.
When transferring to other organs (e.g., Pancreas), only the channels of Qt are redefined (e.g., using a generic body mask instead of a lung mask), while the remainder of StructSAM architecture remains unchanged. This preserves the cross-domain design philosophy and facilitates reuse on diverse medical segmentation tasks.

Design rationale and insights
The design of StructSAM is guided by both the clinical challenges of lung cancer CT imaging and the limitations of current foundation model adaptations. The structure-aware prompt generator directly addresses the ambiguity of low-contrast nodules and tumors by injecting anatomical priors, thereby providing stable guidance for SAM’s mask decoder. The 3D inter-slice aggregator ensures volumetric continuity, a critical requirement in thoracic CT where slice-by-slice inconsistencies may lead to clinically misleading lesion volumes. Finally, PEFT balances performance with scalability, enabling deployment across institutions with heterogeneous computational resources.
Taken together, these design choices reflect a general principle: adapting foundation models for clinical use requires not only data alignment but also structural alignment. By embedding domain knowledge into the prompt space and integrating volumetric reasoning, StructSAM achieves a synergy between general-purpose representation power and disease-specific constraints. This rationale underpins the experimental results that follow, where we demonstrate the impact of each component on both accuracy and robustness.

Methodological discussion
While StructSAM substantially improves lung cancer lesion segmentation, several methodological considerations warrant discussion. First, the reliance on handcrafted anatomical priors such as edge or vesselness maps may limit portability across imaging centers that use different preprocessing pipelines. Future work could explore learning such priors in a self-supervised manner directly from large-scale unlabeled CT data. Second, although the 3D inter-slice aggregator enhances volumetric consistency, more advanced mechanisms such as hybrid transformer-convolutional designs or spatio-temporal state space models may further improve contextual reasoning. Finally, PEFT, though effective, does not fully replace the need for long-term continual adaptation. Integrating StructSAM with online test-time refinement could provide a more complete solution for deployment in dynamic clinical environments.
To sum up, this methodological discussion emphasizes that StructSAM should be viewed not as an endpoint but as a step toward a broader paradigm: embedding structural plausibility into foundation model adaptation. This perspective will inform both the evaluation presented in the next section and future research directions.

Ethics approval and consent to participate
This study makes use of publicly available datasets that have received prior ethical approval and informed consent from the respective institutions. No new human or animal data were collected.

출처: PubMed Central (JATS). 라이선스는 원 publisher 정책을 따릅니다 — 인용 시 원문을 표기해 주세요.

🟢 PMC 전문 열기