Leveraging pretrained vision-language model for enhanced breast cancer diagnosis with multi-view mammography.
1/5 보강
PICO 자동 추출 (휴리스틱, conf 2/4)
유사 논문P · Population 대상 환자/모집단
This study highlights the potential of applying the finetuned vision-language models for developing multi-view, image-text-based CAD schemes of breast cancer.
I · Intervention 중재 / 시술
추출되지 않음
C · Comparison 대조 / 비교
추출되지 않음
O · Outcome 결과 / 결론
[CONCLUSIONS] The proposed Mammo-CLIP demonstrates superior breast cancer diagnosis performance compared to SOTA methods. This study highlights the potential of applying the finetuned vision-language models for developing multi-view, image-text-based CAD schemes of breast cancer.
[BACKGROUND] Although fusion of information from multiple views of mammograms plays an important role to increase accuracy of breast cancer detection, developing multi-view mammograms-based computer-a
APA
Chen X, Li Y, et al. (2026). Leveraging pretrained vision-language model for enhanced breast cancer diagnosis with multi-view mammography.. Medical physics, 53(1), e70261. https://doi.org/10.1002/mp.70261
MLA
Chen X, et al.. "Leveraging pretrained vision-language model for enhanced breast cancer diagnosis with multi-view mammography.." Medical physics, vol. 53, no. 1, 2026, pp. e70261.
PMID
41532302
DOI
10.1002/mp.70261
Abstract
[BACKGROUND] Although fusion of information from multiple views of mammograms plays an important role to increase accuracy of breast cancer detection, developing multi-view mammograms-based computer-aided diagnosis (CAD) schemes still faces big challenges and no such CAD schemes have been used in clinical practice.
[PURPOSE] To overcome these challenges, we investigate a new approach based on the concept of contrastive language-image pre-training (CLIP), which has sparked interest across various medical imaging tasks. The aim is to solve the challenges in: (1) effectively adapting the single-view CLIP for multi-view feature fusion and (2) efficiently fine-tuning this parameter-dense model with limited samples and computational resources.
[METHODS] We introduce a unique Mammo-CLIP, the first multi-modal framework to process multi-view mammograms and corresponding simple texts. Mammo-CLIP uses an early feature fusion strategy to learn multi-view relationships in four mammograms acquired from the craniocaudal (CC) and mediolateral oblique (MLO) views of the left and right breasts. To enhance learning efficiency, plug-and-play adapters are added into CLIP's image and text encoders for fine-tuning the model efficiently and limiting updates to about 1% of the parameters. For framework evaluation, we assembled two datasets retrospectively. The first dataset, comprising 470 malignant and 479 benign cases, was used for few-shot fine-tuning and internal evaluation of the proposed Mammo-CLIP via 5-fold cross-validation. The second dataset, including 60 malignant and 294 benign cases, was used to test generalizability of Mammo-CLIP.
[RESULTS] Mammo-CLIP outperforms the state-of-the-art (SOTA) cross-view transformer evaluated using areas under ROC curves (AUC = 0.841 ± 0.017 vs. 0.817 ± 0.012 and 0.837 ± 0.034 vs. 0.807 ± 0.036) on both datasets. It also surpasses previous two CLIP-based methods by 20.3% and 14.3% in AUC.
[CONCLUSIONS] The proposed Mammo-CLIP demonstrates superior breast cancer diagnosis performance compared to SOTA methods. This study highlights the potential of applying the finetuned vision-language models for developing multi-view, image-text-based CAD schemes of breast cancer.
[PURPOSE] To overcome these challenges, we investigate a new approach based on the concept of contrastive language-image pre-training (CLIP), which has sparked interest across various medical imaging tasks. The aim is to solve the challenges in: (1) effectively adapting the single-view CLIP for multi-view feature fusion and (2) efficiently fine-tuning this parameter-dense model with limited samples and computational resources.
[METHODS] We introduce a unique Mammo-CLIP, the first multi-modal framework to process multi-view mammograms and corresponding simple texts. Mammo-CLIP uses an early feature fusion strategy to learn multi-view relationships in four mammograms acquired from the craniocaudal (CC) and mediolateral oblique (MLO) views of the left and right breasts. To enhance learning efficiency, plug-and-play adapters are added into CLIP's image and text encoders for fine-tuning the model efficiently and limiting updates to about 1% of the parameters. For framework evaluation, we assembled two datasets retrospectively. The first dataset, comprising 470 malignant and 479 benign cases, was used for few-shot fine-tuning and internal evaluation of the proposed Mammo-CLIP via 5-fold cross-validation. The second dataset, including 60 malignant and 294 benign cases, was used to test generalizability of Mammo-CLIP.
[RESULTS] Mammo-CLIP outperforms the state-of-the-art (SOTA) cross-view transformer evaluated using areas under ROC curves (AUC = 0.841 ± 0.017 vs. 0.817 ± 0.012 and 0.837 ± 0.034 vs. 0.807 ± 0.036) on both datasets. It also surpasses previous two CLIP-based methods by 20.3% and 14.3% in AUC.
[CONCLUSIONS] The proposed Mammo-CLIP demonstrates superior breast cancer diagnosis performance compared to SOTA methods. This study highlights the potential of applying the finetuned vision-language models for developing multi-view, image-text-based CAD schemes of breast cancer.
🏷️ 키워드 / MeSH
같은 제1저자의 인용 많은 논문 (5)
- Rare fusion transcript in a refractory adult T-cell lymphoblastic lymphoma.
- Rabdosin B suppresses proliferation of nonsmall cell lung cancer by regulating the SRC/PI3K/AKT signaling pathway.
- Development of a chemiluminescence immunoassay for proGRP in human serum.
- Genetically encoded biosensors in microbes for Tumor targeting.
- Analysis of discordant results in multi-technique platform-based MRD detection in multiple myeloma and the clinical decision-making dilemma.