본문으로 건너뛰기
← 뒤로

Dataset of artificial breast cancer MRIs produced from unpaired mammograms.

1/5 보강
Data in brief 2026 Vol.64() p. 112406
Retraction 확인
출처

Muthukumar A, Zapata I, Brooks A

📝 환자 설명용 한 줄

The development of machine learning models for medical imaging is often constrained by the scarcity of large, paired datasets, particularly in breast cancer diagnostics where mammography and MRI modal

이 논문을 인용하기

↓ .bib ↓ .ris
APA Muthukumar A, Zapata I, Brooks A (2026). Dataset of artificial breast cancer MRIs produced from unpaired mammograms.. Data in brief, 64, 112406. https://doi.org/10.1016/j.dib.2025.112406
MLA Muthukumar A, et al.. "Dataset of artificial breast cancer MRIs produced from unpaired mammograms.." Data in brief, vol. 64, 2026, pp. 112406.
PMID 41561909 ↗

Abstract

The development of machine learning models for medical imaging is often constrained by the scarcity of large, paired datasets, particularly in breast cancer diagnostics where mammography and MRI modalities offer complementary diagnostic information. We introduce a dataset of artificial breast cancer MRIs generated through a CycleGAN architecture trained on unpaired cancer mammograms and breast MRI data. This approach addresses the critical gap in open-source paired mammography-MRI datasets by leveraging adversarial learning to establish cross-modal relationships without requiring direct image correspondence. Our methodology builds upon previous multi-modal imaging efforts, employing unpaired translation to force data pairing and create, large-scale training datasets. The generated artificial MRIs offer substantial benefits including enhanced patient privacy protection through synthetic data generation and significant cost reduction potential for MRI acquisition in resource-limited settings. We comprehensively evaluate the fidelity of our artificial MRI dataset against pre-existing tumor detection models. This dataset ultimately supports the development of more generalizable machine learning models for cancer diagnosis and treatment planning, through the use of artificial data augmentation.

🏷️ 키워드 / MeSH 📖 같은 키워드 OA만

같은 제1저자의 인용 많은 논문 (1)

📖 전문 본문 읽기 PMC JATS · ~25 KB · 영문

Value of the Data

1
Value of the Data

•These data provide a curated collection of paired synthetic and clinical breast MRI images, generated using multimodal imaging frameworks, enabling reproducible benchmarking for studies in medical image synthesis and validation.

•Researchers developing or testing foundation models for medical imaging can reuse these data to assess generalization performance, zero-shot segmentation capability, and cross-modality consistency.

•The dataset can serve as a reference standard for future AI-driven diagnostic validation, facilitating transparent comparison between generated synthetic imaging outputs and clinical MRI data.

•The inclusion of artificially paired datasets allows future researchers to develop and train machine learning models that require paired input-output mappings, even in scenarios where true clinical pairs are unavailable.

Background

2
Background
Large and interoperable imaging datasets are foundational for robust medical imaging AI but remain difficult to assemble due to privacy concerns and heterogeneity across institutions and devices [1]. This is especially true for cross-modality tasks, such as mammography and MRI, where open-access paired datasets are not available. To address this problem, Generative adversarial networks (GANs), including Cycle-Consistent GANs (CycleGANs), have demonstrated utility for unpaired image-to-image translation [[2], [3], [4], [5]], vendor normalization, contrast synthesis, and high-resolution reconstruction in medical imaging, while preserving anatomy through tailored losses and discriminators [6]. In mammography, GAN-based synthesis has enabled anomaly detection using normal-only training, highlighting the role of synthetic data to augment scarce lesion phenotypes and support unsupervised paradigms [7,8]. Recent reviews underscore both opportunities and risks of synthetic medical data—privacy protection via de-identification by design, cost efficiency, and expanded training corpora—while cautioning that realism, diversity, and rigorous evaluation are essential for safe deployment [1,[9], [10], [11]].
Recent transformer-based multi-modal MRI segmentation models [12,13] demonstrate the substantial performance gains achievable when complementary imaging modalities are jointly modeled. These advances highlight the value of multi-modal MRI and simultaneously underscore a major practical limitation: high-quality paired datasets are rare, particularly for modalities like mammography and MRI. Synthetic cross-modality generation therefore offers a scalable way to supply missing modalities and enable pipelines in settings where paired acquisitions are not feasible.
This paper introduces a dataset of artificial breast cancer MRIs generated by a CycleGAN trained on unpaired cancer mammograms and breast MRI, operationalizing modality translation to create MRI-like images from mammographic inputs in the absence of paired data. Architectural choices are motivated by prior evidence that CycleGAN with anatomy-preserving objectives (e.g., mutual information loss, localized discriminators) can maintain breast morphology and tissue structures across domains and vendors in unpaired settings [7,8].
By generating artificial MRIs from widely available mammograms, the dataset aims to mitigate contrast and scan-time burdens suggested by synthetic-contrast paradigms and expand access in resource-limited settings. Synthetic imaging can reduce reliance on sharing identifiable clinical data and support FAIR-aligned data growth when paired with transparent documentation and governance, addressing known gaps in mammography datasets [10]. The work frames synthetic imaging as a component of a circular bioeconomy approach [14]—transforming existing clinical imaging streams into training assets while preserving privacy—consistent with the emerging view that responsibly governed synthetic data can catalyze computer vision advances in radiography and MRI, provided rigorous technical and ethical safeguards are in place.

Data Description

3
Data Description
The CycleGAN architecture design, training procedures, and model optimization protocols have been detailed extensively in our previous work [15]. The trained machine learning model from this prior study was applied to generate artificial breast MRIs from mammographic inputs.
The main tasks this dataset is designed to support are highlighted in the following three goals:1.To generate a synthetic breast MRI dataset that can be quantitatively validated using both an established foundation model (SAM-Med) and a clinically oriented segmentation framework (MONAI), ensuring that the synthesized images retain anatomically consistent tumor information.

2.To enable image-to-image translation between mammograms and MRIs, using the synthetic MRIs as aligned targets. This is so that multimodal generative models can be trained despite the absence of naturally paired data in clinical practice.

3.To provide artificial mammogram–MRI “pairs” for developing new multimodal learning frameworks, allowing future models to infer what a patient’s MRI could look like directly from their mammogram. This supports applications such as risk stratification and improved diagnostic decision-support.

3.1
Input data
The source dataset for synthetic MRI generation was the Curated Breast Imaging Subset of Digital Database for Screening Mammography (CBIS-DDSM) [16], publicly available through The Cancer Imaging Archive. The original CBIS-DDSM dataset (Curated Breast Imaging Subset of the Digital Database for Screening Mammography) comprises 2620 scanned film mammography studies with cases categorized as normal, benign, or malignant, and includes verified pathology data. The dataset contains 753 calcification cases and 891 mass cases, curated for computer-aided detection and diagnosis research [16]. The original DDSM dataset [17], from which CBIS-DDSM is derived, includes women aged 18 to 96 years, with the majority between 40 and 79 years. The CBIS-DDSM subset maintains a similar age distribution, but the exact age range for CBIS-DDSM is not explicitly specified in the medical literature. The dataset provides pathologic diagnosis (benign vs. malignant) but does not include detailed tumor staging information such as TNM classification. Our dataset only represents and uses the malignant files from the CBIS-DDSM. The medical literature does not specify race or ethnicity data for subjects in CBIS-DDSM. The images were collected between 1990 and 1999 as part of the original DDSM project; CBIS-DDSM is a curated subset released in 2017. The original data were collected from four U.S. institutions: Massachusetts General Hospital, Wake Forest University, Sacred Heart Hospital, and Washington University in St. Louis. Images were digitized using LUMISYS 85 and LUMISYS 150 laser film scanners [16,17].
The dataset in this paper represents a derivative work from the CBIS-DDSM and does not constitute an original clinical collection. The dataset we have produced uses only the malignant cases from the original CBIS-DDSM, and is representative of the original set itself. Each mammogram from the CBIS-DDSM dataset was processed through our trained CycleGAN model, with the output being a PNG-formatted artificial MRI image corresponding to the input mammogram. Details on this CycleGAN model are available in our previously written paper on the model architecture [15]. The generation process maintained the original CBIS-DDSM naming conventions and organizational structure to facilitate cross-referencing between source mammograms and generated MRIs.

3.2
Data records
The synthetic MRI dataset is organized following the hierarchical folder structure of the source CBIS-DDSM dataset, ensuring direct correspondence and traceability between original mammograms and generated artificial MRIs.
The generated dataset maintains the CBIS-DDSM naming convention for seamless cross-referencing. Each patient folder contains:•Artificial MRI image (PNG format)

•SAM-Med generated tumor highlighting overlay (grayscale)

•SAM-Med generated tumor highlighting overlay (color highlight)

•Bounding box annotation corresponding to the original CBIS-DDSM tumor localization

The patient folders include images that are named in the following format:
Individual folders are structured identically to CBIS-DDSM, with each folder containing the complete set of validation and annotation files for the corresponding synthetic MRI.
The dataset also contains a CSV file of comparison statistics with the percentage difference between the SAM-Med highlighted difference and the actual bounding region of the tumor. A distribution of the dataset is shown in Fig. 1 in the next section.

Experimental Design, Materials and Methods

4
Experimental Design, Materials and Methods
4.1
Technical validation
Validation of this dataset uses pre-trained, publicly available models in a zero-shot configuration, not models trained on this dataset. This is intentional: our goal is to show that standard clinical AI tooling can consume the dataset and produce validated outputs, not to optimize task performance. However, the inference code that was used in both the validation models, MONAI and SAM-Med, are provided in our GitHub repository in the Code Availability section of this paper.
Validation of the generated artificial MRIs was performed using SAM-Med (Segment Anything Model for Medical Imaging), a foundation model specifically adapted for medical image segmentation tasks [18]. SAM-Med employs a transformer-based architecture trained on diverse medical imaging datasets to perform zero-shot segmentation of anatomical structures and pathological regions. More clinically relevant validation was performed using MONAI (Medical Open Network for AI). MONAI is often used in practice for the clinical development of AI models for medical image analysis [19]. MONAI evaluated the quantitative accuracy of the SAM-Med–generated tumor highlight overlays rather than their visual appearance.
Validation Protocol: For SAM-Med, each synthetic MRI was processed to identify suspected tumor regions, with the same procedure applied to the corresponding original mammogram. The resulting segmentation masks were compared for pixel-wise similarity. For MONAI, tumor masks were generated on the synthetic MRI, and the pixel-wise distance from each predicted mask to the ground-truth annotation bounding box was calculated. Distributions of these comparison metrics for both SAM-Med and MONAI are presented in Fig. 1.
SAM-Med is a transformer-based foundation model adapted for medical image segmentation that supports zero-shot segmentation of anatomical and pathological structures. For each synthetic MRI and its corresponding original mammogram, we used SAM-Med in its default configuration to segment suspected tumor regions. We then compared the resulting masks to verified tumor locations, computing a percentage overlay difference between the SAM-Med mask on the synthetic MRI and the reference tumor region. As shown in Fig. 1 (left) and summarized in Table 1, overlay differences ranged from ∼0 % to ∼3.2 %, with a median of ∼0.2 %, and most values tightly clustered near zero, indicating high overlap accuracy.
To provide a complementary, clinically oriented validation, we used a reference MONAI segmentation pipeline (3D UNet) with default parameters to generate tumor masks on the synthetic MRIs, again without additional training on this dataset. For each case, we computed the Euclidean distance (in pixels) between the centroid of the MONAI mask and the center of the ground-truth tumor bounding box. The distribution of distances (Fig. 1, right) shows that most segmentations lie within approximately 20–40 pixels of the ground truth, with a small number of larger deviations up to ∼70 pixels. This wider distribution reflects the intrinsic difficulty of the task but confirms that synthetic MRIs preserve tumor location sufficiently well for off-the-shelf models to localize the lesion with modest error.
While we do not train new models in this manuscript, we provide ready-to-use configuration files in our GitHub/Zenodo repository so that other researchers can run inference using the same MONAI or SAM-Med models we used on top of our dataset.

4.2
Code availability
The code associated with calculating the metrics indicated in this paper is provided at our GitHub repository, available at this linked Zenodo repository: https://doi.org/10.5281/zenodo.17444052

Limitations

Limitations
The primary limitation of this dataset is that the artificial breast MRIs were generated from unpaired mammograms, meaning there is no direct one-to-one correspondence between real MRI and mammogram images. As a result, while the CycleGAN model preserves general breast morphology, fine-grained spatial correspondence cannot be validated against true paired MRI data. Additionally, the source dataset (CBIS-DDSM) consists primarily of 2D mammograms from a specific demographic and imaging era, which may limit generalizability to contemporary or diverse populations. The synthetic data are derived from a single GAN model trained on this dataset, and may therefore reflect inherent data biases or artifacts from both the original mammograms and model training process. No raw MRI data were collected directly for this work.
Human evaluation is an important future direction for assessing the perceptual and clinical realism of synthetic MRIs. In this initial release, our aim is to provide a quantitatively validated dataset that the broader community, specifically clinical experts, can subsequently evaluate using diverse task-specific human review protocols.

Ethics Statement

Ethics Statement
The authors have read and follow the ethical requirements for publication in Data in Brief. This work does not involve human subjects, animal experiments, or data collected from social media platforms. All mammographic data used were obtained from the publicly available Curated Breast Imaging Subset of the Digital Database for Screening Mammography (CBIS-DDSM) hosted on The Cancer Imaging Archive (TCIA), which provides fully de-identified images for research use.

CRediT Author Statement

CRediT Author Statement
Aarthi Muthukumar: Conceptualization, Methodology, Software, Data Curation, Writing – Original Draft, Investigation, Validation. Isain Zapata: Resources, Formal Analysis, Validation, Supervision, Writing – Review and Editing. Amanda Brooks: Resources, Formal Analysis, Validation, Supervision, Writing – Review and Editing.

출처: PubMed Central (JATS). 라이선스는 원 publisher 정책을 따릅니다 — 인용 시 원문을 표기해 주세요.

🏷️ 같은 키워드 · 무료전문 — 이 논문 MeSH/keyword 기반

🟢 PMC 전문 열기