Comparison of artificial intelligence (AI) services for Breast Imaging-Reporting and Data System (BI-RADS) classification on mammograms.
1/5 보강
[BACKGROUND] Existing literature primarily focuses on artificial intelligence (AI) ability to detect malignant breast tumors, often neglecting or limiting analysis to Breast Imaging-Reporting and Data
APA
Vasilev Y, Mayorova A, et al. (2026). Comparison of artificial intelligence (AI) services for Breast Imaging-Reporting and Data System (BI-RADS) classification on mammograms.. Quantitative imaging in medicine and surgery, 16(4), 311. https://doi.org/10.21037/qims-2025-1658
MLA
Vasilev Y, et al.. "Comparison of artificial intelligence (AI) services for Breast Imaging-Reporting and Data System (BI-RADS) classification on mammograms.." Quantitative imaging in medicine and surgery, vol. 16, no. 4, 2026, pp. 311.
PMID
41972034
Abstract
[BACKGROUND] Existing literature primarily focuses on artificial intelligence (AI) ability to detect malignant breast tumors, often neglecting or limiting analysis to Breast Imaging-Reporting and Data System (BI-RADS) categories 4 and 5. The diagnostic performance of AI for other BI-RADS categories remains understudied. The objective of this study is to compare the diagnostic accuracy of three mammographic AI services in predicting individual BI-RADS categories and definition of opportunity integration of AI into routine clinical practice.
[METHODS] Anonymized mammograms were obtained from the Unified Radiological Information Service of Moscow. Inclusion criteria: screening mammogram, radiology report from an AI and a human radiologist, age patients 40-75 years. Exclusion criteria: mammograms without BI-RADS categories, BI-RADS categories 0 and 6. The AI performance was assessed by calculating their diagnostic performance using the radiologists' opinion as the ground truth together with the calibration tests.
[RESULTS] The study sample consisted of 81,895 mammograms. Median accuracy was 76.9%, with a positive predictive value (PPV) of 11.8%. The highest negative predictive value (NPV) was observed for BI-RADS 2 (78.5-83.4%). The second highest NPVs were observed for BI-RADS 1, 3, 4, and 5 (over 84.7%). Binary classification yielded median accuracy and PPV values of 80.5% and 98.6% respectively, compared to the calibration testing (76.0% and 84.7%).
[CONCLUSIONS] Most AI service metrics were suboptimal for individual BI-RADS prediction, potentially due to reliance on variable radiologist conclusions and lack of histological calibration. Binary classification demonstrated higher performance metrics, and no significant differences in NPV were observed across AI applications, which means they can be recommended to confirm the absence of pathology. Successful integration of AI into routine clinical practice requires consideration of various diagnostic accuracy assessment methods, tailored to specific use cases.
[METHODS] Anonymized mammograms were obtained from the Unified Radiological Information Service of Moscow. Inclusion criteria: screening mammogram, radiology report from an AI and a human radiologist, age patients 40-75 years. Exclusion criteria: mammograms without BI-RADS categories, BI-RADS categories 0 and 6. The AI performance was assessed by calculating their diagnostic performance using the radiologists' opinion as the ground truth together with the calibration tests.
[RESULTS] The study sample consisted of 81,895 mammograms. Median accuracy was 76.9%, with a positive predictive value (PPV) of 11.8%. The highest negative predictive value (NPV) was observed for BI-RADS 2 (78.5-83.4%). The second highest NPVs were observed for BI-RADS 1, 3, 4, and 5 (over 84.7%). Binary classification yielded median accuracy and PPV values of 80.5% and 98.6% respectively, compared to the calibration testing (76.0% and 84.7%).
[CONCLUSIONS] Most AI service metrics were suboptimal for individual BI-RADS prediction, potentially due to reliance on variable radiologist conclusions and lack of histological calibration. Binary classification demonstrated higher performance metrics, and no significant differences in NPV were observed across AI applications, which means they can be recommended to confirm the absence of pathology. Successful integration of AI into routine clinical practice requires consideration of various diagnostic accuracy assessment methods, tailored to specific use cases.