Automating the segmentation, date extraction, and classification of multi-report PDFs in outside medical records using optical character recognition and generative artificial intelligence.

Damani S; Hinton B; Hunt T; Lawrence N; Miller K; Rice M; Peterson K; McLaughlin S; Ryu A

doi:10.1093/jamiaopen/ooag027

← 뒤로

Automating the segmentation, date extraction, and classification of multi-report PDFs in outside medical records using optical character recognition and generative artificial intelligence.

JAMIA open 2026 Vol.9(2) p. ooag027

Damani S, Hinton B, Hunt T, Lawrence N, Miller K, Rice M, Peterson K, McLaughlin S, Ryu A

PMC 전문 ↗ 원문 ↗ DOI ↗ BibTeX ↓ RIS ↓

📝 환자 설명용 한 줄

이 논문을 인용하기

BibTeX ↓ RIS ↓

APA Damani S, Hinton B, et al. (2026). Automating the segmentation, date extraction, and classification of multi-report PDFs in outside medical records using optical character recognition and generative artificial intelligence.. JAMIA open, 9(2), ooag027. https://doi.org/10.1093/jamiaopen/ooag027

MLA Damani S, et al.. "Automating the segmentation, date extraction, and classification of multi-report PDFs in outside medical records using optical character recognition and generative artificial intelligence.." JAMIA open, vol. 9, no. 2, 2026, pp. ooag027.

PMID 41924015

DOI 10.1093/jamiaopen/ooag027

Abstract

[OBJECTIVES] Patients referred for specialized care often arrive with outside medical records (OMRs) compiled into multi-report PDFs that include imaging, pathology, and clinical notes in unstructured formats. Reviewing these records is time consuming and mentally taxing, increasing the risk of delayed care, clinician frustration, and missed information affecting quality of care. This study aimed to automate the segmentation, classification, and date extraction of scanned OMRs, with a focus on records relevant to breast cancer care.

[MATERIALS AND METHODS] We used optical character recognition (OCR) to extract machine-readable text from 1303 scanned PDF documents from 116 distinct external institutions. Gemini 1.5, a large language model (LLM), was then used to segment multi-report files into individual documents, classify them into clinically meaningful categories such as mammograms and pathology reports, and extract study dates to build diagnostic timelines. Document categories were informed by clinical workflows in a breast cancer center.

[RESULTS] The system achieved an F1 score of 0.95 for segmentation, 0.96 for classification, and 0.90 for date extraction. In a pilot of 45 records reviewed by clinicians, only 2 classification errors and 1 date error were reported. Clinicians estimated that the tool reduced OMR review time by 40%, improved workflow efficiency, and increased satisfaction.

[DISCUSSION] Our findings demonstrate that combining OCR with LLMs can significantly enhance the processing of unstructured medical records, reducing manual burden and supporting timely clinical decision-making.

[CONCLUSION] This study demonstrates the successful application of OCR and LLMs for organizing scanned OMRs within a specialty clinic. By automating a previously manual process, the approach supports scalable review of incoming outside records and has potential for adaptation to other clinical workflows. Future work will focus on evaluating the system across additional specialties and institutions.