Can Large Language Models Reduce the Cost of Extracting Data from Electronic Health Records for Research?

Hagler S; Adibuzzaman M; McWeeney SK; Cohen A

doi:10.64898/2026.01.09.26343792

← 뒤로

Can Large Language Models Reduce the Cost of Extracting Data from Electronic Health Records for Research?

1/5 보강

medRxiv : the preprint server for health sciences 📖 저널 OA 100% 2023~2026 2026

PICO 자동 추출 (휴리스틱, conf 2/4)

유사 논문

P · Population 대상 환자/모집단

We estimate that NLP only becomes economically competitive with manual extraction when there are ~6500 EHRs records.

I · Intervention 중재 / 시술

추출되지 않음

C · Comparison 대조 / 비교

추출되지 않음

O · Outcome 결과 / 결론

[DISCUSSION] LLMs exhibit lower hands-on development costs compared to standard NLP techniques, but require significant and potentially costly computation resources. [CONCLUSION] LLMs may potentially allow the economically competitive application of NLP to smaller projects if computation costs can be managed.

Hagler S, Adibuzzaman M, McWeeney SK, Cohen A

📖 무료 전문 🟢 PMC 전문 PMC12803400

PubMed ↗ DOI ↗ BibTeX ↓ RIS ↓

📝 환자 설명용 한 줄

[OBJECTIVE] Much medical data is only available in unstructured electronic health records (EHR).

이 논문을 인용하기

↓ .bib ↓ .ris

APA Hagler S, Adibuzzaman M, et al. (2026). Can Large Language Models Reduce the Cost of Extracting Data from Electronic Health Records for Research?. medRxiv : the preprint server for health sciences. https://doi.org/10.64898/2026.01.09.26343792

MLA Hagler S, et al.. "Can Large Language Models Reduce the Cost of Extracting Data from Electronic Health Records for Research?." medRxiv : the preprint server for health sciences, 2026.

PMID 41542661 ↗

DOI 10.64898/2026.01.09.26343792

Abstract

[OBJECTIVE] Much medical data is only available in unstructured electronic health records (EHR). These data can be obtained through manual (human) extraction or programmatic natural language processing (NLP) methods. We estimate that NLP only becomes economically competitive with manual extraction when there are ~6500 EHRs records. We have found that there is interest from clinicians and researchers in using NLP on projects with fewer records. We examine whether a large language model (LLM) can be used to reduce the cost of NLP to make it economically competitive for such projects, and study the feasibility of such framework for accuracy.

[MATERIALS AND METHODS] We developed an NLP pipeline using an off-the-shelf open LLM to extract breast cancer ER, PR, and HER2 biomarker data. Pipeline development stopped when the prompts performances were competitive with manual extraction. The development time and extraction performance were compared to those for an existing rule-based (RB) NLP pipeline. The code for the extraction portion of the LLM pipeline is available at https://github.com/sehagler/llm_biomarker_extraction .

[RESULTS] The LLM pipeline produced performance competitive with manual data extraction with a hands-on development time that was ~38% that of the RB pipeline.

[DISCUSSION] LLMs exhibit lower hands-on development costs compared to standard NLP techniques, but require significant and potentially costly computation resources.

[CONCLUSION] LLMs may potentially allow the economically competitive application of NLP to smaller projects if computation costs can be managed.

이 논문을 인용하기

Abstract 한글 요약

Abstract