Application of a Natural Language Processing Framework for Data Extraction From Pathology Reports Across Multiple Cancer Types.

Park P; Choi Y; Han N; Park S; Park YL; Hwang J; Choi KS; Yoo CW; Kim HJ

doi:10.3346/jkms.2026.41.e79

← 뒤로

Application of a Natural Language Processing Framework for Data Extraction From Pathology Reports Across Multiple Cancer Types.

Journal of Korean medical science 2026 Vol.41(8) p. e79

Park P, Choi Y, Han N, Park S, Park YL, Hwang J, Choi KS, Yoo CW, Kim HJ

PMC 전문 ↗ 원문 ↗ DOI ↗ BibTeX ↓ RIS ↓

📝 환자 설명용 한 줄

[BACKGROUND] Pathological reports provide comprehensive insights into the clinical and pathological features of different cancer types.

이 논문을 인용하기

BibTeX ↓ RIS ↓

APA Park P, Choi Y, et al. (2026). Application of a Natural Language Processing Framework for Data Extraction From Pathology Reports Across Multiple Cancer Types.. Journal of Korean medical science, 41(8), e79. https://doi.org/10.3346/jkms.2026.41.e79

MLA Park P, et al.. "Application of a Natural Language Processing Framework for Data Extraction From Pathology Reports Across Multiple Cancer Types.." Journal of Korean medical science, vol. 41, no. 8, 2026, pp. e79.

PMID 41775279

DOI 10.3346/jkms.2026.41.e79

Abstract

[BACKGROUND] Pathological reports provide comprehensive insights into the clinical and pathological features of different cancer types. However, extraction of this semi-structured data for research is challenging. To better utilize pathology reports in cancer studies, we developed an efficient natural language processing (NLP) system to automate the extraction of items from pathology reports, facilitating streamlined storage, retrieval, and analysis of clinical data in a centralized database.

[METHODS] To determine the optimal model for our study, we conducted a comparative analysis of various deep learning architectures, including long short-term memory, convolutional neural network, and transformer-based models such as bidirectional encoder representations from transformers (BERT), BioBERT, and ClinicalBERT. The proficiency of the ClinicalBERT model in medical terminology and context significantly enhanced the accuracy and efficiency of data extraction from these reports.

[RESULTS] Among the aforementioned models, ClinicalBERT exhibited the best performance and was selected as the base model. The ClinicalBERT model demonstrated an exceptional performance in accurately classifying variables across multiple cancer types. Regarding stomach cancer, F1 scores (F1 = 1.0) were achieved for variables such as angiolymphatic invasion, and operation name (F1 = 1.0); however, a lower performance was observed for distant metastasis (F1 = 0.3889). Regarding liver cancer, high performance was consistently observed for most variables, with F1 scores above 0.99. Regarding colorectal cancer, F1 scores were achieved for variables such as Dworak's grade, lymph node, operation name, and total mesorectal excision (F1 = 1.0), while slightly lower but acceptable performance was noted for surgical margin (F1 = 0.9259). Regarding breast cancer, F1 scores were achieved for several variables including nipple margin, organ, and superficial margin (F1 = 1.0), while strong performances were noted for lateral and medial margins (F1 > 0.94).

[CONCLUSION] This study underscores the efficacy of NLP systems, specifically the ClinicalBERT model, in automating the extraction of important clinical data from pathology reports across various cancer types. This approach can not only simplify the process but also enhance the accuracy of the extracted information.

MeSH Terms

Humans; Natural Language Processing; Neoplasms; Deep Learning; Neural Networks, Computer; Databases, Factual; Stomach Neoplasms; Data Mining; Liver Neoplasms

같은 제1저자의 인용 많은 논문 (1)

Diced Cartilage in Fascia for Major Nasal Dorsal Augmentation in Asians: A Review of 15 Consecutive Cases.
Aesthetic plastic surgery 2016