Extraction of Treatments and Responses From Non-Small Cell Lung Cancer Clinical Notes Using Natural Language Processing.

Sivarajkumar S; Edupuganti S; Lazris D; Bhattacharya M; Davis M; Dressman D; Thomas R; Hu Y; Ren Y; Xu H; Yang P; Huang Y; Wang Y

doi:10.1200/CCI-25-00138

← 뒤로

Extraction of Treatments and Responses From Non-Small Cell Lung Cancer Clinical Notes Using Natural Language Processing.

1/5 보강

JCO clinical cancer informatics 📖 저널 OA 43.9% 2024~2026 2026 Vol.10() p. e2500138

Sivarajkumar S, Edupuganti S, Lazris D, Bhattacharya M, Davis M, Dressman D

📖 무료 전문 🟢 PMC 전문 PMC12788794

PubMed ↗ DOI ↗ BibTeX ↓ RIS ↓

📝 환자 설명용 한 줄

[PURPOSE] Manual extraction of treatment outcomes from unstructured oncology clinical notes is a significant challenge for real-world evidence (RWE) generation.

이 논문을 인용하기

↓ .bib ↓ .ris

APA Sivarajkumar S, Edupuganti S, et al. (2026). Extraction of Treatments and Responses From Non-Small Cell Lung Cancer Clinical Notes Using Natural Language Processing.. JCO clinical cancer informatics, 10, e2500138. https://doi.org/10.1200/CCI-25-00138

MLA Sivarajkumar S, et al.. "Extraction of Treatments and Responses From Non-Small Cell Lung Cancer Clinical Notes Using Natural Language Processing.." JCO clinical cancer informatics, vol. 10, 2026, pp. e2500138.

PMID 41505664 ↗

DOI 10.1200/CCI-25-00138

Abstract

[PURPOSE] Manual extraction of treatment outcomes from unstructured oncology clinical notes is a significant challenge for real-world evidence (RWE) generation. This study aimed to develop and evaluate a robust natural language processing (NLP) system to automatically extract cancer treatments and their associated RECIST-based response categories (complete response, partial response, stable disease, and progressive disease) from non-small cell lung cancer (NSCLC) clinical notes.

[METHODS] This retrospective NLP development and validation study used a corpus of 250 NSCLC oncology notes from University of Pittsburgh Medical Center (UPMC) Hillman Cancer Center, annotated by physician experts. An end-to-end NLP pipeline was designed, integrating a rule-based module for entity extraction (treatments and responses) and a machine learning module using biomedical clinical bidirectional encoder representations from transformers for relation classification. The system's performance was evaluated on a held-out test set, with partial external validation for relation extraction on a Mayo Clinic data set.

[RESULTS] The NLP system achieved high overall accuracy. On the UPMC test set (64 notes), the relation classification model attained an area under the receiver operating characteristic curve of 0.94 and an F1 score of 0.92 for linking treatments with documented responses. The rule-based entity extraction demonstrated a macro-averaged F1 score of 0.87 (precision 0.98, recall 0.81). Although precision was high for chemotherapy and most response types (1.00), recall for cancer surgery was 0.45. External validation at Mayo Clinic showed moderate relation extraction F1 scores (range: 0.51-0.64).

[CONCLUSION] The proposed NLP system can reliably extract structured treatment and response information from unstructured NSCLC oncology notes with high accuracy. This automated approach can assist in abstracting critical cancer treatment outcomes from clinical narrative text, thereby streamlining real-world data analysis and supporting the generation of RWE in oncology.

🏷️ 키워드 / MeSH 📖 같은 키워드 OA만

📖 전문 본문 읽기 PMC JATS · ~34 KB · 영문

Introduction

Introduction
Lung cancer treatment is highly complex, often involving surgery, radiotherapy, and systemic therapies, such as chemotherapy, targeted therapy, and immunotherapy. While clinical trials utilize standardized protocols and response criteria, real-world cancer care frequently deviates from trial protocols due to patient heterogeneity and practical constraints1. In routine practice, clinicians document treatment details and outcomes in narrative electronic health record (EHR) notes. Extracting information about treatment response, especially disease progression, from these unstructured notes is crucial for generating real-world evidence (RWE) from broader patient populations and guide adjustments to therapy outside of controlled trial settings2,3.
Manual abstraction from notes is labor-intensive, inconsistent, and often impractical at scale4. Clinical trials utilize The Response Evaluation Criteria in Solid Tumors (RECIST) criteria to provide standardized definitions for tumor response categories: complete response (CR), partial response (PR), stable disease (SD), and progressive disease (PD) 5. However, in everyday practice, such responses may be described in varied ways (or not explicitly labeled as CR/PR/SD/PD), making them difficult to consistently identify(Refer Figure 1). For example, an oncologist’s note might describe “tumor shrinkage” or “no new lesions” instead of explicitly stating “partial response” or “stable disease”6. Differences in documentation style, use of abbreviations, and inclusion of unrelated information all complicate the extraction of true response outcomes from narrative text.
Natural language processing (NLP) has emerged as a promising solution to automate information extraction from clinical narratives7–9. Previous studies have demonstrated the potential of NLP to retrieve meaningful clinical data from oncology records10, and even more broadly across other medical domains such as rehabilitation and neurology11. Early rule-based systems and classical machine learning models (often using techniques like term frequency-inverse document frequency for text features) have achieved some success in extracting information from clinical text12,13. The rule-based system uses predefined patterns and lexicons to identify treatment and response entities with high precision. TF-IDF (term frequency-inverse document frequency) is a statistical measure used to evaluate the importance of words in a document collection, which could be utilized to generate features for the machine learning models. However, these approaches struggle with the rich variability and context sensitivity of clinical language. More recent advances leverage deep learning and contextual language models to better interpret clinical text; indeed, studies have shown that modern NLP methods can extract complex clinical information with increasing accuracy14–17. For instance, Zuo et al (2024) developed an NLP pipeline using BioBERT18/BioClinicalBERT19 language models to identify systemic anticancer therapies and response outcomes from notes20,21. However, by design their annotation schema collapsed individual agents into broader therapy courses (e.g., annotating “etoposide and carboplatin” as a single SACT mention), so the system did not capture agent-level granularity or complex regimens. Hence, we introduce a fine-grained entity schema that distinguishes individual chemotherapy agents and multi-agent regimens, and replaces extensive manual rule curation for relation linking with a streamlined, BERT-based classifier optimized for these granular links.
Our approach integrates two components: (1) a rule-based entity extraction module to recognize and normalize mentions of cancer treatments and responses (with attention to negations and hypothetical phrases), and (2) a machine learning-based relation classifier that uses BioClinicalBERT contextual embeddings to determine which treatments are linked to which response outcomes in the text. Figure 2 shows the end-to-end NLP system for extracting treatments and responses from clinical notes. We also created a carefully annotated corpus of lung cancer clinical notes to train and evaluate the system. We hypothesized that this combined approach would achieve high accuracy in identifying correct treatment-response pairs, exceeding what either rules or conventional machine learning alone could accomplish. Although LLM-based extraction methods have shown recent promise, they typically require vast annotated corpora, extensive fine-tuning, and can lack the interpretability, much higher computing power and compliance guarantees needed for clinical deployment, making our targeted BioClinicalBERT framework a more practical choice. Ultimately, by translating unstructured oncology notes into structured data on treatment outcomes, this work aims to enhance real-world evidence generation and support oncology decision-making for patient populations beyond the scope of clinical trials.

Methods

Methods

Data Collection and Annotation
We randomly sampled a retrospective cohort of 250 lung cancer clinical notes for method development. These notes were sourced from NSCLC patients treated at UPMC Hillman Cancer Center (Pittsburgh, PA) and included a mix of initial consultation notes and progress notes(EPIC HER, Jan 2017–Dec 2021). All sampled notes were authored by board-certified medical oncologists. The University of Pittsburgh’s Institutional Review Board (IRB) reviewed and approved this study’s protocol (#STUDY21040204).
The notes spanned typical outpatient oncology documentation, with an average length of ~4,200 characters per note (approximately 1.5–2 pages of text). Each note was manually annotated by domain experts (two physicians) to mentions of five treatment entities (surgery, radiation therapy, chemotherapy, targeted therapy, immunotherapy) and four response entities corresponding to outcomes (complete response, partial response, stable disease, progressive disease). If two or more treatment entities were part of one treatment regimen, these were annotated as a ‘combination.’ Additionally, whenever a treatment and a response were related (i.e. the response was the outcome of that treatment), an annotated relation link was established between those entities.
The annotation process followed a rigorous protocol to ensure high quality and consistency. First, two physician annotators independently annotated the same initial set of 20 notes. We measured inter-annotator agreement (IAA) on this set, obtaining a macro-averaged F1 score of 0.58, which indicated moderate agreement given the complexity of the task. The team then met to reconcile differences and refine the annotation guidelines. Key clarifications were made on tricky cases (for example, how to handle ambiguous wording or multiple treatments with one response). After consensus, the annotators jointly annotated 30 additional notes, periodically computing IAA and updating guidelines until consistency improved (macro-averaged F1 score of 0.68). Once the annotation schema was well-understood, the remaining notes were divided, and each annotator independently labeled roughly 100 unique notes (for a total corpus of 250 annotated notes). The final schema and guidelines included precise definitions of each entity type and relation, example phrases, and rules for edge cases. The final annotation guideline document included explicit definitions for each entity and relation type, illustrative examples, and rules for handling ambiguous wording, ensuring annotator alignment prior to independent labeling. Table 1 summarizes the entity categories with examples and specific instructions from the annotation guidelines.

Rule-Based NLP Module
Given that many healthcare systems still lack extensive computing resources and remain cautious about the black-box nature of advanced deep learning methods, our initial approach employs a rule-based information extraction (IE) module to accurately identify the occurrences of treatments and responses in the text. This module addressed three core challenges in processing clinical notes: (1) recognition of diverse terminology for cancer treatments and outcomes, (2) handling negations or hypothetical statements, and (3) distinguishing when multiple treatments and responses appear in proximity (ensuring each entity is correctly identified and not conflated).

Entity recognition using regular expressions:
We compiled domain-specific lexicons and patterns to capture the variety of ways each treatment or response could be mentioned. For example, the chemotherapy lexicon included generic terms (“chemo”), specific drug names (e.g., carboplatin, paclitaxel), and adjectives like “adjuvant therapy.” For responses, patterns included full phrases (“complete response”), abbreviations (“CR”), and colloquial equivalents (“no evidence of disease” for complete response). We implemented these as regular expression (regex) patterns with boundaries (\b) to avoid partial matches, and applied case insensitivity so that variations in capitalization (e.g., “PD” vs “pd”) would be recognized equally. Appendix Table A1 provides the key regex patterns for each entity type along with example text snippets they cover.

Negation and context handling:
It is common for notes to state what is not present (e.g., “no evidence of progression”) or to discuss hypothetical scenarios. To avoid falsely extracting such instances as positive findings, we implemented a two-tier negation detection. First, a lexical negation filter scans for negation phrases immediately preceding an entity mention (for instance, “no evidence of…” or “without progression”) and temporarily flags those entities. Second, we applied a dependency parsing approach using spaCy’s NegSpaCy extension, which identifies grammatical negation relationships (e.g., a negation token “no” linked to “progression” in the dependency tree). If an entity mention was found to be negated by either method, the IE module would label it as negated and we would exclude it from downstream relation analysis (since it does not represent an actual treatment given or response observed). Similarly, we handled hypothetical language by ignoring entities in phrases like “if progression is noted…” where the context is conditional or future-oriented rather than a statement of fact.

Attribute validation:
The module also checked that each extracted response truly pertained to cancer status (for example, the word “stable” could describe a patient’s symptoms or weight, not only “stable disease” in tumor response). Simple heuristics, such as looking at surrounding words or requiring certain keywords (like “disease” following “stable”), were used to validate that the extracted entity fit the intended category. This helped improve precision by filtering out spurious matches.
After applying these patterns (and the negation/context filters) to a note, the rule-based module produces a list of detected treatment and response entities, each tagged with its category. These serve as candidate inputs to the relation extraction stage. By using conservative patterns and thorough negation handling, this module prioritizes precision-aiming to avoid false positives-since downstream machine learning can help recover some missed cases (improving recall).

Relation Extraction Pipeline
The second major component of our system is a machine learning pipeline that determines which treatments and responses mentioned in the same note are actually related. In other words, if a note contains multiple treatments and a particular response, we need to identify the correct pairing (e.g., chemotherapy X led to partial response, vs. radiation led to stable disease, etc.). This is framed as a binary relation classification problem: for each candidate Treatment-Response pair in a note (as identified by the IE module), predict whether that pair is a true relation (the treatment resulted in the response) or not.

Feature representation with BioClinicalBERT:
For each candidate pair, we extracted the text segment spanning from the treatment mention to the response mention (including a window of context words around them). We limited the length of this segment to a maximum of 512 tokens to meet the input size constraints of BERT models. If the distance between the two entities in the note was larger than this (which was rare given how clinical notes are structured), we truncated less informative parts of the text while preserving the immediate context around each entity. We then used BioClinicalBERT, a transformer-based language model pre-trained on biomedical and clinical text, to convert the text segment into a numerical representation. In particular, after tokenizing the segment using the model’s WordPiece tokenizer, we obtained the embedding of the special [CLS] token from BioClinicalBERT’s output. This [CLS] embedding (a 768-dimensional vector) serves as a summary representation of the entire text segment, effectively encoding the contextual information linking the treatment and response mention. We chose BioClinicalBERT because prior research has shown it excels at understanding clinical narrative context and jargon compared to general BERT models22.

Classifier training:
Using these BioClinicalBERT embeddings as input features, we trained several machine learning classifiers to predict Treatment_Response vs No_Relation. We experimented with five different classifier algorithms: logistic regression, support vector machine (SVM), gradient boosting (using XGBoost and LightGBM implementations), and a standalone scikit-learn gradient boosting classifier. These models were selected based on their performance in similar text classification tasks and their ability to handle high-dimensional feature vectors. We randomly split the 250 annotated notes into a training set (158 notes), validation set (28 notes), and test set (64 notes). During training, each model learned to classify pairs using the known relations in the training set. Hyperparameter tuning was performed via grid search, optimizing for the highest F1-score on the validation set to balance precision and recall. All model development was conducted using the scikit-learn machine learning library in Python.

Threshold optimization:
The classifiers initially used a default probability threshold of 0.5 to decide whether a given pair was a true relation. Because the prevalence of true Treatment-Response relations in the data is relatively low compared to all possible pairs (most treatments in a note will not correspond to a particular response mention and vice versa), we adjusted this decision threshold. Using the validation set, we varied the threshold from 0.01 to 0.99 and selected the threshold that maximized the F1-score for each model. This calibration helped ensure that models were neither too conservative nor too trigger-happy in labeling relations, given our goal of a high F1-score.
Finally, we evaluated the best-performing model on the held-out test set of 64 notes, using standard metrics described below. The entire pipeline (rule-based extraction followed by relation classification) was run on each test note to simulate how the system would function on new unseen data.
All the algorithms and software codes are made publicly available ( https://github.com/PittNAIL/nlp-nsclc ).

External Validation
We performed a partial external validation using a Mayo Clinic corpus of deidentified oncology notes (Li et al., 2020, Zuo et al., 2024), supported by our collaborators at Mayo Clinic. The dataset contained 117 de-identified NSCLC notes randomly sampled from the Mayo Clinic Epidemiology and Genetics of Lung Cancer Database (2015–2019) and annotated for chemotherapy treatments and RECIST response mentions by clinical experts. Because their annotation schema for treatment and response entities differed substantially from ours, we did not evaluate the rulebased entity extraction component externally. Instead, we applied our relation extraction models directly to the Mayo dataset’s goldstandard entity spans, assessing only relation classification performance for chemotherapy entities linked to outcomes (CR, PR, SD, PD).

Evaluation Metrics
We assessed the system’s performance at two levels: entity extraction and relation extraction. For each, we report precision, recall, and F1-score. Precision is the proportion of items identified by the system that were correct; recall is the proportion of true items (per gold-standard annotation) that were identified; and F1-score is the harmonic mean of precision and recall. In addition, for the relation classification task, we calculated the area under the ROC curve (AUC-ROC) and area under the precision-recall curve (AUC-PR) to capture performance across all classification thresholds. The AUC-ROC reflects the model’s ability to discriminate between true and false relations across varying sensitivity-specificity tradeoffs, whereas the AUC-PR is particularly informative in our context of class imbalance. All results are reported on the held-out test set unless otherwise noted. We also compare certain results to previously published methods when applicable.

Results

Results

Rule-Based Entity Extraction Performance
The rule-based entity extraction module achieved high precision across all treatment and response categories, with slightly variable recall depending on the entity type (Table 2). Overall, the system obtained a macro-averaged precision of 0.98, recall of 0.81, and F1-score of 0.87 for extracting the predefined treatments and responses from the clinical notes. This indicates that almost all entities it identified were correct (very few false positives), and it retrieved a large majority of the true entities annotated by experts.
Breaking down performance by category, Chemotherapy and Immunotherapy mentions were recognized extremely well (F1 = 0.96 and 0.98, respectively), each with precision ~1.00 and recall in the high 0.90s. These high scores reflect that chemotherapy and immunotherapy regimens tend to use consistent drug names or terms (which our lexicons covered well) and are frequently documented. Radiotherapy had perfect precision (0.98) but more moderate recall (0.69), likely because radiation can be described in varied ways. Targeted therapy mentions (e.g., TKIs like erlotinib) also showed strong performance (F1 = 0.98). Cancer surgery had the most notable gap: while precision was 1.00, recall was only 0.45, indicating the system missed more than half of surgical procedures. For the response categories, precision was very high (0.90–1.00 for all four outcomes). Complete response (CR) had a recall of 0.56 (F1 0.71); partial response (PR) recall was 0.64 (F1 0.78); stable disease (SD) had recall 0.86 (F1 0.92); and progressive disease (PD) recall was 0.72 (F1 0.80). These results are summarized in Table 2.

Treatment-Response Relation Extraction Performance
The machine learning relation extraction pipeline successfully learned to identify true treatment-response pairs with high accuracy. Table 3 summarizes the performance of each classifier evaluated on the test set. All five classifiers achieved strong results, with AUC-ROC values around 0.92 or higher and F1-scores around 0.90 or above, indicating that the BioClinicalBERT embedding features provided an informative representation for this task.
The logistic regression classifier performed the best, with an AUC-ROC of 0.938 and an F1-score of 0.92, balancing precision (0.93) and recall (0.90) effectively. The gradient boosting ensemble models, XGBoost and LightGBM, were very competitive (AUC-ROC of 0.938 and 0.941, respectively, with F1-scores of 0.93 and 0.92). The SVM classifier performed well (AUC-ROC 0.921, F1 0.90) but had a slightly lower precision-recall tradeoff, and the generic Gradient Boosting (scikit-learn) achieved an AUC-ROC of 0.920 and F1 of 0.91. These performance metrics are presented in Table 3 and Figure 3. For detailed error analysis, please refer to Appendix A Figure A1.

External Validation Analysis:
On the Mayo Clinic dataset, relation extraction performance was lower than on our internal test set, reflecting both domain shifts in documentation and annotation standards. Logistic regression achieved precision 0.55, recall 0.64, and F1score 0.51; XGBoost achieved 0.60/0.68/0.64; LightGBM 0.61/0.70/0.63; while the SVM and scikitlearn gradient boosting classifiers yielded F1scores of 0.49 and 0.54, respectively. These findings underscore the feasibility of our approach beyond a single institution, but also highlight the need for harmonized annotation guidelines for full external validation.

Discussion

Discussion
In this study, we developed an end-to-end NLP system that extracts treatment types and RECIST-based response outcomes from unstructured lung cancer clinic notes. The system combines the reliability of rule-based entity extraction with the nuance of a BioClinicalBERT-based machine learning model for relation classification. Our results demonstrate that this approach addresses several key challenges in processing clinical text and achieves a level of performance that is, to our knowledge, higher than previously reported systems for similar tasks23. High precision in entity detection ensures that the critical pieces of information (therapies administered and their outcomes) are accurately captured, while the learned model effectively links those pieces even in the presence of multiple treatments or complex narrative structures.

Comparison with prior work:
Our findings build upon and extend earlier research on extracting oncology information from EHR notes. Zuo et al (2024) reported an NLP pipeline for systemic therapy and response extraction using BERT-based models, achieving good precision but requiring extensive manual rule-based post-processing. In our study, leveraging a focused rule-based component and BioClinicalBERT for contextual embeddings improved recall for varied entity types (e.g., capturing immunotherapies and targeted therapies comprehensively) and yielded a higher F1-score (0.92) for treatment-response linking compared with mid-0.6 scores in prior work. This improvement is likely due to BioClinicalBERT’s ability to interpret subtle linguistic cues in clinical text and our threshold optimization strategy24. Furthermore, exploring multiple classifiers revealed that simpler logistic regression, when combined with robust embedding features, can slightly outperform more complex models. This suggests that the choice of textual representation may be more crucial than the classifier itself.

Clinical relevance and potential applications:
From a clinical perspective, an automated system that reliably extracts a patient’s treatment history and response can be extremely valuable. Oncologists routinely review a patient’s course of treatment to determine treatment regimens and clinical trial eligibility. 25–28. Currently, manual extraction of this information from free-text notes is labor-intensive and prone to errors. Our NLP tool can substantially alleviate this burden by offering a structured summary of treatment responses that encompasses crucial outcomes like progression, which can guide future treatments.. In research settings, automating the extraction of real-world treatment responses can accelerate observational studies and enable large-scale population-level analysis 29. This approach thus supports both improved clinical care and the generation of real-world evidence that can inform future trials.

Applicability:
Our system is based on outpatient oncology notes such as initial consultation and follow-up progress notes, where treatment and response mentions are both prominently featured. In contrast, other note types (e.g., inpatient notes or survivorship summaries) may reference response more diffusely, making extraction more challenging. In these documents, sentences such as “The patient received four cycles of carboplatin” or “Imaging shows no new lesions” plainly link therapies to outcomes, which our span-detection and relation models can readily capture. By contrast, inpatient progress notes and discharge summaries often include oncology observations into broader multisystem narratives, embedding treatment and response information within longer, temporally complex passages that can obscure the relationship between therapies and outcomes.
The Mayo Clinic corpus used for external validation employed a distinct annotation schema and definition for chemotherapy and response entities, which likely contributed to the observed F1-score decline (0.51 – 0.64). Aligning annotation standards or fine-tuning on a small set of similarly annotated local notes should help mitigate these discrepancies and restore high performance across diverse documentation settings.

Limitations:
Our study has following limitations. First, the system was developed and evaluated on notes from NSCLC patients. Therefore, its effectiveness in other cancer types is unknown. Second, while the initial inter-annotator agreement was moderate (F1 = 0.68), ongoing refinement of the guidelines helped improve consistency; however, inherent ambiguities in clinical language remain. Third, the system currently processes only narrative text and does not integrate structured data such as radiology measurements, which may further enhance accuracy. Our study was conducted on unstructured oncology notes authored by NSCLC specialists; we did not explicitly evaluate performance on notes of differing quality, or narratives from other cancer care providers.
Although clinician-documented impressions often correlate with formal RECIST assessments, these narrative descriptions may reflect subjective clinical interpretation. We chose this approach initially to leverage readily available narrative content across sites, given the variability and limited accessibility of structured radiology measurements in our EHR. Future work will focus on linking extracted response mentions to structured radiology measurements or on developing modules to automatically extract longest-diameter measurements from radiology reports, thereby enabling standardized RECIST computation.
Furthermore, due to annotation guideline discrepancies between our UPMC corpus and the Mayo Clinic dataset, we were unable to externally evaluate the rule-based entity extraction module; only relation extraction was validated, yielding F1scores of 0.51–0.64 (Table 3). Harmonization of annotation standards will be essential for comprehensive multicenter validation. Finally, there is provider variability in real-world documentation of treatment response, and it is often difficult to distinguish between stable disease, partial response, and complete response. Prior work has demonstrated that narrative RECIST mentions agree with formal RECIST-measured categories in only 70–85% of cases(Li et al), and our empirical chart review showed that narrative descriptions of ‘stable disease’ corresponded to formal RECIST-defined SD. However, its relevance in real-world settings is nuanced especially when intent of treatment is palliative. In these scenarios, treatment changes often occur in the setting of progressive symptoms or disease but remain unchanged if patients have stable disease, partial response, or complete response without progressive symptoms. So, our NLP system can be a useful tool in specific patient populations.

Future directions:
We have validated our entity extraction component for chemotherapy and the four response entities using a Mayo Clinic dataset. To further improve generalizability, a key priority is external validation using datasets from other institutions. To achieve end-to-end RECIST automation, we are developing a radiology-linking module that combines NLP and rule-based methods to extract target-lesion longest-diameter measurements from the ‘Impression’ section of radiology reports. These measurements will then be programmatically linked to their corresponding narrative response mentions in clinical notes, enabling automatic computation of formal RECIST categories.
Additional future directions include incorporating large language models (LLMs) to potentially consolidate the extraction and relation linking tasks in a single model and integrating related clinical documents (e.g., radiology and pathology reports) to improve accuracy further. Some studies have attempted to extract cancer treatment progression using LLMs30, but the substantial size of these models limits their integration into actual hospital settings.
In this study, our goal was to evaluate the system’s generalizability on a proprietary external corpus; we therefore did not perform Mayo-specific fine-tuning. Although techniques such as few-shot fine-tuning, unsupervised domain adaptation, and semi-supervised lexicon transfer have been shown to improve cross-institution NLP performance, these were beyond the scope of the current work and will be explored in follow-on studies.
Embedding this tool within live EHR systems could facilitate real-time decision support via automated updates of structured treatment response summaries, which may streamline oncologist workflow and reduce manual chart review burdens. This tool can also be adapted to extract other key endpoints that are not typically identified in structured data, such as toxicity patterns and symptom clusters over time in patients in relation to specific treatment regimens and co-morbidities.
In conclusion, our study demonstrates that an end-to-end NLP approach can effectively extract meaningful oncology outcomes from free-text clinical notes with accuracy approaching that of human abstraction. By structuring the unstructured, our methodology unlocks valuable real-world treatment response data at scale, potentially informing both clinical decision-making and large-scale observational research. Future multi-center validations and workflow integrations will be key to realizing the full clinical utility of this approach.

Supplementary Material

Supplementary Material
PV Appendix Figure A1Appendix Figure Legends
Figure A1. Confusion matrix illustrating entity-level performance across UPMC test set.

PV Appendix APV Appendix Table A1

출처: PubMed Central (JATS). 라이선스는 원 publisher 정책을 따릅니다 — 인용 시 원문을 표기해 주세요.

🏷️ 같은 키워드 · 무료전문 — 이 논문 MeSH/keyword 기반

A Phase I Study of Hydroxychloroquine and Suba-Itraconazole in Men with Biochemical Relapse of Prostate Cancer (HITMAN-PC): Dose Escalation Results.
Cancer research communications 2026 Talmor B 외 📖 unpaywall
Self-management of male urinary symptoms: qualitative findings from a primary care trial.
The British journal of general practice : the journal of the Royal College of General Practitioners 2026 Wheeler JR 외 📖 unpaywall
Clinical and Liquid Biomarkers of 20-Year Prostate Cancer Risk in Men Aged 45 to 70 Years.
JAMA network open 2026 Lindholz M 외 📖 unpaywall
Diagnostic accuracy of Ga-PSMA PET/CT versus multiparametric MRI for preoperative pelvic invasion in the patients with prostate cancer.
Science progress 2026 Qin Z 외 📖 unpaywall
Comprehensive analysis of androgen receptor splice variant target gene expression in prostate cancer.
Biochimica et biophysica acta. Molecular cell research 2026 Wüstmann N 외 📖 unpaywall
Clinical Presentation and Outcomes of Patients Undergoing Surgery for Thyroid Cancer.
Journal of the College of Physicians and Surgeons--Pakistan : JCPSP 2026 Khan MMU 외 📖 unpaywall