JMIR Preprints #65984: Large language model applications for health information extraction in oncology: a scoping review

Current Preprint Settings

(as selected by the authors)

1. When the manuscript is submitted, allow peer review from:

(a) Anybody (open community peer review)
(b) Editor-selected reviewers (closed peer review)

2. When the manuscript is submitted, display the preprint PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

3. When the manuscript is accepted, display the accepted manuscript PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

Large language model applications for health information extraction in oncology: a scoping review

David Chen;
Saif Alnassar;
Kate Avison;
Ryan Huang;
Srinivas Raman

ABSTRACT

Background:

Natural language processing systems for data extraction from unstructured clinical text requires expert-driven input for labelled annotations and model training. The natural language processing competency of large language models (LLM) can enable automated data extraction of important patient characteristics from electronic health records useful for accelerating cancer clinical research and informing oncology care.

Objective:

This scoping review will map the current landscape, including definitions, frameworks, and future directions of LLMs applied to data extraction from clinical text in oncology.

Methods:

We queried Ovid Medline for primary, peer-reviewed research studies published since 2000 on June 2, 2024 using oncology and LLM-related keywords. This scoping review included studies that evaluated the performance of a large language model applied to data extraction from clinical text in oncology contexts. Study attributes and main outcomes were extracted to outline key trends of research in LLMs for data extraction.

Results:

The literature search yielded 24 studies for inclusion. The majority of studies assessed original and fine-tuned variants of the BERT LLM (n=18, 75%) followed by the Chat-GPT conversational LLM (n=6, 25%). LLMs for data extraction were commonly applied in pan-cancer clinical settings (46%), followed by breast (n=4, 17%), and lung (n=4, 17%) cancer contexts, and evaluated LLM performance from multi-institution datasets (n=18, 75%). Comparing studies published in 2022-2024 to 2019-2021, the total number of studies, the number of studies using fine-tuning, and the number of studies using prompt-engineering increased. Advantages of LLMs included positive data extraction performance and reduction of manual workload.

Conclusions:

LLMs applied to data extraction in oncology can serve as a useful automated tool to reduce the administrative review of patient health records and increase time for patient-facing care. Recent advances in prompt engineering and fine-tuning methods, and multi-modal data extraction serve as promising directions for future research. Future research is needed to evaluate the performance of LLM-enabled data extraction in clinical domains outside of the training dataset and assessment of the scope and integration of LLMs into real-world clinical environments.

Citation

Please cite as:

Chen D, Alnassar S, Avison K, Huang R, Raman S

Large Language Model Applications for Health Information Extraction in Oncology: Scoping Review

JMIR Cancer 2025;11:e65984

DOI: 10.2196/65984

PMID: 40153782

PMCID: 11970800

Download PDF

Request queued. Please wait while the file is being generated. It may take some time.

Copyright

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.

JMIR Publications

JMIR Preprints

Accepted for/Published in: JMIR Cancer

Date Submitted: Aug 30, 2024

Date Accepted: Jan 27, 2025

Large language model applications for health information extraction in oncology: a scoping review

ABSTRACT

Citation

Copyright