JMIR Preprints #68707: Performance of Natural Language Processing for Information Extraction from Electronic Health Records within Cancer: A Systematic Review

Current Preprint Settings

(as selected by the authors)

1. When the manuscript is submitted, allow peer review from:

(a) Anybody (open community peer review)
(b) Editor-selected reviewers (closed peer review)

2. When the manuscript is submitted, display the preprint PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

3. When the manuscript is accepted, display the accepted manuscript PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

Performance of Natural Language Processing for Information Extraction from Electronic Health Records within Cancer: A Systematic Review

Simon Dahl;
Martin Bøgsted;
Tomer Sagi;
Charles Vesteghem

ABSTRACT

Background:

Over the last decade, natural language processing (NLP) has provided various solutions for information extraction (IE) from textual clinical data. In recent years, the use of NLP in cancer research has gained considerable attention, with numerous studies exploring the effectiveness of various NLP techniques for identifying and extracting cancer-related entities from clinical text data.

Objective:

We aimed to summarize the performance differences between various NLP models for IE within the context of cancer to provide an overview of the relative performance of existing models.

Methods:

This systematic literature review was conducted using three databases (PubMed, Scopus, and Web of Science) to search for articles extracting cancer-related entities from clinical texts. 33 articles were eligible for inclusion. We extracted NLP models and their performance by F1 scores. Each model was categorized into the following categories: Rule-based, Traditional Machine Learning, CRF-based, Neural Network, and Bidirectional transformer. The average of the performance difference for each combination of categorizations was calculated across all articles.

Results:

The articles covered various scenarios, with the best performance for each article, ranging from 0.355 to 0.985 in F1 score. Looking at the overall relative performances, the bidirectional transformer category outperformed every other category (by between 0.2335 and 0.0439 on average F1 score). The percentage of articles on implementing bidirectional transformers has increased over the years.

Conclusions:

NLP has demonstrated the ability to identify and extract cancer-related entities from unstructured textual data. Generally, more advanced models outperform less advanced ones. The bidirectional transformer category performed the best.

Citation

Please cite as:

Dahl S, Bøgsted M, Sagi T, Vesteghem C

Performance of Natural Language Processing for Information Extraction From Electronic Health Records Within Cancer: Systematic Review

JMIR Med Inform 2025;13:e68707

DOI: 10.2196/68707

PMID: 40939201

PMCID: 12431712

Download PDF

Request queued. Please wait while the file is being generated. It may take some time.

Copyright

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.

JMIR Publications

JMIR Preprints

Accepted for/Published in: JMIR Medical Informatics

Date Submitted: Nov 12, 2024

Date Accepted: Jun 17, 2025

Performance of Natural Language Processing for Information Extraction from Electronic Health Records within Cancer: A Systematic Review

ABSTRACT

Citation

Copyright