JMIR Preprints #63767: Deep learning models to screen electronic health records for breast and colorectal cancer progression

Current Preprint Settings

(as selected by the authors)

1. When the manuscript is submitted, allow peer review from:

(a) Anybody (open community peer review)
(b) Editor-selected reviewers (closed peer review)

2. When the manuscript is submitted, display the preprint PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

3. When the manuscript is accepted, display the accepted manuscript PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

Deep learning models to screen electronic health records for breast and colorectal cancer progression

Pascal Lambert;
Rayyan Khan;
Marshall Pitz;
Harminder Singh;
Helen Chen;
Kathleen Decker

ABSTRACT

Background:

Cancer progression is an important outcome in cancer research. However, it is frequently only found documented in electronic health records (EHRs) as unstructured text, which require lengthy and costly chart reviews to extract for retrospective studies.

Objective:

The objective of this study is to evaluate the performance of three deep learning language models on determining breast and colorectal cancer progression in EHRs.

Methods:

Electronic health records for individuals diagnosed with stage IV breast or colorectal cancer between 2004 and 2020 in Manitoba, Canada were extracted. A chart review was conducted to identify cancer progression for each EHR. Data were analyzed with pre-trained deep learning language models (Bio+ClinicalBERT, Clinical-BigBird, and Clinical-Longformer). Sensitivity, positive predictive value, area under the curve, and scaled Brier scores were used to evaluate performance. Influential tokens were identified by removing and adding tokens to EHRs and examining changes in predicted probabilities.

Results:

Clinical-BigBird and Clinical-Longformer models for breast and colorectal cancer cohorts demonstrated higher accuracy than Bio+ClinicalBERT models (scaled Brier scores for breast cancer models: 0.71 – 0.79 versus 0.49 – 0.71; scaled Briers for colorectal cancer models: 0.61 – 0.65 versus 0.49 – 0.61). The same models also demonstrated higher sensitivity (breast cancer models: 86.6 - 94.3% versus 77.8 – 87.1%; colorectal cancer models: 73.1 – 78.9% versus 62.8 – 78.2%) and positive predictive value (breast cancer models: 77.9 – 92.3% versus 68.0 – 85.5%; colorectal cancer models: 81.6 – 86.3% versus 72.7 – 84.1%) compared to Bio+ClinicalBERT models. All models could remove more than 84% of charts from the chart review process. The most influential token was the word “progression”, which was influenced by the presence of other tokens and position within an EHR.

Conclusions:

The deep learning language models could help to identify breast and colorectal cancer progression in EHRs and remove a majority of charts from the chart review process. A limited number of tokens may influence model predictions. Improvements in model performance could be obtained by increasing the training dataset size and analyzing EHRs at the sentence- rather than EHR-level.

Citation

Please cite as:

Lambert P, Khan R, Pitz M, Singh H, Chen H, Decker K

Deep Learning Models to Screen Electronic Health Records for Breast and Colorectal Cancer Progression: Performance Evaluation Study

JMIR AI 2025;4:e63767

DOI: 10.2196/63767

PMID: 41082723

PMCID: 12559821

Download PDF

Request queued. Please wait while the file is being generated. It may take some time.

Copyright

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.

JMIR Publications

JMIR Preprints

Accepted for/Published in: JMIR AI

Date Submitted: Jun 28, 2024

Open Peer Review Period: Jun 28, 2024 - Aug 23, 2024

Date Accepted: Aug 2, 2025

(closed for review but you can still tweet)

Deep learning models to screen electronic health records for breast and colorectal cancer progression

ABSTRACT

Citation

Copyright