JMIR Preprints #23930: Using Machine Learning to extract Rheumatoid Arthritis patients from Electronic Health Records: Algorithm Development and Validation study

Current Preprint Settings

(as selected by the authors)

1. When the manuscript is submitted, allow peer review from:

(a) Anybody (open community peer review)
(b) Editor-selected reviewers (closed peer review)

2. When the manuscript is submitted, display the preprint PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

3. When the manuscript is accepted, display the accepted manuscript PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

Using Machine Learning to extract Rheumatoid Arthritis patients from Electronic Health Records: Algorithm Development and Validation study

Tjardo D. Maarseveen;
Timo Meinderink;
Marcel J.T. Reinders;
Johannes Knitza;
Tom W.J. Huizinga;
Arnd Kleyer;
David Simon;
Erik B. van den Akker;
Rachel Knevel

ABSTRACT

Background:

Financial codes are often used to extract diagnoses from Electronic Health Records (EHR). This approach is prone to false positives. Alternatively, queries are constructed but these are highly center and language specific. A tantalizing alternative is the automatic identification of patients by employing machine learning on format-free text entries.

Objective:

Develop an easily implementable workflow that builds a machine learning method (MLM) capable of accurately identifying patients with rheumatoid arthritis (RA) from format-free text fields in electronic health records (EHR).

Methods:

For this study, two datasets were employed: Leiden (N = 3,000) and Erlangen (N = 4,771). Using the Leiden study, we first compared the performances of six different MLMs and a naive word-matching algorithm with a 10-fold cross validation setting. Performances were compared with the area under the receiver operating characteristic (AUC-ROC) and the area under the precision recall curve (AUC-PRC). We used the F1-score (harmonic mean of precision and recall) as primary criteria for selecting the best MLM. The MLMs provide probabilities for being RA case by building a so-called classifying algorithm. We selected the optimal threshold (defined by a high positive predictive value (PPV)) for case identification on the output of the best MLM in the training data. The developed workflow was subsequently applied to the Erlangen data, where we trained and validated on 4,293 Erlangen (Germany) patient records. In the final test phase, the best performing MLMs were applied on unseen test data (Leiden N=1,000; Erlangen N=478) for an unbiased evaluation.

Results:

In the Leiden data, the AUC-ROC and AUC-PRC of the word-matching algorithm were respectively good (0.90) and poor (0.33). Four out of six MLMs significantly outperformed the word-matching algorithm, with the support vector machines (SVM) performing best (AUC-ROC=0.98 vs. 0.90; AUC-PRC=0.88 vs. 0.33; F1-score=0.83 vs. 0.55). Applying this SVM classifier on the independent 1,000 patients resulted in a similarly high performance (F1=0.81; PPV=0.94). With this method, we could identify 2,873 RA-patients out of the complete 23,300 patients in the Leiden-EHR in less than 7 seconds. In Erlangen, the Gradient Boosting (GBM) performed best (AUC-ROC=0.94; AUC-PRC=0.85; F1=0.82) in the training set. Applying the settings of the first phase to the untouched data resulted once again in good results (F1=0.67; PPV=0.97).

Conclusions:

We demonstrate that MLMs can extract RA-cases from EHR data with high precision, allowing research on very large populations for limited costs. Our approach is language and center independent and can be applied to any other diagnosis. We have developed our pipeline into a universally applicable and easy-to-implement workflow to equip centers to obtain their own high performing algorithm. This allows the creation of observational studies of unprecedented size covering different countries for low cost from already available data in EHR systems.

Citation

Please cite as:

Maarseveen TD, Meinderink T, Reinders MJ, Knitza J, Huizinga TW, Kleyer A, Simon D, van den Akker EB, Knevel R

Machine Learning Electronic Health Record Identification of Patients with Rheumatoid Arthritis: Algorithm Pipeline Development and Validation Study

JMIR Med Inform 2020;8(11):e23930

DOI: 10.2196/23930

PMID: 33252349

PMCID: 7735897

Download PDF

Request queued. Please wait while the file is being generated. It may take some time.

Copyright

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.

JMIR Publications

JMIR Preprints

Accepted for/Published in: JMIR Medical Informatics

Date Submitted: Aug 28, 2020

Date Accepted: Oct 24, 2020

Using Machine Learning to extract Rheumatoid Arthritis patients from Electronic Health Records: Algorithm Development and Validation study

ABSTRACT

Citation

Copyright