Maintenance Notice

Due to necessary scheduled maintenance, the JMIR Publications website will be unavailable from Wednesday, July 01, 2020 at 8:00 PM to 10:00 PM EST. We apologize in advance for any inconvenience this may cause you.

Who will be affected?

Accepted for/Published in: JMIR Medical Informatics

Date Submitted: Aug 28, 2020
Date Accepted: Oct 24, 2020

The final, peer-reviewed published version of this preprint can be found here:

Machine Learning Electronic Health Record Identification of Patients with Rheumatoid Arthritis: Algorithm Pipeline Development and Validation Study

Maarseveen TD, Meinderink T, Reinders MJ, Knitza J, Huizinga TW, Kleyer A, Simon D, van den Akker EB, Knevel R

Machine Learning Electronic Health Record Identification of Patients with Rheumatoid Arthritis: Algorithm Pipeline Development and Validation Study

JMIR Med Inform 2020;8(11):e23930

DOI: 10.2196/23930

PMID: 33252349

PMCID: 7735897

Using Machine Learning to extract Rheumatoid Arthritis patients from Electronic Health Records: Algorithm Development and Validation study

  • Tjardo D. Maarseveen; 
  • Timo Meinderink; 
  • Marcel J.T. Reinders; 
  • Johannes Knitza; 
  • Tom W.J. Huizinga; 
  • Arnd Kleyer; 
  • David Simon; 
  • Erik B. van den Akker; 
  • Rachel Knevel

ABSTRACT

Background:

Financial codes are often used to extract diagnoses from Electronic Health Records (EHR). This approach is prone to false positives. Alternatively, queries are constructed but these are highly center and language specific. A tantalizing alternative is the automatic identification of patients by employing machine learning on format-free text entries.

Objective:

Develop an easily implementable workflow that builds a machine learning method (MLM) capable of accurately identifying patients with rheumatoid arthritis (RA) from format-free text fields in electronic health records (EHR).

Methods:

For this study, two datasets were employed: Leiden (N = 3,000) and Erlangen (N = 4,771). Using the Leiden study, we first compared the performances of six different MLMs and a naive word-matching algorithm with a 10-fold cross validation setting. Performances were compared with the area under the receiver operating characteristic (AUC-ROC) and the area under the precision recall curve (AUC-PRC). We used the F1-score (harmonic mean of precision and recall) as primary criteria for selecting the best MLM. The MLMs provide probabilities for being RA case by building a so-called classifying algorithm. We selected the optimal threshold (defined by a high positive predictive value (PPV)) for case identification on the output of the best MLM in the training data. The developed workflow was subsequently applied to the Erlangen data, where we trained and validated on 4,293 Erlangen (Germany) patient records. In the final test phase, the best performing MLMs were applied on unseen test data (Leiden N=1,000; Erlangen N=478) for an unbiased evaluation.

Results:

In the Leiden data, the AUC-ROC and AUC-PRC of the word-matching algorithm were respectively good (0.90) and poor (0.33). Four out of six MLMs significantly outperformed the word-matching algorithm, with the support vector machines (SVM) performing best (AUC-ROC=0.98 vs. 0.90; AUC-PRC=0.88 vs. 0.33; F1-score=0.83 vs. 0.55). Applying this SVM classifier on the independent 1,000 patients resulted in a similarly high performance (F1=0.81; PPV=0.94). With this method, we could identify 2,873 RA-patients out of the complete 23,300 patients in the Leiden-EHR in less than 7 seconds. In Erlangen, the Gradient Boosting (GBM) performed best (AUC-ROC=0.94; AUC-PRC=0.85; F1=0.82) in the training set. Applying the settings of the first phase to the untouched data resulted once again in good results (F1=0.67; PPV=0.97).

Conclusions:

We demonstrate that MLMs can extract RA-cases from EHR data with high precision, allowing research on very large populations for limited costs. Our approach is language and center independent and can be applied to any other diagnosis. We have developed our pipeline into a universally applicable and easy-to-implement workflow to equip centers to obtain their own high performing algorithm. This allows the creation of observational studies of unprecedented size covering different countries for low cost from already available data in EHR systems.


 Citation

Please cite as:

Maarseveen TD, Meinderink T, Reinders MJ, Knitza J, Huizinga TW, Kleyer A, Simon D, van den Akker EB, Knevel R

Machine Learning Electronic Health Record Identification of Patients with Rheumatoid Arthritis: Algorithm Pipeline Development and Validation Study

JMIR Med Inform 2020;8(11):e23930

DOI: 10.2196/23930

PMID: 33252349

PMCID: 7735897

Download PDF


Request queued. Please wait while the file is being generated. It may take some time.

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.