Maintenance Notice

Due to necessary scheduled maintenance, the JMIR Publications website will be unavailable from Wednesday, July 01, 2020 at 8:00 PM to 10:00 PM EST. We apologize in advance for any inconvenience this may cause you.

Who will be affected?

Accepted for/Published in: Journal of Medical Internet Research

Date Submitted: Jul 27, 2020
Date Accepted: Jan 20, 2021

The final, peer-reviewed published version of this preprint can be found here:

Natural Language Processing and Machine Learning for Identifying Incident Stroke From Electronic Health Records: Algorithm Development and Validation

Zhao Y, Fu S, Bielinski S, Decker PA, Chamberlain AM, Roger VL, Liu H, Larson NB

Natural Language Processing and Machine Learning for Identifying Incident Stroke From Electronic Health Records: Algorithm Development and Validation

J Med Internet Res 2021;23(3):e22951

DOI: 10.2196/22951

PMID: 33683212

PMCID: 7985804

Using Natural Language Processing and Machine Learning to Identify Incident Stroke from Electronic Health Records

  • Yiqing Zhao; 
  • Sunyang Fu; 
  • Suzette Bielinski; 
  • Paul A Decker; 
  • Alanna M Chamberlain; 
  • Veronique L Roger; 
  • Hongfang Liu; 
  • Nicholas B Larson

ABSTRACT

Background:

Stroke, a syndrome of rapid loss of cerebral function with vascular origin, is an important outcome in cardiovascular research. The ascertainment of incident stroke is typically accomplished via time-consuming manual chart abstraction. Existing phenotyping efforts using electronic health records for stroke focus on case ascertainment rather than incident disease that require knowledge of the temporal sequence of events.

Objective:

To develop a machine learning-based phenotyping algorithm for incident stroke ascertainment based on diagnosis codes, procedure codes, and clinical concepts extracted from clinical notes using natural language processing.

Methods:

The algorithm was trained and validated using an existing epidemiology cohort consisting of 4914 patients with atrial fibrillation (AF) with manually curated incident stroke events. Various combinations of feature sets and machine learning classifiers were compared. Using a heuristic rule based on the composition of concepts and codes, we further detected stroke subtype (ischemic stroke/Transient Ischemic Attack or hemorrhagic stroke) of each identified stroke. The algorithm was also evaluated using a cohort (n=150) stratified sampled from an Olmsted County population (n=74,314).

Results:

Among 4914 patients with AF, 740 had validated incident stroke events. The best performing stroke phenotyping algorithm used clinical concepts, diagnosis codes, and procedure codes as features with a random forest classifier. On the general population sample, the best model achieved a positive predictive value of 86%, negative predictive value of 96%, sensitivity of 0.92, and specificity of 0.93. For subtype identification, we achieved an accuracy of 83% in the AF cohort and 80% in the general population sample.

Conclusions:

In conclusion, we developed and validated a stroke algorithm that performed well for identifying incident stroke as well as determining type of stroke. The algorithm also performed well on a sample from a general population which further proves its generalizability and potential for adoption by other institutions.


 Citation

Please cite as:

Zhao Y, Fu S, Bielinski S, Decker PA, Chamberlain AM, Roger VL, Liu H, Larson NB

Natural Language Processing and Machine Learning for Identifying Incident Stroke From Electronic Health Records: Algorithm Development and Validation

J Med Internet Res 2021;23(3):e22951

DOI: 10.2196/22951

PMID: 33683212

PMCID: 7985804

Download PDF


Request queued. Please wait while the file is being generated. It may take some time.

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.