Accepted for/Published in: JMIR Medical Informatics
Date Submitted: Nov 2, 2020
Date Accepted: Dec 12, 2020
Date Submitted to PubMed: Jan 15, 2021
Classification of the Disposition of Patients Hospitalized with COVID-19: Reading Discharge Summaries using Natural Language Processing
ABSTRACT
Background:
Medical notes are a rich source of patient data, however the nature of unstructured text has largely precluded using these data in large retrospective analyses. Transforming clinical text into structured data can enable large-scale research studies with electronic health records (EHR) data. Natural language processing (NLP) can be used for text information retrieval, reducing the need for labor intensive chart review. Here we present an application of NLP to large-scale analysis of medical records at two large hospitals for patients hospitalized with COVID-19 infections.
Objective:
Our study goal was to develop an NLP pipeline to classify the discharge disposition (home, inpatient rehabilitation, skilled inpatient nursing facility (SNIF) and death) of patients hospitalized with COVID-19 based on hospital discharge summaries notes.
Methods:
Text mining and feature engineering were applied to unstructured text from hospital discharge summaries. The study included patients with COVID-19 discharged from 2 hospitals in the Boston, Massachusetts area (Massachusetts General Hospital and Brigham and Women’s Hospital) between March 10, 2020, and June 30, 2020. The data was divided into 70% for training and 30% for a hold-out test set. Discharge summaries were represented as bags-of-words consisting of single words (1-grams), 2-grams and 3-grams. The number of features was reduced during training by excluding n-grams that occurred in fewer than 10% of discharge summaries, and further using LASSO regularization while training a multiclass logistic regression model. Model performance was evaluated in the hold-out test set.
Results:
The study cohort comprised 1737 adult patients (median [SD] age, 61[18] years old; 55% men; 45% White and 16% Black; 14% non-survivors; 61% discharged home). The model selected 179 from a vocabulary of 1056 engineered features, consisting of combinations of unigrams, bigrams and trigrams. The top features contributing most to the classification by the model (for each outcome) were: ‘appointments specialty', ‘home health’ and ‘home care' (home), 'intubate’, and ‘ARDS’ (inpatient rehabilitation), ‘service’ (SNIF), ‘brief assessment' and ‘covid' (death). The model achieved micro average area under the receiver operating characteristic and average precision in the testing set of 0.98 (95% CI 0.97-0.98) and 0.81 (95% CI 0.75-0.84), respectively, for prediction of discharge disposition.
Conclusions:
A supervised learning-based NLP approach is able to classify discharge disposition of patients hospitalized with COVID-19 infection. This approach has the potential to accelerate and increase the scale of research on patients’ discharge disposition that is possible with EHR data. Clinical Trial: Not clinical trial.
Citation
Request queued. Please wait while the file is being generated. It may take some time.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.