Accepted for/Published in: JMIR Medical Informatics
Date Submitted: Jun 12, 2024
Date Accepted: Apr 11, 2025
Warning: This is an author submission that is not peer-reviewed or edited. Preprints - unless they show as "accepted" - should not be relied on to guide clinical practice or health-related behavior and should not be reported in news media as established information.
A Machine Learning Approach for Identifying People with Neuroinfectious Diseases in Electronic Health Records: Original Research Article
ABSTRACT
Background:
Identifying NID cases using manual chart review is time-consuming, and billing codes are imprecise. EHR-based ML models for NID identification have yet to be investigated.
Objective:
To develop and validate a machine learning (ML) model to identify neuroinfectious diseases (NID) from unstructured patient notes.
Methods:
Clinical notes from patients who had undergone lumbar puncture were obtained using the EHR of an academic hospital network (Mass General Brigham, MGB), with half associated with NID-related diagnostic codes. Ground-truth was established by chart review with six NID-expert physicians. NID keywords were generated with regular expressions, and extracted texts were converted into bag-of-words representations using n-grams (n=1, 2, 3). Notes were randomly split into training (80%) and hold-out testing (20%) sets. Feature selection was performed using logistic regression with L1 regularization. An extreme gradient boosting (XGBoost) model classified NID cases, and performance was evaluated using the Area Under the Receiver Operating Curve (AUROC) and the Precision-Recall Curve (AUPRC). The model was validated in an external dataset from an independent hospital (Beth Israel Deaconess Medical Center, BIDMC).
Results:
This study included 3,000 patient notes from MGB from January 22, 2010, to September 21, 2023. Of the initial 1,284 n-gram features, 342 were selected, with the most significant features being ‘meningitis,’ ‘ventriculitis,’ and ‘meningoencephalitis.’ The XGBoost model demonstrated an AUROC of 0.977 (0.964 - 0.988) and AUPRC of 0.894 (0.831 - 0.943) on the MGB hold-out test set. On 600 notes from BIDMC, the model performance based on AUROC and AUPRC was 0.976 (0.961 - 0.989) and 0.779 (0.655 - 0.885), respectively.
Conclusions:
Our ML model identifies NID cases using clinical notes tested on two independent hospitals, thereby enhancing efficiency in future large-scale NID research and cohort generation. Future studies incorporating patient notes from different regions are needed to ensure generalizability.
Citation
Request queued. Please wait while the file is being generated. It may take some time.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.