Accepted for/Published in: Journal of Medical Internet Research
Date Submitted: Apr 29, 2023
Date Accepted: Apr 30, 2024
Early Detection of Pulmonary Embolism in a General Patient Population Immediately Upon Hospital Admission Using Machine Learning Enables Identification of New, Unidentified Risk Factors
ABSTRACT
Background:
Under- or late identification of Pulmonary Embolism (PE)—a potentially lethal thrombosis of one or more pulmonary arteries that seriously threatens patients’ lives—is a major challenge confronting modern medicine worldwide.
Objective:
We aim to establish accurate and informative models to identify patients at high risk for PE, upon hospital admission before the first clinical checkup is made and using only information available from the patient's medical history.
Methods:
We trained a random forest (RF) to detect PE at the earliest possible time during hospitalization, already upon a patient’s hospital admission. We obtained a 13-year data set of 46,639 (1,942 PE and 44,697 non-PE) patients admitted to all internal departments of a tertiary medical center, including patient demographics, prior diagnoses, and chronic medications. Our first suggested method to remedy data imbalance sets the decision threshold determining the probability above which a patient is classified as positive for PE at the minority-to-majority class ratio. Our second method trains as many classifiers as the inverse of this ratio on a balanced set of PE and (random) control patients before averaging performance over the ensemble on a balanced test set. Then, to identify significant features from different experiments, we propose a non-parametric statistical test to compare feature importance lists obtained from the RF model over several data permutations. Further, we suggest a supervised clustering method to identify informative clusters that may relate patient demographic and clinical characteristics on hospital admission to improve care.
Results:
The models of the methods to tackle the imbalance data predicted PE based on age, sex, body mass index, past clinical PE events, chronic lung disease, past thrombotic events, and usage of anticoagulants, returning an ~80% value of the geometric mean—an informative performance measure for imbalance data. Although only ~4% of the patients had a final diagnosis of PE, we found two 5-cluster clustering schemes, each with a cluster or two with over 61% positive patients for PE. The cluster of the first scheme included 36% of all PE patients who were characterized by a definitive past PE diagnosis, and six- and three-times larger prevalence of deep vein thrombosis and pneumonia compared with patients of the other clusters. In the second scheme, two clusters (one of only males and one of only females) included patients who all had a past PE diagnosis and a relatively high prevalence of pneumonia, and a third cluster included only patients with a past diagnosis of pneumonia.
Conclusions:
Despite the highly imbalanced scenario and using only information available from the patient's medical history, our models were both accurate and informative in identifying patients at high risk for PE, already upon hospital admission before even the first clinical checkup was made.
Citation
Request queued. Please wait while the file is being generated. It may take some time.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.