Accepted for/Published in: JMIR Formative Research
Date Submitted: Sep 20, 2022
Date Accepted: Feb 7, 2023
Predicting Measles Outbreaks in the United States: Evaluation of Machine Learning Approaches
ABSTRACT
Background:
Measles is resurging in the US, driven by international importation and declining domestic vaccination coverage. Improved methods to predict outbreaks at the county level would facilitate the optimal allocation of public health resources.
Objective:
We aimed to develop and compare supervised, unsupervised and hybrid machine learning models to identify US counties at risk of measles outbreaks.
Methods:
We constructed a supervised machine learning model based on eXtreme Gradient Boosting (XGBoost) and unsupervised models based on Hierarchical Density-based Spatial Clustering of Applications with Noise (HDBSCAN) and unsupervised Random Forest (uRF). The unsupervised models were used to investigate clustering patterns among counties with measles outbreaks; these clustering data were also incorporated into hybrid XGBoost models as additional input variables. The machine learning models were then compared to weighted logistic regression models with and without input from the unsupervised models.
Results:
Both HDBSCAN and uRF identified clusters that included a high percentage of counties with measles outbreaks. XGBoost and its hybrid models outperformed weighted logistic regression and its hybrid models, with area under the receiver operating curve values of 0.920–0.926 versus 0.900–0.908, area under the precision–recall curve (AUPRC) values of 0.522–0.532 versus 0.485–0.513, and F2 scores of 0.595–0.601 versus 0.385–0.426. Weighted logistic regression and its hybrid models had higher sensitivity than XGBoost and its hybrid models (0.837–0.857 versus 0.704–0.735) but lower positive predictive value (0.122–0.141 versus 0.340–0.367) and specificity (0.793–0.821 versus 0.952–0.958). The hybrid versions of the weighted logistic regression and XGBoost models had slightly higher AUPRC, specificity, and positive predictive values than the respective models that did not include any unsupervised features.
Conclusions:
XGBoost provided more accurate predictions of measles cases at the county level compared with weighted logistic regression. The threshold of prediction in this model can be adjusted to align with each county’s resources, priorities, and measles risk. While clustering pattern data from unsupervised machine learning approaches improved some aspects of model performance in this imbalanced data set, the optimal approach for integration of such approaches with supervised machine learning models requires further investigation. Clinical Trial: N/A
Citation
Request queued. Please wait while the file is being generated. It may take some time.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.