Accepted for/Published in: JMIR AI
Date Submitted: Mar 16, 2024
Open Peer Review Period: Apr 18, 2024 - Jun 13, 2024
Date Accepted: Aug 24, 2024
(closed for review but you can still tweet)
Warning: This is an author submission that is not peer-reviewed or edited. Preprints - unless they show as "accepted" - should not be relied on to guide clinical practice or health-related behavior and should not be reported in news media as established information.
Machine-learning based prediction for high health care utilizers using a multi-institution diabetes registry: model training and evaluation.
ABSTRACT
Background:
The cost of healthcare in many countries is increasing rapidly. There is a growing interest in using machine learning to predict high healthcare utilizers for population health initiatives. Previous studies have focused on individuals who contribute to the highest financial burden. However, this group is small and represents a limited opportunity for long-term cost reduction.
Objective:
We developed an ensemble of models that predict future healthcare utilization at various thresholds.
Methods:
We utilized data from a multi-institutional diabetes database from the year 2019 to develop binary classification models. These models predict healthcare utilization in the subsequent year across six different outcomes: patients having a length of stay of ≥7, ≥14, and ≥30 days, and emergency department (ED) attendance of ≥3, ≥5, and ≥10 visits. To address class imbalance, random and synthetic minority oversampling techniques were employed. The models were then applied to unseen data from 2020 and 2021 to predict healthcare utilization in the following year. A portfolio of performance metrics, with a priority on area under the receiver operating curve (AUC), sensitivity and positive predictive value was used for comparison.
Results:
When trained with random oversampling, four models – logistic regression, multivariate adaptive regression splines, boosted trees, and multilayer perceptron – consistently achieved high AUC (>0.80) and sensitivity (>0.60) across training-validation and test datasets. Correcting for class imbalance proved critical for model performance. Key predictors for all outcomes included age, number of ED visits in the present year, chronic kidney disease stage, inpatient bed days in the present year, and mean HbA1c levels.
Conclusions:
We successfully developed machine learning models capable of predicting high service level utilization with robust performance. These models can be integrated into wider diabetes-related population health initiatives. Clinical Trial: Not Applicable
Citation

The author of this paper has made a PDF available, but requires the user to login, or create an account.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.