Accepted for/Published in: Journal of Medical Internet Research
Date Submitted: May 14, 2020
Date Accepted: Aug 8, 2020
Predicting Mortality in ICU Cases with Machine Learning: Incorporating Case Difficulty and Explainability Using Item Response Theory
ABSTRACT
Background:
Supervised machine learning (ML) has made its way into the healthcare literature with results frequently reported using the metrics of accuracy, sensitivity, specificity, recall or F1. While each provides a different perspective on the performance, they remain overall measures on the whole sample, discounting the uniqueness of each case/patient. Intuitively we know that all cases are not equal, but current evaluative approaches do not take case difficulty into account.
Objective:
A more case-based comprehensive approach is warranted to assess supervised ML outcomes and forms the rationale for this study. We demonstrate how Item Response Theory (IRT) can be used to stratify the data based on how ‘difficult’ each case is to classify, independent of the outcome measure of interest (e.g., accuracy). This stratification allows the evaluation of ML classifiers to take the form of a distribution rather than a single scalar value.
Methods:
Two large, public intensive care unit (ICU) data sets, MIMIC III and eICU, were used to showcase this method in predicting mortality. For each data set, a balanced and an imbalanced sample were drawn. Conventional metrics for ML classification are reported for methodological comparison. Several ML algorithms were used in the demonstration: logistic regression (LR), linear discriminate analysis (LDA), K-nearest neighbors (KNN), decision tree (DT), naïve bayes (NB), and a neural network (NN). Generalized linear mixed model analyses assessed the effects of case difficulty strata, machine learning algorithm and their interaction in predicting accuracy.
Results:
The results illustrated that all classifiers performed better with easier-to-classify cases and that overall the NN performed best. Significant interactions suggest that cases that fall in the most arduous strata should be handled by LR LDA, DT or NN, but not NB or KNN. This demonstration shows that IRT is a viable method for understanding the data that are provided to ML algorithms, independent of outcome measures, and highlights how well classifiers differentiate cases of varying difficulty.
Conclusions:
This method generates an explanation into which features are indicative of healthy states and why. It enables ends users to tailor the classifier appropriate to the difficulty level of the patient for a personalized medicine approach. Clinical Trial: N/A
Citation
Request queued. Please wait while the file is being generated. It may take some time.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.