Accepted for/Published in: JMIR Medical Informatics
Date Submitted: Dec 10, 2020
Date Accepted: Jun 3, 2021
Date Submitted to PubMed: Aug 13, 2021
Current and Next Visit Prediction for Fatty Liver Disease with a Large-Scale Dataset: Model Development and Performance Comparison
ABSTRACT
Background:
Fatty liver disease (FLD) arises from the accumulation of fat in the liver and may cause liver inflammation which, according to past research, if not actively well-controlled, may develop into liver fibrosis, cirrhosis, or even hepatocellular carcinoma.
Objective:
We describe the construction of machine-learning models for current-visit prediction (CVP), which can help physicians obtain more information for accurate diagnosis, and next-visit prediction (NVP), which can help physicians provide potential high-risk patients with advice to effectively prevent FLD.
Methods:
The large-scale and high-dimensional dataset used in this study comes from the Taipei MJ Health Research Foundation in Taiwan. We used one-pass ranking (OPR) and sequential forward selection (SFS) for feature selection in FLD prediction. For CVP, we explored multiple models including k-nearest-neighbor classifier (KNNC), Adaboost, support vector machine (SVM), logistic regression (LR), random forest (RF), Gaussian Naïve Bayes (GNB), decision trees C4.5 (C4.5), and classification & regression trees (CART). For next-visit prediction (NVP), we used long short-term memory (LSTM) and a number of its variants as sequence classifiers that take various input sets for prediction. Model performance is evaluated based on two criteria: the accuracy of the test set, and the IoU/coverage between the features selected by OPR/SFS and by domain experts. The accuracy, precision, recall, f-measure, and area under the receiver operating characteristic curve (AUROC) were calculated for both current-visit prediction and next-visit prediction for the males and the females.
Results:
After data cleaning, the dataset includes 34,856 and 31,394 unique visits for males and females, respectively, during 2009∼2016. The test accuracy of CVP using KNNC, Adaboost, SVM, LR, RF, GNB, C4.5 and CART were 84.28%, 83.84%, 82.22%, 82.21%, 76.03%, 75.78%, 75.53%, respectively. The test accuracy of NVP using LSTM, biLSTM, Stack-LSTM, and Stack-biLSTM, and Attention-LSTM {78.14%, 78.03%, 78.31%, 78.05%, 77.69%} and {77.32%, 75.53%, 76.04%, 77.48%, 78.57%}, respectively, for fixed- and variable-interval features.
Conclusions:
This study explores a large-scale fatty liver disease (FLD) dataset with high dimensionality. We have developed FLD prediction models for CVP and NVP. We have also implemented efficient feature selection schemes for CVP and NVP to compare the automatically selected features with expert-selected ones. In particular, NVP is more valuable from the viewpoint of preventive medicine. For NVP, we have proposed the use of feature set 2 (with variable intervals) which is more compact and flexible. We have also tried several variants of LSTM in combination with two features sets to identify the best match for male and female FLD prediction. More specifically, the best model for males is Attention-LSTM using feature set 1 (with 78.57% accuracy), while the best model for females is Stack-biLSTM using feature set 2 (with 82.47% accuracy).
Citation
Request queued. Please wait while the file is being generated. It may take some time.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.