Accepted for/Published in: JMIR Medical Informatics
Date Submitted: Dec 10, 2020
Date Accepted: Jun 3, 2021
Date Submitted to PubMed: Aug 13, 2021
Warning: This is an author submission that is not peer-reviewed or edited. Preprints - unless they show as "accepted" - should not be relied on to guide clinical practice or health-related behavior and should not be reported in news media as established information.
Current and Next Visit Prediction for Fatty Liver Disease with a Large-Scale Dataset
ABSTRACT
Background:
Fatty liver disease (FLD) arises from the accumulation of fat in the liver and may cause liver inflammation which, according to past research it is shown that if not actively well-controlled, may develop into liver fibrosis, cirrhosis, or even hepatocellular carcinoma in the future.
Objective:
We describe the construction of machine-learning models for current-visit prediction (CVP) which can help physicians obtain more information for accurate diagnosis, and next-visit prediction (NVP) which can help physicians deal provide potential high-risk patients with advice to effectively prevent or delay health deterioration.
Methods:
The large-scale and high-dimensional dataset used in this study comes from the MJ Health Research Foundation in Taipei. The models we created use sequence forward selection (SFS) and one-pass ranking (OPR) for feature selection. For current-visit prediction (CVP), we explored multiple models including Adaboost, support vector machine (SVM), logistic regression (LR), random forest (RF), Gaussian Naïve Bayes (GNB), decision trees C4.5 (C4.5), and classification & regression trees (CART). For next-visit prediction (NVP), we used long short-term memory (LSTM) as a sequence classifier that uses various input sets for prediction. Model performance is evaluated based on two criteria: the accuracy of the test set, and the IoU and coverage between the features selected by OPR/SFS and by domain experts.
Results:
The dataset respectively includes 34,856 and 31,394 unique visits by male and female patients during 2009∼2016. The test accuracy results of CVP for Adaboost, SVM, LR, RF, GNB, C4.5, and CART were respectively 84.28, 83.84, 82.22, 82.21, 76.03, 75.78, and 75.53%. The test accuracy results of NVP of LSTM with fixed and variable intervals were respectively 78.20% and 76.79%. The proposed two paradigms of LSTM respectively achieved 39.29% and 41.21% error reduction when compared with a baseline model of simple induction.
Conclusions:
This study explores a large fatty liver disease (FLD) dataset with high dimensionality. We have developed prediction models that can use for CVP and NVP for FLD prediction. We have also implemented efficient feature selection schemes for CVP and NVP to compare the automatically selected features with expert-selected features.
Citation
Request queued. Please wait while the file is being generated. It may take some time.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.