Accepted for/Published in: Journal of Medical Internet Research
Date Submitted: Jul 3, 2020
Date Accepted: May 17, 2021
Date Submitted to PubMed: May 19, 2021
Predicting quality of life and academic performance of school children in Norway: relative performance of machine learning and linear regression for modelling continuous outcomes
ABSTRACT
Background:
Machine learning (ML) approaches are increasingly being used in health research. It is not clear how useful these approaches are for modelling continuous health outcomes. Child quality of life (QoL) is associated with parental socioeconomic status and child activity levels, and may be associated with aerobic fitness and strength. It is not clear whether diet, or academic performance (AP) is associated with QoL.
Objective:
To compare predictive performances of ML approaches with linear regression for modelling QoL and AP using parental education and lifestyle data.
Methods:
We modelled data from children attending nine schools in a quasi-experimental study (NCT02495714). We split data randomly into training and validation sets, and simulated curvilinear, non-linear, and heteroscedastic variables. We examined relative performance of ML approaches using R2, making comparisons to mixed and fixed models, and regression with splines, with and without imputation. We also examined the effect of training set size on overfitting.
Results:
We had 1,711 cases. Using real data, our regression models explained 24% of AP variance in the complete-case validation set, and up to 15% of QoL variance. While ML models explained high proportions of variance in training sets, in validation sets these explained ~0% of AP and between 3% and 8% of QoL. Following imputation, ML models improved up to 15% for AP. ML models outperformed regression for modelling simulated non-linear and heteroscedastic variables only. A smaller training set did not lead to increased overfitting. The best predictors of QoL were 7-point self-reported activity (P<.001; ß=1.09 (95% CI 0.53 to 1.66)) and TV/computer use (P=.002; ß=-0.95 (-1.55 to -0.36)). For AP, these were mother having master’s-level education (P<.001; ß=1.98 (0.25 to 3.71)) and dichotomised self-reported activity (P=.001; ß=2.47 (1.08 to 3.87)). Adjusted academic performance was associated with QoL (P=.02; ß=0.12 (0.02 to 0.22)).
Conclusions:
Exercising to cause sweat once per week and 2 hours per day of TV or computer use are associated with small-to-medium increases and decreases in child QoL, respectively. An increase in AP of 20 units is associated with a small increase in QoL. A mother having higher and master’s-level education, 2 hours per day of TV or computer use, and taking at least 2 hours of exercise, are each associated with small-to-medium increases in AP. Differences between effects of computer/TV use for work/leisure needs further investigation. Linear regression is less prone to overfitting and performs better than ML in predicting continuous health outcomes in a dataset containing missing data. Imputation improves ML performance but not enough to outperform regression. ML outperformed regression with non-linear and heteroscedastic data and may be of use when such relationships exist, and where imputation is sensible or there are no missing data. Clinical Trial: The data are from a quasi-experimental design and not an RCT but nevertheless the study from which the data are from does have a registration: NCT02495714
Citation
Request queued. Please wait while the file is being generated. It may take some time.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.