Maintenance Notice

Due to necessary scheduled maintenance, the JMIR Publications website will be unavailable from Wednesday, July 01, 2020 at 8:00 PM to 10:00 PM EST. We apologize in advance for any inconvenience this may cause you.

Who will be affected?

Accepted for/Published in: Journal of Medical Internet Research

Date Submitted: Jul 3, 2020
Date Accepted: May 17, 2021
Date Submitted to PubMed: May 19, 2021

The final, peer-reviewed published version of this preprint can be found here:

Relative Performance of Machine Learning and Linear Regression in Predicting Quality of Life and Academic Performance of School Children in Norway: Data Analysis of a Quasi-Experimental Study

Froud R, Hansen SH, Ruud HK, Foss J, Ferguson L, Fredriksen PM

Relative Performance of Machine Learning and Linear Regression in Predicting Quality of Life and Academic Performance of School Children in Norway: Data Analysis of a Quasi-Experimental Study

J Med Internet Res 2021;23(7):e22021

DOI: 10.2196/22021

PMID: 34009128

PMCID: 8325075

Predicting quality of life and academic performance of school children in Norway: relative performance of machine learning and linear regression for modelling continuous outcomes

  • Robert Froud; 
  • Solveig Hakestad Hansen; 
  • Hans Kristian Ruud; 
  • Jonathan Foss; 
  • Leila Ferguson; 
  • Per Morten Fredriksen

ABSTRACT

Background:

Machine learning (ML) approaches are increasingly being used in health research. It is not clear how useful these approaches are for modelling continuous health outcomes. Child quality of life (QoL) is associated with parental socioeconomic status and child activity levels, and may be associated with aerobic fitness and strength. It is not clear whether diet, or academic performance (AP) is associated with QoL.

Objective:

To compare predictive performances of ML approaches with linear regression for modelling QoL and AP using parental education and lifestyle data.

Methods:

We modelled data from children attending nine schools in a quasi-experimental study (NCT02495714). We split data randomly into training and validation sets, and simulated curvilinear, non-linear, and heteroscedastic variables. We examined relative performance of ML approaches using R2, making comparisons to mixed and fixed models, and regression with splines, with and without imputation. We also examined the effect of training set size on overfitting.

Results:

We had 1,711 cases. Using real data, our regression models explained 24% of AP variance in the complete-case validation set, and up to 15% of QoL variance. While ML models explained high proportions of variance in training sets, in validation sets these explained ~0% of AP and between 3% and 8% of QoL. Following imputation, ML models improved up to 15% for AP. ML models outperformed regression for modelling simulated non-linear and heteroscedastic variables only. A smaller training set did not lead to increased overfitting. The best predictors of QoL were 7-point self-reported activity (P<.001; ß=1.09 (95% CI 0.53 to 1.66)) and TV/computer use (P=.002; ß=-0.95 (-1.55 to -0.36)). For AP, these were mother having master’s-level education (P<.001; ß=1.98 (0.25 to 3.71)) and dichotomised self-reported activity (P=.001; ß=2.47 (1.08 to 3.87)). Adjusted academic performance was associated with QoL (P=.02; ß=0.12 (0.02 to 0.22)).

Conclusions:

Exercising to cause sweat once per week and 2 hours per day of TV or computer use are associated with small-to-medium increases and decreases in child QoL, respectively. An increase in AP of 20 units is associated with a small increase in QoL. A mother having higher and master’s-level education, 2 hours per day of TV or computer use, and taking at least 2 hours of exercise, are each associated with small-to-medium increases in AP. Differences between effects of computer/TV use for work/leisure needs further investigation. Linear regression is less prone to overfitting and performs better than ML in predicting continuous health outcomes in a dataset containing missing data. Imputation improves ML performance but not enough to outperform regression. ML outperformed regression with non-linear and heteroscedastic data and may be of use when such relationships exist, and where imputation is sensible or there are no missing data. Clinical Trial: The data are from a quasi-experimental design and not an RCT but nevertheless the study from which the data are from does have a registration: NCT02495714


 Citation

Please cite as:

Froud R, Hansen SH, Ruud HK, Foss J, Ferguson L, Fredriksen PM

Relative Performance of Machine Learning and Linear Regression in Predicting Quality of Life and Academic Performance of School Children in Norway: Data Analysis of a Quasi-Experimental Study

J Med Internet Res 2021;23(7):e22021

DOI: 10.2196/22021

PMID: 34009128

PMCID: 8325075

Download PDF


Request queued. Please wait while the file is being generated. It may take some time.

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.