Accepted for/Published in: JMIR Medical Informatics
Date Submitted: Jul 18, 2024
Date Accepted: Dec 9, 2024
Identification of Clusters in an Obese Population of The Maastricht Study Data Using Machine Learning: A Hypothesis-Free Approach
ABSTRACT
Background:
Modern lifestyle risk factors, like physical inactivity and poor nutrition, contribute to rising rates of obesity and chronic diseases like type 2 diabetes and heart disease. Particularly personalized interventions have been shown to be effective for long-term behavior change. Machine Learning (ML) can be used to uncover insights without predefined hypotheses, revealing complex relationships and distinct population clusters. New data-driven approaches, such as the factor probabilistic distance clustering (FPDC) algorithm, provide opportunities to identify potentially meaningful clusters within large and complex datasets.
Objective:
This study aims to identify potential clusters and relevant variables among obese individuals using a data-driven and hypothesis-free ML approach.
Methods:
We used cross-sectional data from individuals with abdominal obesity from The Maastricht Study. Data (2971 variables) included demographics, lifestyle, biomedical aspects, advanced phenotyping, and social factors (cohort 2010). The FPDC algorithm was applied in order to detect clusters within this high-dimensional data. To identify a subset of distinct independent variables, we used the statistically equivalent signature (SES) algorithm. To describe the clusters, we applied measures of central tendency and variability, and we assessed distinctiveness of the clusters through the emerged variables using the F-test for continuous variables and the chi-square test for categorical variables at a confidence level of α=.001
Results:
We identified 3 distinct clusters (including 44.93% of all datapoints) among individuals (n = 4128) with obesity. The most significant continuous variable for distinguishing cluster 1 (n=1458) from cluster 2 and 3 combined (n=2670) is the lower energy intake (mean 1684, SD 393 kcal/day; P<.001) versus mean 2358 (SD 635) kcal/day. The most significant categorical variable (P <.001) is occupation. Of the 1458 participants 1236 (84.77%) did not work versus 1486 out of 2670 participants (55.66%) in cluster 2 and 3 combined. For cluster 2 (n=1521) the most significant continuous variable is a higher energy intake (mean 2755, SD 506.2 kcal/day; P<.001) versus mean 1749 (SD 375) kcal/day. The most significant categorical variable (P <.001) is sex. Of the 1521 participants in cluster 2 there are 997 (65,55%) males compared to the other 2 clusters (885/2607, 33.95%) For cluster 3 (n=1149) the most significant continuous variable is overall higher cognitive functioning (mean 0.2349, SD 0.5702; P <.001) versus (mean -0.3088, SD 0.7212) and educational level is the most significant categorical variable (P<.001). A significant higher proportion (475/1149, 41.34%) in cluster 3 received higher vocational or university education in comparison to cluster 1 and 2 were 729 out of 2979 (24.47%) participants received this level of education.
Conclusions:
This study demonstrates that a hypothesis-free and fully data-driven approach can be used to identify distinguishable participant clusters in large and complex datasets and find relevant variables within obese populations on which they differ.
Citation
Request queued. Please wait while the file is being generated. It may take some time.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.