Maintenance Notice

Due to necessary scheduled maintenance, the JMIR Publications website will be unavailable from Wednesday, July 01, 2020 at 8:00 PM to 10:00 PM EST. We apologize in advance for any inconvenience this may cause you.

Who will be affected?

Accepted for/Published in: JMIR Medical Informatics

Date Submitted: Jul 18, 2024
Date Accepted: Dec 9, 2024

The final, peer-reviewed published version of this preprint can be found here:

Identification of Clusters in a Population With Obesity Using Machine Learning: Secondary Analysis of The Maastricht Study

Beuken MJ, Kleynen M, Braun S, Van Berkel K, van der Kallen C, Koster A, Bosma H, Berendschot TT, Houben AJ, Dukers N, Van den Bergh JP, Kroon AA, Maastricht Study Management , Kanera IM

Identification of Clusters in a Population With Obesity Using Machine Learning: Secondary Analysis of The Maastricht Study

JMIR Med Inform 2025;13:e64479

DOI: 10.2196/64479

PMID: 39908080

PMCID: 11840370

Identification of Clusters in an Obese Population of The Maastricht Study Data Using Machine Learning: A Hypothesis-Free Approach

  • Maik JM Beuken; 
  • Melanie Kleynen; 
  • Susy Braun; 
  • Kees Van Berkel; 
  • Carla van der Kallen; 
  • Annemarie Koster; 
  • Hans Bosma; 
  • Tos TJM Berendschot; 
  • Alfons JHM Houben; 
  • Nicole Dukers; 
  • Joop P Van den Bergh; 
  • Abraham A Kroon; 
  • Maastricht Study Management; 
  • Iris M Kanera

ABSTRACT

Background:

Modern lifestyle risk factors, like physical inactivity and poor nutrition, contribute to rising rates of obesity and chronic diseases like type 2 diabetes and heart disease. Particularly personalized interventions have been shown to be effective for long-term behavior change. Machine Learning (ML) can be used to uncover insights without predefined hypotheses, revealing complex relationships and distinct population clusters. New data-driven approaches, such as the factor probabilistic distance clustering (FPDC) algorithm, provide opportunities to identify potentially meaningful clusters within large and complex datasets.

Objective:

This study aims to identify potential clusters and relevant variables among obese individuals using a data-driven and hypothesis-free ML approach.

Methods:

We used cross-sectional data from individuals with abdominal obesity from The Maastricht Study. Data (2971 variables) included demographics, lifestyle, biomedical aspects, advanced phenotyping, and social factors (cohort 2010). The FPDC algorithm was applied in order to detect clusters within this high-dimensional data. To identify a subset of distinct independent variables, we used the statistically equivalent signature (SES) algorithm. To describe the clusters, we applied measures of central tendency and variability, and we assessed distinctiveness of the clusters through the emerged variables using the F-test for continuous variables and the chi-square test for categorical variables at a confidence level of α=.001

Results:

We identified 3 distinct clusters (including 44.93% of all datapoints) among individuals (n = 4128) with obesity. The most significant continuous variable for distinguishing cluster 1 (n=1458) from cluster 2 and 3 combined (n=2670) is the lower energy intake (mean 1684, SD 393 kcal/day; P<.001) versus mean 2358 (SD 635) kcal/day. The most significant categorical variable (P <.001) is occupation. Of the 1458 participants 1236 (84.77%) did not work versus 1486 out of 2670 participants (55.66%) in cluster 2 and 3 combined. For cluster 2 (n=1521) the most significant continuous variable is a higher energy intake (mean 2755, SD 506.2 kcal/day; P<.001) versus mean 1749 (SD 375) kcal/day. The most significant categorical variable (P <.001) is sex. Of the 1521 participants in cluster 2 there are 997 (65,55%) males compared to the other 2 clusters (885/2607, 33.95%) For cluster 3 (n=1149) the most significant continuous variable is overall higher cognitive functioning (mean 0.2349, SD 0.5702; P <.001) versus (mean -0.3088, SD 0.7212) and educational level is the most significant categorical variable (P<.001). A significant higher proportion (475/1149, 41.34%) in cluster 3 received higher vocational or university education in comparison to cluster 1 and 2 were 729 out of 2979 (24.47%) participants received this level of education.

Conclusions:

This study demonstrates that a hypothesis-free and fully data-driven approach can be used to identify distinguishable participant clusters in large and complex datasets and find relevant variables within obese populations on which they differ.


 Citation

Please cite as:

Beuken MJ, Kleynen M, Braun S, Van Berkel K, van der Kallen C, Koster A, Bosma H, Berendschot TT, Houben AJ, Dukers N, Van den Bergh JP, Kroon AA, Maastricht Study Management , Kanera IM

Identification of Clusters in a Population With Obesity Using Machine Learning: Secondary Analysis of The Maastricht Study

JMIR Med Inform 2025;13:e64479

DOI: 10.2196/64479

PMID: 39908080

PMCID: 11840370

Download PDF


Request queued. Please wait while the file is being generated. It may take some time.

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.