Accepted for/Published in: Journal of Medical Internet Research
Date Submitted: Mar 6, 2026
Date Accepted: Jun 3, 2026
Development and validation of an explainable machine learning model to assess the prevalence probability of Gastrointestinal Heat Retention Syndrome in children: a cross-sectional study
ABSTRACT
Background:
Background:
Health issues related to energy metabolism imbalance are increasingly recognized. Our team first proposed Gastrointestinal Heat Retention Syndrome (GHRS), a gastrointestinal‑specific syndrome caused by high‑calorie diets and excessive energy intake. GHRS is closely associated with multiple pediatric diseases, including recurrent respiratory tract infections, and early intervention has important preventive value. Current GHRS research is limited to risk factor analysis, with no interpretable model for assessing the current prevalence probability suitable for public lifestyle‑based application.
Objective:
Objective:
To develop and validate an explainable machine learning model for assessing the current prevalence probability of GHRS in children to support early identification and intervention.
Methods:
Methods:
This cross‑sectional study enrolled 108,447 valid children from 442 kindergartens in Shenzhen between May and July 2021. After excluding missing values, 49,798 samples were used for analysis. Associated factors were screened using univariate logistic regression, collinearity diagnosis, LASSO regression, and multivariate logistic regression. The dataset was split into training and test sets at an 8:2 ratio and balanced using SMOTETomek. Features were selected via Pearson correlation and recursive feature elimination. Seven machine learning models were constructed and evaluated for discrimination, calibration, and clinical utility. The optimal model was interpreted using SHapley Additive exPlanations (SHAP) and deployed online with Streamlit.
Results:
Results:
Fifty‑nine GHRS‑associated factors were identified, including 10 protective and 49 risk factors. In the training set, XGBoost performed best (AUC 0.7417±0.0086; Brier score 0.206±0.0031), whereas random forest (RF) showed superior overall performance in the test set (AUC 0.7427; Brier score 0.2058) with higher net benefit and reliable risk stratification. The final model was built using RF. SHAP analysis identified the top influential features: quietness level, speech volume, muscle softness, and sallow complexion. A web application was developed to estimate GHRS probability, stratify risk, and provide health guidance using 75 questionnaire items, available at https://rf-gastro-heat-predictor-kncte3chxxrnyerczz3dxb.streamlit.app/.
Conclusions:
Conclusions:
This explainable machine learning model enables reliable assessment of current GHRS prevalence probability in children. It supports family‑based early screening and timely intervention for unhealthy dietary and lifestyle habits, helping prevent GHRS and related pediatric diseases.
Citation
Request queued. Please wait while the file is being generated. It may take some time.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.