Previously submitted to: JMIR AI (no longer under consideration since Apr 27, 2026)
Date Submitted: Apr 21, 2026
Warning: This is an author submission that is not peer-reviewed or edited. Preprints - unless they show as "accepted" - should not be relied on to guide clinical practice or health-related behavior and should not be reported in news media as established information.
Normalization Is a Model-Level Design Choice in Outpatient Type 2 Diabetes AI: A Leakage-Safe Comparative Study on Public Datasets
ABSTRACT
Background:
Feature normalization is frequently underreported in clinical machine learning studies, despite its strong influence on model behavior, calibration, and interpretability. In outpatient type 2 diabetes (T2D) decision-support settings, unclear preprocessing choices can reduce reproducibility and weaken translational reliability.
Objective:
This study aimed to evaluate normalization as an explicit model-selection factor in outpatient T2D prediction workflows and to quantify how different normalization strategies affect model performance across classifier families under a leakage-safe evaluation design
Methods:
We conducted a comparative benchmarking study on two public diabetes datasets from Hugging Face (Dataset A: GB2024/diabetes; Dataset B: khoaguin/pima-indians-diabetes-database-partitions). To ensure tractable and reproducible benchmarking across all experiments, datasets larger than 20,000 rows were capped using stratified random sampling (random_state=42). We compared 4 classifier families (Logistic Regression, SVC-RBF, KNN, Random Forest) across 6 normalization strategies (none, standard, min-max, robust, quantile-normal, Yeo-Johnson). Preprocessing (imputation, encoding, normalization) was fit on training folds only. Evaluation used stratified 5-fold cross-validation and held-out testing, with macro-F1 as the primary metric and AUC/accuracy as secondary metrics. A staged proxy-leakage sensitivity analysis was performed.
Results:
Normalization effects were model dependent. KNN and SVC-RBF showed larger performance sensitivity to normalization choice, while Random Forest was comparatively stable. In Dataset B, best test macro-F1 values approached 0.9916, but sensitivity analyses showed that near-ceiling performance can be partially inflated by leakage-adjacent proxy features. Across datasets, reporting only best final metrics masked important normalization-dependent performance spread.
Conclusions:
In outpatient T2D clinical AI, normalization should be treated as a high-impact methodological decision rather than a default preprocessing step. A transparent two-layer preprocessing strategy (clinically meaningful feature encoding plus statistical normalization), leakage-safe validation, and proxy-leakage sensitivity checks can improve reproducibility and support safer translation into treatment-support workflows. Clinical Trial: Not applicable.
Citation
Request queued. Please wait while the file is being generated. It may take some time.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.