JMIR Preprints #98989: Normalization Is a Model-Level Design Choice in Outpatient Type 2 Diabetes AI: A Leakage-Safe Comparative Study on Public Datasets

Current Preprint Settings

(as selected by the authors)

1. When the manuscript is submitted, allow peer review from:

(a) Anybody (open community peer review)
(b) Editor-selected reviewers (closed peer review)

2. When the manuscript is submitted, display the preprint PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

3. When the manuscript is accepted, display the accepted manuscript PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

Normalization Is a Model-Level Design Choice in Outpatient Type 2 Diabetes AI: A Leakage-Safe Comparative Study on Public Datasets

Igor Korsakov

ABSTRACT

Background:

Feature normalization is frequently underreported in clinical machine learning studies, despite its strong influence on model behavior, calibration, and interpretability. In outpatient type 2 diabetes (T2D) decision-support settings, unclear preprocessing choices can reduce reproducibility and weaken translational reliability.

Objective:

This study aimed to evaluate normalization as an explicit model-selection factor in outpatient T2D prediction workflows and to quantify how different normalization strategies affect model performance across classifier families under a leakage-safe evaluation design

Methods:

We conducted a comparative benchmarking study on two public diabetes datasets from Hugging Face (Dataset A: GB2024/diabetes; Dataset B: khoaguin/pima-indians-diabetes-database-partitions). To ensure tractable and reproducible benchmarking across all experiments, datasets larger than 20,000 rows were capped using stratified random sampling (random_state=42). We compared 4 classifier families (Logistic Regression, SVC-RBF, KNN, Random Forest) across 6 normalization strategies (none, standard, min-max, robust, quantile-normal, Yeo-Johnson). Preprocessing (imputation, encoding, normalization) was fit on training folds only. Evaluation used stratified 5-fold cross-validation and held-out testing, with macro-F1 as the primary metric and AUC/accuracy as secondary metrics. A staged proxy-leakage sensitivity analysis was performed.

Results:

Normalization effects were model dependent. KNN and SVC-RBF showed larger performance sensitivity to normalization choice, while Random Forest was comparatively stable. In Dataset B, best test macro-F1 values approached 0.9916, but sensitivity analyses showed that near-ceiling performance can be partially inflated by leakage-adjacent proxy features. Across datasets, reporting only best final metrics masked important normalization-dependent performance spread.

Conclusions:

In outpatient T2D clinical AI, normalization should be treated as a high-impact methodological decision rather than a default preprocessing step. A transparent two-layer preprocessing strategy (clinically meaningful feature encoding plus statistical normalization), leakage-safe validation, and proxy-leakage sensitivity checks can improve reproducibility and support safer translation into treatment-support workflows. Clinical Trial: Not applicable.

Citation

Please cite as:

Korsakov I

Normalization Is a Model-Level Design Choice in Outpatient Type 2 Diabetes AI: A Leakage-Safe Comparative Study on Public Datasets

JMIR Preprints. 21/04/2026:98989

DOI: 10.2196/preprints.98989

URL: https://preprints.jmir.org/preprint/98989

Download PDF

Request queued. Please wait while the file is being generated. It may take some time.

Copyright

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.

JMIR Publications

JMIR Preprints

Previously submitted to: JMIR AI (no longer under consideration since Apr 27, 2026)

Date Submitted: Apr 21, 2026

Normalization Is a Model-Level Design Choice in Outpatient Type 2 Diabetes AI: A Leakage-Safe Comparative Study on Public Datasets

ABSTRACT

Citation

Copyright