Accepted for/Published in: JMIR Medical Informatics
Date Submitted: Apr 8, 2025
Date Accepted: Feb 5, 2026
Warning: This is an author submission that is not peer-reviewed or edited. Preprints - unless they show as "accepted" - should not be relied on to guide clinical practice or health-related behavior and should not be reported in news media as established information.
Enhancing Model Generalizability in Medical AI: Systematic Comparison of Categorical Encoding and Sampling Techniques for Imbalanced Data
ABSTRACT
Background:
Despite the increasing use of machine learning (ML) in clinical research, the early stages of data preparation—especially for structured clinical data—often receive limited methodological scrutiny. These datasets typically contain missing values, complex categorical variables, and imbalanced class distributions, all of which complicate downstream model development and interpretation.
Objective:
This study introduces a structured preprocessing framework designed to address common challenges in medical tabular data and to assess how preprocessing choices affect the stability and transferability of predictive models across settings.
Methods:
We constructed a modular workflow comprising three components. First, preprocessing strategies included imputation for missing data, three types of categorical encoding (One-Hot, Frequency, Target), and resampling approaches for class imbalance (SMOTE, ROSE). Second, six classification algorithms were used to evaluate performance patterns: Logistic Regression, Decision Tree, Random Forest, XGBoost, CatBoost, and LightGBM. Third, external validation was conducted across two datasets with distinct data-generating mechanisms: an ESRD patient registry (n=412) and the population-based BRFSS 2015 survey.
Results:
One-Hot Encoding in combination with ROSE yielded the most consistent performance improvements in terms of AUC (0.940) and accuracy (0.932), particularly for classifiers sensitive to class distribution. Notably, ROSE enhanced sensitivity without substantially distorting the original data structure. Feature importance rankings also contributed to model interpretability, and performance trends were largely reproducible in external validation.
Conclusions:
Our findings suggest that preprocessing decisions—often treated as ancillary—play a central role in shaping model outcomes, especially in high-variance clinical datasets. The proposed framework offers a reproducible and adaptable tool for aligning data preparation with the unique demands of healthcare prediction tasks, and may serve as a foundation for future efforts to standardize preprocessing in clinical ML workflows.
Citation
Request queued. Please wait while the file is being generated. It may take some time.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.