JMIR Preprints #75655: Enhancing Model Generalizability in Medical AI: Systematic Comparison of Categorical Encoding and Sampling Techniques for Imbalanced Data

Current Preprint Settings

(as selected by the authors)

1. When the manuscript is submitted, allow peer review from:

(a) Anybody (open community peer review)
(b) Editor-selected reviewers (closed peer review)

2. When the manuscript is submitted, display the preprint PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

3. When the manuscript is accepted, display the accepted manuscript PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

Enhancing Model Generalizability in Medical AI: Systematic Comparison of Categorical Encoding and Sampling Techniques for Imbalanced Data

Chien-wei Chuang;
Chung-Kuan Wu;
Chao-Hsin Wu;
Ben-Chang Shia;
Mingchih Chen

ABSTRACT

Background:

Despite the increasing use of machine learning (ML) in clinical research, the early stages of data preparation—especially for structured clinical data—often receive limited methodological scrutiny. These datasets typically contain missing values, complex categorical variables, and imbalanced class distributions, all of which complicate downstream model development and interpretation.

Objective:

This study introduces a structured preprocessing framework designed to address common challenges in medical tabular data and to assess how preprocessing choices affect the stability and transferability of predictive models across settings.

Methods:

We constructed a modular workflow comprising three components. First, preprocessing strategies included imputation for missing data, three types of categorical encoding (One-Hot, Frequency, Target), and resampling approaches for class imbalance (SMOTE, ROSE). Second, six classification algorithms were used to evaluate performance patterns: Logistic Regression, Decision Tree, Random Forest, XGBoost, CatBoost, and LightGBM. Third, external validation was conducted across two datasets with distinct data-generating mechanisms: an ESRD patient registry (n=412) and the population-based BRFSS 2015 survey.

Results:

One-Hot Encoding in combination with ROSE yielded the most consistent performance improvements in terms of AUC (0.940) and accuracy (0.932), particularly for classifiers sensitive to class distribution. Notably, ROSE enhanced sensitivity without substantially distorting the original data structure. Feature importance rankings also contributed to model interpretability, and performance trends were largely reproducible in external validation.

Conclusions:

Our findings suggest that preprocessing decisions—often treated as ancillary—play a central role in shaping model outcomes, especially in high-variance clinical datasets. The proposed framework offers a reproducible and adaptable tool for aligning data preparation with the unique demands of healthcare prediction tasks, and may serve as a foundation for future efforts to standardize preprocessing in clinical ML workflows.

Citation

Please cite as:

Chuang Cw, Wu CK, Wu CH, Shia BC, Chen M

Enhancing Model Generalizability in Medical Artificial Intelligence: Systematic Comparison of Categorical Encoding and Sampling Techniques for Imbalanced Data

JMIR Med Inform 2026;14:e75655

DOI: 10.2196/75655

PMID: 41973872

PMCID: 13075634

Download PDF

Request queued. Please wait while the file is being generated. It may take some time.

Copyright

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.

JMIR Publications

JMIR Preprints

Accepted for/Published in: JMIR Medical Informatics

Date Submitted: Apr 8, 2025

Date Accepted: Feb 5, 2026

Enhancing Model Generalizability in Medical AI: Systematic Comparison of Categorical Encoding and Sampling Techniques for Imbalanced Data

ABSTRACT

Citation

Copyright