Maintenance Notice

Due to necessary scheduled maintenance, the JMIR Publications website will be unavailable from Wednesday, July 01, 2020 at 8:00 PM to 10:00 PM EST. We apologize in advance for any inconvenience this may cause you.

Who will be affected?

Accepted for/Published in: JMIR Medical Informatics

Date Submitted: Apr 8, 2025
Date Accepted: Feb 5, 2026

The final, peer-reviewed published version of this preprint can be found here:

Enhancing Model Generalizability in Medical Artificial Intelligence: Systematic Comparison of Categorical Encoding and Sampling Techniques for Imbalanced Data

Chuang Cw, Wu CK, Wu CH, Shia BC, Chen M

Enhancing Model Generalizability in Medical Artificial Intelligence: Systematic Comparison of Categorical Encoding and Sampling Techniques for Imbalanced Data

JMIR Med Inform 2026;14:e75655

DOI: 10.2196/75655

PMID: 41973872

PMCID: 13075634

Warning: This is an author submission that is not peer-reviewed or edited. Preprints - unless they show as "accepted" - should not be relied on to guide clinical practice or health-related behavior and should not be reported in news media as established information.

Enhancing Model Generalizability in Medical AI: Systematic Comparison of Categorical Encoding and Sampling Techniques for Imbalanced Data

  • Chien-wei Chuang; 
  • Chung-Kuan Wu; 
  • Chao-Hsin Wu; 
  • Ben-Chang Shia; 
  • Mingchih Chen

ABSTRACT

Background:

Despite the increasing use of machine learning (ML) in clinical research, the early stages of data preparation—especially for structured clinical data—often receive limited methodological scrutiny. These datasets typically contain missing values, complex categorical variables, and imbalanced class distributions, all of which complicate downstream model development and interpretation.

Objective:

This study introduces a structured preprocessing framework designed to address common challenges in medical tabular data and to assess how preprocessing choices affect the stability and transferability of predictive models across settings.

Methods:

We constructed a modular workflow comprising three components. First, preprocessing strategies included imputation for missing data, three types of categorical encoding (One-Hot, Frequency, Target), and resampling approaches for class imbalance (SMOTE, ROSE). Second, six classification algorithms were used to evaluate performance patterns: Logistic Regression, Decision Tree, Random Forest, XGBoost, CatBoost, and LightGBM. Third, external validation was conducted across two datasets with distinct data-generating mechanisms: an ESRD patient registry (n=412) and the population-based BRFSS 2015 survey.

Results:

One-Hot Encoding in combination with ROSE yielded the most consistent performance improvements in terms of AUC (0.940) and accuracy (0.932), particularly for classifiers sensitive to class distribution. Notably, ROSE enhanced sensitivity without substantially distorting the original data structure. Feature importance rankings also contributed to model interpretability, and performance trends were largely reproducible in external validation.

Conclusions:

Our findings suggest that preprocessing decisions—often treated as ancillary—play a central role in shaping model outcomes, especially in high-variance clinical datasets. The proposed framework offers a reproducible and adaptable tool for aligning data preparation with the unique demands of healthcare prediction tasks, and may serve as a foundation for future efforts to standardize preprocessing in clinical ML workflows.


 Citation

Please cite as:

Chuang Cw, Wu CK, Wu CH, Shia BC, Chen M

Enhancing Model Generalizability in Medical Artificial Intelligence: Systematic Comparison of Categorical Encoding and Sampling Techniques for Imbalanced Data

JMIR Med Inform 2026;14:e75655

DOI: 10.2196/75655

PMID: 41973872

PMCID: 13075634

Download PDF


Request queued. Please wait while the file is being generated. It may take some time.

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.