JMIR Preprints #44081: Benchmarking Machine Learning Models to Predict Low Birthweight Baby Outcomes and Identify Associated Risk Factors from an Extremely Unbalanced Large-Scale Dataset

Current Preprint Settings

(as selected by the authors)

1. When the manuscript is submitted, allow peer review from:

(a) Anybody (open community peer review)
(b) Editor-selected reviewers (closed peer review)

2. When the manuscript is submitted, display the preprint PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

3. When the manuscript is accepted, display the accepted manuscript PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

Benchmarking Machine Learning Models to Predict Low Birthweight Baby Outcomes and Identify Associated Risk Factors from an Extremely Unbalanced Large-Scale Dataset

Yang Ren;
Dezhi Wu;
Yan Tong;
Ana López-De Fede;
Sarah Gareau

ABSTRACT

Background:

Low birthweight (LBW) is one of the leading causes of neonatal mortality in the United States (US), and also is a major causative factor of short- and long-term adverse health effects in newborns. To prevent adverse birth outcomes, it is critical to precisely predict and identify which mothers are at high risk of bearing LBW babies in the early stage of their pregnancy. Previous studies proposed various LBW prediction models from different ML algorithms primarily on small datasets. However, their model performance is significantly limited by data access barriers and data quality issues with one major technical challenge of handling data imbalance issues. To date, scarce studies have successfully benchmarked the performance of machine learning (ML) models in maternal health, thus, it is critical to establish such benchmarks to advance the ML use and to improve birth outcomes.

Objective:

This study aims to establish several key benchmarking ML models in predicting LBW and to systematically apply different rebalance optimization methods to a large-scale and extremely unbalanced Medicare and Medicaid Claim dataset, which connects mother and baby data at a state level in the US. We also investigate the risk factors that adversely affect birth outcomes, which lead to LBW.

Methods:

Our large dataset consisted of 266,687 birth records across six years from a state in the US. Among these records, 23,019 (8.63%) are labeled as LBW. To set up benchmark ML models to predict LBW, we applied six classic ML models (i.e., Logistic Regression, Naïve Bayes, Random Forest, Extreme Gradient Boosting [XGBoost], Adaptive Boosting [AdaBoost], and Multi-layer Perceptron) while using four different data rebalance methods: random under-sampling, random oversampling, synthetic minority oversampling technique (SMOTE), and weight rebalancing. Due to the ethical consideration, in addition to ML evaluation metrics, such as accuracy, precision, and F1-score, we primarily used recall to evaluate the model performance, indicating the rate of number of correct predicted LBW cases of all actual LBW cases, since false negative healthcare outcomes (i.e., an actual LBW patient is predicted as non-LBW) could be fatal to the patient. We also further analyzed feature importance to explore the degree of each feature contributing to the ML model prediction among our best performing models.

Results:

We found Random Forest achieved the highest recall score – 0.62, using the random under-sampling method. XGBoost achieved the same recall score but with the weight rebalancing method. Our results show that various data rebalance methods improved the prediction performance of the LBW group significantly, e.g., increasing the Recall score from 0.34 to 0.62. From the feature importance analysis, the maternal race, the sum of pre-12 months inpatient hospitalization, predelivery disease profile, and social vulnerability index of housing type are important risk factors associated with LBW.

Conclusions:

Our study findings establish useful ML benchmarks to improve birth outcomes in maternal health domain. They are informative to identify the minority classes based on an extremely unbalanced dataset, and also have important practical implications for personalized LBW early prevention programs and maternal and infant health policy changes.

Citation

Please cite as:

Ren Y, Wu D, Tong Y, López-De Fede A, Gareau S

Issue of Data Imbalance on Low Birthweight Baby Outcomes Prediction and Associated Risk Factors Identification: Establishment of Benchmarking Key Machine Learning Models With Data Rebalancing Strategies

J Med Internet Res 2023;25:e44081

DOI: 10.2196/44081

PMID: 37256674

PMCID: 10267797