Accepted for/Published in: Journal of Medical Internet Research
Date Submitted: May 6, 2024
Date Accepted: Oct 20, 2024
Empirical Sample Size Determination for Popular Classification Algorithms and Tabular Clinical Data: Learning Curve Analysis
ABSTRACT
Background:
The performance of a classification algorithm eventually reaches a point of diminishing returns, where additional sample added does not improve results. Thus, there is a need for determining an optimal sample size that both maximizes performance, while accounting for computational burden or budgetary concerns.
Objective:
Develop concrete guidelines for calculating sample size within the context of machine learning for a binary outcome in the field of healthcare/clinical data analysis using large open-source dataset.
Methods:
Sixteen large open-source datasets were collected, each containing a binary clinical outcome. Four machine learning algorithms were assessed: XGBoost (XGB), Random Forest (RF), Logistic Regression (LR), and Neural Networks (NN). For each dataset, the cross-validated AUC was calculated at increasing sample sizes, and learning curves were fit. Sample sizes needed to reach the full-dataset AUC minus 2% (or, 0.02) were calculated from the fitted learning curves and compared across the datasets and algorithms. Dataset-level characteristics: minority class proportion, full-dataset AUC, strength/number/type of features, and degree of nonlinearity, were examined. Negative binomial regression models were used to quantify relationships between these characteristics and expected sample sizes within each algorithm. Four multivariable models were constructed which selected the best combination of dataset-specific characteristics that minimized out-of-sample prediction error. Additional models were fitted which allowed for prediction of the expected gap in performance at a given sample size using the same empirical learning curve data.
Results:
Among the sixteen datasets (full-dataset sample sizes ranging from 70,000-1,000,000), median sample sizes were 9,960 (XGB), 3,404 (RF), 696 (LR), and 12,298 (NN) to reach AUC convergence. For all four algorithms, more balanced classes (multiplier: 0.93-0.96 for 1% increase in minority class proportion) were associated with decreased sample size. Other characteristics varied in importance across algorithms - in general, more features, weaker features, and more complex relationships between the predictors and the response increased expected sample sizes. In multivariable analysis, top selected predictors were minority class proportion, full-dataset AUC, and dataset nonlinearity (XGB and RF). For LR, top predictors were minority class proportion, percentage of strong linear features, and number of features. For NN, top predictors were minority class proportion, percentage of numeric features, and dataset nonlinearity.
Conclusions:
The sample sizes needed to reach convergence among four popular classification algorithms vary by dataset and method and are associated with dataset-specific characteristics that can be influenced or estimated prior to the start of a research study.
Citation
Request queued. Please wait while the file is being generated. It may take some time.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.