Accepted for/Published in: JMIR Formative Research
Date Submitted: Sep 10, 2025
Open Peer Review Period: Sep 10, 2025 - Nov 5, 2025
Date Accepted: Apr 30, 2026
(closed for review but you can still tweet)
Prediction of 30-Day All-Cause Hospital Readmissions Using Limited Structured Electronic Health Record Data: Retrospective Comparative Study
ABSTRACT
Background:
Unplanned hospital readmissions represent a critical operational and financial challenge for healthcare systems in the United States, with 3.8 million 30-day all-cause readmissions in 2018 at an average cost of $15,200 each, totaling $58 billion in costs. Many published prediction models rely on comprehensive information (e.g., full billing abstractions, discharge summaries, labs, and vitals) that becomes available only late in the encounter, limiting usefulness for real-time, in-hospital intervention. This creates a timeliness accuracy trade-off: models that are most accurate retrospectively may arrive too late to act upon.
Objective:
This study tests the central hypothesis that a clinically meaningful predictive signal for 30-day all-cause readmission is present within the minimal, structured data available at the beginning of a patient’s hospital stay. This approach addresses the critical trade-off between predictive accuracy and the timeliness required for actionable intervention,
Methods:
We conducted a retrospective comparative modeling study using a large, de-identified Electronic Health Record (EHR) cohort of 50,000 inpatient encounters. Two feature sets were constructed: (1) a Limited set simulating an early-encounter view (first five International Classification of Diseases (ICD) and five Current Procedural Terminology (CPT) codes + Charlson Comorbidity Index [CCI]) and (2) a Rich set using all available ICD/CPT codes + CCI. We trained four models, Random Forest, CatBoost, Multi-Layer Perceptron (MLP), and DistilBERT (structured codes mapped to text and tokenized with distilbert-base-uncased). Evaluation used an untouched hold-out set. Primary metrics were Area under the receiver operating characteristic curve (AUC-ROC), Area under the precision recall curve (PR-AUC), F1, accuracy, and calibration. To address class imbalance, the training split only was balanced via undersampling of the majority class and bootstrap oversampling of the minority class; validation/test distributions were left unchanged.
Results:
Across three of four architectures, models trained on the Limited feature set matched, or modestly exceeded, the discrimination of their Rich counterparts, indicating that early-encounter data can be competitively predictive. For example, Random Forest achieved AUC 0.5596 (Limited) vs 0.5541 (Rich), and MLP achieved AUC 0.5386 (Limited) vs 0.5287 (Rich). Differences across architectures were small in absolute terms, with threshold-dependent metrics (e.g., F1) similarly comparable.
Conclusions:
Minimal admission-time coding data (ICD/CPT) augmented with CCI can provide timely and competitive performance for 30-day readmission prediction. Focusing on the quality and accessibility of early-encounter data enables real-time risk stratification and supports a shift from reactive, post-discharge analysis to proactive, in-hospital resource management. These findings motivate early-warning clinical decision-support tools that prioritize timeliness without incurring a substantial loss in accuracy.
Citation
Request queued. Please wait while the file is being generated. It may take some time.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.