Currently submitted to: JMIR Medical Informatics
Date Submitted: Feb 19, 2026
Open Peer Review Period: Mar 4, 2026 - Apr 29, 2026
(currently open for review)
Warning: This is an author submission that is not peer-reviewed or edited. Preprints - unless they show as "accepted" - should not be relied on to guide clinical practice or health-related behavior and should not be reported in news media as established information.
Network Analysis-Driven Machine Learning Model: Identifying High-Cost Stroke Inpatients Using Hospital discharge data
ABSTRACT
Background:
The escalating medical burden associated with stroke poses a substantial challenge, characterized by a skewed distribution wherein a minority of high-cost patients accounts for a disproportionate share of healthcare expenditures. Consequently, the timely and accurate identification of this cohort is paramount for optimizing the quality of care and mitigating unnecessary resource utilization.
Objective:
This study aims to construct a comorbidity network for stroke patients using hospital discharge data, extract topological features characterizing disease interactions, and integrate these features with machine learning algorithms to establish a robust and clinically interpretable framework for the accurate identification of high-cost stroke patients.
Methods:
We conducted a retrospective study using hospital discharge data from 10,301 stroke inpatients at a tertiary hospital in Northeast China between 2021 and 2023. Data from the 2021–2022 period were used to construct two specific networks: the Phenotypic Comorbidity Network (PCN) and the Distance-based Disease Cost Network (DDCN). From these networks, topological features were extracted to capture latent associations between comorbidities and high costs. The 2023 dataset was subsequently partitioned into training and testing sets to develop five machine learning models, including Logistic Regression (LR), Support Vector Machine (SVM), Neural Network (NN), Random Forest (RF), and XGBoost, for the identification of high-cost stroke inpatients. Furthermore, the SHAP method was applied to elucidate both the global and local contributions of the model features.
Results:
The integration of network features significantly improved model performance, with XGBoost exhibiting superior predictive capability (AUC = 0.911). Global feature importance analysis indicated that network features accounted for the majority of the total contribution (52.8%). Specifically, Shortest Distance (SD), length of stay, Normalized High-Cost Propensity (NHCP), age, and insurance type were identified as the top five predictors of high-cost risk. Moreover, SHAP interaction analysis revealed the phasic heterogeneity inherent in patient resource utilization.
Conclusions:
Our comprehensive framework, integrating comorbidity network analysis with machine learning algorithms, significantly enhances the identification of high-cost stroke inpatients. These findings highlight the framework's potential utility in optimizing healthcare resource allocation and enabling proactive cost containment strategies. Clinical Trial: Not applicable
Citation
Request queued. Please wait while the file is being generated. It may take some time.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.