Accepted for/Published in: JMIR Medical Informatics
Date Submitted: Jul 9, 2025
Date Accepted: Feb 17, 2026
Predicting the Risk of Progression to Dialysis in Patients with Polycystic Kidney Disease: A Population-based Machine Learning Study
ABSTRACT
Background:
Autosomal dominant polycystic kidney disease (ADPKD), characterized by progressive cyst growth and renal decline, is the leading genetic cause of end‐stage renal disease.
Objective:
To develop and validate machine learning models for predicting the risk of progression to dialysis in patients with ADPKD using a nationwide administrative database. Early identification of high-risk patients is critical for timely monitoring.
Methods:
This retrospective cohort study utilized data from Taiwan's National Health Insurance Research Database (2007–2018) to identify newly diagnosed ADPKD patients. We employed six machine learning algorithms, including Logistic Regression, Random Forest, and eXtreme Gradient Boosting (XGBoost), to predict progression to dialysis. Models were developed using 10-fold cross-validation, with Synthetic Minority Over-sampling Technique applied within training folds to address class imbalance. An ensemble-based feature selection strategy was implemented to identify the most robust predictors and optimize final model performance. Model evaluation was conducted using a strict temporal split.
Results:
The study included 1,856 patients with ADPKD, of whom 302 (16.27%) progressed to dialysis. A multivariable Cox regression identified several significant risk factors, including age ≥66 years (Hazard Ratio [HR] 4.63, 95% CI 2.71-7.92; P<.001), anemia (HR 4.33, 95% CI 3.25-5.78; P<.001), congestive heart failure (CHF) (HR 1.81, 95% CI 1.29-2.54; P<.001), and acute kidney injury (AKI) (HR 1.69, 95% CI 1.19-2.41; P=.003). Among the machine learning models developed, the XGBoost model, using an optimized set of 27 features, demonstrated the highest predictive performance on the held-out temporal test set (accuracy 98.3%; AUC 0.955; F1 score 0.800; Brier score 0.022). The top predictors in the XGBoost model largely aligned with age, comorbidity burden, anemia, and cardiovascular disease markers and medication use (e.g., anticoagulants, loop diuretics, febuxostat) were among the most influential predictors. Importantly, medication-related predictors should be interpreted as proxies for disease complexity rather than direct risk modulators.
Conclusions:
This study demonstrates that machine learning models can predict dialysis risk in ADPKD patients using administrative data with temporal validation. This approach may support risk stratification by helping identify individuals at higher predicted risk who may warrant closer monitoring and further specialist evaluation.
Citation
Request queued. Please wait while the file is being generated. It may take some time.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.