Accepted for/Published in: JMIR AI
Date Submitted: Jul 30, 2025
Open Peer Review Period: Aug 1, 2025 - Sep 26, 2025
Date Accepted: Dec 5, 2025
Date Submitted to PubMed: Dec 8, 2025
(closed for review but you can still tweet)
Accelerating Discovery of Leukemia Inhibitors with AI-Driven QSAR Modeling
ABSTRACT
Background:
Leukemia treatment remains a major challenge in oncology. While Thiadiazolidinone (TDZD) analogs show potential to inhibit leukemia cell proliferation, they often lack sufficient potency and selectivity. Traditional drug discovery struggles to efficiently explore the vast chemical landscape, highlighting the need for innovative computational strategies. Machine learning (ML)-enhanced QSAR modeling offers a promising route to identify and optimize inhibitors with improved activity and specificity.
Objective:
To develop and validate an integrated machine learning–enhanced QSAR modeling workflow for the rational design and prediction of Thiadiazolidinone (TDZD) analogs with improved anti-leukemia activity by systematically evaluating molecular descriptors and algorithmic approaches to identify key determinants of potency and guide future inhibitor optimization.
Methods:
We analyzed 35 TDZD derivatives with confirmed anti-leukemia activity, removing outliers for data quality. Using Schrödinger MAESTRO, we calculated 220 molecular descriptors (1D–4D). Seventeen ML models, including Random Forests, XGBoost, and Neural Networks, were trained on 70% of data and tested on 30%, using stratified sampling. Model performance was assessed with 12 metrics, including MSE, R², and SHAP values, and optimized via hyperparameter tuning and 5-fold cross-validation. Additional analyses including train-test gap assessment, comparison to baseline linear models, and cross-validation stability analysis were performed to assess genuine learning rather than overfitting.
Results:
Ensemble methods, especially LightGBM and Random Forest, showed superior predictive performance (LightGBM: MSE = 0.00063 ± 0.00012; R² = 0.971 ± 0.0084). Training-to-test performance degradation was modest (ΔR² = -0.01, ΔMSE = +0.000126), suggesting genuine pattern learning rather than memorization. Isotonic Regression ranked second, outperforming baseline models by over 15% in explained variance. SHAP analysis revealed that the most influential features contributing to anti-leukemia activity were global molecular shape (r_qp_glob; mean SHAP value = 0.52), weighted polar surface area (r_qp_WPSA; ~0.50), polarizability (r_qp_QPpolrz; ~0.49), partition coefficient (r_qp_QPlogPC16; ~0.48), solvent-accessible surface area (r_qp_SASA; ~0.48), hydrogen bond donor count (r_qp_donorHB; ~0.48), and the sum of topological distances between oxygen and chlorine atoms (i_desc_Sum_of_topological_distances_between_O.Cl; ~0.47). These parameters highlight the importance of steric complementarity and the three-dimensional arrangement of functional groups. Aqueous solubility (r_qp_QPlogS; ~0.47) and hydrogen bond acceptor count (r_qp_accptHB; ~0.44) were also among the top ten features. The significance of these descriptors was consistent across multiple algorithmic models, including Random Forest, XGBoost, and PLS approaches.
Conclusions:
Integrating advanced ML with QSAR modeling enables systematic analysis of structure-activity relationships in TDZD analogs on this dataset. While ensemble methods capture complex patterns with high internal validation metrics, external validation on independent compounds and prospective experimental testing are essential before broad therapeutic claims can be made. This work provides a methodological foundation and identifies molecular features for future validation efforts.
Citation
Request queued. Please wait while the file is being generated. It may take some time.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.