Accepted for/Published in: JMIR AI
Date Submitted: May 21, 2025
Open Peer Review Period: Jun 9, 2025 - Aug 4, 2025
Date Accepted: Aug 16, 2025
(closed for review but you can still tweet)
Discovery of DNA Polymerase Inhibitors: Machine Learning–Enhanced QSAR Modeling Approach
ABSTRACT
Background:
Cisplatin resistance remains a significant obstacle in cancer therapy, frequently driven by translesion DNA synthesis (TLS) mechanisms that utilize specialized polymerases such as human DNA polymerase η (hpol η). Although small-molecule inhibitors like PNR-7-02 have demonstrated potential to disrupt hpol η activity, current compounds often lack sufficient potency and specificity to effectively combat chemoresistance. The vastness of chemical space further limits traditional drug discovery approaches, underscoring the need for advanced computational strategies such as machine learning (ML)-enhanced Quantitative Structure-Activity Relationship (QSAR) modeling.
Objective:
This study aimed to develop and validate ML-augmented QSAR models to accurately predict hpol η inhibition by indole thio-barbituric acid (ITBA) analogs, with the goal of accelerating the discovery of potent and selective inhibitors to overcome cisplatin resistance.
Methods:
A curated library of 85 ITBA analogs with validated hpol η inhibition data was used, excluding outliers to ensure data integrity. Molecular descriptors spanning 1D to 4D were computed, resulting in 220 features. Seventeen ML algorithms—including Random Forests, XGBoost, and Neural Networks—were trained using 80% of the data for training and evaluated with 14 performance metrics. Robustness was ensured through hyperparameter optimization and 5-fold cross-validation.
Results:
Ensemble methods outperformed other algorithms, with Random Forest achieving near-perfect predictive performance (training MSE = 0.0002, R² = 0.9999; testing MSE = 0.0003, R² = 0.9998). SHAP analysis revealed that electronic properties, lipophilicity, and topological atomic distances were the most important predictors of hpol η inhibition. Linear models exhibited higher error rates, highlighting the non-linear relationship between molecular descriptors and inhibitory activity.
Conclusions:
Integrating machine learning with QSAR modeling provides a robust framework for optimizing hpol η inhibition, offering both high predictive accuracy and biochemical interpretability. This approach accelerates the identification of potent, selective inhibitors and represents a promising strategy to overcome cisplatin resistance, thereby advancing precision oncology.
Citation
Request queued. Please wait while the file is being generated. It may take some time.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.