Accepted for/Published in: JMIR Medical Informatics
Date Submitted: Apr 4, 2025
Open Peer Review Period: Apr 15, 2025 - Jun 10, 2025
Date Accepted: Jul 15, 2025
Date Submitted to PubMed: Aug 8, 2025
(closed for review but you can still tweet)
Warning: This is an author submission that is not peer-reviewed or edited. Preprints - unless they show as "accepted" - should not be relied on to guide clinical practice or health-related behavior and should not be reported in news media as established information.
Development and Validation of Machine Learning Algorithm with Oversampling Technique in Limited Data Scenarios for Prediction of Present and Future Restorative Treatment Need
ABSTRACT
Background:
Untreated dental caries is the most common health condition globally. Because of this, new strategies need to be developed to reduce the manifestations of dental caries.
Objective:
The aim of this study was to develop and test a machine learning (ML) algorithm for detecting present and predicting future carious lesions in the adolescent population, utilizing a set of easy- to-collect predictive variables. In addition, another aim was to deal with an imbalanced and small dataset with an oversampling method.
Methods:
This population-based study was conducted among secondary schoolchildren, between 13–17 years of age, from northern parts of Finland in 2022. After the inclusion criteria was met, a total of n=218 participants were included in this study. The inclusion criteria consisted of participants having completed a web-based risk assessment questionnaire and undergone a clinical examination at Public Healthcare Services. Dental caries (ICDAS4-6) and active initial caries (ICDAS2+,3+) were considered as outcomes. Several predictors, such as behavioral and dietary habits, were included. An eXtreme Gradient Boosting (XGBoost) model was developed, tested and assessed for its predictive performance. A 4-fold cross-validation (CV) was performed using the nested resampling technique. The random over-sampling examples (ROSE) method and the k-nearest neighbors (KNN) classifiers were utilized for all four folds. The mean (SD) performance of all folds was computed.
Results:
The prevalence of dental caries was 65.56% (ICDAS2+,3+,4-6). The mean (SD) area under the curve (AUC) was 0.769 (0.042) and the mean (SD) F1-score was 0.816 (0.055) for the XGBoost model. Similarly, the mean (SD) AUC and mean (SD) F1-scores after oversampling were 0.744 (0.045) and 0.787 (0.035), respectively. The SHapley Additive exPlanations (SHAP) values were calculated for all four folds to assess feature importance, revealing that previous dental fillings were the feature most strongly associated with the need for restorative treatment.
Conclusions:
Based on performance metrics, the ML algorithm developed and tested in this study can be considered good. The ML algorithm could serve as a cost-effective screening tool for dental professionals in identifying the risk the future restorative treatment needs. However, future studies with longitudinal cohorts and longitudinal data, along with external validation for generalizability, are needed to validate our model.
Citation
Request queued. Please wait while the file is being generated. It may take some time.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.