JMIR Preprints #75117: Development and Validation of Machine Learning Algorithm with Oversampling Technique in Limited Data Scenarios for Prediction of Present and Future Restorative Treatment Need

Current Preprint Settings

(as selected by the authors)

1. When the manuscript is submitted, allow peer review from:

(a) Anybody (open community peer review)
(b) Editor-selected reviewers (closed peer review)

2. When the manuscript is submitted, display the preprint PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

3. When the manuscript is accepted, display the accepted manuscript PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

Development and Validation of Machine Learning Algorithm with Oversampling Technique in Limited Data Scenarios for Prediction of Present and Future Restorative Treatment Need

Elina Väyrynen;
Otso Tirkkonen;
Henna Tiensuu;
Jaakko Suutala;
Vuokko Anttonen;
Marja-Liisa Laitala;
Katri Kukkola;
Saujanya Karki

ABSTRACT

Background:

Untreated dental caries is the most common health condition globally. Because of this, new strategies need to be developed to reduce the manifestations of dental caries.

Objective:

The aim of this study was to develop and test a machine learning (ML) algorithm for detecting present and predicting future carious lesions in the adolescent population, utilizing a set of easy- to-collect predictive variables. In addition, another aim was to deal with an imbalanced and small dataset with an oversampling method.

Methods:

This population-based study was conducted among secondary schoolchildren, between 13–17 years of age, from northern parts of Finland in 2022. After the inclusion criteria was met, a total of n=218 participants were included in this study. The inclusion criteria consisted of participants having completed a web-based risk assessment questionnaire and undergone a clinical examination at Public Healthcare Services. Dental caries (ICDAS4-6) and active initial caries (ICDAS2+,3+) were considered as outcomes. Several predictors, such as behavioral and dietary habits, were included. An eXtreme Gradient Boosting (XGBoost) model was developed, tested and assessed for its predictive performance. A 4-fold cross-validation (CV) was performed using the nested resampling technique. The random over-sampling examples (ROSE) method and the k-nearest neighbors (KNN) classifiers were utilized for all four folds. The mean (SD) performance of all folds was computed.

Results:

The prevalence of dental caries was 65.56% (ICDAS2+,3+,4-6). The mean (SD) area under the curve (AUC) was 0.769 (0.042) and the mean (SD) F1-score was 0.816 (0.055) for the XGBoost model. Similarly, the mean (SD) AUC and mean (SD) F1-scores after oversampling were 0.744 (0.045) and 0.787 (0.035), respectively. The SHapley Additive exPlanations (SHAP) values were calculated for all four folds to assess feature importance, revealing that previous dental fillings were the feature most strongly associated with the need for restorative treatment.

Conclusions:

Based on performance metrics, the ML algorithm developed and tested in this study can be considered good. The ML algorithm could serve as a cost-effective screening tool for dental professionals in identifying the risk the future restorative treatment needs. However, future studies with longitudinal cohorts and longitudinal data, along with external validation for generalizability, are needed to validate our model.

Citation

Please cite as:

Väyrynen E, Tirkkonen O, Tiensuu H, Suutala J, Anttonen V, Laitala ML, Kukkola K, Karki S

A Machine Learning Algorithm With an Oversampling Technique in Limited Data Scenarios for the Prediction of Present and Future Restorative Treatment Need: Development and Validation Study

JMIR Med Inform 2025;13:e75117

DOI: 10.2196/75117

PMID: 40778806

PMCID: 12426571

Download PDF

Request queued. Please wait while the file is being generated. It may take some time.

Copyright

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.

JMIR Publications

JMIR Preprints

Accepted for/Published in: JMIR Medical Informatics

Date Submitted: Apr 4, 2025

Open Peer Review Period: Apr 15, 2025 - Jun 10, 2025

Date Accepted: Jul 15, 2025

Date Submitted to PubMed: Aug 8, 2025

(closed for review but you can still tweet)

Development and Validation of Machine Learning Algorithm with Oversampling Technique in Limited Data Scenarios for Prediction of Present and Future Restorative Treatment Need

ABSTRACT

Citation

Copyright