Maintenance Notice

Due to necessary scheduled maintenance, the JMIR Publications website will be unavailable from Wednesday, July 01, 2020 at 8:00 PM to 10:00 PM EST. We apologize in advance for any inconvenience this may cause you.

Who will be affected?

Accepted for/Published in: JMIR Medical Informatics

Date Submitted: Apr 4, 2025
Open Peer Review Period: Apr 15, 2025 - Jun 10, 2025
Date Accepted: Jul 15, 2025
Date Submitted to PubMed: Aug 8, 2025
(closed for review but you can still tweet)

The final, peer-reviewed published version of this preprint can be found here:

A Machine Learning Algorithm With an Oversampling Technique in Limited Data Scenarios for the Prediction of Present and Future Restorative Treatment Need: Development and Validation Study

Väyrynen E, Tirkkonen O, Tiensuu H, Suutala J, Anttonen V, Laitala ML, Kukkola K, Karki S

A Machine Learning Algorithm With an Oversampling Technique in Limited Data Scenarios for the Prediction of Present and Future Restorative Treatment Need: Development and Validation Study

JMIR Med Inform 2025;13:e75117

DOI: 10.2196/75117

PMID: 40778806

PMCID: 12426571

Warning: This is an author submission that is not peer-reviewed or edited. Preprints - unless they show as "accepted" - should not be relied on to guide clinical practice or health-related behavior and should not be reported in news media as established information.

Development and Validation of Machine Learning Algorithm with Oversampling Technique in Limited Data Scenarios for Prediction of Present and Future Restorative Treatment Need

  • Elina Väyrynen; 
  • Otso Tirkkonen; 
  • Henna Tiensuu; 
  • Jaakko Suutala; 
  • Vuokko Anttonen; 
  • Marja-Liisa Laitala; 
  • Katri Kukkola; 
  • Saujanya Karki

ABSTRACT

Background:

Untreated dental caries is the most common health condition globally. Because of this, new strategies need to be developed to reduce the manifestations of dental caries.

Objective:

The aim of this study was to develop and test a machine learning (ML) algorithm for detecting present and predicting future carious lesions in the adolescent population, utilizing a set of easy- to-collect predictive variables. In addition, another aim was to deal with an imbalanced and small dataset with an oversampling method.

Methods:

This population-based study was conducted among secondary schoolchildren, between 13–17 years of age, from northern parts of Finland in 2022. After the inclusion criteria was met, a total of n=218 participants were included in this study. The inclusion criteria consisted of participants having completed a web-based risk assessment questionnaire and undergone a clinical examination at Public Healthcare Services. Dental caries (ICDAS4-6) and active initial caries (ICDAS2+,3+) were considered as outcomes. Several predictors, such as behavioral and dietary habits, were included. An eXtreme Gradient Boosting (XGBoost) model was developed, tested and assessed for its predictive performance. A 4-fold cross-validation (CV) was performed using the nested resampling technique. The random over-sampling examples (ROSE) method and the k-nearest neighbors (KNN) classifiers were utilized for all four folds. The mean (SD) performance of all folds was computed.

Results:

The prevalence of dental caries was 65.56% (ICDAS2+,3+,4-6). The mean (SD) area under the curve (AUC) was 0.769 (0.042) and the mean (SD) F1-score was 0.816 (0.055) for the XGBoost model. Similarly, the mean (SD) AUC and mean (SD) F1-scores after oversampling were 0.744 (0.045) and 0.787 (0.035), respectively. The SHapley Additive exPlanations (SHAP) values were calculated for all four folds to assess feature importance, revealing that previous dental fillings were the feature most strongly associated with the need for restorative treatment.

Conclusions:

Based on performance metrics, the ML algorithm developed and tested in this study can be considered good. The ML algorithm could serve as a cost-effective screening tool for dental professionals in identifying the risk the future restorative treatment needs. However, future studies with longitudinal cohorts and longitudinal data, along with external validation for generalizability, are needed to validate our model.


 Citation

Please cite as:

Väyrynen E, Tirkkonen O, Tiensuu H, Suutala J, Anttonen V, Laitala ML, Kukkola K, Karki S

A Machine Learning Algorithm With an Oversampling Technique in Limited Data Scenarios for the Prediction of Present and Future Restorative Treatment Need: Development and Validation Study

JMIR Med Inform 2025;13:e75117

DOI: 10.2196/75117

PMID: 40778806

PMCID: 12426571

Download PDF


Request queued. Please wait while the file is being generated. It may take some time.

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.