Accepted for/Published in: JMIR Research Protocols
Date Submitted: Apr 26, 2024
Date Accepted: Oct 18, 2024
The Application of Machine Learning Algorithms to Predict HIV Testing in Repeated Adult Population-Based Surveys in South Africa: A Protocol for Multi-Wave Cross-Sectional Analysis
ABSTRACT
Background:
Human immunodeficiency virus (HIV) testing is the cornerstone of HIV prevention and a pivotal step in realizing the Joint United Nations Program on HIV/AIDS (UNAIDS) goal of ending AIDS by 2030. Despite the increasing availability of survey data on the factors associated with HIV testing in South Africa, there exists a research gap in the feasibility and effectiveness of using machine learning (ML) approaches to analyze and predict HIV testing among repeat adult population-based data, necessitating further investigation to bridge this knowledge gap and inform evidence-based interventions.
Objective:
The study aims to determine consistent predictors of HIV testing by applying supervised machine learning (SML) algorithms in repeat adult population-based surveys in South Africa.
Methods:
A retrospective analysis of data from multiple cross-sectional surveys will be used to predict factors associated with HIV testing across the five cycles of the South African National HIV Prevalence, Incidence, Behavior, and Communication Survey (SABSSM) surveys using SML algorithms. The Human Science Research Council (HSRC) conducted the SABSSM surveys in 2002, 2005, 2008, 2012 and 2017. The available SABSSM datasets will be imported to R-Studio to clean and remove outliers. A chi-square test will be conducted to select important predictors of HIV testing. The selected features will be encoded using one-hot encoding based on the information available on the SABSSM surveys. Each dataset from the five cycles of the SABSSM surveys combined will be split into 80% training and 20% test samples. Logistic regression, support vector machines (SVM), random forests, and decision trees will be employed. A cross-validation technique will be used to divide the training sample into k-folds, including a validation set, and models will be trained on each fold. The models’ performance will be evaluated on the validation set using evaluation metrics such as accuracy, precision, recall, f-1 score, AUC-ROC and confusion matrix.
Results:
The SABSSM datasets are open-access datasets available on the HSRC database. The HSRC provided access to all the SABSSM datasets, which were explored to identify the independent variables that will likely influence HIV testing uptake. The findings of this study will identify consistent variables predicting HIV testing uptake among the South African adult population over the course of 20 years. Furthermore, the study will evaluate and compare the performance metrics of the four different ML algorithms, and the best model will be used to develop an HIV testing predictive model.
Conclusions:
This study will contribute to knowledge and deepen understanding of factors linked to HIV testing beyond traditional methods. Consequently, the findings would inform evidence-based policy recommendations that can guide policymakers to formulate more effective and targeted public health approaches toward strengthening HIV testing.
Citation
Request queued. Please wait while the file is being generated. It may take some time.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.