Accepted for/Published in: JMIRx Med
Date Submitted: Nov 6, 2020
Open Peer Review Period: Nov 6, 2020 - Nov 12, 2020
Date Accepted: Mar 12, 2021
Date Submitted to PubMed: Sep 19, 2023
(closed for review but you can still tweet)
Machine Learning for Risk Group Identification and User Data Collection, a Case-Study of Herpes Simplex Virus Patient Registry: Algorithm Development and Validation
ABSTRACT
Background:
Conducting research about people with herpes simplex virus is challenging because of poor data quality, low user engagement, and concerns around stigma and anonymity.
Objective:
This project aimed to improve data collection for a real-world HSV registry by identifying predictors of HSV infection and selecting a limited number of relevant questions to ask new registry users to determine their level of HSV infection risk.
Methods:
The US National Health and Nutrition Examination Survey (NHANES, 2015-16) database includes the confirmed HSV1 and HSV2 status of American participants (14-49 years) as well as a wealth of demographic and health-related data. The questionnaires and datasets from this survey were used to form two datasets (for HSV1 and HSV2). These datasets were used to train and test a model that used a Random Forest algorithm (devised using Python) to minimize the number of anonymous lifestyle-based questions needed to identify risk groups for HSV.
Results:
The model selected a reduced number of questions from the NHANES questionnaire that predicted HSV infection risk with high accuracy scores of 0.91 and 0.96 and high recall scores of 0.88 and 0.98 for HSV1 and HSV2 datasets, respectively. The number of questions was reduced from 150 to an average of 40, depending on age and gender. The model therefore provided high predictability of risk of infection with minimal required input.
Conclusions:
This machine-learning algorithm can be used in a real-world evidence registry to collect relevant lifestyle data and identify individuals’ levels of risk of HSV infection. A current limitation is the absence of real user data and integration with electronic medical records, which would enable model learning and improvement. Future work will explore model adjustments, anonymisation options, explicit permissions and standardised data schema that meet General Data Protection Regulation (GDPR), Health Insurance Portability and Accountability Act (HIPAA), and third-party interface connectivity requirements.
Citation
Request queued. Please wait while the file is being generated. It may take some time.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.