JMIR Preprints #25560: Machine Learning for Risk Group Identification and User Data Collection, a Case-Study of Herpes Simplex Virus Patient Registry: Algorithm Development and Validation

Current Preprint Settings

(as selected by the authors)

1. When the manuscript is submitted, allow peer review from:

(a) Anybody (open community peer review)
(b) Editor-selected reviewers (closed peer review)

2. When the manuscript is submitted, display the preprint PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

3. When the manuscript is accepted, display the accepted manuscript PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

Machine Learning for Risk Group Identification and User Data Collection, a Case-Study of Herpes Simplex Virus Patient Registry: Algorithm Development and Validation

Svitlana Surodina;
Ching Lam;
Svetislav Grbich;
Madison Milne-Ives;
Michelle van Velthoven;
Edward Meinert

ABSTRACT

Background:

Conducting research about people with herpes simplex virus is challenging because of poor data quality, low user engagement, and concerns around stigma and anonymity.

Objective:

This project aimed to improve data collection for a real-world HSV registry by identifying predictors of HSV infection and selecting a limited number of relevant questions to ask new registry users to determine their level of HSV infection risk.

Methods:

The US National Health and Nutrition Examination Survey (NHANES, 2015-16) database includes the confirmed HSV1 and HSV2 status of American participants (14-49 years) as well as a wealth of demographic and health-related data. The questionnaires and datasets from this survey were used to form two datasets (for HSV1 and HSV2). These datasets were used to train and test a model that used a Random Forest algorithm (devised using Python) to minimize the number of anonymous lifestyle-based questions needed to identify risk groups for HSV.

Results:

The model selected a reduced number of questions from the NHANES questionnaire that predicted HSV infection risk with high accuracy scores of 0.91 and 0.96 and high recall scores of 0.88 and 0.98 for HSV1 and HSV2 datasets, respectively. The number of questions was reduced from 150 to an average of 40, depending on age and gender. The model therefore provided high predictability of risk of infection with minimal required input.

Conclusions:

This machine-learning algorithm can be used in a real-world evidence registry to collect relevant lifestyle data and identify individuals’ levels of risk of HSV infection. A current limitation is the absence of real user data and integration with electronic medical records, which would enable model learning and improvement. Future work will explore model adjustments, anonymisation options, explicit permissions and standardised data schema that meet General Data Protection Regulation (GDPR), Health Insurance Portability and Accountability Act (HIPAA), and third-party interface connectivity requirements.

Citation

Please cite as:

Surodina S, Lam C, Grbich S, Milne-Ives M, van Velthoven M, Meinert E

Machine Learning for Risk Group Identification and User Data Collection in a Herpes Simplex Virus Patient Registry: Algorithm Development and Validation Study

JMIRx Med 2021;2(2):e25560

DOI: 10.2196/25560

PMID: 37725536

PMCID: 10414389

Download PDF

Request queued. Please wait while the file is being generated. It may take some time.

Copyright

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.

JMIR Publications

JMIR Preprints

Accepted for/Published in: JMIRx Med

Date Submitted: Nov 6, 2020

Open Peer Review Period: Nov 6, 2020 - Nov 12, 2020

Date Accepted: Mar 12, 2021

Date Submitted to PubMed: Sep 19, 2023

(closed for review but you can still tweet)

Machine Learning for Risk Group Identification and User Data Collection, a Case-Study of Herpes Simplex Virus Patient Registry: Algorithm Development and Validation

ABSTRACT

Citation

Copyright