JMIR Preprints #42721: Estimating Rare Diseases Incidences with Large-scale Internet Search Data: Two-step Machine Learning Method

Current Preprint Settings

(as selected by the authors)

1. When the manuscript is submitted, allow peer review from:

(a) Anybody (open community peer review)
(b) Editor-selected reviewers (closed peer review)

2. When the manuscript is submitted, display the preprint PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

3. When the manuscript is accepted, display the accepted manuscript PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

Estimating Rare Diseases Incidences with Large-scale Internet Search Data: Two-step Machine Learning Method

Jiayu Li;
Zhiyu He;
Min Zhang;
Weizhi Ma;
Ye Jin;
Lei Zhang;
Shuyang Zhang;
Yiqun Liu;
Shaoping Ma

ABSTRACT

Background:

As rare diseases (RDs) receive increasing attention, obtaining accurate RD incidences has become an essential concern in public health. Whereas, since rare diseases are difficult to diagnose, include diverse types, and have scarce cases, traditional epidemiological methods are costly in RD registries. With the development of the Internet, users have gotten used to searching for disease-related information through search engines before seeking medical treatment. Therefore, online search data provides a new way to estimate RDs incidences.

Objective:

This study aimed to estimate incidences of multiple rare diseases in distinct regions in China with online search data.

Methods:

Our research scale included 15 rare diseases (RDs) in China from 2016 to 2019. The online search data was obtained from Sogou, one of the top-3 commercial search engines in China. By matching to multi-level keywords about 15 RDs during the four years, we retrieved keyword-matched RD-related queries. And the queries over a period before and after the keyword-matched queries formed RD-related search sessions. Then a two-step method was proposed to estimate RDs incidences with users’ intents conveyed by the sessions. In the first step, a combination of long short-term memory (LSTM) and multilayer perceptron (MLP) was used to predict whether the intents of search sessions were RD-concerned, News-concerned, or Others. And the second step utilized a linear regression (LR) model to estimate the incidences of multiple RDs in distinct regions based on RD-concerned and News-concerned session numbers. For evaluation, the estimated incidences were compared with RD incidences collected from China’s national multi-center clinical database of rare diseases. Root mean square error (RMSE) and Relative Error Rate (RER) were used as the evaluation metrics.

Results:

The RD-related online data included 2,749,257 queries and 1,769,986 sessions from 1,380,186 users during 2016 to 2019. The best LR model with sessions as input estimated the RDs incidences with an RMSE of 0.017 (CI=[0.016,0.017]) and an RER of 0.365 (CI=[0.341,0.388]). And the best LR model with queries as input gained an RMSE of 0.023 (CI=[0.017,0.029]) and an RER of 0.511 (CI=[0.377,0.645]). Compared with queries, using session intents achieved an error decrease of 28.57% in terms of RER (P=.012). Analysis of different RDs and regions showed that session input was more suitable for estimating most diseases (14 of 15 RDs). Moreover, examples on two RDs showed that News-concerned session intents reflected the out-break news and helped correct the overestimation of incidences. Experiments on RD types further indicated that types showed no significant influence on the RD estimation task.

Conclusions:

This work shed light on a novel way of RDs incidences quick estimation in the Internet era, and search session intents were especially helpful for the estimation. The two-step estimation method in this study could be a valuable supplement to the traditional registry for understanding RDs, planning policies, and allocating medical resources. And the utilization of search sessions in disease detection and estimation could be transferred to other epidemics or chronic.

Citation

Please cite as:

Li J, He Z, Zhang M, Ma W, Jin Y, Zhang L, Zhang S, Liu Y, Ma S

Estimating Rare Disease Incidences With Large-scale Internet Search Data: Development and Evaluation of a Two-step Machine Learning Method

JMIR Infodemiology 2023;3:e42721

DOI: 10.2196/42721

PMCID: 10182453

Download PDF

Request queued. Please wait while the file is being generated. It may take some time.

Copyright

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.

JMIR Publications

JMIR Preprints

Accepted for/Published in: JMIR Infodemiology

Date Submitted: Sep 15, 2022

Open Peer Review Period: Sep 15, 2022 - Sep 29, 2022

Date Accepted: Mar 27, 2023

(closed for review but you can still tweet)

Estimating Rare Diseases Incidences with Large-scale Internet Search Data: Two-step Machine Learning Method

ABSTRACT

Citation

Copyright