Accepted for/Published in: Journal of Medical Internet Research
Date Submitted: Jan 9, 2019
Open Peer Review Period: Jan 11, 2019 - Mar 8, 2019
Date Accepted: Nov 29, 2019
(closed for review but you can still tweet)
Lifestyle Disease Surveillance Using Population Search Behavior: A Feasibility Study
ABSTRACT
Background:
As the process of producing official health statistics for lifestyle diseases is slow, the scientific community has explored using web search data from Google Trends as a non-traditional real-time data source for lifestyle disease surveillance. Existing studies, however, are prone to at least one of the following issues: ad-hoc keyword selection, overfitting, insufficient predictive evaluation, lack of generalization, and failure to compare against trivial baselines.
Objective:
The aims of this study are (1) to review key past literature on lifestyle disease surveillance using Google Trends (2) to employ a spatio-temporal corrective approach improving previous methods; (3) to study the key limitations in regards to the use of Google Trends for lifestyle disease surveillance; and (4) to test the generalizability of our methodology to other countries beyond the U.S.
Methods:
Diabetes, obesity, exercise, and suicide rates were chosen as target variables. For each of these, prevalence rates were collected. After a rigorous keyword selection process, data from Google Trends was collected. This data was de-normalized to form spatio-temporal indices. L1-regularized regression models were trained to predict prevalence rates from de-normalized Google Trends indices. Model were tested on a held-out set and compared against baselines from literature as well as a trivial “last year equals this year” baseline. Furthermore, a similar analysis was done using a time-lagged regression analysis framework where the previous-year’s prevalence was included as covariate. The model trained on U.S. data was then applied in a transfer learning framework to Canada.
Results:
We find a low-validity for using search behavior for lifestyle disease surveillance. In the U.S. context, our proposed models improve significantly over prior work with an average improvement of 32% in terms of the RMSE, and 9.5% in terms of the Spearman’s R coefficient. However, almost all models, including those of prior work, fail to beat the trivial baseline. As a positive result, the proposed across-country transfer learning framework shows promising results with correlation coefficients of 0.77 for diabetes, and 0.82 for obesity.
Conclusions:
The background literature fails to compare their results to a trivial “last year is the same as this year” baseline. Doing so, this study concludes a low-validity of Google Trends in the context of lifestyle disease surveillance, even when applying novel corrective approaches, including a proposed denormalization scheme. While Google Trends may not be a feasible tool to use for quantitative analysis of slow-moving trends, it can still be used to conduct qualitative analysis. As an example, we find that searches for guns are spatially correlated with suicide rates. For the quantitative analysis, the highest utility of using Google Trends is in the context of transfer learning where low-resource countries could benefit from high-resource countries by using proxy models.
Citation

Per the author's request the PDF is not available.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.