Accepted for/Published in: JMIR Research Protocols
Date Submitted: Oct 8, 2019
Date Accepted: Mar 24, 2020
Date Submitted to PubMed: May 22, 2020
Warning: This is an author submission that is not peer-reviewed or edited. Preprints - unless they show as "accepted" - should not be relied on to guide clinical practice or health-related behavior and should not be reported in news media as established information.
Using Google Data for Health Research: Applications to Reproductive Health in the United States
ABSTRACT
Background:
Individuals increasingly are turning to search engines like Google to obtain health information and access resources. Analysis of Google search queries offers a novel approach to understanding in near or real time the sexual and reproductive health concerns and needs of populations. While searches have been examined predominantly with the Google Trends tool, newer Application Program Interfaces (APIs) are now available to academics to draw a richer, more systematic landscape of searches. These APIs allow users to write code in languages like Python to retrieve sample data directly from Google servers.
Objective:
The purpose of this paper is to describe the protocol for analysis of Google searches obtained from three Google APIs. We empirically tested the protocol and verified its usefulness by comparing search traffic on abortion and birth control in 2017 in the United States (US) and Mississippi (MS).
Methods:
We used the Google Trends API, the Google Health Trends (also referred to as Flu Trends) API, and the Google Custom Search APIs to obtain search data from Google using Python version 2.7.13. Our simulation protocol consisted of four steps: i) developing a master list of top search queries for abortion and for birth control using the publicly available Google Trends API; ii) gathering information on relative search volume using the private Health Trends API; iii) determining most popular sites using the publicly available Custom Search API, and iv) calculating estimated total search volume for abortion and for birth control. Two separate programmers working independently achieved similar results with insignificant variation due to sample variability.
Results:
The simulation was successful in obtaining the top search queries, relative search volume and estimated total search volume for both locations during 2017. We were able to overcome the inherent limitations of the datasets with the addition of Planned Parenthood Federation of America website data from 2017 as a baseline for estimated search volume calculations. Nonetheless, we were only able to gain access to the most popular national websites associated with the top queries and propose the use of Google Consumer Surveys to supplement API-generated data at the state level.
Conclusions:
The methodology proposed in this paper combines data from multiple Google APIs and provides thorough documentation required to systematically identify top search queries and websites, as well as estimate relative and total search volume of queries in real or near-real time in specific locations, allowing for other researchers to replicate the methods used and to advance our understanding of population-level reproductive health concerns.
Citation
Request queued. Please wait while the file is being generated. It may take some time.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.