Cargando…

Using Application Programming Interfaces to Access Google Data for Health Research: Protocol for a Methodological Framework

BACKGROUND: Individuals are increasingly turning to search engines like Google to obtain health information and access resources. Analysis of Google search queries offers a novel approach, which is part of the methodological toolkit for infodemiology or infoveillance researchers, to understanding po...

Descripción completa

Detalles Bibliográficos
Autores principales: Zepecki, Anne, Guendelman, Sylvia, DeNero, John, Prata, Ndola
Formato: Online Artículo Texto
Lenguaje:English
Publicado: JMIR Publications 2020
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7381000/
https://www.ncbi.nlm.nih.gov/pubmed/32442159
http://dx.doi.org/10.2196/16543
_version_ 1783562952010366976
author Zepecki, Anne
Guendelman, Sylvia
DeNero, John
Prata, Ndola
author_facet Zepecki, Anne
Guendelman, Sylvia
DeNero, John
Prata, Ndola
author_sort Zepecki, Anne
collection PubMed
description BACKGROUND: Individuals are increasingly turning to search engines like Google to obtain health information and access resources. Analysis of Google search queries offers a novel approach, which is part of the methodological toolkit for infodemiology or infoveillance researchers, to understanding population health concerns and needs in real time or near-real time. While searches predominantly have been examined with the Google Trends website tool, newer application programming interfaces (APIs) are now available to academics to draw a richer landscape of searches. These APIs allow users to write code in languages like Python to retrieve sample data directly from Google servers. OBJECTIVE: The purpose of this paper is to describe a novel protocol to determine the top queries, volume of queries, and the top sites reached by a population searching on the web for a specific health term. The protocol retrieves Google search data obtained from three Google APIs: Google Trends, Google Health Trends (also referred to as Flu Trends), and Google Custom Search. METHODS: Our protocol consisted of four steps: (1) developing a master list of top search queries for an initial search term using Google Trends, (2) gathering information on relative search volume using Google Health Trends, (3) determining the most popular sites using Google Custom Search, and (4) calculating estimated total search volume. We tested the protocol following key procedures at each step and verified its usefulness by examining search traffic on birth control in 2017 in the United States. Two separate programmers working independently achieved similar results with insignificant variation due to sample variability. RESULTS: We successfully tested the methodology on the initial search term birth control. We identified top search queries for birth control, of which birth control pill was the most popular and obtained the relative and estimated total search volume for the top queries: relative search volume was 0.54 for the pill, corresponding to an estimated 9.3-10.7 million searches. We used the estimates of the proportion of search activity for the top queries to arrive at a generated list of the most popular websites: for the pill, the Planned Parenthood website was the top site. CONCLUSIONS: The proposed methodological framework demonstrates how to retrieve Google query data from multiple Google APIs and provides thorough documentation required to systematically identify search queries and websites, as well as estimate relative and total search volume of queries in real time or near-real time in specific locations and time periods. Although the protocol needs further testing, it allows researchers to replicate the steps and shows promise in advancing our understanding of population-level health concerns. INTERNATIONAL REGISTERED REPORT IDENTIFIER (IRRID): RR1-10.2196/16543
format Online
Article
Text
id pubmed-7381000
institution National Center for Biotechnology Information
language English
publishDate 2020
publisher JMIR Publications
record_format MEDLINE/PubMed
spelling pubmed-73810002020-08-06 Using Application Programming Interfaces to Access Google Data for Health Research: Protocol for a Methodological Framework Zepecki, Anne Guendelman, Sylvia DeNero, John Prata, Ndola JMIR Res Protoc Protocol BACKGROUND: Individuals are increasingly turning to search engines like Google to obtain health information and access resources. Analysis of Google search queries offers a novel approach, which is part of the methodological toolkit for infodemiology or infoveillance researchers, to understanding population health concerns and needs in real time or near-real time. While searches predominantly have been examined with the Google Trends website tool, newer application programming interfaces (APIs) are now available to academics to draw a richer landscape of searches. These APIs allow users to write code in languages like Python to retrieve sample data directly from Google servers. OBJECTIVE: The purpose of this paper is to describe a novel protocol to determine the top queries, volume of queries, and the top sites reached by a population searching on the web for a specific health term. The protocol retrieves Google search data obtained from three Google APIs: Google Trends, Google Health Trends (also referred to as Flu Trends), and Google Custom Search. METHODS: Our protocol consisted of four steps: (1) developing a master list of top search queries for an initial search term using Google Trends, (2) gathering information on relative search volume using Google Health Trends, (3) determining the most popular sites using Google Custom Search, and (4) calculating estimated total search volume. We tested the protocol following key procedures at each step and verified its usefulness by examining search traffic on birth control in 2017 in the United States. Two separate programmers working independently achieved similar results with insignificant variation due to sample variability. RESULTS: We successfully tested the methodology on the initial search term birth control. We identified top search queries for birth control, of which birth control pill was the most popular and obtained the relative and estimated total search volume for the top queries: relative search volume was 0.54 for the pill, corresponding to an estimated 9.3-10.7 million searches. We used the estimates of the proportion of search activity for the top queries to arrive at a generated list of the most popular websites: for the pill, the Planned Parenthood website was the top site. CONCLUSIONS: The proposed methodological framework demonstrates how to retrieve Google query data from multiple Google APIs and provides thorough documentation required to systematically identify search queries and websites, as well as estimate relative and total search volume of queries in real time or near-real time in specific locations and time periods. Although the protocol needs further testing, it allows researchers to replicate the steps and shows promise in advancing our understanding of population-level health concerns. INTERNATIONAL REGISTERED REPORT IDENTIFIER (IRRID): RR1-10.2196/16543 JMIR Publications 2020-07-06 /pmc/articles/PMC7381000/ /pubmed/32442159 http://dx.doi.org/10.2196/16543 Text en ©Anne Zepecki, Sylvia Guendelman, John DeNero, Ndola Prata. Originally published in JMIR Research Protocols (http://www.researchprotocols.org), 06.07.2020. https://creativecommons.org/licenses/by/4.0/ This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR Research Protocols, is properly cited. The complete bibliographic information, a link to the original publication on http://www.researchprotocols.org, as well as this copyright and license information must be included.
spellingShingle Protocol
Zepecki, Anne
Guendelman, Sylvia
DeNero, John
Prata, Ndola
Using Application Programming Interfaces to Access Google Data for Health Research: Protocol for a Methodological Framework
title Using Application Programming Interfaces to Access Google Data for Health Research: Protocol for a Methodological Framework
title_full Using Application Programming Interfaces to Access Google Data for Health Research: Protocol for a Methodological Framework
title_fullStr Using Application Programming Interfaces to Access Google Data for Health Research: Protocol for a Methodological Framework
title_full_unstemmed Using Application Programming Interfaces to Access Google Data for Health Research: Protocol for a Methodological Framework
title_short Using Application Programming Interfaces to Access Google Data for Health Research: Protocol for a Methodological Framework
title_sort using application programming interfaces to access google data for health research: protocol for a methodological framework
topic Protocol
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7381000/
https://www.ncbi.nlm.nih.gov/pubmed/32442159
http://dx.doi.org/10.2196/16543
work_keys_str_mv AT zepeckianne usingapplicationprogramminginterfacestoaccessgoogledataforhealthresearchprotocolforamethodologicalframework
AT guendelmansylvia usingapplicationprogramminginterfacestoaccessgoogledataforhealthresearchprotocolforamethodologicalframework
AT denerojohn usingapplicationprogramminginterfacestoaccessgoogledataforhealthresearchprotocolforamethodologicalframework
AT pratandola usingapplicationprogramminginterfacestoaccessgoogledataforhealthresearchprotocolforamethodologicalframework