Cargando…

Towards Identifying and Reducing the Bias of Disease Information Extracted from Search Engine Data

The estimation of disease prevalence in online search engine data (e.g., Google Flu Trends (GFT)) has received a considerable amount of scholarly and public attention in recent years. While the utility of search engine data for disease surveillance has been demonstrated, the scientific community sti...

Descripción completa

Detalles Bibliográficos
Autores principales: Huang, Da-Cang, Wang, Jin-Feng, Huang, Ji-Xia, Sui, Daniel Z., Zhang, Hong-Yan, Hu, Mao-Gui, Xu, Cheng-Dong
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Public Library of Science 2016
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4894584/
https://www.ncbi.nlm.nih.gov/pubmed/27271698
http://dx.doi.org/10.1371/journal.pcbi.1004876
_version_ 1782435691393712128
author Huang, Da-Cang
Wang, Jin-Feng
Huang, Ji-Xia
Sui, Daniel Z.
Zhang, Hong-Yan
Hu, Mao-Gui
Xu, Cheng-Dong
author_facet Huang, Da-Cang
Wang, Jin-Feng
Huang, Ji-Xia
Sui, Daniel Z.
Zhang, Hong-Yan
Hu, Mao-Gui
Xu, Cheng-Dong
author_sort Huang, Da-Cang
collection PubMed
description The estimation of disease prevalence in online search engine data (e.g., Google Flu Trends (GFT)) has received a considerable amount of scholarly and public attention in recent years. While the utility of search engine data for disease surveillance has been demonstrated, the scientific community still seeks ways to identify and reduce biases that are embedded in search engine data. The primary goal of this study is to explore new ways of improving the accuracy of disease prevalence estimations by combining traditional disease data with search engine data. A novel method, Biased Sentinel Hospital-based Area Disease Estimation (B-SHADE), is introduced to reduce search engine data bias from a geographical perspective. To monitor search trends on Hand, Foot and Mouth Disease (HFMD) in Guangdong Province, China, we tested our approach by selecting 11 keywords from the Baidu index platform, a Chinese big data analyst similar to GFT. The correlation between the number of real cases and the composite index was 0.8. After decomposing the composite index at the city level, we found that only 10 cities presented a correlation of close to 0.8 or higher. These cities were found to be more stable with respect to search volume, and they were selected as sample cities in order to estimate the search volume of the entire province. After the estimation, the correlation improved from 0.8 to 0.864. After fitting the revised search volume with historical cases, the mean absolute error was 11.19% lower than it was when the original search volume and historical cases were combined. To our knowledge, this is the first study to reduce search engine data bias levels through the use of rigorous spatial sampling strategies.
format Online
Article
Text
id pubmed-4894584
institution National Center for Biotechnology Information
language English
publishDate 2016
publisher Public Library of Science
record_format MEDLINE/PubMed
spelling pubmed-48945842016-06-23 Towards Identifying and Reducing the Bias of Disease Information Extracted from Search Engine Data Huang, Da-Cang Wang, Jin-Feng Huang, Ji-Xia Sui, Daniel Z. Zhang, Hong-Yan Hu, Mao-Gui Xu, Cheng-Dong PLoS Comput Biol Research Article The estimation of disease prevalence in online search engine data (e.g., Google Flu Trends (GFT)) has received a considerable amount of scholarly and public attention in recent years. While the utility of search engine data for disease surveillance has been demonstrated, the scientific community still seeks ways to identify and reduce biases that are embedded in search engine data. The primary goal of this study is to explore new ways of improving the accuracy of disease prevalence estimations by combining traditional disease data with search engine data. A novel method, Biased Sentinel Hospital-based Area Disease Estimation (B-SHADE), is introduced to reduce search engine data bias from a geographical perspective. To monitor search trends on Hand, Foot and Mouth Disease (HFMD) in Guangdong Province, China, we tested our approach by selecting 11 keywords from the Baidu index platform, a Chinese big data analyst similar to GFT. The correlation between the number of real cases and the composite index was 0.8. After decomposing the composite index at the city level, we found that only 10 cities presented a correlation of close to 0.8 or higher. These cities were found to be more stable with respect to search volume, and they were selected as sample cities in order to estimate the search volume of the entire province. After the estimation, the correlation improved from 0.8 to 0.864. After fitting the revised search volume with historical cases, the mean absolute error was 11.19% lower than it was when the original search volume and historical cases were combined. To our knowledge, this is the first study to reduce search engine data bias levels through the use of rigorous spatial sampling strategies. Public Library of Science 2016-06-06 /pmc/articles/PMC4894584/ /pubmed/27271698 http://dx.doi.org/10.1371/journal.pcbi.1004876 Text en © 2016 Huang et al http://creativecommons.org/licenses/by/4.0/ This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
spellingShingle Research Article
Huang, Da-Cang
Wang, Jin-Feng
Huang, Ji-Xia
Sui, Daniel Z.
Zhang, Hong-Yan
Hu, Mao-Gui
Xu, Cheng-Dong
Towards Identifying and Reducing the Bias of Disease Information Extracted from Search Engine Data
title Towards Identifying and Reducing the Bias of Disease Information Extracted from Search Engine Data
title_full Towards Identifying and Reducing the Bias of Disease Information Extracted from Search Engine Data
title_fullStr Towards Identifying and Reducing the Bias of Disease Information Extracted from Search Engine Data
title_full_unstemmed Towards Identifying and Reducing the Bias of Disease Information Extracted from Search Engine Data
title_short Towards Identifying and Reducing the Bias of Disease Information Extracted from Search Engine Data
title_sort towards identifying and reducing the bias of disease information extracted from search engine data
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4894584/
https://www.ncbi.nlm.nih.gov/pubmed/27271698
http://dx.doi.org/10.1371/journal.pcbi.1004876
work_keys_str_mv AT huangdacang towardsidentifyingandreducingthebiasofdiseaseinformationextractedfromsearchenginedata
AT wangjinfeng towardsidentifyingandreducingthebiasofdiseaseinformationextractedfromsearchenginedata
AT huangjixia towardsidentifyingandreducingthebiasofdiseaseinformationextractedfromsearchenginedata
AT suidanielz towardsidentifyingandreducingthebiasofdiseaseinformationextractedfromsearchenginedata
AT zhanghongyan towardsidentifyingandreducingthebiasofdiseaseinformationextractedfromsearchenginedata
AT humaogui towardsidentifyingandreducingthebiasofdiseaseinformationextractedfromsearchenginedata
AT xuchengdong towardsidentifyingandreducingthebiasofdiseaseinformationextractedfromsearchenginedata