Cargando…
Lifestyle Disease Surveillance Using Population Search Behavior: Feasibility Study
BACKGROUND: As the process of producing official health statistics for lifestyle diseases is slow, researchers have explored using Web search data as a proxy for lifestyle disease surveillance. Existing studies, however, are prone to at least one of the following issues: ad-hoc keyword selection, ov...
Autores principales: | , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
JMIR Publications
2020
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7011125/ https://www.ncbi.nlm.nih.gov/pubmed/32012050 http://dx.doi.org/10.2196/13347 |
_version_ | 1783496010378510336 |
---|---|
author | Memon, Shahan Ali Razak, Saquib Weber, Ingmar |
author_facet | Memon, Shahan Ali Razak, Saquib Weber, Ingmar |
author_sort | Memon, Shahan Ali |
collection | PubMed |
description | BACKGROUND: As the process of producing official health statistics for lifestyle diseases is slow, researchers have explored using Web search data as a proxy for lifestyle disease surveillance. Existing studies, however, are prone to at least one of the following issues: ad-hoc keyword selection, overfitting, insufficient predictive evaluation, lack of generalization, and failure to compare against trivial baselines. OBJECTIVE: The aims of this study were to (1) employ a corrective approach improving previous methods; (2) study the key limitations in using Google Trends for lifestyle disease surveillance; and (3) test the generalizability of our methodology to other countries beyond the United States. METHODS: For each of the target variables (diabetes, obesity, and exercise), prevalence rates were collected. After a rigorous keyword selection process, data from Google Trends were collected. These data were denormalized to form spatio-temporal indices. L1-regularized regression models were trained to predict prevalence rates from denormalized Google Trends indices. Models were tested on a held-out set and compared against baselines from the literature as well as a trivial last year equals this year baseline. A similar analysis was done using a multivariate spatio-temporal model where the previous year’s prevalence was included as a covariate. This model was modified to create a time-lagged regression analysis framework. Finally, a hierarchical time-lagged multivariate spatio-temporal model was created to account for subnational trends in the data. The model trained on US data was, then, applied in a transfer learning framework to Canada. RESULTS: In the US context, our proposed models beat the performances of the prior work, as well as the trivial baselines. In terms of the mean absolute error (MAE), the best of our proposed models yields 24% improvement (0.72-0.55; P<.001) for diabetes; 18% improvement (1.20-0.99; P=.001) for obesity, and 34% improvement (2.89-1.95; P<.001) for exercise. Our proposed across-country transfer learning framework also shows promising results with an average Spearman and Pearson correlation of 0.70 for diabetes and 0.90 and 0.91 for obesity, respectively. CONCLUSIONS: Although our proposed models beat the baselines, we find the modeling of lifestyle diseases to be a challenging problem, one that requires an abundance of data as well as creative modeling strategies. In doing so, this study shows a low-to-moderate validity of Google Trends in the context of lifestyle disease surveillance, even when applying novel corrective approaches, including a proposed denormalization scheme. We envision qualitative analyses to be a more practical use of Google Trends in the context of lifestyle disease surveillance. For the quantitative analyses, the highest utility of using Google Trends is in the context of transfer learning where low-resource countries could benefit from high-resource countries by using proxy models. |
format | Online Article Text |
id | pubmed-7011125 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2020 |
publisher | JMIR Publications |
record_format | MEDLINE/PubMed |
spelling | pubmed-70111252020-03-05 Lifestyle Disease Surveillance Using Population Search Behavior: Feasibility Study Memon, Shahan Ali Razak, Saquib Weber, Ingmar J Med Internet Res Original Paper BACKGROUND: As the process of producing official health statistics for lifestyle diseases is slow, researchers have explored using Web search data as a proxy for lifestyle disease surveillance. Existing studies, however, are prone to at least one of the following issues: ad-hoc keyword selection, overfitting, insufficient predictive evaluation, lack of generalization, and failure to compare against trivial baselines. OBJECTIVE: The aims of this study were to (1) employ a corrective approach improving previous methods; (2) study the key limitations in using Google Trends for lifestyle disease surveillance; and (3) test the generalizability of our methodology to other countries beyond the United States. METHODS: For each of the target variables (diabetes, obesity, and exercise), prevalence rates were collected. After a rigorous keyword selection process, data from Google Trends were collected. These data were denormalized to form spatio-temporal indices. L1-regularized regression models were trained to predict prevalence rates from denormalized Google Trends indices. Models were tested on a held-out set and compared against baselines from the literature as well as a trivial last year equals this year baseline. A similar analysis was done using a multivariate spatio-temporal model where the previous year’s prevalence was included as a covariate. This model was modified to create a time-lagged regression analysis framework. Finally, a hierarchical time-lagged multivariate spatio-temporal model was created to account for subnational trends in the data. The model trained on US data was, then, applied in a transfer learning framework to Canada. RESULTS: In the US context, our proposed models beat the performances of the prior work, as well as the trivial baselines. In terms of the mean absolute error (MAE), the best of our proposed models yields 24% improvement (0.72-0.55; P<.001) for diabetes; 18% improvement (1.20-0.99; P=.001) for obesity, and 34% improvement (2.89-1.95; P<.001) for exercise. Our proposed across-country transfer learning framework also shows promising results with an average Spearman and Pearson correlation of 0.70 for diabetes and 0.90 and 0.91 for obesity, respectively. CONCLUSIONS: Although our proposed models beat the baselines, we find the modeling of lifestyle diseases to be a challenging problem, one that requires an abundance of data as well as creative modeling strategies. In doing so, this study shows a low-to-moderate validity of Google Trends in the context of lifestyle disease surveillance, even when applying novel corrective approaches, including a proposed denormalization scheme. We envision qualitative analyses to be a more practical use of Google Trends in the context of lifestyle disease surveillance. For the quantitative analyses, the highest utility of using Google Trends is in the context of transfer learning where low-resource countries could benefit from high-resource countries by using proxy models. JMIR Publications 2020-01-27 /pmc/articles/PMC7011125/ /pubmed/32012050 http://dx.doi.org/10.2196/13347 Text en ©Shahan Ali Memon, Saquib Razak, Ingmar Weber. Originally published in the Journal of Medical Internet Research (http://www.jmir.org), 27.01.2020. https://creativecommons.org/licenses/by/4.0/This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in the Journal of Medical Internet Research, is properly cited. The complete bibliographic information, a link to the original publication on http://www.jmir.org/, as well as this copyright and license information must be included. |
spellingShingle | Original Paper Memon, Shahan Ali Razak, Saquib Weber, Ingmar Lifestyle Disease Surveillance Using Population Search Behavior: Feasibility Study |
title | Lifestyle Disease Surveillance Using Population Search Behavior: Feasibility Study |
title_full | Lifestyle Disease Surveillance Using Population Search Behavior: Feasibility Study |
title_fullStr | Lifestyle Disease Surveillance Using Population Search Behavior: Feasibility Study |
title_full_unstemmed | Lifestyle Disease Surveillance Using Population Search Behavior: Feasibility Study |
title_short | Lifestyle Disease Surveillance Using Population Search Behavior: Feasibility Study |
title_sort | lifestyle disease surveillance using population search behavior: feasibility study |
topic | Original Paper |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7011125/ https://www.ncbi.nlm.nih.gov/pubmed/32012050 http://dx.doi.org/10.2196/13347 |
work_keys_str_mv | AT memonshahanali lifestylediseasesurveillanceusingpopulationsearchbehaviorfeasibilitystudy AT razaksaquib lifestylediseasesurveillanceusingpopulationsearchbehaviorfeasibilitystudy AT weberingmar lifestylediseasesurveillanceusingpopulationsearchbehaviorfeasibilitystudy |