Cargando…

Using Twitter to Measure Public Discussion of Diseases: A Case Study

BACKGROUND: Twitter is increasingly used to estimate disease prevalence, but such measurements can be biased, due to both biased sampling and inherent ambiguity of natural language. OBJECTIVE: We characterized the extent of these biases and how they vary with disease. METHODS: We correlated self-rep...

Descripción completa

Detalles Bibliográficos
Autores principales: Weeg, Christopher, Schwartz, H Andrew, Hill, Shawndra, Merchant, Raina M, Arango, Catalina, Ungar, Lyle
Formato: Online Artículo Texto
Lenguaje:English
Publicado: JMIR Publications 2015
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4763717/
https://www.ncbi.nlm.nih.gov/pubmed/26925459
http://dx.doi.org/10.2196/publichealth.3953
_version_ 1782417316694196224
author Weeg, Christopher
Schwartz, H Andrew
Hill, Shawndra
Merchant, Raina M
Arango, Catalina
Ungar, Lyle
author_facet Weeg, Christopher
Schwartz, H Andrew
Hill, Shawndra
Merchant, Raina M
Arango, Catalina
Ungar, Lyle
author_sort Weeg, Christopher
collection PubMed
description BACKGROUND: Twitter is increasingly used to estimate disease prevalence, but such measurements can be biased, due to both biased sampling and inherent ambiguity of natural language. OBJECTIVE: We characterized the extent of these biases and how they vary with disease. METHODS: We correlated self-reported prevalence rates for 22 diseases from Experian’s Simmons National Consumer Study (n=12,305) with the number of times these diseases were mentioned on Twitter during the same period (2012). We also identified and corrected for two types of bias present in Twitter data: (1) demographic variance between US Twitter users and the general US population; and (2) natural language ambiguity, which creates the possibility that mention of a disease name may not actually refer to the disease (eg, “heart attack” on Twitter often does not refer to myocardial infarction). We measured the correlation between disease prevalence and Twitter disease mentions both with and without bias correction. This allowed us to quantify each disease’s overrepresentation or underrepresentation on Twitter, relative to its prevalence. RESULTS: Our sample included 80,680,449 tweets. Adjusting disease prevalence to correct for Twitter demographics more than doubles the correlation between Twitter disease mentions and disease prevalence in the general population (from .113 to .258, P <.001). In addition, diseases varied widely in how often mentions of their names on Twitter actually referred to the diseases, from 14.89% (3827/25,704) of instances (for stroke) to 99.92% (5044/5048) of instances (for arthritis). Applying ambiguity correction to our Twitter corpus achieves a correlation between disease mentions and prevalence of .208 ( P <.001). Simultaneously applying correction for both demographics and ambiguity more than triples the baseline correlation to .366 ( P <.001). Compared with prevalence rates, cancer appeared most overrepresented in Twitter, whereas high cholesterol appeared most underrepresented. CONCLUSIONS: Twitter is a potentially useful tool to measure public interest in and concerns about different diseases, but when comparing diseases, improvements can be made by adjusting for population demographics and word ambiguity.
format Online
Article
Text
id pubmed-4763717
institution National Center for Biotechnology Information
language English
publishDate 2015
publisher JMIR Publications
record_format MEDLINE/PubMed
spelling pubmed-47637172016-05-25 Using Twitter to Measure Public Discussion of Diseases: A Case Study Weeg, Christopher Schwartz, H Andrew Hill, Shawndra Merchant, Raina M Arango, Catalina Ungar, Lyle JMIR Public Health Surveill Original Paper BACKGROUND: Twitter is increasingly used to estimate disease prevalence, but such measurements can be biased, due to both biased sampling and inherent ambiguity of natural language. OBJECTIVE: We characterized the extent of these biases and how they vary with disease. METHODS: We correlated self-reported prevalence rates for 22 diseases from Experian’s Simmons National Consumer Study (n=12,305) with the number of times these diseases were mentioned on Twitter during the same period (2012). We also identified and corrected for two types of bias present in Twitter data: (1) demographic variance between US Twitter users and the general US population; and (2) natural language ambiguity, which creates the possibility that mention of a disease name may not actually refer to the disease (eg, “heart attack” on Twitter often does not refer to myocardial infarction). We measured the correlation between disease prevalence and Twitter disease mentions both with and without bias correction. This allowed us to quantify each disease’s overrepresentation or underrepresentation on Twitter, relative to its prevalence. RESULTS: Our sample included 80,680,449 tweets. Adjusting disease prevalence to correct for Twitter demographics more than doubles the correlation between Twitter disease mentions and disease prevalence in the general population (from .113 to .258, P <.001). In addition, diseases varied widely in how often mentions of their names on Twitter actually referred to the diseases, from 14.89% (3827/25,704) of instances (for stroke) to 99.92% (5044/5048) of instances (for arthritis). Applying ambiguity correction to our Twitter corpus achieves a correlation between disease mentions and prevalence of .208 ( P <.001). Simultaneously applying correction for both demographics and ambiguity more than triples the baseline correlation to .366 ( P <.001). Compared with prevalence rates, cancer appeared most overrepresented in Twitter, whereas high cholesterol appeared most underrepresented. CONCLUSIONS: Twitter is a potentially useful tool to measure public interest in and concerns about different diseases, but when comparing diseases, improvements can be made by adjusting for population demographics and word ambiguity. JMIR Publications 2015-06-26 /pmc/articles/PMC4763717/ /pubmed/26925459 http://dx.doi.org/10.2196/publichealth.3953 Text en ©Christopher Weeg, H. Andrew Schwartz, Shawndra Hill, Raina M Merchant, Catalina Arango, Lyle Ungar. Originally published in JMIR Public Health and Surveillance (http://publichealth.jmir.org), 26.06.2015. https://creativecommons.org/licenses/by/2.0/This is an open-access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0/ (https://creativecommons.org/licenses/by/2.0/) ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR Public Health and Surveillance, is properly cited. The complete bibliographic information, a link to the original publication on http://publichealth.jmir.org, as well as this copyright and license information must be included.
spellingShingle Original Paper
Weeg, Christopher
Schwartz, H Andrew
Hill, Shawndra
Merchant, Raina M
Arango, Catalina
Ungar, Lyle
Using Twitter to Measure Public Discussion of Diseases: A Case Study
title Using Twitter to Measure Public Discussion of Diseases: A Case Study
title_full Using Twitter to Measure Public Discussion of Diseases: A Case Study
title_fullStr Using Twitter to Measure Public Discussion of Diseases: A Case Study
title_full_unstemmed Using Twitter to Measure Public Discussion of Diseases: A Case Study
title_short Using Twitter to Measure Public Discussion of Diseases: A Case Study
title_sort using twitter to measure public discussion of diseases: a case study
topic Original Paper
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4763717/
https://www.ncbi.nlm.nih.gov/pubmed/26925459
http://dx.doi.org/10.2196/publichealth.3953
work_keys_str_mv AT weegchristopher usingtwittertomeasurepublicdiscussionofdiseasesacasestudy
AT schwartzhandrew usingtwittertomeasurepublicdiscussionofdiseasesacasestudy
AT hillshawndra usingtwittertomeasurepublicdiscussionofdiseasesacasestudy
AT merchantrainam usingtwittertomeasurepublicdiscussionofdiseasesacasestudy
AT arangocatalina usingtwittertomeasurepublicdiscussionofdiseasesacasestudy
AT ungarlyle usingtwittertomeasurepublicdiscussionofdiseasesacasestudy