Cargando…

Content Analysis of Syndromic Twitter Data

OBJECTIVE: We present an annotation scheme developed to analyze syndromic Twitter data, and the results of its application to a set of respiratory syndrome-related tweets [1]. The scheme was designed to differentiate true positive tweets (where an individual is experiencing respiratory symptoms) fro...

Descripción completa

Detalles Bibliográficos
Autores principales: Keffala, Bethany, Conway, Mike, Doan, Son, Collier, Nigel
Formato: Online Artículo Texto
Lenguaje:English
Publicado: University of Illinois at Chicago Library 2013
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3692812/
_version_ 1782274661195710464
author Keffala, Bethany
Conway, Mike
Doan, Son
Collier, Nigel
author_facet Keffala, Bethany
Conway, Mike
Doan, Son
Collier, Nigel
author_sort Keffala, Bethany
collection PubMed
description OBJECTIVE: We present an annotation scheme developed to analyze syndromic Twitter data, and the results of its application to a set of respiratory syndrome-related tweets [1]. The scheme was designed to differentiate true positive tweets (where an individual is experiencing respiratory symptoms) from false positive tweets (where an individual is not experiencing respiratory symptoms), and to quantify more fine-grained information within the data. INTRODUCTION: The popularity of Twitter, a social-networking service, creates the opportunity for researchers to collect large amounts of free, localizable data in real-time. Data takes the form of short, user-written messages, and has been employed for general syndromic surveillance [2] and surveillance of public attitudes toward the H1N1 flu outbreak [3]. Accessibility of tweets in real-time makes them particularly appropriate for use in early warning systems. Data collected through keyword search contains a significant amount of noise, however, annotation can help boost the signal for true positive tweets. METHODS: The annotation scheme was developed based on information relevant for early warning systems (e.g. who is experiencing symptoms, and when) as well as other information present in the tweets (e.g. aspirations regarding symptoms, or abuse of substances such as cough syrup). Categories included Experiencer: Self/Other, Temporality: Current/Non-Current, Sentiment: Positive/Negative, Information: Providing/Seeking, Language: Non-English, Aspiration, Hyperbole, and Substance Abuse. All categories with the exception of Language and Substance Abuse were defined in reference to diseases or symptoms. The scheme was applied to 1,100 respiratory syndrome-related tweets (544 false positive, 556 true positive) from a previously collected corpus of syndromic twitter data [2]. Inter-annotator agreement was calculated for 9% of the data (100 tweets). RESULTS: Inter-annotator agreement was generally good, however certain categories had lower scores. Categories for Experiencer, Temporality, Sentiment: Negative, Information: Providing, and Language all had Kappa values above .9, Sentiment: Positive, Aspiration, and Substance abuse had Kappa values above .7, and Information: Seeking and Hyperbole had Kappas above .6. There was good separation between true positive tweets and false positive tweets, especially for the Experiencer: Self, Temporality: Current, Sentiment: Negative, Aspiration, Hyperbole, and Substance Abuse categories (see Table). True positive data were more likely to belong to any category except Information: Providing, and Substance Abuse, in which cases false positive tweets had greater likelihood of category inclusion. Within the true positive data, we found that users were more likely to reference symptoms that they themselves were currently experiencing than they were to reference another person’s symptoms or non-current symptoms. Sentiment was largely negative, and there was significant use of aspiration and hyperbole. CONCLUSIONS: Future work will apply the scheme to other syndromes, including constitutional, gastrointestinal, neurological, rash, and hemorrhagic.
format Online
Article
Text
id pubmed-3692812
institution National Center for Biotechnology Information
language English
publishDate 2013
publisher University of Illinois at Chicago Library
record_format MEDLINE/PubMed
spelling pubmed-36928122013-06-26 Content Analysis of Syndromic Twitter Data Keffala, Bethany Conway, Mike Doan, Son Collier, Nigel Online J Public Health Inform ISDS 2012 Conference Abstracts OBJECTIVE: We present an annotation scheme developed to analyze syndromic Twitter data, and the results of its application to a set of respiratory syndrome-related tweets [1]. The scheme was designed to differentiate true positive tweets (where an individual is experiencing respiratory symptoms) from false positive tweets (where an individual is not experiencing respiratory symptoms), and to quantify more fine-grained information within the data. INTRODUCTION: The popularity of Twitter, a social-networking service, creates the opportunity for researchers to collect large amounts of free, localizable data in real-time. Data takes the form of short, user-written messages, and has been employed for general syndromic surveillance [2] and surveillance of public attitudes toward the H1N1 flu outbreak [3]. Accessibility of tweets in real-time makes them particularly appropriate for use in early warning systems. Data collected through keyword search contains a significant amount of noise, however, annotation can help boost the signal for true positive tweets. METHODS: The annotation scheme was developed based on information relevant for early warning systems (e.g. who is experiencing symptoms, and when) as well as other information present in the tweets (e.g. aspirations regarding symptoms, or abuse of substances such as cough syrup). Categories included Experiencer: Self/Other, Temporality: Current/Non-Current, Sentiment: Positive/Negative, Information: Providing/Seeking, Language: Non-English, Aspiration, Hyperbole, and Substance Abuse. All categories with the exception of Language and Substance Abuse were defined in reference to diseases or symptoms. The scheme was applied to 1,100 respiratory syndrome-related tweets (544 false positive, 556 true positive) from a previously collected corpus of syndromic twitter data [2]. Inter-annotator agreement was calculated for 9% of the data (100 tweets). RESULTS: Inter-annotator agreement was generally good, however certain categories had lower scores. Categories for Experiencer, Temporality, Sentiment: Negative, Information: Providing, and Language all had Kappa values above .9, Sentiment: Positive, Aspiration, and Substance abuse had Kappa values above .7, and Information: Seeking and Hyperbole had Kappas above .6. There was good separation between true positive tweets and false positive tweets, especially for the Experiencer: Self, Temporality: Current, Sentiment: Negative, Aspiration, Hyperbole, and Substance Abuse categories (see Table). True positive data were more likely to belong to any category except Information: Providing, and Substance Abuse, in which cases false positive tweets had greater likelihood of category inclusion. Within the true positive data, we found that users were more likely to reference symptoms that they themselves were currently experiencing than they were to reference another person’s symptoms or non-current symptoms. Sentiment was largely negative, and there was significant use of aspiration and hyperbole. CONCLUSIONS: Future work will apply the scheme to other syndromes, including constitutional, gastrointestinal, neurological, rash, and hemorrhagic. University of Illinois at Chicago Library 2013-04-04 /pmc/articles/PMC3692812/ Text en ©2013 the author(s) http://www.uic.edu/htbin/cgiwrap/bin/ojs/index.php/ojphi/about/submissions#copyrightNotice This is an Open Access article. Authors own copyright of their articles appearing in the Online Journal of Public Health Informatics. Readers may copy articles without permission of the copyright owner(s), as long as the author and OJPHI are acknowledged in the copy and the copy is used for educational, not-for-profit purposes.
spellingShingle ISDS 2012 Conference Abstracts
Keffala, Bethany
Conway, Mike
Doan, Son
Collier, Nigel
Content Analysis of Syndromic Twitter Data
title Content Analysis of Syndromic Twitter Data
title_full Content Analysis of Syndromic Twitter Data
title_fullStr Content Analysis of Syndromic Twitter Data
title_full_unstemmed Content Analysis of Syndromic Twitter Data
title_short Content Analysis of Syndromic Twitter Data
title_sort content analysis of syndromic twitter data
topic ISDS 2012 Conference Abstracts
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3692812/
work_keys_str_mv AT keffalabethany contentanalysisofsyndromictwitterdata
AT conwaymike contentanalysisofsyndromictwitterdata
AT doanson contentanalysisofsyndromictwitterdata
AT colliernigel contentanalysisofsyndromictwitterdata