Cargando…

Assessing Electronic Cigarette-Related Tweets for Sentiment and Content Using Supervised Machine Learning

BACKGROUND: Electronic cigarettes (e-cigarettes) continue to be a growing topic among social media users, especially on Twitter. The ability to analyze conversations about e-cigarettes in real-time can provide important insight into trends in the public’s knowledge, attitudes, and beliefs surroundin...

Descripción completa

Detalles Bibliográficos
Autores principales: Cole-Lewis, Heather, Varghese, Arun, Sanders, Amy, Schwarz, Mary, Pugatch, Jillian, Augustson, Erik
Formato: Online Artículo Texto
Lenguaje:English
Publicado: JMIR Publications Inc. 2015
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4642404/
https://www.ncbi.nlm.nih.gov/pubmed/26307512
http://dx.doi.org/10.2196/jmir.4392
_version_ 1782400360361492480
author Cole-Lewis, Heather
Varghese, Arun
Sanders, Amy
Schwarz, Mary
Pugatch, Jillian
Augustson, Erik
author_facet Cole-Lewis, Heather
Varghese, Arun
Sanders, Amy
Schwarz, Mary
Pugatch, Jillian
Augustson, Erik
author_sort Cole-Lewis, Heather
collection PubMed
description BACKGROUND: Electronic cigarettes (e-cigarettes) continue to be a growing topic among social media users, especially on Twitter. The ability to analyze conversations about e-cigarettes in real-time can provide important insight into trends in the public’s knowledge, attitudes, and beliefs surrounding e-cigarettes, and subsequently guide public health interventions. OBJECTIVE: Our aim was to establish a supervised machine learning algorithm to build predictive classification models that assess Twitter data for a range of factors related to e-cigarettes. METHODS: Manual content analysis was conducted for 17,098 tweets. These tweets were coded for five categories: e-cigarette relevance, sentiment, user description, genre, and theme. Machine learning classification models were then built for each of these five categories, and word groupings (n-grams) were used to define the feature space for each classifier. RESULTS: Predictive performance scores for classification models indicated that the models correctly labeled the tweets with the appropriate variables between 68.40% and 99.34% of the time, and the percentage of maximum possible improvement over a random baseline that was achieved by the classification models ranged from 41.59% to 80.62%. Classifiers with the highest performance scores that also achieved the highest percentage of the maximum possible improvement over a random baseline were Policy/Government (performance: 0.94; % improvement: 80.62%), Relevance (performance: 0.94; % improvement: 75.26%), Ad or Promotion (performance: 0.89; % improvement: 72.69%), and Marketing (performance: 0.91; % improvement: 72.56%). The most appropriate word-grouping unit (n-gram) was 1 for the majority of classifiers. Performance continued to marginally increase with the size of the training dataset of manually annotated data, but eventually leveled off. Even at low dataset sizes of 4000 observations, performance characteristics were fairly sound. CONCLUSIONS: Social media outlets like Twitter can uncover real-time snapshots of personal sentiment, knowledge, attitudes, and behavior that are not as accessible, at this scale, through any other offline platform. Using the vast data available through social media presents an opportunity for social science and public health methodologies to utilize computational methodologies to enhance and extend research and practice. This study was successful in automating a complex five-category manual content analysis of e-cigarette-related content on Twitter using machine learning techniques. The study details machine learning model specifications that provided the best accuracy for data related to e-cigarettes, as well as a replicable methodology to allow extension of these methods to additional topics.
format Online
Article
Text
id pubmed-4642404
institution National Center for Biotechnology Information
language English
publishDate 2015
publisher JMIR Publications Inc.
record_format MEDLINE/PubMed
spelling pubmed-46424042016-01-12 Assessing Electronic Cigarette-Related Tweets for Sentiment and Content Using Supervised Machine Learning Cole-Lewis, Heather Varghese, Arun Sanders, Amy Schwarz, Mary Pugatch, Jillian Augustson, Erik J Med Internet Res Original Paper BACKGROUND: Electronic cigarettes (e-cigarettes) continue to be a growing topic among social media users, especially on Twitter. The ability to analyze conversations about e-cigarettes in real-time can provide important insight into trends in the public’s knowledge, attitudes, and beliefs surrounding e-cigarettes, and subsequently guide public health interventions. OBJECTIVE: Our aim was to establish a supervised machine learning algorithm to build predictive classification models that assess Twitter data for a range of factors related to e-cigarettes. METHODS: Manual content analysis was conducted for 17,098 tweets. These tweets were coded for five categories: e-cigarette relevance, sentiment, user description, genre, and theme. Machine learning classification models were then built for each of these five categories, and word groupings (n-grams) were used to define the feature space for each classifier. RESULTS: Predictive performance scores for classification models indicated that the models correctly labeled the tweets with the appropriate variables between 68.40% and 99.34% of the time, and the percentage of maximum possible improvement over a random baseline that was achieved by the classification models ranged from 41.59% to 80.62%. Classifiers with the highest performance scores that also achieved the highest percentage of the maximum possible improvement over a random baseline were Policy/Government (performance: 0.94; % improvement: 80.62%), Relevance (performance: 0.94; % improvement: 75.26%), Ad or Promotion (performance: 0.89; % improvement: 72.69%), and Marketing (performance: 0.91; % improvement: 72.56%). The most appropriate word-grouping unit (n-gram) was 1 for the majority of classifiers. Performance continued to marginally increase with the size of the training dataset of manually annotated data, but eventually leveled off. Even at low dataset sizes of 4000 observations, performance characteristics were fairly sound. CONCLUSIONS: Social media outlets like Twitter can uncover real-time snapshots of personal sentiment, knowledge, attitudes, and behavior that are not as accessible, at this scale, through any other offline platform. Using the vast data available through social media presents an opportunity for social science and public health methodologies to utilize computational methodologies to enhance and extend research and practice. This study was successful in automating a complex five-category manual content analysis of e-cigarette-related content on Twitter using machine learning techniques. The study details machine learning model specifications that provided the best accuracy for data related to e-cigarettes, as well as a replicable methodology to allow extension of these methods to additional topics. JMIR Publications Inc. 2015-08-25 /pmc/articles/PMC4642404/ /pubmed/26307512 http://dx.doi.org/10.2196/jmir.4392 Text en ©Heather Cole-Lewis, Arun Varghese, Amy Sanders, Mary Schwarz, Jillian Pugatch, Erik Augustson. Originally published in the Journal of Medical Internet Research (http://www.jmir.org), 25.08.2015. https://creativecommons.org/licenses/by/2.0/This is an open-access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0/ (https://creativecommons.org/licenses/by/2.0/) ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in the Journal of Medical Internet Research, is properly cited. The complete bibliographic information, a link to the original publication on http://www.jmir.org/, as well as this copyright and license information must be included.
spellingShingle Original Paper
Cole-Lewis, Heather
Varghese, Arun
Sanders, Amy
Schwarz, Mary
Pugatch, Jillian
Augustson, Erik
Assessing Electronic Cigarette-Related Tweets for Sentiment and Content Using Supervised Machine Learning
title Assessing Electronic Cigarette-Related Tweets for Sentiment and Content Using Supervised Machine Learning
title_full Assessing Electronic Cigarette-Related Tweets for Sentiment and Content Using Supervised Machine Learning
title_fullStr Assessing Electronic Cigarette-Related Tweets for Sentiment and Content Using Supervised Machine Learning
title_full_unstemmed Assessing Electronic Cigarette-Related Tweets for Sentiment and Content Using Supervised Machine Learning
title_short Assessing Electronic Cigarette-Related Tweets for Sentiment and Content Using Supervised Machine Learning
title_sort assessing electronic cigarette-related tweets for sentiment and content using supervised machine learning
topic Original Paper
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4642404/
https://www.ncbi.nlm.nih.gov/pubmed/26307512
http://dx.doi.org/10.2196/jmir.4392
work_keys_str_mv AT colelewisheather assessingelectroniccigaretterelatedtweetsforsentimentandcontentusingsupervisedmachinelearning
AT varghesearun assessingelectroniccigaretterelatedtweetsforsentimentandcontentusingsupervisedmachinelearning
AT sandersamy assessingelectroniccigaretterelatedtweetsforsentimentandcontentusingsupervisedmachinelearning
AT schwarzmary assessingelectroniccigaretterelatedtweetsforsentimentandcontentusingsupervisedmachinelearning
AT pugatchjillian assessingelectroniccigaretterelatedtweetsforsentimentandcontentusingsupervisedmachinelearning
AT augustsonerik assessingelectroniccigaretterelatedtweetsforsentimentandcontentusingsupervisedmachinelearning