Cargando…

Assessing Electronic Cigarette-Related Tweets for Sentiment and Content Using Supervised Machine Learning

BACKGROUND: Electronic cigarettes (e-cigarettes) continue to be a growing topic among social media users, especially on Twitter. The ability to analyze conversations about e-cigarettes in real-time can provide important insight into trends in the public’s knowledge, attitudes, and beliefs surroundin...

Descripción completa

Detalles Bibliográficos
Autores principales:	Cole-Lewis, Heather, Varghese, Arun, Sanders, Amy, Schwarz, Mary, Pugatch, Jillian, Augustson, Erik
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	JMIR Publications Inc. 2015
Materias:	Original Paper
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4642404/ https://www.ncbi.nlm.nih.gov/pubmed/26307512 http://dx.doi.org/10.2196/jmir.4392

_version_	1782400360361492480
author	Cole-Lewis, Heather Varghese, Arun Sanders, Amy Schwarz, Mary Pugatch, Jillian Augustson, Erik
author_facet	Cole-Lewis, Heather Varghese, Arun Sanders, Amy Schwarz, Mary Pugatch, Jillian Augustson, Erik
author_sort	Cole-Lewis, Heather
collection	PubMed
description	BACKGROUND: Electronic cigarettes (e-cigarettes) continue to be a growing topic among social media users, especially on Twitter. The ability to analyze conversations about e-cigarettes in real-time can provide important insight into trends in the public’s knowledge, attitudes, and beliefs surrounding e-cigarettes, and subsequently guide public health interventions. OBJECTIVE: Our aim was to establish a supervised machine learning algorithm to build predictive classification models that assess Twitter data for a range of factors related to e-cigarettes. METHODS: Manual content analysis was conducted for 17,098 tweets. These tweets were coded for five categories: e-cigarette relevance, sentiment, user description, genre, and theme. Machine learning classification models were then built for each of these five categories, and word groupings (n-grams) were used to define the feature space for each classifier. RESULTS: Predictive performance scores for classification models indicated that the models correctly labeled the tweets with the appropriate variables between 68.40% and 99.34% of the time, and the percentage of maximum possible improvement over a random baseline that was achieved by the classification models ranged from 41.59% to 80.62%. Classifiers with the highest performance scores that also achieved the highest percentage of the maximum possible improvement over a random baseline were Policy/Government (performance: 0.94; % improvement: 80.62%), Relevance (performance: 0.94; % improvement: 75.26%), Ad or Promotion (performance: 0.89; % improvement: 72.69%), and Marketing (performance: 0.91; % improvement: 72.56%). The most appropriate word-grouping unit (n-gram) was 1 for the majority of classifiers. Performance continued to marginally increase with the size of the training dataset of manually annotated data, but eventually leveled off. Even at low dataset sizes of 4000 observations, performance characteristics were fairly sound. CONCLUSIONS: Social media outlets like Twitter can uncover real-time snapshots of personal sentiment, knowledge, attitudes, and behavior that are not as accessible, at this scale, through any other offline platform. Using the vast data available through social media presents an opportunity for social science and public health methodologies to utilize computational methodologies to enhance and extend research and practice. This study was successful in automating a complex five-category manual content analysis of e-cigarette-related content on Twitter using machine learning techniques. The study details machine learning model specifications that provided the best accuracy for data related to e-cigarettes, as well as a replicable methodology to allow extension of these methods to additional topics.
format	Online Article Text
id	pubmed-4642404
institution	National Center for Biotechnology Information
language	English
publishDate	2015
publisher	JMIR Publications Inc.
record_format	MEDLINE/PubMed
spelling	pubmed-46424042016-01-12 Assessing Electronic Cigarette-Related Tweets for Sentiment and Content Using Supervised Machine Learning Cole-Lewis, Heather Varghese, Arun Sanders, Amy Schwarz, Mary Pugatch, Jillian Augustson, Erik J Med Internet Res Original Paper BACKGROUND: Electronic cigarettes (e-cigarettes) continue to be a growing topic among social media users, especially on Twitter. The ability to analyze conversations about e-cigarettes in real-time can provide important insight into trends in the public’s knowledge, attitudes, and beliefs surrounding e-cigarettes, and subsequently guide public health interventions. OBJECTIVE: Our aim was to establish a supervised machine learning algorithm to build predictive classification models that assess Twitter data for a range of factors related to e-cigarettes. METHODS: Manual content analysis was conducted for 17,098 tweets. These tweets were coded for five categories: e-cigarette relevance, sentiment, user description, genre, and theme. Machine learning classification models were then built for each of these five categories, and word groupings (n-grams) were used to define the feature space for each classifier. RESULTS: Predictive performance scores for classification models indicated that the models correctly labeled the tweets with the appropriate variables between 68.40% and 99.34% of the time, and the percentage of maximum possible improvement over a random baseline that was achieved by the classification models ranged from 41.59% to 80.62%. Classifiers with the highest performance scores that also achieved the highest percentage of the maximum possible improvement over a random baseline were Policy/Government (performance: 0.94; % improvement: 80.62%), Relevance (performance: 0.94; % improvement: 75.26%), Ad or Promotion (performance: 0.89; % improvement: 72.69%), and Marketing (performance: 0.91; % improvement: 72.56%). The most appropriate word-grouping unit (n-gram) was 1 for the majority of classifiers. Performance continued to marginally increase with the size of the training dataset of manually annotated data, but eventually leveled off. Even at low dataset sizes of 4000 observations, performance characteristics were fairly sound. CONCLUSIONS: Social media outlets like Twitter can uncover real-time snapshots of personal sentiment, knowledge, attitudes, and behavior that are not as accessible, at this scale, through any other offline platform. Using the vast data available through social media presents an opportunity for social science and public health methodologies to utilize computational methodologies to enhance and extend research and practice. This study was successful in automating a complex five-category manual content analysis of e-cigarette-related content on Twitter using machine learning techniques. The study details machine learning model specifications that provided the best accuracy for data related to e-cigarettes, as well as a replicable methodology to allow extension of these methods to additional topics. JMIR Publications Inc. 2015-08-25 /pmc/articles/PMC4642404/ /pubmed/26307512 http://dx.doi.org/10.2196/jmir.4392 Text en ©Heather Cole-Lewis, Arun Varghese, Amy Sanders, Mary Schwarz, Jillian Pugatch, Erik Augustson. Originally published in the Journal of Medical Internet Research (http://www.jmir.org), 25.08.2015. https://creativecommons.org/licenses/by/2.0/This is an open-access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0/ (https://creativecommons.org/licenses/by/2.0/) ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in the Journal of Medical Internet Research, is properly cited. The complete bibliographic information, a link to the original publication on http://www.jmir.org/, as well as this copyright and license information must be included.
spellingShingle	Original Paper Cole-Lewis, Heather Varghese, Arun Sanders, Amy Schwarz, Mary Pugatch, Jillian Augustson, Erik Assessing Electronic Cigarette-Related Tweets for Sentiment and Content Using Supervised Machine Learning
title	Assessing Electronic Cigarette-Related Tweets for Sentiment and Content Using Supervised Machine Learning
title_full	Assessing Electronic Cigarette-Related Tweets for Sentiment and Content Using Supervised Machine Learning
title_fullStr	Assessing Electronic Cigarette-Related Tweets for Sentiment and Content Using Supervised Machine Learning
title_full_unstemmed	Assessing Electronic Cigarette-Related Tweets for Sentiment and Content Using Supervised Machine Learning
title_short	Assessing Electronic Cigarette-Related Tweets for Sentiment and Content Using Supervised Machine Learning
title_sort	assessing electronic cigarette-related tweets for sentiment and content using supervised machine learning
topic	Original Paper
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4642404/ https://www.ncbi.nlm.nih.gov/pubmed/26307512 http://dx.doi.org/10.2196/jmir.4392
work_keys_str_mv	AT colelewisheather assessingelectroniccigaretterelatedtweetsforsentimentandcontentusingsupervisedmachinelearning AT varghesearun assessingelectroniccigaretterelatedtweetsforsentimentandcontentusingsupervisedmachinelearning AT sandersamy assessingelectroniccigaretterelatedtweetsforsentimentandcontentusingsupervisedmachinelearning AT schwarzmary assessingelectroniccigaretterelatedtweetsforsentimentandcontentusingsupervisedmachinelearning AT pugatchjillian assessingelectroniccigaretterelatedtweetsforsentimentandcontentusingsupervisedmachinelearning AT augustsonerik assessingelectroniccigaretterelatedtweetsforsentimentandcontentusingsupervisedmachinelearning

Assessing Electronic Cigarette-Related Tweets for Sentiment and Content Using Supervised Machine Learning

Ejemplares similares