Cargando…
Assessing Electronic Cigarette-Related Tweets for Sentiment and Content Using Supervised Machine Learning
BACKGROUND: Electronic cigarettes (e-cigarettes) continue to be a growing topic among social media users, especially on Twitter. The ability to analyze conversations about e-cigarettes in real-time can provide important insight into trends in the public’s knowledge, attitudes, and beliefs surroundin...
Autores principales: | , , , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
JMIR Publications Inc.
2015
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4642404/ https://www.ncbi.nlm.nih.gov/pubmed/26307512 http://dx.doi.org/10.2196/jmir.4392 |
_version_ | 1782400360361492480 |
---|---|
author | Cole-Lewis, Heather Varghese, Arun Sanders, Amy Schwarz, Mary Pugatch, Jillian Augustson, Erik |
author_facet | Cole-Lewis, Heather Varghese, Arun Sanders, Amy Schwarz, Mary Pugatch, Jillian Augustson, Erik |
author_sort | Cole-Lewis, Heather |
collection | PubMed |
description | BACKGROUND: Electronic cigarettes (e-cigarettes) continue to be a growing topic among social media users, especially on Twitter. The ability to analyze conversations about e-cigarettes in real-time can provide important insight into trends in the public’s knowledge, attitudes, and beliefs surrounding e-cigarettes, and subsequently guide public health interventions. OBJECTIVE: Our aim was to establish a supervised machine learning algorithm to build predictive classification models that assess Twitter data for a range of factors related to e-cigarettes. METHODS: Manual content analysis was conducted for 17,098 tweets. These tweets were coded for five categories: e-cigarette relevance, sentiment, user description, genre, and theme. Machine learning classification models were then built for each of these five categories, and word groupings (n-grams) were used to define the feature space for each classifier. RESULTS: Predictive performance scores for classification models indicated that the models correctly labeled the tweets with the appropriate variables between 68.40% and 99.34% of the time, and the percentage of maximum possible improvement over a random baseline that was achieved by the classification models ranged from 41.59% to 80.62%. Classifiers with the highest performance scores that also achieved the highest percentage of the maximum possible improvement over a random baseline were Policy/Government (performance: 0.94; % improvement: 80.62%), Relevance (performance: 0.94; % improvement: 75.26%), Ad or Promotion (performance: 0.89; % improvement: 72.69%), and Marketing (performance: 0.91; % improvement: 72.56%). The most appropriate word-grouping unit (n-gram) was 1 for the majority of classifiers. Performance continued to marginally increase with the size of the training dataset of manually annotated data, but eventually leveled off. Even at low dataset sizes of 4000 observations, performance characteristics were fairly sound. CONCLUSIONS: Social media outlets like Twitter can uncover real-time snapshots of personal sentiment, knowledge, attitudes, and behavior that are not as accessible, at this scale, through any other offline platform. Using the vast data available through social media presents an opportunity for social science and public health methodologies to utilize computational methodologies to enhance and extend research and practice. This study was successful in automating a complex five-category manual content analysis of e-cigarette-related content on Twitter using machine learning techniques. The study details machine learning model specifications that provided the best accuracy for data related to e-cigarettes, as well as a replicable methodology to allow extension of these methods to additional topics. |
format | Online Article Text |
id | pubmed-4642404 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2015 |
publisher | JMIR Publications Inc. |
record_format | MEDLINE/PubMed |
spelling | pubmed-46424042016-01-12 Assessing Electronic Cigarette-Related Tweets for Sentiment and Content Using Supervised Machine Learning Cole-Lewis, Heather Varghese, Arun Sanders, Amy Schwarz, Mary Pugatch, Jillian Augustson, Erik J Med Internet Res Original Paper BACKGROUND: Electronic cigarettes (e-cigarettes) continue to be a growing topic among social media users, especially on Twitter. The ability to analyze conversations about e-cigarettes in real-time can provide important insight into trends in the public’s knowledge, attitudes, and beliefs surrounding e-cigarettes, and subsequently guide public health interventions. OBJECTIVE: Our aim was to establish a supervised machine learning algorithm to build predictive classification models that assess Twitter data for a range of factors related to e-cigarettes. METHODS: Manual content analysis was conducted for 17,098 tweets. These tweets were coded for five categories: e-cigarette relevance, sentiment, user description, genre, and theme. Machine learning classification models were then built for each of these five categories, and word groupings (n-grams) were used to define the feature space for each classifier. RESULTS: Predictive performance scores for classification models indicated that the models correctly labeled the tweets with the appropriate variables between 68.40% and 99.34% of the time, and the percentage of maximum possible improvement over a random baseline that was achieved by the classification models ranged from 41.59% to 80.62%. Classifiers with the highest performance scores that also achieved the highest percentage of the maximum possible improvement over a random baseline were Policy/Government (performance: 0.94; % improvement: 80.62%), Relevance (performance: 0.94; % improvement: 75.26%), Ad or Promotion (performance: 0.89; % improvement: 72.69%), and Marketing (performance: 0.91; % improvement: 72.56%). The most appropriate word-grouping unit (n-gram) was 1 for the majority of classifiers. Performance continued to marginally increase with the size of the training dataset of manually annotated data, but eventually leveled off. Even at low dataset sizes of 4000 observations, performance characteristics were fairly sound. CONCLUSIONS: Social media outlets like Twitter can uncover real-time snapshots of personal sentiment, knowledge, attitudes, and behavior that are not as accessible, at this scale, through any other offline platform. Using the vast data available through social media presents an opportunity for social science and public health methodologies to utilize computational methodologies to enhance and extend research and practice. This study was successful in automating a complex five-category manual content analysis of e-cigarette-related content on Twitter using machine learning techniques. The study details machine learning model specifications that provided the best accuracy for data related to e-cigarettes, as well as a replicable methodology to allow extension of these methods to additional topics. JMIR Publications Inc. 2015-08-25 /pmc/articles/PMC4642404/ /pubmed/26307512 http://dx.doi.org/10.2196/jmir.4392 Text en ©Heather Cole-Lewis, Arun Varghese, Amy Sanders, Mary Schwarz, Jillian Pugatch, Erik Augustson. Originally published in the Journal of Medical Internet Research (http://www.jmir.org), 25.08.2015. https://creativecommons.org/licenses/by/2.0/This is an open-access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0/ (https://creativecommons.org/licenses/by/2.0/) ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in the Journal of Medical Internet Research, is properly cited. The complete bibliographic information, a link to the original publication on http://www.jmir.org/, as well as this copyright and license information must be included. |
spellingShingle | Original Paper Cole-Lewis, Heather Varghese, Arun Sanders, Amy Schwarz, Mary Pugatch, Jillian Augustson, Erik Assessing Electronic Cigarette-Related Tweets for Sentiment and Content Using Supervised Machine Learning |
title | Assessing Electronic Cigarette-Related Tweets for Sentiment and Content Using Supervised Machine Learning |
title_full | Assessing Electronic Cigarette-Related Tweets for Sentiment and Content Using Supervised Machine Learning |
title_fullStr | Assessing Electronic Cigarette-Related Tweets for Sentiment and Content Using Supervised Machine Learning |
title_full_unstemmed | Assessing Electronic Cigarette-Related Tweets for Sentiment and Content Using Supervised Machine Learning |
title_short | Assessing Electronic Cigarette-Related Tweets for Sentiment and Content Using Supervised Machine Learning |
title_sort | assessing electronic cigarette-related tweets for sentiment and content using supervised machine learning |
topic | Original Paper |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4642404/ https://www.ncbi.nlm.nih.gov/pubmed/26307512 http://dx.doi.org/10.2196/jmir.4392 |
work_keys_str_mv | AT colelewisheather assessingelectroniccigaretterelatedtweetsforsentimentandcontentusingsupervisedmachinelearning AT varghesearun assessingelectroniccigaretterelatedtweetsforsentimentandcontentusingsupervisedmachinelearning AT sandersamy assessingelectroniccigaretterelatedtweetsforsentimentandcontentusingsupervisedmachinelearning AT schwarzmary assessingelectroniccigaretterelatedtweetsforsentimentandcontentusingsupervisedmachinelearning AT pugatchjillian assessingelectroniccigaretterelatedtweetsforsentimentandcontentusingsupervisedmachinelearning AT augustsonerik assessingelectroniccigaretterelatedtweetsforsentimentandcontentusingsupervisedmachinelearning |