Cargando…

Machine Learning Classifiers for Twitter Surveillance of Vaping: Comparative Machine Learning Study

BACKGROUND: Twitter presents a valuable and relevant social media platform to study the prevalence of information and sentiment on vaping that may be useful for public health surveillance. Machine learning classifiers that identify vaping-relevant tweets and characterize sentiments in them can under...

Descripción completa

Detalles Bibliográficos
Autores principales:	Visweswaran, Shyam, Colditz, Jason B, O’Halloran, Patrick, Han, Na-Rae, Taneja, Sanya B, Welling, Joel, Chu, Kar-Hai, Sidani, Jaime E, Primack, Brian A
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	JMIR Publications 2020
Materias:	Original Paper
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7450367/ https://www.ncbi.nlm.nih.gov/pubmed/32784184 http://dx.doi.org/10.2196/17478

_version_	1783574803748225024
author	Visweswaran, Shyam Colditz, Jason B O’Halloran, Patrick Han, Na-Rae Taneja, Sanya B Welling, Joel Chu, Kar-Hai Sidani, Jaime E Primack, Brian A
author_facet	Visweswaran, Shyam Colditz, Jason B O’Halloran, Patrick Han, Na-Rae Taneja, Sanya B Welling, Joel Chu, Kar-Hai Sidani, Jaime E Primack, Brian A
author_sort	Visweswaran, Shyam
collection	PubMed
description	BACKGROUND: Twitter presents a valuable and relevant social media platform to study the prevalence of information and sentiment on vaping that may be useful for public health surveillance. Machine learning classifiers that identify vaping-relevant tweets and characterize sentiments in them can underpin a Twitter-based vaping surveillance system. Compared with traditional machine learning classifiers that are reliant on annotations that are expensive to obtain, deep learning classifiers offer the advantage of requiring fewer annotated tweets by leveraging the large numbers of readily available unannotated tweets. OBJECTIVE: This study aims to derive and evaluate traditional and deep learning classifiers that can identify tweets relevant to vaping, tweets of a commercial nature, and tweets with provape sentiments. METHODS: We continuously collected tweets that matched vaping-related keywords over 2 months from August 2018 to October 2018. From this data set of tweets, a set of 4000 tweets was selected, and each tweet was manually annotated for relevance (vape relevant or not), commercial nature (commercial or not), and sentiment (provape or not). Using the annotated data, we derived traditional classifiers that included logistic regression, random forest, linear support vector machine, and multinomial naive Bayes. In addition, using the annotated data set and a larger unannotated data set of tweets, we derived deep learning classifiers that included a convolutional neural network (CNN), long short-term memory (LSTM) network, LSTM-CNN network, and bidirectional LSTM (BiLSTM) network. The unannotated tweet data were used to derive word vectors that deep learning classifiers can leverage to improve performance. RESULTS: LSTM-CNN performed the best with the highest area under the receiver operating characteristic curve (AUC) of 0.96 (95% CI 0.93-0.98) for relevance, all deep learning classifiers including LSTM-CNN performed better than the traditional classifiers with an AUC of 0.99 (95% CI 0.98-0.99) for distinguishing commercial from noncommercial tweets, and BiLSTM performed the best with an AUC of 0.83 (95% CI 0.78-0.89) for provape sentiment. Overall, LSTM-CNN performed the best across all 3 classification tasks. CONCLUSIONS: We derived and evaluated traditional machine learning and deep learning classifiers to identify vaping-related relevant, commercial, and provape tweets. Overall, deep learning classifiers such as LSTM-CNN had superior performance and had the added advantage of requiring no preprocessing. The performance of these classifiers supports the development of a vaping surveillance system.
format	Online Article Text
id	pubmed-7450367
institution	National Center for Biotechnology Information
language	English
publishDate	2020
publisher	JMIR Publications
record_format	MEDLINE/PubMed
spelling	pubmed-74503672020-08-31 Machine Learning Classifiers for Twitter Surveillance of Vaping: Comparative Machine Learning Study Visweswaran, Shyam Colditz, Jason B O’Halloran, Patrick Han, Na-Rae Taneja, Sanya B Welling, Joel Chu, Kar-Hai Sidani, Jaime E Primack, Brian A J Med Internet Res Original Paper BACKGROUND: Twitter presents a valuable and relevant social media platform to study the prevalence of information and sentiment on vaping that may be useful for public health surveillance. Machine learning classifiers that identify vaping-relevant tweets and characterize sentiments in them can underpin a Twitter-based vaping surveillance system. Compared with traditional machine learning classifiers that are reliant on annotations that are expensive to obtain, deep learning classifiers offer the advantage of requiring fewer annotated tweets by leveraging the large numbers of readily available unannotated tweets. OBJECTIVE: This study aims to derive and evaluate traditional and deep learning classifiers that can identify tweets relevant to vaping, tweets of a commercial nature, and tweets with provape sentiments. METHODS: We continuously collected tweets that matched vaping-related keywords over 2 months from August 2018 to October 2018. From this data set of tweets, a set of 4000 tweets was selected, and each tweet was manually annotated for relevance (vape relevant or not), commercial nature (commercial or not), and sentiment (provape or not). Using the annotated data, we derived traditional classifiers that included logistic regression, random forest, linear support vector machine, and multinomial naive Bayes. In addition, using the annotated data set and a larger unannotated data set of tweets, we derived deep learning classifiers that included a convolutional neural network (CNN), long short-term memory (LSTM) network, LSTM-CNN network, and bidirectional LSTM (BiLSTM) network. The unannotated tweet data were used to derive word vectors that deep learning classifiers can leverage to improve performance. RESULTS: LSTM-CNN performed the best with the highest area under the receiver operating characteristic curve (AUC) of 0.96 (95% CI 0.93-0.98) for relevance, all deep learning classifiers including LSTM-CNN performed better than the traditional classifiers with an AUC of 0.99 (95% CI 0.98-0.99) for distinguishing commercial from noncommercial tweets, and BiLSTM performed the best with an AUC of 0.83 (95% CI 0.78-0.89) for provape sentiment. Overall, LSTM-CNN performed the best across all 3 classification tasks. CONCLUSIONS: We derived and evaluated traditional machine learning and deep learning classifiers to identify vaping-related relevant, commercial, and provape tweets. Overall, deep learning classifiers such as LSTM-CNN had superior performance and had the added advantage of requiring no preprocessing. The performance of these classifiers supports the development of a vaping surveillance system. JMIR Publications 2020-08-12 /pmc/articles/PMC7450367/ /pubmed/32784184 http://dx.doi.org/10.2196/17478 Text en ©Shyam Visweswaran, Jason B Colditz, Patrick O’Halloran, Na-Rae Han, Sanya B Taneja, Joel Welling, Kar-Hai Chu, Jaime E Sidani, Brian A Primack. Originally published in the Journal of Medical Internet Research (http://www.jmir.org), 12.08.2020. https://creativecommons.org/licenses/by/4.0/ This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in the Journal of Medical Internet Research, is properly cited. The complete bibliographic information, a link to the original publication on http://www.jmir.org/, as well as this copyright and license information must be included.
spellingShingle	Original Paper Visweswaran, Shyam Colditz, Jason B O’Halloran, Patrick Han, Na-Rae Taneja, Sanya B Welling, Joel Chu, Kar-Hai Sidani, Jaime E Primack, Brian A Machine Learning Classifiers for Twitter Surveillance of Vaping: Comparative Machine Learning Study
title	Machine Learning Classifiers for Twitter Surveillance of Vaping: Comparative Machine Learning Study
title_full	Machine Learning Classifiers for Twitter Surveillance of Vaping: Comparative Machine Learning Study
title_fullStr	Machine Learning Classifiers for Twitter Surveillance of Vaping: Comparative Machine Learning Study
title_full_unstemmed	Machine Learning Classifiers for Twitter Surveillance of Vaping: Comparative Machine Learning Study
title_short	Machine Learning Classifiers for Twitter Surveillance of Vaping: Comparative Machine Learning Study
title_sort	machine learning classifiers for twitter surveillance of vaping: comparative machine learning study
topic	Original Paper
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7450367/ https://www.ncbi.nlm.nih.gov/pubmed/32784184 http://dx.doi.org/10.2196/17478
work_keys_str_mv	AT visweswaranshyam machinelearningclassifiersfortwittersurveillanceofvapingcomparativemachinelearningstudy AT colditzjasonb machinelearningclassifiersfortwittersurveillanceofvapingcomparativemachinelearningstudy AT ohalloranpatrick machinelearningclassifiersfortwittersurveillanceofvapingcomparativemachinelearningstudy AT hannarae machinelearningclassifiersfortwittersurveillanceofvapingcomparativemachinelearningstudy AT tanejasanyab machinelearningclassifiersfortwittersurveillanceofvapingcomparativemachinelearningstudy AT wellingjoel machinelearningclassifiersfortwittersurveillanceofvapingcomparativemachinelearningstudy AT chukarhai machinelearningclassifiersfortwittersurveillanceofvapingcomparativemachinelearningstudy AT sidanijaimee machinelearningclassifiersfortwittersurveillanceofvapingcomparativemachinelearningstudy AT primackbriana machinelearningclassifiersfortwittersurveillanceofvapingcomparativemachinelearningstudy

Machine Learning Classifiers for Twitter Surveillance of Vaping: Comparative Machine Learning Study

Ejemplares similares