Cargando…

Classification of Twitter Vaping Discourse Using BERTweet: Comparative Deep Learning Study

BACKGROUND: Twitter provides a valuable platform for the surveillance and monitoring of public health topics; however, manually categorizing large quantities of Twitter data is labor intensive and presents barriers to identify major trends and sentiments. Additionally, while machine and deep learnin...

Descripción completa

Detalles Bibliográficos
Autores principales:	Baker, William, Colditz, Jason B, Dobbs, Page D, Mai, Huy, Visweswaran, Shyam, Zhan, Justin, Primack, Brian A
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	JMIR Publications 2022
Materias:	Original Paper
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9353682/ https://www.ncbi.nlm.nih.gov/pubmed/35862172 http://dx.doi.org/10.2196/33678

_version_	1784762913371717632
author	Baker, William Colditz, Jason B Dobbs, Page D Mai, Huy Visweswaran, Shyam Zhan, Justin Primack, Brian A
author_facet	Baker, William Colditz, Jason B Dobbs, Page D Mai, Huy Visweswaran, Shyam Zhan, Justin Primack, Brian A
author_sort	Baker, William
collection	PubMed
description	BACKGROUND: Twitter provides a valuable platform for the surveillance and monitoring of public health topics; however, manually categorizing large quantities of Twitter data is labor intensive and presents barriers to identify major trends and sentiments. Additionally, while machine and deep learning approaches have been proposed with high accuracy, they require large, annotated data sets. Public pretrained deep learning classification models, such as BERTweet, produce higher-quality models while using smaller annotated training sets. OBJECTIVE: This study aims to derive and evaluate a pretrained deep learning model based on BERTweet that can identify tweets relevant to vaping, tweets (related to vaping) of commercial nature, and tweets with provape sentiment. Additionally, the performance of the BERTweet classifier will be compared against a long short-term memory (LSTM) model to show the improvements a pretrained model has over traditional deep learning approaches. METHODS: Twitter data were collected from August to October 2019 using vaping-related search terms. From this set, a random subsample of 2401 English tweets was manually annotated for relevance (vaping related or not), commercial nature (commercial or not), and sentiment (positive, negative, or neutral). Using the annotated data, 3 separate classifiers were built using BERTweet with the default parameters defined by the Simple Transformer application programming interface (API). Each model was trained for 20 iterations and evaluated with a random split of the annotated tweets, reserving 10% (n=165) of tweets for evaluations. RESULTS: The relevance, commercial, and sentiment classifiers achieved an area under the receiver operating characteristic curve (AUROC) of 94.5%, 99.3%, and 81.7%, respectively. Additionally, the weighted F1 scores of each were 97.6%, 99.0%, and 86.1%, respectively. We found that BERTweet outperformed the LSTM model in the classification of all categories. CONCLUSIONS: Large, open-source deep learning classifiers, such as BERTweet, can provide researchers the ability to reliably determine if tweets are relevant to vaping; include commercial content; and include positive, negative, or neutral content about vaping with a higher accuracy than traditional natural language processing deep learning models. Such enhancement to the utilization of Twitter data can allow for faster exploration and dissemination of time-sensitive data than traditional methodologies (eg, surveys, polling research).
format	Online Article Text
id	pubmed-9353682
institution	National Center for Biotechnology Information
language	English
publishDate	2022
publisher	JMIR Publications
record_format	MEDLINE/PubMed
spelling	pubmed-93536822022-08-06 Classification of Twitter Vaping Discourse Using BERTweet: Comparative Deep Learning Study Baker, William Colditz, Jason B Dobbs, Page D Mai, Huy Visweswaran, Shyam Zhan, Justin Primack, Brian A JMIR Med Inform Original Paper BACKGROUND: Twitter provides a valuable platform for the surveillance and monitoring of public health topics; however, manually categorizing large quantities of Twitter data is labor intensive and presents barriers to identify major trends and sentiments. Additionally, while machine and deep learning approaches have been proposed with high accuracy, they require large, annotated data sets. Public pretrained deep learning classification models, such as BERTweet, produce higher-quality models while using smaller annotated training sets. OBJECTIVE: This study aims to derive and evaluate a pretrained deep learning model based on BERTweet that can identify tweets relevant to vaping, tweets (related to vaping) of commercial nature, and tweets with provape sentiment. Additionally, the performance of the BERTweet classifier will be compared against a long short-term memory (LSTM) model to show the improvements a pretrained model has over traditional deep learning approaches. METHODS: Twitter data were collected from August to October 2019 using vaping-related search terms. From this set, a random subsample of 2401 English tweets was manually annotated for relevance (vaping related or not), commercial nature (commercial or not), and sentiment (positive, negative, or neutral). Using the annotated data, 3 separate classifiers were built using BERTweet with the default parameters defined by the Simple Transformer application programming interface (API). Each model was trained for 20 iterations and evaluated with a random split of the annotated tweets, reserving 10% (n=165) of tweets for evaluations. RESULTS: The relevance, commercial, and sentiment classifiers achieved an area under the receiver operating characteristic curve (AUROC) of 94.5%, 99.3%, and 81.7%, respectively. Additionally, the weighted F1 scores of each were 97.6%, 99.0%, and 86.1%, respectively. We found that BERTweet outperformed the LSTM model in the classification of all categories. CONCLUSIONS: Large, open-source deep learning classifiers, such as BERTweet, can provide researchers the ability to reliably determine if tweets are relevant to vaping; include commercial content; and include positive, negative, or neutral content about vaping with a higher accuracy than traditional natural language processing deep learning models. Such enhancement to the utilization of Twitter data can allow for faster exploration and dissemination of time-sensitive data than traditional methodologies (eg, surveys, polling research). JMIR Publications 2022-07-21 /pmc/articles/PMC9353682/ /pubmed/35862172 http://dx.doi.org/10.2196/33678 Text en ©William Baker, Jason B Colditz, Page D Dobbs, Huy Mai, Shyam Visweswaran, Justin Zhan, Brian A Primack. Originally published in JMIR Medical Informatics (https://medinform.jmir.org), 21.07.2022. https://creativecommons.org/licenses/by/4.0/This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR Medical Informatics, is properly cited. The complete bibliographic information, a link to the original publication on https://medinform.jmir.org/, as well as this copyright and license information must be included.
spellingShingle	Original Paper Baker, William Colditz, Jason B Dobbs, Page D Mai, Huy Visweswaran, Shyam Zhan, Justin Primack, Brian A Classification of Twitter Vaping Discourse Using BERTweet: Comparative Deep Learning Study
title	Classification of Twitter Vaping Discourse Using BERTweet: Comparative Deep Learning Study
title_full	Classification of Twitter Vaping Discourse Using BERTweet: Comparative Deep Learning Study
title_fullStr	Classification of Twitter Vaping Discourse Using BERTweet: Comparative Deep Learning Study
title_full_unstemmed	Classification of Twitter Vaping Discourse Using BERTweet: Comparative Deep Learning Study
title_short	Classification of Twitter Vaping Discourse Using BERTweet: Comparative Deep Learning Study
title_sort	classification of twitter vaping discourse using bertweet: comparative deep learning study
topic	Original Paper
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9353682/ https://www.ncbi.nlm.nih.gov/pubmed/35862172 http://dx.doi.org/10.2196/33678
work_keys_str_mv	AT bakerwilliam classificationoftwittervapingdiscourseusingbertweetcomparativedeeplearningstudy AT colditzjasonb classificationoftwittervapingdiscourseusingbertweetcomparativedeeplearningstudy AT dobbspaged classificationoftwittervapingdiscourseusingbertweetcomparativedeeplearningstudy AT maihuy classificationoftwittervapingdiscourseusingbertweetcomparativedeeplearningstudy AT visweswaranshyam classificationoftwittervapingdiscourseusingbertweetcomparativedeeplearningstudy AT zhanjustin classificationoftwittervapingdiscourseusingbertweetcomparativedeeplearningstudy AT primackbriana classificationoftwittervapingdiscourseusingbertweetcomparativedeeplearningstudy

Classification of Twitter Vaping Discourse Using BERTweet: Comparative Deep Learning Study

Ejemplares similares