Cargando…

Toward Using Twitter for Tracking COVID-19: A Natural Language Processing Pipeline and Exploratory Data Set

BACKGROUND: In the United States, the rapidly evolving COVID-19 outbreak, the shortage of available testing, and the delay of test results present challenges for actively monitoring its spread based on testing alone. OBJECTIVE: The objective of this study was to develop, evaluate, and deploy an auto...

Descripción completa

Detalles Bibliográficos
Autores principales:	Klein, Ari Z, Magge, Arjun, O'Connor, Karen, Flores Amaro, Jesus Ivan, Weissenbacher, Davy, Gonzalez Hernandez, Graciela
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	JMIR Publications 2021
Materias:	Original Paper
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7834613/ https://www.ncbi.nlm.nih.gov/pubmed/33449904 http://dx.doi.org/10.2196/25314

_version_	1783642322394677248
author	Klein, Ari Z Magge, Arjun O'Connor, Karen Flores Amaro, Jesus Ivan Weissenbacher, Davy Gonzalez Hernandez, Graciela
author_facet	Klein, Ari Z Magge, Arjun O'Connor, Karen Flores Amaro, Jesus Ivan Weissenbacher, Davy Gonzalez Hernandez, Graciela
author_sort	Klein, Ari Z
collection	PubMed
description	BACKGROUND: In the United States, the rapidly evolving COVID-19 outbreak, the shortage of available testing, and the delay of test results present challenges for actively monitoring its spread based on testing alone. OBJECTIVE: The objective of this study was to develop, evaluate, and deploy an automatic natural language processing pipeline to collect user-generated Twitter data as a complementary resource for identifying potential cases of COVID-19 in the United States that are not based on testing and, thus, may not have been reported to the Centers for Disease Control and Prevention. METHODS: Beginning January 23, 2020, we collected English tweets from the Twitter Streaming application programming interface that mention keywords related to COVID-19. We applied handwritten regular expressions to identify tweets indicating that the user potentially has been exposed to COVID-19. We automatically filtered out “reported speech” (eg, quotations, news headlines) from the tweets that matched the regular expressions, and two annotators annotated a random sample of 8976 tweets that are geo-tagged or have profile location metadata, distinguishing tweets that self-report potential cases of COVID-19 from those that do not. We used the annotated tweets to train and evaluate deep neural network classifiers based on bidirectional encoder representations from transformers (BERT). Finally, we deployed the automatic pipeline on more than 85 million unlabeled tweets that were continuously collected between March 1 and August 21, 2020. RESULTS: Interannotator agreement, based on dual annotations for 3644 (41%) of the 8976 tweets, was 0.77 (Cohen κ). A deep neural network classifier, based on a BERT model that was pretrained on tweets related to COVID-19, achieved an F(1)-score of 0.76 (precision=0.76, recall=0.76) for detecting tweets that self-report potential cases of COVID-19. Upon deploying our automatic pipeline, we identified 13,714 tweets that self-report potential cases of COVID-19 and have US state–level geolocations. CONCLUSIONS: We have made the 13,714 tweets identified in this study, along with each tweet’s time stamp and US state–level geolocation, publicly available to download. This data set presents the opportunity for future work to assess the utility of Twitter data as a complementary resource for tracking the spread of COVID-19.
format	Online Article Text
id	pubmed-7834613
institution	National Center for Biotechnology Information
language	English
publishDate	2021
publisher	JMIR Publications
record_format	MEDLINE/PubMed
spelling	pubmed-78346132021-01-29 Toward Using Twitter for Tracking COVID-19: A Natural Language Processing Pipeline and Exploratory Data Set Klein, Ari Z Magge, Arjun O'Connor, Karen Flores Amaro, Jesus Ivan Weissenbacher, Davy Gonzalez Hernandez, Graciela J Med Internet Res Original Paper BACKGROUND: In the United States, the rapidly evolving COVID-19 outbreak, the shortage of available testing, and the delay of test results present challenges for actively monitoring its spread based on testing alone. OBJECTIVE: The objective of this study was to develop, evaluate, and deploy an automatic natural language processing pipeline to collect user-generated Twitter data as a complementary resource for identifying potential cases of COVID-19 in the United States that are not based on testing and, thus, may not have been reported to the Centers for Disease Control and Prevention. METHODS: Beginning January 23, 2020, we collected English tweets from the Twitter Streaming application programming interface that mention keywords related to COVID-19. We applied handwritten regular expressions to identify tweets indicating that the user potentially has been exposed to COVID-19. We automatically filtered out “reported speech” (eg, quotations, news headlines) from the tweets that matched the regular expressions, and two annotators annotated a random sample of 8976 tweets that are geo-tagged or have profile location metadata, distinguishing tweets that self-report potential cases of COVID-19 from those that do not. We used the annotated tweets to train and evaluate deep neural network classifiers based on bidirectional encoder representations from transformers (BERT). Finally, we deployed the automatic pipeline on more than 85 million unlabeled tweets that were continuously collected between March 1 and August 21, 2020. RESULTS: Interannotator agreement, based on dual annotations for 3644 (41%) of the 8976 tweets, was 0.77 (Cohen κ). A deep neural network classifier, based on a BERT model that was pretrained on tweets related to COVID-19, achieved an F(1)-score of 0.76 (precision=0.76, recall=0.76) for detecting tweets that self-report potential cases of COVID-19. Upon deploying our automatic pipeline, we identified 13,714 tweets that self-report potential cases of COVID-19 and have US state–level geolocations. CONCLUSIONS: We have made the 13,714 tweets identified in this study, along with each tweet’s time stamp and US state–level geolocation, publicly available to download. This data set presents the opportunity for future work to assess the utility of Twitter data as a complementary resource for tracking the spread of COVID-19. JMIR Publications 2021-01-22 /pmc/articles/PMC7834613/ /pubmed/33449904 http://dx.doi.org/10.2196/25314 Text en ©Ari Z Klein, Arjun Magge, Karen O'Connor, Jesus Ivan Flores Amaro, Davy Weissenbacher, Graciela Gonzalez Hernandez. Originally published in the Journal of Medical Internet Research (http://www.jmir.org), 22.01.2021. https://creativecommons.org/licenses/by/4.0/ This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in the Journal of Medical Internet Research, is properly cited. The complete bibliographic information, a link to the original publication on http://www.jmir.org/, as well as this copyright and license information must be included.
spellingShingle	Original Paper Klein, Ari Z Magge, Arjun O'Connor, Karen Flores Amaro, Jesus Ivan Weissenbacher, Davy Gonzalez Hernandez, Graciela Toward Using Twitter for Tracking COVID-19: A Natural Language Processing Pipeline and Exploratory Data Set
title	Toward Using Twitter for Tracking COVID-19: A Natural Language Processing Pipeline and Exploratory Data Set
title_full	Toward Using Twitter for Tracking COVID-19: A Natural Language Processing Pipeline and Exploratory Data Set
title_fullStr	Toward Using Twitter for Tracking COVID-19: A Natural Language Processing Pipeline and Exploratory Data Set
title_full_unstemmed	Toward Using Twitter for Tracking COVID-19: A Natural Language Processing Pipeline and Exploratory Data Set
title_short	Toward Using Twitter for Tracking COVID-19: A Natural Language Processing Pipeline and Exploratory Data Set
title_sort	toward using twitter for tracking covid-19: a natural language processing pipeline and exploratory data set
topic	Original Paper
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7834613/ https://www.ncbi.nlm.nih.gov/pubmed/33449904 http://dx.doi.org/10.2196/25314
work_keys_str_mv	AT kleinariz towardusingtwitterfortrackingcovid19anaturallanguageprocessingpipelineandexploratorydataset AT maggearjun towardusingtwitterfortrackingcovid19anaturallanguageprocessingpipelineandexploratorydataset AT oconnorkaren towardusingtwitterfortrackingcovid19anaturallanguageprocessingpipelineandexploratorydataset AT floresamarojesusivan towardusingtwitterfortrackingcovid19anaturallanguageprocessingpipelineandexploratorydataset AT weissenbacherdavy towardusingtwitterfortrackingcovid19anaturallanguageprocessingpipelineandexploratorydataset AT gonzalezhernandezgraciela towardusingtwitterfortrackingcovid19anaturallanguageprocessingpipelineandexploratorydataset

Toward Using Twitter for Tracking COVID-19: A Natural Language Processing Pipeline and Exploratory Data Set

Ejemplares similares