Cargando…

Development of a COVID-19–Related Anti-Asian Tweet Data Set: Quantitative Study

BACKGROUND: Since the advent of the COVID-19 pandemic, individuals of Asian descent (colloquial usage prevalent in North America, where “Asian” is used to refer to people from East Asia, particularly China) have been the subject of stigma and hate speech in both offline and online communities. One o...

Descripción completa

Detalles Bibliográficos
Autores principales:	Mokhberi, Maryam, Biswas, Ahana, Masud, Zarif, Kteily-Hawa, Roula, Goldstein, Abby, Gillis, Joseph Roy, Rayana, Shebuti, Ahmed, Syed Ishtiaque
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	JMIR Publications 2023
Materias:	Original Paper
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9976773/ https://www.ncbi.nlm.nih.gov/pubmed/36693148 http://dx.doi.org/10.2196/40403

_version_	1784899157027192832
author	Mokhberi, Maryam Biswas, Ahana Masud, Zarif Kteily-Hawa, Roula Goldstein, Abby Gillis, Joseph Roy Rayana, Shebuti Ahmed, Syed Ishtiaque
author_facet	Mokhberi, Maryam Biswas, Ahana Masud, Zarif Kteily-Hawa, Roula Goldstein, Abby Gillis, Joseph Roy Rayana, Shebuti Ahmed, Syed Ishtiaque
author_sort	Mokhberi, Maryam
collection	PubMed
description	BACKGROUND: Since the advent of the COVID-19 pandemic, individuals of Asian descent (colloquial usage prevalent in North America, where “Asian” is used to refer to people from East Asia, particularly China) have been the subject of stigma and hate speech in both offline and online communities. One of the major venues for encountering such unfair attacks is social networks, such as Twitter. As the research community seeks to understand, analyze, and implement detection techniques, high-quality data sets are becoming immensely important. OBJECTIVE: In this study, we introduce a manually labeled data set of tweets containing anti-Asian stigmatizing content. METHODS: We sampled over 668 million tweets posted on Twitter from January to July 2020 and used an iterative data construction approach that included 3 different stages of algorithm-driven data selection. Finally, we found volunteers who manually annotated the tweets by hand to arrive at a high-quality data set of tweets and a second, more sampled data set with higher-quality labels from multiple annotators. We presented this final high-quality Twitter data set on stigma toward Chinese people during the COVID-19 pandemic. The data set and instructions for labeling can be viewed in the Github repository. Furthermore, we implemented some state-of-the-art models to detect stigmatizing tweets to set initial benchmarks for our data set. RESULTS: Our primary contributions are labeled data sets. Data Set v3.0 contained 11,263 tweets with primary labels (unknown/irrelevant, not-stigmatizing, stigmatizing-low, stigmatizing-medium, stigmatizing-high) and tweet subtopics (eg, wet market and eating habits, COVID-19 cases, bioweapon). Data Set v3.1 contained 4998 (44.4%) tweets randomly sampled from Data Set v3.0, where a second annotator labeled them only on the primary labels and then a third annotator resolved conflicts between the first and second annotators. To demonstrate the usefulness of our data set, preliminary experiments on the data set showed that the Bidirectional Encoder Representations from Transformers (BERT) model achieved the highest accuracy of 79% when detecting stigma on unseen data with traditional models, such as a support vector machine (SVM) performing at 73% accuracy. CONCLUSIONS: Our data set can be used as a benchmark for further qualitative and quantitative research and analysis around the issue. It first reaffirms the existence and significance of widespread discrimination and stigma toward the Asian population worldwide. Moreover, our data set and subsequent arguments should assist other researchers from various domains, including psychologists, public policy authorities, and sociologists, to analyze the complex economic, political, historical, and cultural underlying roots of anti-Asian stigmatization and hateful behaviors. A manually annotated data set is of paramount importance for developing algorithms that can be used to detect stigma or problematic text, particularly on social media. We believe this contribution will help predict and subsequently design interventions that will significantly help reduce stigma, hate, and discrimination against marginalized populations during future crises like COVID-19.
format	Online Article Text
id	pubmed-9976773
institution	National Center for Biotechnology Information
language	English
publishDate	2023
publisher	JMIR Publications
record_format	MEDLINE/PubMed
spelling	pubmed-99767732023-03-02 Development of a COVID-19–Related Anti-Asian Tweet Data Set: Quantitative Study Mokhberi, Maryam Biswas, Ahana Masud, Zarif Kteily-Hawa, Roula Goldstein, Abby Gillis, Joseph Roy Rayana, Shebuti Ahmed, Syed Ishtiaque JMIR Form Res Original Paper BACKGROUND: Since the advent of the COVID-19 pandemic, individuals of Asian descent (colloquial usage prevalent in North America, where “Asian” is used to refer to people from East Asia, particularly China) have been the subject of stigma and hate speech in both offline and online communities. One of the major venues for encountering such unfair attacks is social networks, such as Twitter. As the research community seeks to understand, analyze, and implement detection techniques, high-quality data sets are becoming immensely important. OBJECTIVE: In this study, we introduce a manually labeled data set of tweets containing anti-Asian stigmatizing content. METHODS: We sampled over 668 million tweets posted on Twitter from January to July 2020 and used an iterative data construction approach that included 3 different stages of algorithm-driven data selection. Finally, we found volunteers who manually annotated the tweets by hand to arrive at a high-quality data set of tweets and a second, more sampled data set with higher-quality labels from multiple annotators. We presented this final high-quality Twitter data set on stigma toward Chinese people during the COVID-19 pandemic. The data set and instructions for labeling can be viewed in the Github repository. Furthermore, we implemented some state-of-the-art models to detect stigmatizing tweets to set initial benchmarks for our data set. RESULTS: Our primary contributions are labeled data sets. Data Set v3.0 contained 11,263 tweets with primary labels (unknown/irrelevant, not-stigmatizing, stigmatizing-low, stigmatizing-medium, stigmatizing-high) and tweet subtopics (eg, wet market and eating habits, COVID-19 cases, bioweapon). Data Set v3.1 contained 4998 (44.4%) tweets randomly sampled from Data Set v3.0, where a second annotator labeled them only on the primary labels and then a third annotator resolved conflicts between the first and second annotators. To demonstrate the usefulness of our data set, preliminary experiments on the data set showed that the Bidirectional Encoder Representations from Transformers (BERT) model achieved the highest accuracy of 79% when detecting stigma on unseen data with traditional models, such as a support vector machine (SVM) performing at 73% accuracy. CONCLUSIONS: Our data set can be used as a benchmark for further qualitative and quantitative research and analysis around the issue. It first reaffirms the existence and significance of widespread discrimination and stigma toward the Asian population worldwide. Moreover, our data set and subsequent arguments should assist other researchers from various domains, including psychologists, public policy authorities, and sociologists, to analyze the complex economic, political, historical, and cultural underlying roots of anti-Asian stigmatization and hateful behaviors. A manually annotated data set is of paramount importance for developing algorithms that can be used to detect stigma or problematic text, particularly on social media. We believe this contribution will help predict and subsequently design interventions that will significantly help reduce stigma, hate, and discrimination against marginalized populations during future crises like COVID-19. JMIR Publications 2023-02-28 /pmc/articles/PMC9976773/ /pubmed/36693148 http://dx.doi.org/10.2196/40403 Text en ©Maryam Mokhberi, Ahana Biswas, Zarif Masud, Roula Kteily-Hawa, Abby Goldstein, Joseph Roy Gillis, Shebuti Rayana, Syed Ishtiaque Ahmed. Originally published in JMIR Formative Research (https://formative.jmir.org), 28.02.2023. https://creativecommons.org/licenses/by/4.0/This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR Formative Research, is properly cited. The complete bibliographic information, a link to the original publication on https://formative.jmir.org, as well as this copyright and license information must be included.
spellingShingle	Original Paper Mokhberi, Maryam Biswas, Ahana Masud, Zarif Kteily-Hawa, Roula Goldstein, Abby Gillis, Joseph Roy Rayana, Shebuti Ahmed, Syed Ishtiaque Development of a COVID-19–Related Anti-Asian Tweet Data Set: Quantitative Study
title	Development of a COVID-19–Related Anti-Asian Tweet Data Set: Quantitative Study
title_full	Development of a COVID-19–Related Anti-Asian Tweet Data Set: Quantitative Study
title_fullStr	Development of a COVID-19–Related Anti-Asian Tweet Data Set: Quantitative Study
title_full_unstemmed	Development of a COVID-19–Related Anti-Asian Tweet Data Set: Quantitative Study
title_short	Development of a COVID-19–Related Anti-Asian Tweet Data Set: Quantitative Study
title_sort	development of a covid-19–related anti-asian tweet data set: quantitative study
topic	Original Paper
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9976773/ https://www.ncbi.nlm.nih.gov/pubmed/36693148 http://dx.doi.org/10.2196/40403
work_keys_str_mv	AT mokhberimaryam developmentofacovid19relatedantiasiantweetdatasetquantitativestudy AT biswasahana developmentofacovid19relatedantiasiantweetdatasetquantitativestudy AT masudzarif developmentofacovid19relatedantiasiantweetdatasetquantitativestudy AT kteilyhawaroula developmentofacovid19relatedantiasiantweetdatasetquantitativestudy AT goldsteinabby developmentofacovid19relatedantiasiantweetdatasetquantitativestudy AT gillisjosephroy developmentofacovid19relatedantiasiantweetdatasetquantitativestudy AT rayanashebuti developmentofacovid19relatedantiasiantweetdatasetquantitativestudy AT ahmedsyedishtiaque developmentofacovid19relatedantiasiantweetdatasetquantitativestudy

Development of a COVID-19–Related Anti-Asian Tweet Data Set: Quantitative Study

Ejemplares similares