Cargando…

Short text classification with machine learning in the social sciences: The case of climate change on Twitter

To analyse large numbers of texts, social science researchers are increasingly confronting the challenge of text classification. When manual labeling is not possible and researchers have to find automatized ways to classify texts, computer science provides a useful toolbox of machine-learning method...

Descripción completa

Detalles Bibliográficos
Autores principales: Shyrokykh, Karina, Girnyk, Max, Dellmuth, Lisa
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Public Library of Science 2023
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10540966/
https://www.ncbi.nlm.nih.gov/pubmed/37773969
http://dx.doi.org/10.1371/journal.pone.0290762
_version_ 1785113821520592896
author Shyrokykh, Karina
Girnyk, Max
Dellmuth, Lisa
author_facet Shyrokykh, Karina
Girnyk, Max
Dellmuth, Lisa
author_sort Shyrokykh, Karina
collection PubMed
description To analyse large numbers of texts, social science researchers are increasingly confronting the challenge of text classification. When manual labeling is not possible and researchers have to find automatized ways to classify texts, computer science provides a useful toolbox of machine-learning methods whose performance remains understudied in the social sciences. In this article, we compare the performance of the most widely used text classifiers by applying them to a typical research scenario in social science research: a relatively small labeled dataset with infrequent occurrence of categories of interest, which is a part of a large unlabeled dataset. As an example case, we look at Twitter communication regarding climate change, a topic of increasing scholarly interest in interdisciplinary social science research. Using a novel dataset including 5,750 tweets from various international organizations regarding the highly ambiguous concept of climate change, we evaluate the performance of methods in automatically classifying tweets based on whether they are about climate change or not. In this context, we highlight two main findings. First, supervised machine-learning methods perform better than state-of-the-art lexicons, in particular as class balance increases. Second, traditional machine-learning methods, such as logistic regression and random forest, perform similarly to sophisticated deep-learning methods, whilst requiring much less training time and computational resources. The results have important implications for the analysis of short texts in social science research.
format Online
Article
Text
id pubmed-10540966
institution National Center for Biotechnology Information
language English
publishDate 2023
publisher Public Library of Science
record_format MEDLINE/PubMed
spelling pubmed-105409662023-10-01 Short text classification with machine learning in the social sciences: The case of climate change on Twitter Shyrokykh, Karina Girnyk, Max Dellmuth, Lisa PLoS One Research Article To analyse large numbers of texts, social science researchers are increasingly confronting the challenge of text classification. When manual labeling is not possible and researchers have to find automatized ways to classify texts, computer science provides a useful toolbox of machine-learning methods whose performance remains understudied in the social sciences. In this article, we compare the performance of the most widely used text classifiers by applying them to a typical research scenario in social science research: a relatively small labeled dataset with infrequent occurrence of categories of interest, which is a part of a large unlabeled dataset. As an example case, we look at Twitter communication regarding climate change, a topic of increasing scholarly interest in interdisciplinary social science research. Using a novel dataset including 5,750 tweets from various international organizations regarding the highly ambiguous concept of climate change, we evaluate the performance of methods in automatically classifying tweets based on whether they are about climate change or not. In this context, we highlight two main findings. First, supervised machine-learning methods perform better than state-of-the-art lexicons, in particular as class balance increases. Second, traditional machine-learning methods, such as logistic regression and random forest, perform similarly to sophisticated deep-learning methods, whilst requiring much less training time and computational resources. The results have important implications for the analysis of short texts in social science research. Public Library of Science 2023-09-29 /pmc/articles/PMC10540966/ /pubmed/37773969 http://dx.doi.org/10.1371/journal.pone.0290762 Text en © 2023 Shyrokykh et al https://creativecommons.org/licenses/by/4.0/This is an open access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
spellingShingle Research Article
Shyrokykh, Karina
Girnyk, Max
Dellmuth, Lisa
Short text classification with machine learning in the social sciences: The case of climate change on Twitter
title Short text classification with machine learning in the social sciences: The case of climate change on Twitter
title_full Short text classification with machine learning in the social sciences: The case of climate change on Twitter
title_fullStr Short text classification with machine learning in the social sciences: The case of climate change on Twitter
title_full_unstemmed Short text classification with machine learning in the social sciences: The case of climate change on Twitter
title_short Short text classification with machine learning in the social sciences: The case of climate change on Twitter
title_sort short text classification with machine learning in the social sciences: the case of climate change on twitter
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10540966/
https://www.ncbi.nlm.nih.gov/pubmed/37773969
http://dx.doi.org/10.1371/journal.pone.0290762
work_keys_str_mv AT shyrokykhkarina shorttextclassificationwithmachinelearninginthesocialsciencesthecaseofclimatechangeontwitter
AT girnykmax shorttextclassificationwithmachinelearninginthesocialsciencesthecaseofclimatechangeontwitter
AT dellmuthlisa shorttextclassificationwithmachinelearninginthesocialsciencesthecaseofclimatechangeontwitter