Cargando…

Strategies for the Analysis of Large Social Media Corpora: Sampling and Keyword Extraction Methods

In the context of the COVID-19 pandemic, social media platforms such as Twitter have been of great importance for users to exchange news, ideas, and perceptions. Researchers from fields such as discourse analysis and the social sciences have resorted to this content to explore public opinion and sta...

Descripción completa

Detalles Bibliográficos
Autores principales:	Moreno-Ortiz, Antonio, García-Gámez, María
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Springer International Publishing 2023
Materias:	Original Paper
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10148754/ https://www.ncbi.nlm.nih.gov/pubmed/37361894 http://dx.doi.org/10.1007/s41701-023-00143-0

_version_	1785035039427264512
author	Moreno-Ortiz, Antonio García-Gámez, María
author_facet	Moreno-Ortiz, Antonio García-Gámez, María
author_sort	Moreno-Ortiz, Antonio
collection	PubMed
description	In the context of the COVID-19 pandemic, social media platforms such as Twitter have been of great importance for users to exchange news, ideas, and perceptions. Researchers from fields such as discourse analysis and the social sciences have resorted to this content to explore public opinion and stance on this topic, and they have tried to gather information through the compilation of large-scale corpora. However, the size of such corpora is both an advantage and a drawback, as simple text retrieval techniques and tools may prove to be impractical or altogether incapable of handling such masses of data. This study provides methodological and practical cues on how to manage the contents of a large-scale social media corpus such as Chen et al. (JMIR Public Health Surveill 6(2):e19273, 2020) COVID-19 corpus. We compare and evaluate, in terms of efficiency and efficacy, available methods to handle such a large corpus. First, we compare different sample sizes to assess whether it is possible to achieve similar results despite the size difference and evaluate sampling methods following a specific data management approach to storing the original corpus. Second, we examine two keyword extraction methodologies commonly used to obtain a compact representation of the main subject and topics of a text: the traditional method used in corpus linguistics, which compares word frequencies using a reference corpus, and graph-based techniques as developed in Natural Language Processing tasks. The methods and strategies discussed in this study enable valuable quantitative and qualitative analyses of an otherwise intractable mass of social media data.
format	Online Article Text
id	pubmed-10148754
institution	National Center for Biotechnology Information
language	English
publishDate	2023
publisher	Springer International Publishing
record_format	MEDLINE/PubMed
spelling	pubmed-101487542023-05-01 Strategies for the Analysis of Large Social Media Corpora: Sampling and Keyword Extraction Methods Moreno-Ortiz, Antonio García-Gámez, María Corpus Pragmat Original Paper In the context of the COVID-19 pandemic, social media platforms such as Twitter have been of great importance for users to exchange news, ideas, and perceptions. Researchers from fields such as discourse analysis and the social sciences have resorted to this content to explore public opinion and stance on this topic, and they have tried to gather information through the compilation of large-scale corpora. However, the size of such corpora is both an advantage and a drawback, as simple text retrieval techniques and tools may prove to be impractical or altogether incapable of handling such masses of data. This study provides methodological and practical cues on how to manage the contents of a large-scale social media corpus such as Chen et al. (JMIR Public Health Surveill 6(2):e19273, 2020) COVID-19 corpus. We compare and evaluate, in terms of efficiency and efficacy, available methods to handle such a large corpus. First, we compare different sample sizes to assess whether it is possible to achieve similar results despite the size difference and evaluate sampling methods following a specific data management approach to storing the original corpus. Second, we examine two keyword extraction methodologies commonly used to obtain a compact representation of the main subject and topics of a text: the traditional method used in corpus linguistics, which compares word frequencies using a reference corpus, and graph-based techniques as developed in Natural Language Processing tasks. The methods and strategies discussed in this study enable valuable quantitative and qualitative analyses of an otherwise intractable mass of social media data. Springer International Publishing 2023-04-30 /pmc/articles/PMC10148754/ /pubmed/37361894 http://dx.doi.org/10.1007/s41701-023-00143-0 Text en © The Author(s) 2023 https://creativecommons.org/licenses/by/4.0/Open AccessThis article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) .
spellingShingle	Original Paper Moreno-Ortiz, Antonio García-Gámez, María Strategies for the Analysis of Large Social Media Corpora: Sampling and Keyword Extraction Methods
title	Strategies for the Analysis of Large Social Media Corpora: Sampling and Keyword Extraction Methods
title_full	Strategies for the Analysis of Large Social Media Corpora: Sampling and Keyword Extraction Methods
title_fullStr	Strategies for the Analysis of Large Social Media Corpora: Sampling and Keyword Extraction Methods
title_full_unstemmed	Strategies for the Analysis of Large Social Media Corpora: Sampling and Keyword Extraction Methods
title_short	Strategies for the Analysis of Large Social Media Corpora: Sampling and Keyword Extraction Methods
title_sort	strategies for the analysis of large social media corpora: sampling and keyword extraction methods
topic	Original Paper
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10148754/ https://www.ncbi.nlm.nih.gov/pubmed/37361894 http://dx.doi.org/10.1007/s41701-023-00143-0
work_keys_str_mv	AT morenoortizantonio strategiesfortheanalysisoflargesocialmediacorporasamplingandkeywordextractionmethods AT garciagamezmaria strategiesfortheanalysisoflargesocialmediacorporasamplingandkeywordextractionmethods

Strategies for the Analysis of Large Social Media Corpora: Sampling and Keyword Extraction Methods

Ejemplares similares