Cargando…

A comparison of approaches for imbalanced classification problems in the context of retrieving relevant documents for an analysis

One of the first steps in many text-based social science studies is to retrieve documents that are relevant for an analysis from large corpora of otherwise irrelevant documents. The conventional approach in social science to address this retrieval task is to apply a set of keywords and to consider t...

Descripción completa

Detalles Bibliográficos
Autor principal:	Wankmüller, Sandra
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Springer Nature Singapore 2022
Materias:	Survey Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9762672/ https://www.ncbi.nlm.nih.gov/pubmed/36568019 http://dx.doi.org/10.1007/s42001-022-00191-7

_version_	1784852912197861376
author	Wankmüller, Sandra
author_facet	Wankmüller, Sandra
author_sort	Wankmüller, Sandra
collection	PubMed
description	One of the first steps in many text-based social science studies is to retrieve documents that are relevant for an analysis from large corpora of otherwise irrelevant documents. The conventional approach in social science to address this retrieval task is to apply a set of keywords and to consider those documents to be relevant that contain at least one of the keywords. But the application of incomplete keyword lists has a high risk of drawing biased inferences. More complex and costly methods such as query expansion techniques, topic model-based classification rules, and active as well as passive supervised learning could have the potential to more accurately separate relevant from irrelevant documents and thereby reduce the potential size of bias. Yet, whether applying these more expensive approaches increases retrieval performance compared to keyword lists at all, and if so, by how much, is unclear as a comparison of these approaches is lacking. This study closes this gap by comparing these methods across three retrieval tasks associated with a data set of German tweets (Linder in SSRN, 2017. 10.2139/ssrn.3026393), the Social Bias Inference Corpus (SBIC) (Sap et al. in Social bias frames: reasoning about social and power implications of language. In: Jurafsky et al. (eds) Proceedings of the 58th annual meeting of the association for computational linguistics. Association for Computational Linguistics, p 5477–5490, 2020. 10.18653/v1/2020.aclmain.486), and the Reuters-21578 corpus (Lewis in Reuters-21578 (Distribution 1.0). [Data set], 1997. http://www.daviddlewis.com/resources/testcollections/reuters21578/). Results show that query expansion techniques and topic model-based classification rules in most studied settings tend to decrease rather than increase retrieval performance. Active supervised learning, however, if applied on a not too small set of labeled training instances (e.g. 1000 documents), reaches a substantially higher retrieval performance than keyword lists.
format	Online Article Text
id	pubmed-9762672
institution	National Center for Biotechnology Information
language	English
publishDate	2022
publisher	Springer Nature Singapore
record_format	MEDLINE/PubMed
spelling	pubmed-97626722022-12-20 A comparison of approaches for imbalanced classification problems in the context of retrieving relevant documents for an analysis Wankmüller, Sandra J Comput Soc Sci Survey Article One of the first steps in many text-based social science studies is to retrieve documents that are relevant for an analysis from large corpora of otherwise irrelevant documents. The conventional approach in social science to address this retrieval task is to apply a set of keywords and to consider those documents to be relevant that contain at least one of the keywords. But the application of incomplete keyword lists has a high risk of drawing biased inferences. More complex and costly methods such as query expansion techniques, topic model-based classification rules, and active as well as passive supervised learning could have the potential to more accurately separate relevant from irrelevant documents and thereby reduce the potential size of bias. Yet, whether applying these more expensive approaches increases retrieval performance compared to keyword lists at all, and if so, by how much, is unclear as a comparison of these approaches is lacking. This study closes this gap by comparing these methods across three retrieval tasks associated with a data set of German tweets (Linder in SSRN, 2017. 10.2139/ssrn.3026393), the Social Bias Inference Corpus (SBIC) (Sap et al. in Social bias frames: reasoning about social and power implications of language. In: Jurafsky et al. (eds) Proceedings of the 58th annual meeting of the association for computational linguistics. Association for Computational Linguistics, p 5477–5490, 2020. 10.18653/v1/2020.aclmain.486), and the Reuters-21578 corpus (Lewis in Reuters-21578 (Distribution 1.0). [Data set], 1997. http://www.daviddlewis.com/resources/testcollections/reuters21578/). Results show that query expansion techniques and topic model-based classification rules in most studied settings tend to decrease rather than increase retrieval performance. Active supervised learning, however, if applied on a not too small set of labeled training instances (e.g. 1000 documents), reaches a substantially higher retrieval performance than keyword lists. Springer Nature Singapore 2022-12-19 2023 /pmc/articles/PMC9762672/ /pubmed/36568019 http://dx.doi.org/10.1007/s42001-022-00191-7 Text en © The Author(s), under exclusive licence to Springer Nature Singapore Pte Ltd. 2022, Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law. This article is made available via the PMC Open Access Subset for unrestricted research re-use and secondary analysis in any form or by any means with acknowledgement of the original source. These permissions are granted for the duration of the World Health Organization (WHO) declaration of COVID-19 as a global pandemic.
spellingShingle	Survey Article Wankmüller, Sandra A comparison of approaches for imbalanced classification problems in the context of retrieving relevant documents for an analysis
title	A comparison of approaches for imbalanced classification problems in the context of retrieving relevant documents for an analysis
title_full	A comparison of approaches for imbalanced classification problems in the context of retrieving relevant documents for an analysis
title_fullStr	A comparison of approaches for imbalanced classification problems in the context of retrieving relevant documents for an analysis
title_full_unstemmed	A comparison of approaches for imbalanced classification problems in the context of retrieving relevant documents for an analysis
title_short	A comparison of approaches for imbalanced classification problems in the context of retrieving relevant documents for an analysis
title_sort	comparison of approaches for imbalanced classification problems in the context of retrieving relevant documents for an analysis
topic	Survey Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9762672/ https://www.ncbi.nlm.nih.gov/pubmed/36568019 http://dx.doi.org/10.1007/s42001-022-00191-7
work_keys_str_mv	AT wankmullersandra acomparisonofapproachesforimbalancedclassificationproblemsinthecontextofretrievingrelevantdocumentsforananalysis AT wankmullersandra comparisonofapproachesforimbalancedclassificationproblemsinthecontextofretrievingrelevantdocumentsforananalysis

A comparison of approaches for imbalanced classification problems in the context of retrieving relevant documents for an analysis

Ejemplares similares