Cargando…

Data augmentation in natural language processing: a novel text generation approach for long and short text classifiers

In many cases of machine learning, research suggests that the development of training data might have a higher relevance than the choice and modelling of classifiers themselves. Thus, data augmentation methods have been developed to improve classifiers by artificially created training data. In NLP,...

Descripción completa

Detalles Bibliográficos
Autores principales:	Bayer, Markus, Kaufhold, Marc-André, Buchhold, Björn, Keller, Marcel, Dallmeyer, Jörg, Reuter, Christian
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Springer Berlin Heidelberg 2022
Materias:	Original Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9001823/ https://www.ncbi.nlm.nih.gov/pubmed/35432623 http://dx.doi.org/10.1007/s13042-022-01553-3

_version_	1784685754263273472
author	Bayer, Markus Kaufhold, Marc-André Buchhold, Björn Keller, Marcel Dallmeyer, Jörg Reuter, Christian
author_facet	Bayer, Markus Kaufhold, Marc-André Buchhold, Björn Keller, Marcel Dallmeyer, Jörg Reuter, Christian
author_sort	Bayer, Markus
collection	PubMed
description	In many cases of machine learning, research suggests that the development of training data might have a higher relevance than the choice and modelling of classifiers themselves. Thus, data augmentation methods have been developed to improve classifiers by artificially created training data. In NLP, there is the challenge of establishing universal rules for text transformations which provide new linguistic patterns. In this paper, we present and evaluate a text generation method suitable to increase the performance of classifiers for long and short texts. We achieved promising improvements when evaluating short as well as long text tasks with the enhancement by our text generation method. Especially with regard to small data analytics, additive accuracy gains of up to 15.53% and 3.56% are achieved within a constructed low data regime, compared to the no augmentation baseline and another data augmentation technique. As the current track of these constructed regimes is not universally applicable, we also show major improvements in several real world low data tasks (up to +4.84 F1-score). Since we are evaluating the method from many perspectives (in total 11 datasets), we also observe situations where the method might not be suitable. We discuss implications and patterns for the successful application of our approach on different types of datasets.
format	Online Article Text
id	pubmed-9001823
institution	National Center for Biotechnology Information
language	English
publishDate	2022
publisher	Springer Berlin Heidelberg
record_format	MEDLINE/PubMed
spelling	pubmed-90018232022-04-12 Data augmentation in natural language processing: a novel text generation approach for long and short text classifiers Bayer, Markus Kaufhold, Marc-André Buchhold, Björn Keller, Marcel Dallmeyer, Jörg Reuter, Christian Int J Mach Learn Cybern Original Article In many cases of machine learning, research suggests that the development of training data might have a higher relevance than the choice and modelling of classifiers themselves. Thus, data augmentation methods have been developed to improve classifiers by artificially created training data. In NLP, there is the challenge of establishing universal rules for text transformations which provide new linguistic patterns. In this paper, we present and evaluate a text generation method suitable to increase the performance of classifiers for long and short texts. We achieved promising improvements when evaluating short as well as long text tasks with the enhancement by our text generation method. Especially with regard to small data analytics, additive accuracy gains of up to 15.53% and 3.56% are achieved within a constructed low data regime, compared to the no augmentation baseline and another data augmentation technique. As the current track of these constructed regimes is not universally applicable, we also show major improvements in several real world low data tasks (up to +4.84 F1-score). Since we are evaluating the method from many perspectives (in total 11 datasets), we also observe situations where the method might not be suitable. We discuss implications and patterns for the successful application of our approach on different types of datasets. Springer Berlin Heidelberg 2022-04-12 2023 /pmc/articles/PMC9001823/ /pubmed/35432623 http://dx.doi.org/10.1007/s13042-022-01553-3 Text en © The Author(s) 2022 https://creativecommons.org/licenses/by/4.0/Open AccessThis article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) .
spellingShingle	Original Article Bayer, Markus Kaufhold, Marc-André Buchhold, Björn Keller, Marcel Dallmeyer, Jörg Reuter, Christian Data augmentation in natural language processing: a novel text generation approach for long and short text classifiers
title	Data augmentation in natural language processing: a novel text generation approach for long and short text classifiers
title_full	Data augmentation in natural language processing: a novel text generation approach for long and short text classifiers
title_fullStr	Data augmentation in natural language processing: a novel text generation approach for long and short text classifiers
title_full_unstemmed	Data augmentation in natural language processing: a novel text generation approach for long and short text classifiers
title_short	Data augmentation in natural language processing: a novel text generation approach for long and short text classifiers
title_sort	data augmentation in natural language processing: a novel text generation approach for long and short text classifiers
topic	Original Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9001823/ https://www.ncbi.nlm.nih.gov/pubmed/35432623 http://dx.doi.org/10.1007/s13042-022-01553-3
work_keys_str_mv	AT bayermarkus dataaugmentationinnaturallanguageprocessinganoveltextgenerationapproachforlongandshorttextclassifiers AT kaufholdmarcandre dataaugmentationinnaturallanguageprocessinganoveltextgenerationapproachforlongandshorttextclassifiers AT buchholdbjorn dataaugmentationinnaturallanguageprocessinganoveltextgenerationapproachforlongandshorttextclassifiers AT kellermarcel dataaugmentationinnaturallanguageprocessinganoveltextgenerationapproachforlongandshorttextclassifiers AT dallmeyerjorg dataaugmentationinnaturallanguageprocessinganoveltextgenerationapproachforlongandshorttextclassifiers AT reuterchristian dataaugmentationinnaturallanguageprocessinganoveltextgenerationapproachforlongandshorttextclassifiers

Data augmentation in natural language processing: a novel text generation approach for long and short text classifiers

Ejemplares similares