Cargando…

Dataset construction method of cross-lingual summarization based on filtering and text augmentation

Existing cross-lingual summarization (CLS) datasets consist of inconsistent sample quality and low scale. To address these problems, we propose a method that jointly supervises quality and scale to build CLS datasets. In terms of quality supervision, the method adopts a multi-strategy filtering algo...

Descripción completa

Detalles Bibliográficos
Autores principales:	Pan, Hangyu, Xi, Yaoyi, Wang, Ling, Nan, Yu, Su, Zhizhong, Cao, Rong
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	PeerJ Inc. 2023
Materias:	Artificial Intelligence
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10280405/ https://www.ncbi.nlm.nih.gov/pubmed/37346668 http://dx.doi.org/10.7717/peerj-cs.1299

_version_	1785060786911051776
author	Pan, Hangyu Xi, Yaoyi Wang, Ling Nan, Yu Su, Zhizhong Cao, Rong
author_facet	Pan, Hangyu Xi, Yaoyi Wang, Ling Nan, Yu Su, Zhizhong Cao, Rong
author_sort	Pan, Hangyu
collection	PubMed
description	Existing cross-lingual summarization (CLS) datasets consist of inconsistent sample quality and low scale. To address these problems, we propose a method that jointly supervises quality and scale to build CLS datasets. In terms of quality supervision, the method adopts a multi-strategy filtering algorithm to remove low-quality samples of monolingual summarization (MS) from the perspectives of character and semantics, thereby improving the quality of the MS dataset. In terms of scale supervision, the method adopts a text augmentation algorithm based on the pretrained model to increase the size of CLS datasets with quality assurance. This method was used to build an English-Chinese CLS dataset and evaluate it with a reasonable data quality evaluation framework. The evaluation results show that the dataset is of good quality and large size. These outcomes show that the proposed method may comprehensively improve quality and scale, thereby resulting in a high-quality and large-scale CLS dataset at a lower cost.
format	Online Article Text
id	pubmed-10280405
institution	National Center for Biotechnology Information
language	English
publishDate	2023
publisher	PeerJ Inc.
record_format	MEDLINE/PubMed
spelling	pubmed-102804052023-06-21 Dataset construction method of cross-lingual summarization based on filtering and text augmentation Pan, Hangyu Xi, Yaoyi Wang, Ling Nan, Yu Su, Zhizhong Cao, Rong PeerJ Comput Sci Artificial Intelligence Existing cross-lingual summarization (CLS) datasets consist of inconsistent sample quality and low scale. To address these problems, we propose a method that jointly supervises quality and scale to build CLS datasets. In terms of quality supervision, the method adopts a multi-strategy filtering algorithm to remove low-quality samples of monolingual summarization (MS) from the perspectives of character and semantics, thereby improving the quality of the MS dataset. In terms of scale supervision, the method adopts a text augmentation algorithm based on the pretrained model to increase the size of CLS datasets with quality assurance. This method was used to build an English-Chinese CLS dataset and evaluate it with a reasonable data quality evaluation framework. The evaluation results show that the dataset is of good quality and large size. These outcomes show that the proposed method may comprehensively improve quality and scale, thereby resulting in a high-quality and large-scale CLS dataset at a lower cost. PeerJ Inc. 2023-03-28 /pmc/articles/PMC10280405/ /pubmed/37346668 http://dx.doi.org/10.7717/peerj-cs.1299 Text en ©2023 Pan et al. https://creativecommons.org/licenses/by/4.0/This is an open access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ Computer Science) and either DOI or URL of the article must be cited.
spellingShingle	Artificial Intelligence Pan, Hangyu Xi, Yaoyi Wang, Ling Nan, Yu Su, Zhizhong Cao, Rong Dataset construction method of cross-lingual summarization based on filtering and text augmentation
title	Dataset construction method of cross-lingual summarization based on filtering and text augmentation
title_full	Dataset construction method of cross-lingual summarization based on filtering and text augmentation
title_fullStr	Dataset construction method of cross-lingual summarization based on filtering and text augmentation
title_full_unstemmed	Dataset construction method of cross-lingual summarization based on filtering and text augmentation
title_short	Dataset construction method of cross-lingual summarization based on filtering and text augmentation
title_sort	dataset construction method of cross-lingual summarization based on filtering and text augmentation
topic	Artificial Intelligence
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10280405/ https://www.ncbi.nlm.nih.gov/pubmed/37346668 http://dx.doi.org/10.7717/peerj-cs.1299
work_keys_str_mv	AT panhangyu datasetconstructionmethodofcrosslingualsummarizationbasedonfilteringandtextaugmentation AT xiyaoyi datasetconstructionmethodofcrosslingualsummarizationbasedonfilteringandtextaugmentation AT wangling datasetconstructionmethodofcrosslingualsummarizationbasedonfilteringandtextaugmentation AT nanyu datasetconstructionmethodofcrosslingualsummarizationbasedonfilteringandtextaugmentation AT suzhizhong datasetconstructionmethodofcrosslingualsummarizationbasedonfilteringandtextaugmentation AT caorong datasetconstructionmethodofcrosslingualsummarizationbasedonfilteringandtextaugmentation

Dataset construction method of cross-lingual summarization based on filtering and text augmentation

Ejemplares similares