Cargando…

Unsupervised acquisition of idiomatic units of symbolic natural language: An n-gram frequency-based approach for the chunking of news articles and tweets

Symbolic sequential data are produced in huge quantities in numerous contexts, such as text and speech data, biometrics, genomics, financial market indexes, music sheets, and online social media posts. In this paper, an unsupervised approach for the chunking of idiomatic units of sequential text dat...

Descripción completa

Detalles Bibliográficos
Autores principales: Borrelli, Dario, Gongora Svartzman, Gabriela, Lipizzi, Carlo
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Public Library of Science 2020
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7279599/
https://www.ncbi.nlm.nih.gov/pubmed/32511252
http://dx.doi.org/10.1371/journal.pone.0234214
_version_ 1783543597975470080
author Borrelli, Dario
Gongora Svartzman, Gabriela
Lipizzi, Carlo
author_facet Borrelli, Dario
Gongora Svartzman, Gabriela
Lipizzi, Carlo
author_sort Borrelli, Dario
collection PubMed
description Symbolic sequential data are produced in huge quantities in numerous contexts, such as text and speech data, biometrics, genomics, financial market indexes, music sheets, and online social media posts. In this paper, an unsupervised approach for the chunking of idiomatic units of sequential text data is presented. Text chunking refers to the task of splitting a string of textual information into non-overlapping groups of related units. This is a fundamental problem in numerous fields where understanding the relation between raw units of symbolic sequential data is relevant. Existing methods are based primarily on supervised and semi-supervised learning approaches; however, in this study, a novel unsupervised approach is proposed based on the existing concept of n-grams, which requires no labeled text as an input. The proposed methodology is applied to two natural language corpora: a Wall Street Journal corpus and a Twitter corpus. In both cases, the corpus length was increased gradually to measure the accuracy with a different number of unitary elements as inputs. Both corpora reveal improvements in accuracy proportional with increases in the number of tokens. For the Twitter corpus, the increase in accuracy follows a linear trend. The results show that the proposed methodology can achieve a higher accuracy with incremental usage. A future study will aim at designing an iterative system for the proposed methodology.
format Online
Article
Text
id pubmed-7279599
institution National Center for Biotechnology Information
language English
publishDate 2020
publisher Public Library of Science
record_format MEDLINE/PubMed
spelling pubmed-72795992020-06-17 Unsupervised acquisition of idiomatic units of symbolic natural language: An n-gram frequency-based approach for the chunking of news articles and tweets Borrelli, Dario Gongora Svartzman, Gabriela Lipizzi, Carlo PLoS One Research Article Symbolic sequential data are produced in huge quantities in numerous contexts, such as text and speech data, biometrics, genomics, financial market indexes, music sheets, and online social media posts. In this paper, an unsupervised approach for the chunking of idiomatic units of sequential text data is presented. Text chunking refers to the task of splitting a string of textual information into non-overlapping groups of related units. This is a fundamental problem in numerous fields where understanding the relation between raw units of symbolic sequential data is relevant. Existing methods are based primarily on supervised and semi-supervised learning approaches; however, in this study, a novel unsupervised approach is proposed based on the existing concept of n-grams, which requires no labeled text as an input. The proposed methodology is applied to two natural language corpora: a Wall Street Journal corpus and a Twitter corpus. In both cases, the corpus length was increased gradually to measure the accuracy with a different number of unitary elements as inputs. Both corpora reveal improvements in accuracy proportional with increases in the number of tokens. For the Twitter corpus, the increase in accuracy follows a linear trend. The results show that the proposed methodology can achieve a higher accuracy with incremental usage. A future study will aim at designing an iterative system for the proposed methodology. Public Library of Science 2020-06-08 /pmc/articles/PMC7279599/ /pubmed/32511252 http://dx.doi.org/10.1371/journal.pone.0234214 Text en © 2020 Borrelli et al http://creativecommons.org/licenses/by/4.0/ This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
spellingShingle Research Article
Borrelli, Dario
Gongora Svartzman, Gabriela
Lipizzi, Carlo
Unsupervised acquisition of idiomatic units of symbolic natural language: An n-gram frequency-based approach for the chunking of news articles and tweets
title Unsupervised acquisition of idiomatic units of symbolic natural language: An n-gram frequency-based approach for the chunking of news articles and tweets
title_full Unsupervised acquisition of idiomatic units of symbolic natural language: An n-gram frequency-based approach for the chunking of news articles and tweets
title_fullStr Unsupervised acquisition of idiomatic units of symbolic natural language: An n-gram frequency-based approach for the chunking of news articles and tweets
title_full_unstemmed Unsupervised acquisition of idiomatic units of symbolic natural language: An n-gram frequency-based approach for the chunking of news articles and tweets
title_short Unsupervised acquisition of idiomatic units of symbolic natural language: An n-gram frequency-based approach for the chunking of news articles and tweets
title_sort unsupervised acquisition of idiomatic units of symbolic natural language: an n-gram frequency-based approach for the chunking of news articles and tweets
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7279599/
https://www.ncbi.nlm.nih.gov/pubmed/32511252
http://dx.doi.org/10.1371/journal.pone.0234214
work_keys_str_mv AT borrellidario unsupervisedacquisitionofidiomaticunitsofsymbolicnaturallanguageanngramfrequencybasedapproachforthechunkingofnewsarticlesandtweets
AT gongorasvartzmangabriela unsupervisedacquisitionofidiomaticunitsofsymbolicnaturallanguageanngramfrequencybasedapproachforthechunkingofnewsarticlesandtweets
AT lipizzicarlo unsupervisedacquisitionofidiomaticunitsofsymbolicnaturallanguageanngramfrequencybasedapproachforthechunkingofnewsarticlesandtweets