Cargando…

Estimating Sentence-like Structure in Synthetic Languages Using Information Topology

Estimating sentence-like units and sentence boundaries in human language is an important task in the context of natural language understanding. While this topic has been considered using a range of techniques, including rule-based approaches and supervised and unsupervised algorithms, a common aspec...

Descripción completa

Detalles Bibliográficos
Autores principales:	Back, Andrew D., Wiles, Janet
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	MDPI 2022
Materias:	Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9317616/ https://www.ncbi.nlm.nih.gov/pubmed/35885083 http://dx.doi.org/10.3390/e24070859

_version_	1784755100434038784
author	Back, Andrew D. Wiles, Janet
author_facet	Back, Andrew D. Wiles, Janet
author_sort	Back, Andrew D.
collection	PubMed
description	Estimating sentence-like units and sentence boundaries in human language is an important task in the context of natural language understanding. While this topic has been considered using a range of techniques, including rule-based approaches and supervised and unsupervised algorithms, a common aspect of these methods is that they inherently rely on a priori knowledge of human language in one form or another. Recently we have been exploring synthetic languages based on the concept of modeling behaviors using emergent languages. These synthetic languages are characterized by a small alphabet and limited vocabulary and grammatical structure. A particular challenge for synthetic languages is that there is generally no a priori language model available, which limits the use of many natural language processing methods. In this paper, we are interested in exploring how it may be possible to discover natural ‘chunks’ in synthetic language sequences in terms of sentence-like units. The problem is how to do this with no linguistic or semantic language model. Our approach is to consider the problem from the perspective of information theory. We extend the basis of information geometry and propose a new concept, which we term information topology, to model the incremental flow of information in natural sequences. We introduce an information topology view of the incremental information and incremental tangent angle of the Wasserstein-1 distance of the probabilistic symbolic language input. It is not suggested as a fully viable alternative for sentence boundary detection per se but provides a new conceptual method for estimating the structure and natural limits of information flow in language sequences but without any semantic knowledge. We consider relevant existing performance metrics such as the F-measure and indicate limitations, leading to the introduction of a new information-theoretic global performance based on modeled distributions. Although the methodology is not proposed for human language sentence detection, we provide some examples using human language corpora where potentially useful results are shown. The proposed model shows potential advantages for overcoming difficulties due to the disambiguation of complex language and potential improvements for human language methods.
format	Online Article Text
id	pubmed-9317616
institution	National Center for Biotechnology Information
language	English
publishDate	2022
publisher	MDPI
record_format	MEDLINE/PubMed
spelling	pubmed-93176162022-07-27 Estimating Sentence-like Structure in Synthetic Languages Using Information Topology Back, Andrew D. Wiles, Janet Entropy (Basel) Article Estimating sentence-like units and sentence boundaries in human language is an important task in the context of natural language understanding. While this topic has been considered using a range of techniques, including rule-based approaches and supervised and unsupervised algorithms, a common aspect of these methods is that they inherently rely on a priori knowledge of human language in one form or another. Recently we have been exploring synthetic languages based on the concept of modeling behaviors using emergent languages. These synthetic languages are characterized by a small alphabet and limited vocabulary and grammatical structure. A particular challenge for synthetic languages is that there is generally no a priori language model available, which limits the use of many natural language processing methods. In this paper, we are interested in exploring how it may be possible to discover natural ‘chunks’ in synthetic language sequences in terms of sentence-like units. The problem is how to do this with no linguistic or semantic language model. Our approach is to consider the problem from the perspective of information theory. We extend the basis of information geometry and propose a new concept, which we term information topology, to model the incremental flow of information in natural sequences. We introduce an information topology view of the incremental information and incremental tangent angle of the Wasserstein-1 distance of the probabilistic symbolic language input. It is not suggested as a fully viable alternative for sentence boundary detection per se but provides a new conceptual method for estimating the structure and natural limits of information flow in language sequences but without any semantic knowledge. We consider relevant existing performance metrics such as the F-measure and indicate limitations, leading to the introduction of a new information-theoretic global performance based on modeled distributions. Although the methodology is not proposed for human language sentence detection, we provide some examples using human language corpora where potentially useful results are shown. The proposed model shows potential advantages for overcoming difficulties due to the disambiguation of complex language and potential improvements for human language methods. MDPI 2022-06-22 /pmc/articles/PMC9317616/ /pubmed/35885083 http://dx.doi.org/10.3390/e24070859 Text en © 2022 by the authors. https://creativecommons.org/licenses/by/4.0/Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
spellingShingle	Article Back, Andrew D. Wiles, Janet Estimating Sentence-like Structure in Synthetic Languages Using Information Topology
title	Estimating Sentence-like Structure in Synthetic Languages Using Information Topology
title_full	Estimating Sentence-like Structure in Synthetic Languages Using Information Topology
title_fullStr	Estimating Sentence-like Structure in Synthetic Languages Using Information Topology
title_full_unstemmed	Estimating Sentence-like Structure in Synthetic Languages Using Information Topology
title_short	Estimating Sentence-like Structure in Synthetic Languages Using Information Topology
title_sort	estimating sentence-like structure in synthetic languages using information topology
topic	Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9317616/ https://www.ncbi.nlm.nih.gov/pubmed/35885083 http://dx.doi.org/10.3390/e24070859
work_keys_str_mv	AT backandrewd estimatingsentencelikestructureinsyntheticlanguagesusinginformationtopology AT wilesjanet estimatingsentencelikestructureinsyntheticlanguagesusinginformationtopology

Estimating Sentence-like Structure in Synthetic Languages Using Information Topology

Ejemplares similares