Cargando…

Comparing neural‐ and N‐gram‐based language models for word segmentation

Word segmentation is the task of inserting or deleting word boundary characters in order to separate character sequences that correspond to words in some language. In this article we propose an approach based on a beam search algorithm and a language model working at the byte/character level, the la...

Descripción completa

Detalles Bibliográficos
Autores principales:	Doval, Yerai, Gómez‐Rodríguez, Carlos
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	John Wiley & Sons, Inc. 2018
Materias:	Research Articles
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6360409/ https://www.ncbi.nlm.nih.gov/pubmed/30775406 http://dx.doi.org/10.1002/asi.24082

_version_	1783392477124755456
author	Doval, Yerai Gómez‐Rodríguez, Carlos
author_facet	Doval, Yerai Gómez‐Rodríguez, Carlos
author_sort	Doval, Yerai
collection	PubMed
description	Word segmentation is the task of inserting or deleting word boundary characters in order to separate character sequences that correspond to words in some language. In this article we propose an approach based on a beam search algorithm and a language model working at the byte/character level, the latter component implemented either as an n‐gram model or a recurrent neural network. The resulting system analyzes the text input with no word boundaries one token at a time, which can be a character or a byte, and uses the information gathered by the language model to determine if a boundary must be placed in the current position or not. Our aim is to use this system in a preprocessing step for a microtext normalization system. This means that it needs to effectively cope with the data sparsity present on this kind of texts. We also strove to surpass the performance of two readily available word segmentation systems: The well‐known and accessible Word Breaker by Microsoft, and the Python module WordSegment by Grant Jenks. The results show that we have met our objectives, and we hope to continue to improve both the precision and the efficiency of our system in the future.
format	Online Article Text
id	pubmed-6360409
institution	National Center for Biotechnology Information
language	English
publishDate	2018
publisher	John Wiley & Sons, Inc.
record_format	MEDLINE/PubMed
spelling	pubmed-63604092019-02-14 Comparing neural‐ and N‐gram‐based language models for word segmentation Doval, Yerai Gómez‐Rodríguez, Carlos J Assoc Inf Sci Technol Research Articles Word segmentation is the task of inserting or deleting word boundary characters in order to separate character sequences that correspond to words in some language. In this article we propose an approach based on a beam search algorithm and a language model working at the byte/character level, the latter component implemented either as an n‐gram model or a recurrent neural network. The resulting system analyzes the text input with no word boundaries one token at a time, which can be a character or a byte, and uses the information gathered by the language model to determine if a boundary must be placed in the current position or not. Our aim is to use this system in a preprocessing step for a microtext normalization system. This means that it needs to effectively cope with the data sparsity present on this kind of texts. We also strove to surpass the performance of two readily available word segmentation systems: The well‐known and accessible Word Breaker by Microsoft, and the Python module WordSegment by Grant Jenks. The results show that we have met our objectives, and we hope to continue to improve both the precision and the efficiency of our system in the future. John Wiley & Sons, Inc. 2018-12-02 2019-02 /pmc/articles/PMC6360409/ /pubmed/30775406 http://dx.doi.org/10.1002/asi.24082 Text en © 2018 The Authors. Journal of the Association for Information Science and Technology published by Wiley Periodicals, Inc. on behalf of ASIS&T. This is an open access article under the terms of the http://creativecommons.org/licenses/by/4.0/ License, which permits use, distribution and reproduction in any medium, provided the original work is properly cited.
spellingShingle	Research Articles Doval, Yerai Gómez‐Rodríguez, Carlos Comparing neural‐ and N‐gram‐based language models for word segmentation
title	Comparing neural‐ and N‐gram‐based language models for word segmentation
title_full	Comparing neural‐ and N‐gram‐based language models for word segmentation
title_fullStr	Comparing neural‐ and N‐gram‐based language models for word segmentation
title_full_unstemmed	Comparing neural‐ and N‐gram‐based language models for word segmentation
title_short	Comparing neural‐ and N‐gram‐based language models for word segmentation
title_sort	comparing neural‐ and n‐gram‐based language models for word segmentation
topic	Research Articles
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6360409/ https://www.ncbi.nlm.nih.gov/pubmed/30775406 http://dx.doi.org/10.1002/asi.24082
work_keys_str_mv	AT dovalyerai comparingneuralandngrambasedlanguagemodelsforwordsegmentation AT gomezrodriguezcarlos comparingneuralandngrambasedlanguagemodelsforwordsegmentation

Comparing neural‐ and N‐gram‐based language models for word segmentation

Ejemplares similares