Cargando…

Morpheme Matching Based Text Tokenization for a Scarce Resourced Language

Text tokenization is a fundamental pre-processing step for almost all the information processing applications. This task is nontrivial for the scarce resourced languages such as Urdu, as there is inconsistent use of space between words. In this paper a morpheme matching based approach has been propo...

Descripción completa

Detalles Bibliográficos
Autores principales:	Rehman, Zobia, Anwar, Waqas, Bajwa, Usama Ijaz, Xuan, Wang, Chaoying, Zhou
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Public Library of Science 2013
Materias:	Research Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3749178/ https://www.ncbi.nlm.nih.gov/pubmed/23990871 http://dx.doi.org/10.1371/journal.pone.0068178

_version_	1782281162205429760
author	Rehman, Zobia Anwar, Waqas Bajwa, Usama Ijaz Xuan, Wang Chaoying, Zhou
author_facet	Rehman, Zobia Anwar, Waqas Bajwa, Usama Ijaz Xuan, Wang Chaoying, Zhou
author_sort	Rehman, Zobia
collection	PubMed
description	Text tokenization is a fundamental pre-processing step for almost all the information processing applications. This task is nontrivial for the scarce resourced languages such as Urdu, as there is inconsistent use of space between words. In this paper a morpheme matching based approach has been proposed for Urdu text tokenization, along with some other algorithms to solve the additional issues of boundary detection of compound words, affixation, reduplication, names and abbreviations. This study resulted into 97.28% precision, 93.71% recall, and 95.46% F1-measure; while tokenizing a corpus of 57000 words by using a morpheme list with 6400 entries.
format	Online Article Text
id	pubmed-3749178
institution	National Center for Biotechnology Information
language	English
publishDate	2013
publisher	Public Library of Science
record_format	MEDLINE/PubMed
spelling	pubmed-37491782013-08-29 Morpheme Matching Based Text Tokenization for a Scarce Resourced Language Rehman, Zobia Anwar, Waqas Bajwa, Usama Ijaz Xuan, Wang Chaoying, Zhou PLoS One Research Article Text tokenization is a fundamental pre-processing step for almost all the information processing applications. This task is nontrivial for the scarce resourced languages such as Urdu, as there is inconsistent use of space between words. In this paper a morpheme matching based approach has been proposed for Urdu text tokenization, along with some other algorithms to solve the additional issues of boundary detection of compound words, affixation, reduplication, names and abbreviations. This study resulted into 97.28% precision, 93.71% recall, and 95.46% F1-measure; while tokenizing a corpus of 57000 words by using a morpheme list with 6400 entries. Public Library of Science 2013-08-21 /pmc/articles/PMC3749178/ /pubmed/23990871 http://dx.doi.org/10.1371/journal.pone.0068178 Text en © 2013 Rehman et al http://creativecommons.org/licenses/by/4.0/ This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are properly credited.
spellingShingle	Research Article Rehman, Zobia Anwar, Waqas Bajwa, Usama Ijaz Xuan, Wang Chaoying, Zhou Morpheme Matching Based Text Tokenization for a Scarce Resourced Language
title	Morpheme Matching Based Text Tokenization for a Scarce Resourced Language
title_full	Morpheme Matching Based Text Tokenization for a Scarce Resourced Language
title_fullStr	Morpheme Matching Based Text Tokenization for a Scarce Resourced Language
title_full_unstemmed	Morpheme Matching Based Text Tokenization for a Scarce Resourced Language
title_short	Morpheme Matching Based Text Tokenization for a Scarce Resourced Language
title_sort	morpheme matching based text tokenization for a scarce resourced language
topic	Research Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3749178/ https://www.ncbi.nlm.nih.gov/pubmed/23990871 http://dx.doi.org/10.1371/journal.pone.0068178
work_keys_str_mv	AT rehmanzobia morphemematchingbasedtexttokenizationforascarceresourcedlanguage AT anwarwaqas morphemematchingbasedtexttokenizationforascarceresourcedlanguage AT bajwausamaijaz morphemematchingbasedtexttokenizationforascarceresourcedlanguage AT xuanwang morphemematchingbasedtexttokenizationforascarceresourcedlanguage AT chaoyingzhou morphemematchingbasedtexttokenizationforascarceresourcedlanguage

Morpheme Matching Based Text Tokenization for a Scarce Resourced Language

Ejemplares similares