Cargando…

Morpheme Matching Based Text Tokenization for a Scarce Resourced Language

Text tokenization is a fundamental pre-processing step for almost all the information processing applications. This task is nontrivial for the scarce resourced languages such as Urdu, as there is inconsistent use of space between words. In this paper a morpheme matching based approach has been propo...

Descripción completa

Detalles Bibliográficos
Autores principales: Rehman, Zobia, Anwar, Waqas, Bajwa, Usama Ijaz, Xuan, Wang, Chaoying, Zhou
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Public Library of Science 2013
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3749178/
https://www.ncbi.nlm.nih.gov/pubmed/23990871
http://dx.doi.org/10.1371/journal.pone.0068178
_version_ 1782281162205429760
author Rehman, Zobia
Anwar, Waqas
Bajwa, Usama Ijaz
Xuan, Wang
Chaoying, Zhou
author_facet Rehman, Zobia
Anwar, Waqas
Bajwa, Usama Ijaz
Xuan, Wang
Chaoying, Zhou
author_sort Rehman, Zobia
collection PubMed
description Text tokenization is a fundamental pre-processing step for almost all the information processing applications. This task is nontrivial for the scarce resourced languages such as Urdu, as there is inconsistent use of space between words. In this paper a morpheme matching based approach has been proposed for Urdu text tokenization, along with some other algorithms to solve the additional issues of boundary detection of compound words, affixation, reduplication, names and abbreviations. This study resulted into 97.28% precision, 93.71% recall, and 95.46% F1-measure; while tokenizing a corpus of 57000 words by using a morpheme list with 6400 entries.
format Online
Article
Text
id pubmed-3749178
institution National Center for Biotechnology Information
language English
publishDate 2013
publisher Public Library of Science
record_format MEDLINE/PubMed
spelling pubmed-37491782013-08-29 Morpheme Matching Based Text Tokenization for a Scarce Resourced Language Rehman, Zobia Anwar, Waqas Bajwa, Usama Ijaz Xuan, Wang Chaoying, Zhou PLoS One Research Article Text tokenization is a fundamental pre-processing step for almost all the information processing applications. This task is nontrivial for the scarce resourced languages such as Urdu, as there is inconsistent use of space between words. In this paper a morpheme matching based approach has been proposed for Urdu text tokenization, along with some other algorithms to solve the additional issues of boundary detection of compound words, affixation, reduplication, names and abbreviations. This study resulted into 97.28% precision, 93.71% recall, and 95.46% F1-measure; while tokenizing a corpus of 57000 words by using a morpheme list with 6400 entries. Public Library of Science 2013-08-21 /pmc/articles/PMC3749178/ /pubmed/23990871 http://dx.doi.org/10.1371/journal.pone.0068178 Text en © 2013 Rehman et al http://creativecommons.org/licenses/by/4.0/ This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are properly credited.
spellingShingle Research Article
Rehman, Zobia
Anwar, Waqas
Bajwa, Usama Ijaz
Xuan, Wang
Chaoying, Zhou
Morpheme Matching Based Text Tokenization for a Scarce Resourced Language
title Morpheme Matching Based Text Tokenization for a Scarce Resourced Language
title_full Morpheme Matching Based Text Tokenization for a Scarce Resourced Language
title_fullStr Morpheme Matching Based Text Tokenization for a Scarce Resourced Language
title_full_unstemmed Morpheme Matching Based Text Tokenization for a Scarce Resourced Language
title_short Morpheme Matching Based Text Tokenization for a Scarce Resourced Language
title_sort morpheme matching based text tokenization for a scarce resourced language
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3749178/
https://www.ncbi.nlm.nih.gov/pubmed/23990871
http://dx.doi.org/10.1371/journal.pone.0068178
work_keys_str_mv AT rehmanzobia morphemematchingbasedtexttokenizationforascarceresourcedlanguage
AT anwarwaqas morphemematchingbasedtexttokenizationforascarceresourcedlanguage
AT bajwausamaijaz morphemematchingbasedtexttokenizationforascarceresourcedlanguage
AT xuanwang morphemematchingbasedtexttokenizationforascarceresourcedlanguage
AT chaoyingzhou morphemematchingbasedtexttokenizationforascarceresourcedlanguage