Cargando…
Morpheme Matching Based Text Tokenization for a Scarce Resourced Language
Text tokenization is a fundamental pre-processing step for almost all the information processing applications. This task is nontrivial for the scarce resourced languages such as Urdu, as there is inconsistent use of space between words. In this paper a morpheme matching based approach has been propo...
Autores principales: | , , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Public Library of Science
2013
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3749178/ https://www.ncbi.nlm.nih.gov/pubmed/23990871 http://dx.doi.org/10.1371/journal.pone.0068178 |
_version_ | 1782281162205429760 |
---|---|
author | Rehman, Zobia Anwar, Waqas Bajwa, Usama Ijaz Xuan, Wang Chaoying, Zhou |
author_facet | Rehman, Zobia Anwar, Waqas Bajwa, Usama Ijaz Xuan, Wang Chaoying, Zhou |
author_sort | Rehman, Zobia |
collection | PubMed |
description | Text tokenization is a fundamental pre-processing step for almost all the information processing applications. This task is nontrivial for the scarce resourced languages such as Urdu, as there is inconsistent use of space between words. In this paper a morpheme matching based approach has been proposed for Urdu text tokenization, along with some other algorithms to solve the additional issues of boundary detection of compound words, affixation, reduplication, names and abbreviations. This study resulted into 97.28% precision, 93.71% recall, and 95.46% F1-measure; while tokenizing a corpus of 57000 words by using a morpheme list with 6400 entries. |
format | Online Article Text |
id | pubmed-3749178 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2013 |
publisher | Public Library of Science |
record_format | MEDLINE/PubMed |
spelling | pubmed-37491782013-08-29 Morpheme Matching Based Text Tokenization for a Scarce Resourced Language Rehman, Zobia Anwar, Waqas Bajwa, Usama Ijaz Xuan, Wang Chaoying, Zhou PLoS One Research Article Text tokenization is a fundamental pre-processing step for almost all the information processing applications. This task is nontrivial for the scarce resourced languages such as Urdu, as there is inconsistent use of space between words. In this paper a morpheme matching based approach has been proposed for Urdu text tokenization, along with some other algorithms to solve the additional issues of boundary detection of compound words, affixation, reduplication, names and abbreviations. This study resulted into 97.28% precision, 93.71% recall, and 95.46% F1-measure; while tokenizing a corpus of 57000 words by using a morpheme list with 6400 entries. Public Library of Science 2013-08-21 /pmc/articles/PMC3749178/ /pubmed/23990871 http://dx.doi.org/10.1371/journal.pone.0068178 Text en © 2013 Rehman et al http://creativecommons.org/licenses/by/4.0/ This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are properly credited. |
spellingShingle | Research Article Rehman, Zobia Anwar, Waqas Bajwa, Usama Ijaz Xuan, Wang Chaoying, Zhou Morpheme Matching Based Text Tokenization for a Scarce Resourced Language |
title | Morpheme Matching Based Text Tokenization for a Scarce Resourced Language |
title_full | Morpheme Matching Based Text Tokenization for a Scarce Resourced Language |
title_fullStr | Morpheme Matching Based Text Tokenization for a Scarce Resourced Language |
title_full_unstemmed | Morpheme Matching Based Text Tokenization for a Scarce Resourced Language |
title_short | Morpheme Matching Based Text Tokenization for a Scarce Resourced Language |
title_sort | morpheme matching based text tokenization for a scarce resourced language |
topic | Research Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3749178/ https://www.ncbi.nlm.nih.gov/pubmed/23990871 http://dx.doi.org/10.1371/journal.pone.0068178 |
work_keys_str_mv | AT rehmanzobia morphemematchingbasedtexttokenizationforascarceresourcedlanguage AT anwarwaqas morphemematchingbasedtexttokenizationforascarceresourcedlanguage AT bajwausamaijaz morphemematchingbasedtexttokenizationforascarceresourcedlanguage AT xuanwang morphemematchingbasedtexttokenizationforascarceresourcedlanguage AT chaoyingzhou morphemematchingbasedtexttokenizationforascarceresourcedlanguage |