Cargando…

iSentenizer-μ: Multilingual Sentence Boundary Detection Model

Sentence boundary detection (SBD) system is normally quite sensitive to genres of data that the system is trained on. The genres of data are often referred to the shifts of text topics and new languages domains. Although new detection models can be retrained for different languages or new text genre...

Descripción completa

Detalles Bibliográficos
Autores principales: Wong, Derek F., Chao, Lidia S., Zeng, Xiaodong
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Hindawi Publishing Corporation 2014
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4030568/
https://www.ncbi.nlm.nih.gov/pubmed/24883358
http://dx.doi.org/10.1155/2014/196574
_version_ 1782317405640327168
author Wong, Derek F.
Chao, Lidia S.
Zeng, Xiaodong
author_facet Wong, Derek F.
Chao, Lidia S.
Zeng, Xiaodong
author_sort Wong, Derek F.
collection PubMed
description Sentence boundary detection (SBD) system is normally quite sensitive to genres of data that the system is trained on. The genres of data are often referred to the shifts of text topics and new languages domains. Although new detection models can be retrained for different languages or new text genres, previous model has to be thrown away and the creation process has to be restarted from scratch. In this paper, we present a multilingual sentence boundary detection system (iSentenizer-μ) for Danish, German, English, Spanish, Dutch, French, Italian, Portuguese, Greek, Finnish, and Swedish languages. The proposed system is able to detect the sentence boundaries of a mixture of different text genres and languages with high accuracy. We employ i (+)Learning algorithm, an incremental tree learning architecture, for constructing the system. iSentenizer-μ, under the incremental learning framework, is adaptable to text of different topics and Roman-alphabet languages, by merging new data into existing model to learn the new knowledge incrementally by revision instead of retraining. The system has been extensively evaluated on different languages and text genres and has been compared against two state-of-the-art SBD systems, Punkt and MaxEnt. The experimental results show that the proposed system outperforms the other systems on all datasets.
format Online
Article
Text
id pubmed-4030568
institution National Center for Biotechnology Information
language English
publishDate 2014
publisher Hindawi Publishing Corporation
record_format MEDLINE/PubMed
spelling pubmed-40305682014-06-01 iSentenizer-μ: Multilingual Sentence Boundary Detection Model Wong, Derek F. Chao, Lidia S. Zeng, Xiaodong ScientificWorldJournal Research Article Sentence boundary detection (SBD) system is normally quite sensitive to genres of data that the system is trained on. The genres of data are often referred to the shifts of text topics and new languages domains. Although new detection models can be retrained for different languages or new text genres, previous model has to be thrown away and the creation process has to be restarted from scratch. In this paper, we present a multilingual sentence boundary detection system (iSentenizer-μ) for Danish, German, English, Spanish, Dutch, French, Italian, Portuguese, Greek, Finnish, and Swedish languages. The proposed system is able to detect the sentence boundaries of a mixture of different text genres and languages with high accuracy. We employ i (+)Learning algorithm, an incremental tree learning architecture, for constructing the system. iSentenizer-μ, under the incremental learning framework, is adaptable to text of different topics and Roman-alphabet languages, by merging new data into existing model to learn the new knowledge incrementally by revision instead of retraining. The system has been extensively evaluated on different languages and text genres and has been compared against two state-of-the-art SBD systems, Punkt and MaxEnt. The experimental results show that the proposed system outperforms the other systems on all datasets. Hindawi Publishing Corporation 2014 2014-04-15 /pmc/articles/PMC4030568/ /pubmed/24883358 http://dx.doi.org/10.1155/2014/196574 Text en Copyright © 2014 Derek F. Wong et al. https://creativecommons.org/licenses/by/3.0/ This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Research Article
Wong, Derek F.
Chao, Lidia S.
Zeng, Xiaodong
iSentenizer-μ: Multilingual Sentence Boundary Detection Model
title iSentenizer-μ: Multilingual Sentence Boundary Detection Model
title_full iSentenizer-μ: Multilingual Sentence Boundary Detection Model
title_fullStr iSentenizer-μ: Multilingual Sentence Boundary Detection Model
title_full_unstemmed iSentenizer-μ: Multilingual Sentence Boundary Detection Model
title_short iSentenizer-μ: Multilingual Sentence Boundary Detection Model
title_sort isentenizer-μ: multilingual sentence boundary detection model
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4030568/
https://www.ncbi.nlm.nih.gov/pubmed/24883358
http://dx.doi.org/10.1155/2014/196574
work_keys_str_mv AT wongderekf isentenizermmultilingualsentenceboundarydetectionmodel
AT chaolidias isentenizermmultilingualsentenceboundarydetectionmodel
AT zengxiaodong isentenizermmultilingualsentenceboundarydetectionmodel