Cargando…

Using Morphological Data in Language Modeling for Serbian Large Vocabulary Speech Recognition

Serbian is in a group of highly inflective and morphologically rich languages that use a lot of different word suffixes to express different grammatical, syntactic, or semantic features. This kind of behaviour usually produces a lot of recognition errors, especially in large vocabulary systems—even...

Descripción completa

Detalles Bibliográficos
Autores principales:	Pakoci, Edvin, Popović, Branislav, Pekar, Darko
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Hindawi 2019
Materias:	Research Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6421827/ https://www.ncbi.nlm.nih.gov/pubmed/30944554 http://dx.doi.org/10.1155/2019/5072918

_version_	1783404307757924352
author	Pakoci, Edvin Popović, Branislav Pekar, Darko
author_facet	Pakoci, Edvin Popović, Branislav Pekar, Darko
author_sort	Pakoci, Edvin
collection	PubMed
description	Serbian is in a group of highly inflective and morphologically rich languages that use a lot of different word suffixes to express different grammatical, syntactic, or semantic features. This kind of behaviour usually produces a lot of recognition errors, especially in large vocabulary systems—even when, due to good acoustical matching, the correct lemma is predicted by the automatic speech recognition system, often a wrong word ending occurs, which is nevertheless counted as an error. This effect is larger for contexts not present in the language model training corpus. In this manuscript, an approach which takes into account different morphological categories of words for language modeling is examined, and the benefits in terms of word error rates and perplexities are presented. These categories include word type, word case, grammatical number, and gender, and they were all assigned to words in the system vocabulary, where applicable. These additional word features helped to produce significant improvements in relation to the baseline system, both for n-gram-based and neural network-based language models. The proposed system can help overcome a lot of tedious errors in a large vocabulary system, for example, for dictation, both for Serbian and for other languages with similar characteristics.
format	Online Article Text
id	pubmed-6421827
institution	National Center for Biotechnology Information
language	English
publishDate	2019
publisher	Hindawi
record_format	MEDLINE/PubMed
spelling	pubmed-64218272019-04-03 Using Morphological Data in Language Modeling for Serbian Large Vocabulary Speech Recognition Pakoci, Edvin Popović, Branislav Pekar, Darko Comput Intell Neurosci Research Article Serbian is in a group of highly inflective and morphologically rich languages that use a lot of different word suffixes to express different grammatical, syntactic, or semantic features. This kind of behaviour usually produces a lot of recognition errors, especially in large vocabulary systems—even when, due to good acoustical matching, the correct lemma is predicted by the automatic speech recognition system, often a wrong word ending occurs, which is nevertheless counted as an error. This effect is larger for contexts not present in the language model training corpus. In this manuscript, an approach which takes into account different morphological categories of words for language modeling is examined, and the benefits in terms of word error rates and perplexities are presented. These categories include word type, word case, grammatical number, and gender, and they were all assigned to words in the system vocabulary, where applicable. These additional word features helped to produce significant improvements in relation to the baseline system, both for n-gram-based and neural network-based language models. The proposed system can help overcome a lot of tedious errors in a large vocabulary system, for example, for dictation, both for Serbian and for other languages with similar characteristics. Hindawi 2019-03-03 /pmc/articles/PMC6421827/ /pubmed/30944554 http://dx.doi.org/10.1155/2019/5072918 Text en Copyright © 2019 Edvin Pakoci et al. http://creativecommons.org/licenses/by/4.0/ This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle	Research Article Pakoci, Edvin Popović, Branislav Pekar, Darko Using Morphological Data in Language Modeling for Serbian Large Vocabulary Speech Recognition
title	Using Morphological Data in Language Modeling for Serbian Large Vocabulary Speech Recognition
title_full	Using Morphological Data in Language Modeling for Serbian Large Vocabulary Speech Recognition
title_fullStr	Using Morphological Data in Language Modeling for Serbian Large Vocabulary Speech Recognition
title_full_unstemmed	Using Morphological Data in Language Modeling for Serbian Large Vocabulary Speech Recognition
title_short	Using Morphological Data in Language Modeling for Serbian Large Vocabulary Speech Recognition
title_sort	using morphological data in language modeling for serbian large vocabulary speech recognition
topic	Research Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6421827/ https://www.ncbi.nlm.nih.gov/pubmed/30944554 http://dx.doi.org/10.1155/2019/5072918
work_keys_str_mv	AT pakociedvin usingmorphologicaldatainlanguagemodelingforserbianlargevocabularyspeechrecognition AT popovicbranislav usingmorphologicaldatainlanguagemodelingforserbianlargevocabularyspeechrecognition AT pekardarko usingmorphologicaldatainlanguagemodelingforserbianlargevocabularyspeechrecognition

Using Morphological Data in Language Modeling for Serbian Large Vocabulary Speech Recognition

Ejemplares similares