Cargando…

Heterophonic speech recognition using composite phones

Heterophones pose challenges during training of automatic speech recognition (ASR) systems because they involve ambiguity in the pronunciation of an orthographic representation of a word. Heterophones are words that have the same spelling but different pronunciations. This paper addresses the proble...

Descripción completa

Detalles Bibliográficos
Autores principales: Alkhairy, Ashraf, Jafri, Afshan
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Springer International Publishing 2016
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5121111/
https://www.ncbi.nlm.nih.gov/pubmed/27933264
http://dx.doi.org/10.1186/s40064-016-3332-9
_version_ 1782469341773561856
author Alkhairy, Ashraf
Jafri, Afshan
author_facet Alkhairy, Ashraf
Jafri, Afshan
author_sort Alkhairy, Ashraf
collection PubMed
description Heterophones pose challenges during training of automatic speech recognition (ASR) systems because they involve ambiguity in the pronunciation of an orthographic representation of a word. Heterophones are words that have the same spelling but different pronunciations. This paper addresses the problem of heterophonic languages by developing the concept of a Composite Phoneme (CP) as a basic pronunciation unit for speech recognition. A CP is a set of alternative sequences of phonemes. CP’s are developed specifically in the context of Arabic by defining phonetic units that are consonant centric and absorb phonemically contrastive short vowels and gemination, not represented in the Arabic Modern Orthography (MO). CPs alleviate the need to diacritize MO into Classical Orthography (CO), to represent short vowels and stress, before generating pronunciation in terms of Simple Phonemes (SP). We develop algorithms to generate CP pronunciation from MO, and SP pronunciation from CO to map a word into a single pronunciation. We investigate the performance of CP, SP, UG (Undiacritized Grapheme), and DG (Diacritized Grapheme) ASRs. The experimental results suggest that UG and DG are inferior to SP and CP. For the A-SpeechDB corpus with MO vocabulary of 8000, the WER for bigram and context dependent phone are: 11.78, 12.64, and 13.59 % for CP, SP_M (SP from manual diacritized CO), and SP_A (SP from automated diacritized MO) respectively. For vocabulary of 24,000 MO words, the corresponding WER’s are 13.69, 15.08, and 16.86 %. For uniform statistical model, SP has a lower WER than CP. For context independent phone (CI), CP has lower WER than SP.
format Online
Article
Text
id pubmed-5121111
institution National Center for Biotechnology Information
language English
publishDate 2016
publisher Springer International Publishing
record_format MEDLINE/PubMed
spelling pubmed-51211112016-12-08 Heterophonic speech recognition using composite phones Alkhairy, Ashraf Jafri, Afshan Springerplus Research Heterophones pose challenges during training of automatic speech recognition (ASR) systems because they involve ambiguity in the pronunciation of an orthographic representation of a word. Heterophones are words that have the same spelling but different pronunciations. This paper addresses the problem of heterophonic languages by developing the concept of a Composite Phoneme (CP) as a basic pronunciation unit for speech recognition. A CP is a set of alternative sequences of phonemes. CP’s are developed specifically in the context of Arabic by defining phonetic units that are consonant centric and absorb phonemically contrastive short vowels and gemination, not represented in the Arabic Modern Orthography (MO). CPs alleviate the need to diacritize MO into Classical Orthography (CO), to represent short vowels and stress, before generating pronunciation in terms of Simple Phonemes (SP). We develop algorithms to generate CP pronunciation from MO, and SP pronunciation from CO to map a word into a single pronunciation. We investigate the performance of CP, SP, UG (Undiacritized Grapheme), and DG (Diacritized Grapheme) ASRs. The experimental results suggest that UG and DG are inferior to SP and CP. For the A-SpeechDB corpus with MO vocabulary of 8000, the WER for bigram and context dependent phone are: 11.78, 12.64, and 13.59 % for CP, SP_M (SP from manual diacritized CO), and SP_A (SP from automated diacritized MO) respectively. For vocabulary of 24,000 MO words, the corresponding WER’s are 13.69, 15.08, and 16.86 %. For uniform statistical model, SP has a lower WER than CP. For context independent phone (CI), CP has lower WER than SP. Springer International Publishing 2016-11-24 /pmc/articles/PMC5121111/ /pubmed/27933264 http://dx.doi.org/10.1186/s40064-016-3332-9 Text en © The Author(s) 2016 Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
spellingShingle Research
Alkhairy, Ashraf
Jafri, Afshan
Heterophonic speech recognition using composite phones
title Heterophonic speech recognition using composite phones
title_full Heterophonic speech recognition using composite phones
title_fullStr Heterophonic speech recognition using composite phones
title_full_unstemmed Heterophonic speech recognition using composite phones
title_short Heterophonic speech recognition using composite phones
title_sort heterophonic speech recognition using composite phones
topic Research
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5121111/
https://www.ncbi.nlm.nih.gov/pubmed/27933264
http://dx.doi.org/10.1186/s40064-016-3332-9
work_keys_str_mv AT alkhairyashraf heterophonicspeechrecognitionusingcompositephones
AT jafriafshan heterophonicspeechrecognitionusingcompositephones