Cargando…

Deep Bottleneck Features for Spoken Language Identification

A key problem in spoken language identification (LID) is to design effective representations which are specific to language information. For example, in recent years, representations based on both phonotactic and acoustic features have proven their effectiveness for LID. Although advances in machine...

Descripción completa

Detalles Bibliográficos
Autores principales: Jiang, Bing, Song, Yan, Wei, Si, Liu, Jun-Hua, McLoughlin, Ian Vince, Dai, Li-Rong
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Public Library of Science 2014
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4077656/
https://www.ncbi.nlm.nih.gov/pubmed/24983963
http://dx.doi.org/10.1371/journal.pone.0100795
_version_ 1782323630167818240
author Jiang, Bing
Song, Yan
Wei, Si
Liu, Jun-Hua
McLoughlin, Ian Vince
Dai, Li-Rong
author_facet Jiang, Bing
Song, Yan
Wei, Si
Liu, Jun-Hua
McLoughlin, Ian Vince
Dai, Li-Rong
author_sort Jiang, Bing
collection PubMed
description A key problem in spoken language identification (LID) is to design effective representations which are specific to language information. For example, in recent years, representations based on both phonotactic and acoustic features have proven their effectiveness for LID. Although advances in machine learning have led to significant improvements, LID performance is still lacking, especially for short duration speech utterances. With the hypothesis that language information is weak and represented only latently in speech, and is largely dependent on the statistical properties of the speech content, existing representations may be insufficient. Furthermore they may be susceptible to the variations caused by different speakers, specific content of the speech segments, and background noise. To address this, we propose using Deep Bottleneck Features (DBF) for spoken LID, motivated by the success of Deep Neural Networks (DNN) in speech recognition. We show that DBFs can form a low-dimensional compact representation of the original inputs with a powerful descriptive and discriminative capability. To evaluate the effectiveness of this, we design two acoustic models, termed DBF-TV and parallel DBF-TV (PDBF-TV), using a DBF based i-vector representation for each speech utterance. Results on NIST language recognition evaluation 2009 (LRE09) show significant improvements over state-of-the-art systems. By fusing the output of phonotactic and acoustic approaches, we achieve an EER of 1.08%, 1.89% and 7.01% for 30 s, 10 s and 3 s test utterances respectively. Furthermore, various DBF configurations have been extensively evaluated, and an optimal system proposed.
format Online
Article
Text
id pubmed-4077656
institution National Center for Biotechnology Information
language English
publishDate 2014
publisher Public Library of Science
record_format MEDLINE/PubMed
spelling pubmed-40776562014-07-03 Deep Bottleneck Features for Spoken Language Identification Jiang, Bing Song, Yan Wei, Si Liu, Jun-Hua McLoughlin, Ian Vince Dai, Li-Rong PLoS One Research Article A key problem in spoken language identification (LID) is to design effective representations which are specific to language information. For example, in recent years, representations based on both phonotactic and acoustic features have proven their effectiveness for LID. Although advances in machine learning have led to significant improvements, LID performance is still lacking, especially for short duration speech utterances. With the hypothesis that language information is weak and represented only latently in speech, and is largely dependent on the statistical properties of the speech content, existing representations may be insufficient. Furthermore they may be susceptible to the variations caused by different speakers, specific content of the speech segments, and background noise. To address this, we propose using Deep Bottleneck Features (DBF) for spoken LID, motivated by the success of Deep Neural Networks (DNN) in speech recognition. We show that DBFs can form a low-dimensional compact representation of the original inputs with a powerful descriptive and discriminative capability. To evaluate the effectiveness of this, we design two acoustic models, termed DBF-TV and parallel DBF-TV (PDBF-TV), using a DBF based i-vector representation for each speech utterance. Results on NIST language recognition evaluation 2009 (LRE09) show significant improvements over state-of-the-art systems. By fusing the output of phonotactic and acoustic approaches, we achieve an EER of 1.08%, 1.89% and 7.01% for 30 s, 10 s and 3 s test utterances respectively. Furthermore, various DBF configurations have been extensively evaluated, and an optimal system proposed. Public Library of Science 2014-07-01 /pmc/articles/PMC4077656/ /pubmed/24983963 http://dx.doi.org/10.1371/journal.pone.0100795 Text en © 2014 Jiang et al http://creativecommons.org/licenses/by/4.0/ This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are properly credited.
spellingShingle Research Article
Jiang, Bing
Song, Yan
Wei, Si
Liu, Jun-Hua
McLoughlin, Ian Vince
Dai, Li-Rong
Deep Bottleneck Features for Spoken Language Identification
title Deep Bottleneck Features for Spoken Language Identification
title_full Deep Bottleneck Features for Spoken Language Identification
title_fullStr Deep Bottleneck Features for Spoken Language Identification
title_full_unstemmed Deep Bottleneck Features for Spoken Language Identification
title_short Deep Bottleneck Features for Spoken Language Identification
title_sort deep bottleneck features for spoken language identification
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4077656/
https://www.ncbi.nlm.nih.gov/pubmed/24983963
http://dx.doi.org/10.1371/journal.pone.0100795
work_keys_str_mv AT jiangbing deepbottleneckfeaturesforspokenlanguageidentification
AT songyan deepbottleneckfeaturesforspokenlanguageidentification
AT weisi deepbottleneckfeaturesforspokenlanguageidentification
AT liujunhua deepbottleneckfeaturesforspokenlanguageidentification
AT mcloughlinianvince deepbottleneckfeaturesforspokenlanguageidentification
AT dailirong deepbottleneckfeaturesforspokenlanguageidentification