Cargando…

Multi-resolution speech analysis for automatic speech recognition using deep neural networks: Experiments on TIMIT

Speech Analysis for Automatic Speech Recognition (ASR) systems typically starts with a Short-Time Fourier Transform (STFT) that implies selecting a fixed point in the time-frequency resolution trade-off. This approach, combined with a Mel-frequency scaled filterbank and a Discrete Cosine Transform g...

Descripción completa

Detalles Bibliográficos
Autores principales:	Toledano, Doroteo T., Fernández-Gallego, María Pilar, Lozano-Diez, Alicia
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Public Library of Science 2018
Materias:	Research Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6179252/ https://www.ncbi.nlm.nih.gov/pubmed/30304055 http://dx.doi.org/10.1371/journal.pone.0205355

_version_	1783362072647565312
author	Toledano, Doroteo T. Fernández-Gallego, María Pilar Lozano-Diez, Alicia
author_facet	Toledano, Doroteo T. Fernández-Gallego, María Pilar Lozano-Diez, Alicia
author_sort	Toledano, Doroteo T.
collection	PubMed
description	Speech Analysis for Automatic Speech Recognition (ASR) systems typically starts with a Short-Time Fourier Transform (STFT) that implies selecting a fixed point in the time-frequency resolution trade-off. This approach, combined with a Mel-frequency scaled filterbank and a Discrete Cosine Transform give rise to the Mel-Frequency Cepstral Coefficients (MFCC), which have been the most common speech features in speech processing for the last decades. These features were particularly well suited for the previous Hidden Markov Models/Gaussian Mixture Models (HMM/GMM) state of the art in ASR. In particular they produced highly uncorrelated features of small dimensionality (typically 13 coefficients plus deltas and double deltas), which was very convenient for diagonal covariance GMMs, for dealing with the curse of dimensionality and for the limited computing resources of a decade ago. Currently most ASR systems use Deep Neural Networks (DNN) instead of the GMMs for modeling the acoustic features, which provides more flexibility regarding the definition of the features. In particular, acoustic features can be highly correlated and can be much larger in size because the DNNs are very powerful at processing high-dimensionality inputs. Also, the computing hardware has reached a level of evolution that makes computational cost in speech processing a less relevant issue. In this context we have decided to revisit the problem of the time-frequency resolution in speech analysis, and in particular to check if multi-resolution speech analysis (both in time and frequency) can be helpful in improving acoustic modeling using DNNs. Our experiments start with several Kaldi baseline system for the well known TIMIT corpus and modify them by adding multi-resolution speech representations by concatenating different spectra computed using different time-frequency resolutions and different post-processed and speaker-adapted features using different time-frequency resolutions. Our experiments show that using a multi-resolution speech representation tends to improve over results using the baseline single resolution speech representation, which seems to confirm our main hypothesis. However, results combining multi-resolution with the highly post-processed and speaker-adapted features, which provide the best results in Kaldi for TIMIT, yield only very modest improvements.
format	Online Article Text
id	pubmed-6179252
institution	National Center for Biotechnology Information
language	English
publishDate	2018
publisher	Public Library of Science
record_format	MEDLINE/PubMed
spelling	pubmed-61792522018-10-26 Multi-resolution speech analysis for automatic speech recognition using deep neural networks: Experiments on TIMIT Toledano, Doroteo T. Fernández-Gallego, María Pilar Lozano-Diez, Alicia PLoS One Research Article Speech Analysis for Automatic Speech Recognition (ASR) systems typically starts with a Short-Time Fourier Transform (STFT) that implies selecting a fixed point in the time-frequency resolution trade-off. This approach, combined with a Mel-frequency scaled filterbank and a Discrete Cosine Transform give rise to the Mel-Frequency Cepstral Coefficients (MFCC), which have been the most common speech features in speech processing for the last decades. These features were particularly well suited for the previous Hidden Markov Models/Gaussian Mixture Models (HMM/GMM) state of the art in ASR. In particular they produced highly uncorrelated features of small dimensionality (typically 13 coefficients plus deltas and double deltas), which was very convenient for diagonal covariance GMMs, for dealing with the curse of dimensionality and for the limited computing resources of a decade ago. Currently most ASR systems use Deep Neural Networks (DNN) instead of the GMMs for modeling the acoustic features, which provides more flexibility regarding the definition of the features. In particular, acoustic features can be highly correlated and can be much larger in size because the DNNs are very powerful at processing high-dimensionality inputs. Also, the computing hardware has reached a level of evolution that makes computational cost in speech processing a less relevant issue. In this context we have decided to revisit the problem of the time-frequency resolution in speech analysis, and in particular to check if multi-resolution speech analysis (both in time and frequency) can be helpful in improving acoustic modeling using DNNs. Our experiments start with several Kaldi baseline system for the well known TIMIT corpus and modify them by adding multi-resolution speech representations by concatenating different spectra computed using different time-frequency resolutions and different post-processed and speaker-adapted features using different time-frequency resolutions. Our experiments show that using a multi-resolution speech representation tends to improve over results using the baseline single resolution speech representation, which seems to confirm our main hypothesis. However, results combining multi-resolution with the highly post-processed and speaker-adapted features, which provide the best results in Kaldi for TIMIT, yield only very modest improvements. Public Library of Science 2018-10-10 /pmc/articles/PMC6179252/ /pubmed/30304055 http://dx.doi.org/10.1371/journal.pone.0205355 Text en © 2018 Toledano et al http://creativecommons.org/licenses/by/4.0/ This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
spellingShingle	Research Article Toledano, Doroteo T. Fernández-Gallego, María Pilar Lozano-Diez, Alicia Multi-resolution speech analysis for automatic speech recognition using deep neural networks: Experiments on TIMIT
title	Multi-resolution speech analysis for automatic speech recognition using deep neural networks: Experiments on TIMIT
title_full	Multi-resolution speech analysis for automatic speech recognition using deep neural networks: Experiments on TIMIT
title_fullStr	Multi-resolution speech analysis for automatic speech recognition using deep neural networks: Experiments on TIMIT
title_full_unstemmed	Multi-resolution speech analysis for automatic speech recognition using deep neural networks: Experiments on TIMIT
title_short	Multi-resolution speech analysis for automatic speech recognition using deep neural networks: Experiments on TIMIT
title_sort	multi-resolution speech analysis for automatic speech recognition using deep neural networks: experiments on timit
topic	Research Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6179252/ https://www.ncbi.nlm.nih.gov/pubmed/30304055 http://dx.doi.org/10.1371/journal.pone.0205355
work_keys_str_mv	AT toledanodoroteot multiresolutionspeechanalysisforautomaticspeechrecognitionusingdeepneuralnetworksexperimentsontimit AT fernandezgallegomariapilar multiresolutionspeechanalysisforautomaticspeechrecognitionusingdeepneuralnetworksexperimentsontimit AT lozanodiezalicia multiresolutionspeechanalysisforautomaticspeechrecognitionusingdeepneuralnetworksexperimentsontimit

Multi-resolution speech analysis for automatic speech recognition using deep neural networks: Experiments on TIMIT

Ejemplares similares