Cargando…

Reliability-Based Large-Vocabulary Audio-Visual Speech Recognition

Audio-visual speech recognition (AVSR) can significantly improve performance over audio-only recognition for small or medium vocabularies. However, current AVSR, whether hybrid or end-to-end (E2E), still does not appear to make optimal use of this secondary information stream as the performance is s...

Descripción completa

Detalles Bibliográficos
Autores principales:	Yu, Wentao, Zeiler, Steffen, Kolossa, Dorothea
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	MDPI 2022
Materias:	Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9370936/ https://www.ncbi.nlm.nih.gov/pubmed/35898005 http://dx.doi.org/10.3390/s22155501

_version_	1784766970977058816
author	Yu, Wentao Zeiler, Steffen Kolossa, Dorothea
author_facet	Yu, Wentao Zeiler, Steffen Kolossa, Dorothea
author_sort	Yu, Wentao
collection	PubMed
description	Audio-visual speech recognition (AVSR) can significantly improve performance over audio-only recognition for small or medium vocabularies. However, current AVSR, whether hybrid or end-to-end (E2E), still does not appear to make optimal use of this secondary information stream as the performance is still clearly diminished in noisy conditions for large-vocabulary systems. We, therefore, propose a new fusion architecture—the decision fusion net (DFN). A broad range of time-variant reliability measures are used as an auxiliary input to improve performance. The DFN is used in both hybrid and E2E models. Our experiments on two large-vocabulary datasets, the Lip Reading Sentences 2 and 3 (LRS2 and LRS3) corpora, show highly significant improvements in performance over previous AVSR systems for large-vocabulary datasets. The hybrid model with the proposed DFN integration component even outperforms oracle dynamic stream-weighting, which is considered to be the theoretical upper bound for conventional dynamic stream-weighting approaches. Compared to the hybrid audio-only model, the proposed DFN achieves a relative word-error-rate reduction of 51% on average, while the E2E-DFN model, with its more competitive audio-only baseline system, achieves a relative word error rate reduction of 43%, both showing the efficacy of our proposed fusion architecture.
format	Online Article Text
id	pubmed-9370936
institution	National Center for Biotechnology Information
language	English
publishDate	2022
publisher	MDPI
record_format	MEDLINE/PubMed
spelling	pubmed-93709362022-08-12 Reliability-Based Large-Vocabulary Audio-Visual Speech Recognition Yu, Wentao Zeiler, Steffen Kolossa, Dorothea Sensors (Basel) Article Audio-visual speech recognition (AVSR) can significantly improve performance over audio-only recognition for small or medium vocabularies. However, current AVSR, whether hybrid or end-to-end (E2E), still does not appear to make optimal use of this secondary information stream as the performance is still clearly diminished in noisy conditions for large-vocabulary systems. We, therefore, propose a new fusion architecture—the decision fusion net (DFN). A broad range of time-variant reliability measures are used as an auxiliary input to improve performance. The DFN is used in both hybrid and E2E models. Our experiments on two large-vocabulary datasets, the Lip Reading Sentences 2 and 3 (LRS2 and LRS3) corpora, show highly significant improvements in performance over previous AVSR systems for large-vocabulary datasets. The hybrid model with the proposed DFN integration component even outperforms oracle dynamic stream-weighting, which is considered to be the theoretical upper bound for conventional dynamic stream-weighting approaches. Compared to the hybrid audio-only model, the proposed DFN achieves a relative word-error-rate reduction of 51% on average, while the E2E-DFN model, with its more competitive audio-only baseline system, achieves a relative word error rate reduction of 43%, both showing the efficacy of our proposed fusion architecture. MDPI 2022-07-23 /pmc/articles/PMC9370936/ /pubmed/35898005 http://dx.doi.org/10.3390/s22155501 Text en © 2022 by the authors. https://creativecommons.org/licenses/by/4.0/Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
spellingShingle	Article Yu, Wentao Zeiler, Steffen Kolossa, Dorothea Reliability-Based Large-Vocabulary Audio-Visual Speech Recognition
title	Reliability-Based Large-Vocabulary Audio-Visual Speech Recognition
title_full	Reliability-Based Large-Vocabulary Audio-Visual Speech Recognition
title_fullStr	Reliability-Based Large-Vocabulary Audio-Visual Speech Recognition
title_full_unstemmed	Reliability-Based Large-Vocabulary Audio-Visual Speech Recognition
title_short	Reliability-Based Large-Vocabulary Audio-Visual Speech Recognition
title_sort	reliability-based large-vocabulary audio-visual speech recognition
topic	Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9370936/ https://www.ncbi.nlm.nih.gov/pubmed/35898005 http://dx.doi.org/10.3390/s22155501
work_keys_str_mv	AT yuwentao reliabilitybasedlargevocabularyaudiovisualspeechrecognition AT zeilersteffen reliabilitybasedlargevocabularyaudiovisualspeechrecognition AT kolossadorothea reliabilitybasedlargevocabularyaudiovisualspeechrecognition

Reliability-Based Large-Vocabulary Audio-Visual Speech Recognition

Ejemplares similares