Cargando…

Lipreading Architecture Based on Multiple Convolutional Neural Networks for Sentence-Level Visual Speech Recognition

In visual speech recognition (VSR), speech is transcribed using only visual information to interpret tongue and teeth movements. Recently, deep learning has shown outstanding performance in VSR, with accuracy exceeding that of lipreaders on benchmark datasets. However, several problems still exist w...

Descripción completa

Detalles Bibliográficos
Autores principales:	Jeon, Sanghun, Elsharkawy, Ahmed, Kim, Mun Sang
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	MDPI 2021
Materias:	Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8747278/ https://www.ncbi.nlm.nih.gov/pubmed/35009612 http://dx.doi.org/10.3390/s22010072

_version_	1784630795621629952
author	Jeon, Sanghun Elsharkawy, Ahmed Kim, Mun Sang
author_facet	Jeon, Sanghun Elsharkawy, Ahmed Kim, Mun Sang
author_sort	Jeon, Sanghun
collection	PubMed
description	In visual speech recognition (VSR), speech is transcribed using only visual information to interpret tongue and teeth movements. Recently, deep learning has shown outstanding performance in VSR, with accuracy exceeding that of lipreaders on benchmark datasets. However, several problems still exist when using VSR systems. A major challenge is the distinction of words with similar pronunciation, called homophones; these lead to word ambiguity. Another technical limitation of traditional VSR systems is that visual information does not provide sufficient data for learning words such as “a”, “an”, “eight”, and “bin” because their lengths are shorter than 0.02 s. This report proposes a novel lipreading architecture that combines three different convolutional neural networks (CNNs; a 3D CNN, a densely connected 3D CNN, and a multi-layer feature fusion 3D CNN), which are followed by a two-layer bi-directional gated recurrent unit. The entire network was trained using connectionist temporal classification. The results of the standard automatic speech recognition evaluation metrics show that the proposed architecture reduced the character and word error rates of the baseline model by 5.681% and 11.282%, respectively, for the unseen-speaker dataset. Our proposed architecture exhibits improved performance even when visual ambiguity arises, thereby increasing VSR reliability for practical applications.
format	Online Article Text
id	pubmed-8747278
institution	National Center for Biotechnology Information
language	English
publishDate	2021
publisher	MDPI
record_format	MEDLINE/PubMed
spelling	pubmed-87472782022-01-11 Lipreading Architecture Based on Multiple Convolutional Neural Networks for Sentence-Level Visual Speech Recognition Jeon, Sanghun Elsharkawy, Ahmed Kim, Mun Sang Sensors (Basel) Article In visual speech recognition (VSR), speech is transcribed using only visual information to interpret tongue and teeth movements. Recently, deep learning has shown outstanding performance in VSR, with accuracy exceeding that of lipreaders on benchmark datasets. However, several problems still exist when using VSR systems. A major challenge is the distinction of words with similar pronunciation, called homophones; these lead to word ambiguity. Another technical limitation of traditional VSR systems is that visual information does not provide sufficient data for learning words such as “a”, “an”, “eight”, and “bin” because their lengths are shorter than 0.02 s. This report proposes a novel lipreading architecture that combines three different convolutional neural networks (CNNs; a 3D CNN, a densely connected 3D CNN, and a multi-layer feature fusion 3D CNN), which are followed by a two-layer bi-directional gated recurrent unit. The entire network was trained using connectionist temporal classification. The results of the standard automatic speech recognition evaluation metrics show that the proposed architecture reduced the character and word error rates of the baseline model by 5.681% and 11.282%, respectively, for the unseen-speaker dataset. Our proposed architecture exhibits improved performance even when visual ambiguity arises, thereby increasing VSR reliability for practical applications. MDPI 2021-12-23 /pmc/articles/PMC8747278/ /pubmed/35009612 http://dx.doi.org/10.3390/s22010072 Text en © 2021 by the authors. https://creativecommons.org/licenses/by/4.0/Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
spellingShingle	Article Jeon, Sanghun Elsharkawy, Ahmed Kim, Mun Sang Lipreading Architecture Based on Multiple Convolutional Neural Networks for Sentence-Level Visual Speech Recognition
title	Lipreading Architecture Based on Multiple Convolutional Neural Networks for Sentence-Level Visual Speech Recognition
title_full	Lipreading Architecture Based on Multiple Convolutional Neural Networks for Sentence-Level Visual Speech Recognition
title_fullStr	Lipreading Architecture Based on Multiple Convolutional Neural Networks for Sentence-Level Visual Speech Recognition
title_full_unstemmed	Lipreading Architecture Based on Multiple Convolutional Neural Networks for Sentence-Level Visual Speech Recognition
title_short	Lipreading Architecture Based on Multiple Convolutional Neural Networks for Sentence-Level Visual Speech Recognition
title_sort	lipreading architecture based on multiple convolutional neural networks for sentence-level visual speech recognition
topic	Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8747278/ https://www.ncbi.nlm.nih.gov/pubmed/35009612 http://dx.doi.org/10.3390/s22010072
work_keys_str_mv	AT jeonsanghun lipreadingarchitecturebasedonmultipleconvolutionalneuralnetworksforsentencelevelvisualspeechrecognition AT elsharkawyahmed lipreadingarchitecturebasedonmultipleconvolutionalneuralnetworksforsentencelevelvisualspeechrecognition AT kimmunsang lipreadingarchitecturebasedonmultipleconvolutionalneuralnetworksforsentencelevelvisualspeechrecognition

Lipreading Architecture Based on Multiple Convolutional Neural Networks for Sentence-Level Visual Speech Recognition

Ejemplares similares