Cargando…

Learning the Relative Dynamic Features for Word-Level Lipreading

Lipreading is a technique for analyzing sequences of lip movements and then recognizing the speech content of a speaker. Limited by the structure of our vocal organs, the number of pronunciations we could make is finite, leading to problems with homophones when speaking. On the other hand, different...

Descripción completa

Detalles Bibliográficos
Autores principales:	Li, Hao, Yadikar, Nurbiya, Zhu, Yali, Mamut, Mutallip, Ubul, Kurban
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	MDPI 2022
Materias:	Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9147953/ https://www.ncbi.nlm.nih.gov/pubmed/35632141 http://dx.doi.org/10.3390/s22103732

_version_	1784716934039732224
author	Li, Hao Yadikar, Nurbiya Zhu, Yali Mamut, Mutallip Ubul, Kurban
author_facet	Li, Hao Yadikar, Nurbiya Zhu, Yali Mamut, Mutallip Ubul, Kurban
author_sort	Li, Hao
collection	PubMed
description	Lipreading is a technique for analyzing sequences of lip movements and then recognizing the speech content of a speaker. Limited by the structure of our vocal organs, the number of pronunciations we could make is finite, leading to problems with homophones when speaking. On the other hand, different speakers will have various lip movements for the same word. For these problems, we focused on the spatial–temporal feature extraction in word-level lipreading in this paper, and an efficient two-stream model was proposed to learn the relative dynamic information of lip motion. In this model, two different channel capacity CNN streams are used to extract static features in a single frame and dynamic information between multi-frame sequences, respectively. We explored a more effective convolution structure for each component in the front-end model and improved by about 8%. Then, according to the characteristics of the word-level lipreading dataset, we further studied the impact of the two sampling methods on the fast and slow channels. Furthermore, we discussed the influence of the fusion methods of the front-end and back-end models under the two-stream network structure. Finally, we evaluated the proposed model on two large-scale lipreading datasets and achieved a new state-of-the-art.
format	Online Article Text
id	pubmed-9147953
institution	National Center for Biotechnology Information
language	English
publishDate	2022
publisher	MDPI
record_format	MEDLINE/PubMed
spelling	pubmed-91479532022-05-29 Learning the Relative Dynamic Features for Word-Level Lipreading Li, Hao Yadikar, Nurbiya Zhu, Yali Mamut, Mutallip Ubul, Kurban Sensors (Basel) Article Lipreading is a technique for analyzing sequences of lip movements and then recognizing the speech content of a speaker. Limited by the structure of our vocal organs, the number of pronunciations we could make is finite, leading to problems with homophones when speaking. On the other hand, different speakers will have various lip movements for the same word. For these problems, we focused on the spatial–temporal feature extraction in word-level lipreading in this paper, and an efficient two-stream model was proposed to learn the relative dynamic information of lip motion. In this model, two different channel capacity CNN streams are used to extract static features in a single frame and dynamic information between multi-frame sequences, respectively. We explored a more effective convolution structure for each component in the front-end model and improved by about 8%. Then, according to the characteristics of the word-level lipreading dataset, we further studied the impact of the two sampling methods on the fast and slow channels. Furthermore, we discussed the influence of the fusion methods of the front-end and back-end models under the two-stream network structure. Finally, we evaluated the proposed model on two large-scale lipreading datasets and achieved a new state-of-the-art. MDPI 2022-05-13 /pmc/articles/PMC9147953/ /pubmed/35632141 http://dx.doi.org/10.3390/s22103732 Text en © 2022 by the authors. https://creativecommons.org/licenses/by/4.0/Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
spellingShingle	Article Li, Hao Yadikar, Nurbiya Zhu, Yali Mamut, Mutallip Ubul, Kurban Learning the Relative Dynamic Features for Word-Level Lipreading
title	Learning the Relative Dynamic Features for Word-Level Lipreading
title_full	Learning the Relative Dynamic Features for Word-Level Lipreading
title_fullStr	Learning the Relative Dynamic Features for Word-Level Lipreading
title_full_unstemmed	Learning the Relative Dynamic Features for Word-Level Lipreading
title_short	Learning the Relative Dynamic Features for Word-Level Lipreading
title_sort	learning the relative dynamic features for word-level lipreading
topic	Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9147953/ https://www.ncbi.nlm.nih.gov/pubmed/35632141 http://dx.doi.org/10.3390/s22103732
work_keys_str_mv	AT lihao learningtherelativedynamicfeaturesforwordlevellipreading AT yadikarnurbiya learningtherelativedynamicfeaturesforwordlevellipreading AT zhuyali learningtherelativedynamicfeaturesforwordlevellipreading AT mamutmutallip learningtherelativedynamicfeaturesforwordlevellipreading AT ubulkurban learningtherelativedynamicfeaturesforwordlevellipreading

Learning the Relative Dynamic Features for Word-Level Lipreading

Ejemplares similares