Cargando…

Lip Reading by Alternating between Spatiotemporal and Spatial Convolutions

Lip reading (LR) is the task of predicting the speech utilizing only the visual information of the speaker. In this work, for the first time, the benefits of alternating between spatiotemporal and spatial convolutions for learning effective features from the LR sequences are studied. In this context...

Descripción completa

Detalles Bibliográficos
Autores principales: Tsourounis, Dimitrios, Kastaniotis, Dimitris, Fotopoulos, Spiros
Formato: Online Artículo Texto
Lenguaje:English
Publicado: MDPI 2021
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8321361/
https://www.ncbi.nlm.nih.gov/pubmed/34460687
http://dx.doi.org/10.3390/jimaging7050091
_version_ 1783730834354733056
author Tsourounis, Dimitrios
Kastaniotis, Dimitris
Fotopoulos, Spiros
author_facet Tsourounis, Dimitrios
Kastaniotis, Dimitris
Fotopoulos, Spiros
author_sort Tsourounis, Dimitrios
collection PubMed
description Lip reading (LR) is the task of predicting the speech utilizing only the visual information of the speaker. In this work, for the first time, the benefits of alternating between spatiotemporal and spatial convolutions for learning effective features from the LR sequences are studied. In this context, a new learnable module named ALSOS (Alternating Spatiotemporal and Spatial Convolutions) is introduced in the proposed LR system. The ALSOS module consists of spatiotemporal (3D) and spatial (2D) convolutions along with two conversion components (3D-to-2D and 2D-to-3D) providing a sequence-to-sequence-mapping. The designed LR system utilizes the ALSOS module in-between ResNet blocks, as well as Temporal Convolutional Networks (TCNs) in the backend for classification. The whole framework is composed by feedforward convolutional along with residual layers and can be trained end-to-end directly from the image sequences in the word-level LR problem. The ALSOS module can capture spatiotemporal dynamics and can be advantageous in the task of LR when combined with the ResNet topology. Experiments with different combinations of ALSOS with ResNet are performed on a dataset in Greek language simulating a medical support application scenario and on the popular large-scale LRW-500 dataset of English words. Results indicate that the proposed ALSOS module can improve the performance of a LR system. Overall, the insertion of ALSOS module into the ResNet architecture obtained higher classification accuracy since it incorporates the contribution of the temporal information captured at different spatial scales of the framework.
format Online
Article
Text
id pubmed-8321361
institution National Center for Biotechnology Information
language English
publishDate 2021
publisher MDPI
record_format MEDLINE/PubMed
spelling pubmed-83213612021-08-26 Lip Reading by Alternating between Spatiotemporal and Spatial Convolutions Tsourounis, Dimitrios Kastaniotis, Dimitris Fotopoulos, Spiros J Imaging Article Lip reading (LR) is the task of predicting the speech utilizing only the visual information of the speaker. In this work, for the first time, the benefits of alternating between spatiotemporal and spatial convolutions for learning effective features from the LR sequences are studied. In this context, a new learnable module named ALSOS (Alternating Spatiotemporal and Spatial Convolutions) is introduced in the proposed LR system. The ALSOS module consists of spatiotemporal (3D) and spatial (2D) convolutions along with two conversion components (3D-to-2D and 2D-to-3D) providing a sequence-to-sequence-mapping. The designed LR system utilizes the ALSOS module in-between ResNet blocks, as well as Temporal Convolutional Networks (TCNs) in the backend for classification. The whole framework is composed by feedforward convolutional along with residual layers and can be trained end-to-end directly from the image sequences in the word-level LR problem. The ALSOS module can capture spatiotemporal dynamics and can be advantageous in the task of LR when combined with the ResNet topology. Experiments with different combinations of ALSOS with ResNet are performed on a dataset in Greek language simulating a medical support application scenario and on the popular large-scale LRW-500 dataset of English words. Results indicate that the proposed ALSOS module can improve the performance of a LR system. Overall, the insertion of ALSOS module into the ResNet architecture obtained higher classification accuracy since it incorporates the contribution of the temporal information captured at different spatial scales of the framework. MDPI 2021-05-20 /pmc/articles/PMC8321361/ /pubmed/34460687 http://dx.doi.org/10.3390/jimaging7050091 Text en © 2021 by the authors. https://creativecommons.org/licenses/by/4.0/Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
spellingShingle Article
Tsourounis, Dimitrios
Kastaniotis, Dimitris
Fotopoulos, Spiros
Lip Reading by Alternating between Spatiotemporal and Spatial Convolutions
title Lip Reading by Alternating between Spatiotemporal and Spatial Convolutions
title_full Lip Reading by Alternating between Spatiotemporal and Spatial Convolutions
title_fullStr Lip Reading by Alternating between Spatiotemporal and Spatial Convolutions
title_full_unstemmed Lip Reading by Alternating between Spatiotemporal and Spatial Convolutions
title_short Lip Reading by Alternating between Spatiotemporal and Spatial Convolutions
title_sort lip reading by alternating between spatiotemporal and spatial convolutions
topic Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8321361/
https://www.ncbi.nlm.nih.gov/pubmed/34460687
http://dx.doi.org/10.3390/jimaging7050091
work_keys_str_mv AT tsourounisdimitrios lipreadingbyalternatingbetweenspatiotemporalandspatialconvolutions
AT kastaniotisdimitris lipreadingbyalternatingbetweenspatiotemporalandspatialconvolutions
AT fotopoulosspiros lipreadingbyalternatingbetweenspatiotemporalandspatialconvolutions