Cargando…
Attention-Based Fusion of Ultrashort Voice Utterances and Depth Videos for Multimodal Person Identification
Multimodal deep learning, in the context of biometrics, encounters significant challenges due to the dependence on long speech utterances and RGB images, which are often impractical in certain situations. This paper presents a novel solution addressing these issues by leveraging ultrashort voice utt...
Autores principales: | , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
MDPI
2023
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10346165/ https://www.ncbi.nlm.nih.gov/pubmed/37447739 http://dx.doi.org/10.3390/s23135890 |
_version_ | 1785073249804091392 |
---|---|
author | Moufidi, Abderrazzaq Rousseau, David Rasti, Pejman |
author_facet | Moufidi, Abderrazzaq Rousseau, David Rasti, Pejman |
author_sort | Moufidi, Abderrazzaq |
collection | PubMed |
description | Multimodal deep learning, in the context of biometrics, encounters significant challenges due to the dependence on long speech utterances and RGB images, which are often impractical in certain situations. This paper presents a novel solution addressing these issues by leveraging ultrashort voice utterances and depth videos of the lip for person identification. The proposed method utilizes an amalgamation of residual neural networks to encode depth videos and a Time Delay Neural Network architecture to encode voice signals. In an effort to fuse information from these different modalities, we integrate self-attention and engineer a noise-resistant model that effectively manages diverse types of noise. Through rigorous testing on a benchmark dataset, our approach exhibits superior performance over existing methods, resulting in an average improvement of [Formula: see text]. This method is notably efficient for scenarios where extended utterances and RGB images are unfeasible or unattainable. Furthermore, its potential extends to various multimodal applications beyond just person identification. |
format | Online Article Text |
id | pubmed-10346165 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2023 |
publisher | MDPI |
record_format | MEDLINE/PubMed |
spelling | pubmed-103461652023-07-15 Attention-Based Fusion of Ultrashort Voice Utterances and Depth Videos for Multimodal Person Identification Moufidi, Abderrazzaq Rousseau, David Rasti, Pejman Sensors (Basel) Article Multimodal deep learning, in the context of biometrics, encounters significant challenges due to the dependence on long speech utterances and RGB images, which are often impractical in certain situations. This paper presents a novel solution addressing these issues by leveraging ultrashort voice utterances and depth videos of the lip for person identification. The proposed method utilizes an amalgamation of residual neural networks to encode depth videos and a Time Delay Neural Network architecture to encode voice signals. In an effort to fuse information from these different modalities, we integrate self-attention and engineer a noise-resistant model that effectively manages diverse types of noise. Through rigorous testing on a benchmark dataset, our approach exhibits superior performance over existing methods, resulting in an average improvement of [Formula: see text]. This method is notably efficient for scenarios where extended utterances and RGB images are unfeasible or unattainable. Furthermore, its potential extends to various multimodal applications beyond just person identification. MDPI 2023-06-25 /pmc/articles/PMC10346165/ /pubmed/37447739 http://dx.doi.org/10.3390/s23135890 Text en © 2023 by the authors. https://creativecommons.org/licenses/by/4.0/Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/). |
spellingShingle | Article Moufidi, Abderrazzaq Rousseau, David Rasti, Pejman Attention-Based Fusion of Ultrashort Voice Utterances and Depth Videos for Multimodal Person Identification |
title | Attention-Based Fusion of Ultrashort Voice Utterances and Depth Videos for Multimodal Person Identification |
title_full | Attention-Based Fusion of Ultrashort Voice Utterances and Depth Videos for Multimodal Person Identification |
title_fullStr | Attention-Based Fusion of Ultrashort Voice Utterances and Depth Videos for Multimodal Person Identification |
title_full_unstemmed | Attention-Based Fusion of Ultrashort Voice Utterances and Depth Videos for Multimodal Person Identification |
title_short | Attention-Based Fusion of Ultrashort Voice Utterances and Depth Videos for Multimodal Person Identification |
title_sort | attention-based fusion of ultrashort voice utterances and depth videos for multimodal person identification |
topic | Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10346165/ https://www.ncbi.nlm.nih.gov/pubmed/37447739 http://dx.doi.org/10.3390/s23135890 |
work_keys_str_mv | AT moufidiabderrazzaq attentionbasedfusionofultrashortvoiceutterancesanddepthvideosformultimodalpersonidentification AT rousseaudavid attentionbasedfusionofultrashortvoiceutterancesanddepthvideosformultimodalpersonidentification AT rastipejman attentionbasedfusionofultrashortvoiceutterancesanddepthvideosformultimodalpersonidentification |