Cargando…

Audiovisual Tracking of Multiple Speakers in Smart Spaces

This paper presents GAVT, a highly accurate audiovisual 3D tracking system based on particle filters and a probabilistic framework, employing a single camera and a microphone array. Our first contribution is a complex visual appearance model that accurately locates the speaker’s mouth. It transforms...

Descripción completa

Detalles Bibliográficos
Autores principales: Sanabria-Macias, Frank, Marron-Romera, Marta, Macias-Guarasa, Javier
Formato: Online Artículo Texto
Lenguaje:English
Publicado: MDPI 2023
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10422319/
https://www.ncbi.nlm.nih.gov/pubmed/37571754
http://dx.doi.org/10.3390/s23156969
_version_ 1785089179888123904
author Sanabria-Macias, Frank
Marron-Romera, Marta
Macias-Guarasa, Javier
author_facet Sanabria-Macias, Frank
Marron-Romera, Marta
Macias-Guarasa, Javier
author_sort Sanabria-Macias, Frank
collection PubMed
description This paper presents GAVT, a highly accurate audiovisual 3D tracking system based on particle filters and a probabilistic framework, employing a single camera and a microphone array. Our first contribution is a complex visual appearance model that accurately locates the speaker’s mouth. It transforms a Viola & Jones face detector classifier kernel into a likelihood estimator, leveraging knowledge from multiple classifiers trained for different face poses. Additionally, we propose a mechanism to handle occlusions based on the new likelihood’s dispersion. The audio localization proposal utilizes a probabilistic steered response power, representing cross-correlation functions as Gaussian mixture models. Moreover, to prevent tracker interference, we introduce a novel mechanism for associating Gaussians with speakers. The evaluation is carried out using the AV16.3 and CAV3D databases for Single- and Multiple-Object Tracking tasks (SOT and MOT, respectively). GAVT significantly improves the localization performance over audio-only and video-only modalities, with up to [Formula: see text] average relative improvement in 3D when compared with the video-only modality. When compared to the state of the art, our audiovisual system achieves up to [Formula: see text] average relative improvement for the SOT and MOT tasks in the AV16.3 dataset (2D comparison), and up to [Formula: see text] average relative improvement in the MOT task for the CAV3D dataset (3D comparison).
format Online
Article
Text
id pubmed-10422319
institution National Center for Biotechnology Information
language English
publishDate 2023
publisher MDPI
record_format MEDLINE/PubMed
spelling pubmed-104223192023-08-13 Audiovisual Tracking of Multiple Speakers in Smart Spaces Sanabria-Macias, Frank Marron-Romera, Marta Macias-Guarasa, Javier Sensors (Basel) Article This paper presents GAVT, a highly accurate audiovisual 3D tracking system based on particle filters and a probabilistic framework, employing a single camera and a microphone array. Our first contribution is a complex visual appearance model that accurately locates the speaker’s mouth. It transforms a Viola & Jones face detector classifier kernel into a likelihood estimator, leveraging knowledge from multiple classifiers trained for different face poses. Additionally, we propose a mechanism to handle occlusions based on the new likelihood’s dispersion. The audio localization proposal utilizes a probabilistic steered response power, representing cross-correlation functions as Gaussian mixture models. Moreover, to prevent tracker interference, we introduce a novel mechanism for associating Gaussians with speakers. The evaluation is carried out using the AV16.3 and CAV3D databases for Single- and Multiple-Object Tracking tasks (SOT and MOT, respectively). GAVT significantly improves the localization performance over audio-only and video-only modalities, with up to [Formula: see text] average relative improvement in 3D when compared with the video-only modality. When compared to the state of the art, our audiovisual system achieves up to [Formula: see text] average relative improvement for the SOT and MOT tasks in the AV16.3 dataset (2D comparison), and up to [Formula: see text] average relative improvement in the MOT task for the CAV3D dataset (3D comparison). MDPI 2023-08-05 /pmc/articles/PMC10422319/ /pubmed/37571754 http://dx.doi.org/10.3390/s23156969 Text en © 2023 by the authors. https://creativecommons.org/licenses/by/4.0/Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
spellingShingle Article
Sanabria-Macias, Frank
Marron-Romera, Marta
Macias-Guarasa, Javier
Audiovisual Tracking of Multiple Speakers in Smart Spaces
title Audiovisual Tracking of Multiple Speakers in Smart Spaces
title_full Audiovisual Tracking of Multiple Speakers in Smart Spaces
title_fullStr Audiovisual Tracking of Multiple Speakers in Smart Spaces
title_full_unstemmed Audiovisual Tracking of Multiple Speakers in Smart Spaces
title_short Audiovisual Tracking of Multiple Speakers in Smart Spaces
title_sort audiovisual tracking of multiple speakers in smart spaces
topic Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10422319/
https://www.ncbi.nlm.nih.gov/pubmed/37571754
http://dx.doi.org/10.3390/s23156969
work_keys_str_mv AT sanabriamaciasfrank audiovisualtrackingofmultiplespeakersinsmartspaces
AT marronromeramarta audiovisualtrackingofmultiplespeakersinsmartspaces
AT maciasguarasajavier audiovisualtrackingofmultiplespeakersinsmartspaces