Cargando…

Audiovisual Tracking of Multiple Speakers in Smart Spaces

This paper presents GAVT, a highly accurate audiovisual 3D tracking system based on particle filters and a probabilistic framework, employing a single camera and a microphone array. Our first contribution is a complex visual appearance model that accurately locates the speaker’s mouth. It transforms...

Descripción completa

Detalles Bibliográficos
Autores principales:	Sanabria-Macias, Frank, Marron-Romera, Marta, Macias-Guarasa, Javier
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	MDPI 2023
Materias:	Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10422319/ https://www.ncbi.nlm.nih.gov/pubmed/37571754 http://dx.doi.org/10.3390/s23156969

_version_	1785089179888123904
author	Sanabria-Macias, Frank Marron-Romera, Marta Macias-Guarasa, Javier
author_facet	Sanabria-Macias, Frank Marron-Romera, Marta Macias-Guarasa, Javier
author_sort	Sanabria-Macias, Frank
collection	PubMed
description	This paper presents GAVT, a highly accurate audiovisual 3D tracking system based on particle filters and a probabilistic framework, employing a single camera and a microphone array. Our first contribution is a complex visual appearance model that accurately locates the speaker’s mouth. It transforms a Viola & Jones face detector classifier kernel into a likelihood estimator, leveraging knowledge from multiple classifiers trained for different face poses. Additionally, we propose a mechanism to handle occlusions based on the new likelihood’s dispersion. The audio localization proposal utilizes a probabilistic steered response power, representing cross-correlation functions as Gaussian mixture models. Moreover, to prevent tracker interference, we introduce a novel mechanism for associating Gaussians with speakers. The evaluation is carried out using the AV16.3 and CAV3D databases for Single- and Multiple-Object Tracking tasks (SOT and MOT, respectively). GAVT significantly improves the localization performance over audio-only and video-only modalities, with up to [Formula: see text] average relative improvement in 3D when compared with the video-only modality. When compared to the state of the art, our audiovisual system achieves up to [Formula: see text] average relative improvement for the SOT and MOT tasks in the AV16.3 dataset (2D comparison), and up to [Formula: see text] average relative improvement in the MOT task for the CAV3D dataset (3D comparison).
format	Online Article Text
id	pubmed-10422319
institution	National Center for Biotechnology Information
language	English
publishDate	2023
publisher	MDPI
record_format	MEDLINE/PubMed
spelling	pubmed-104223192023-08-13 Audiovisual Tracking of Multiple Speakers in Smart Spaces Sanabria-Macias, Frank Marron-Romera, Marta Macias-Guarasa, Javier Sensors (Basel) Article This paper presents GAVT, a highly accurate audiovisual 3D tracking system based on particle filters and a probabilistic framework, employing a single camera and a microphone array. Our first contribution is a complex visual appearance model that accurately locates the speaker’s mouth. It transforms a Viola & Jones face detector classifier kernel into a likelihood estimator, leveraging knowledge from multiple classifiers trained for different face poses. Additionally, we propose a mechanism to handle occlusions based on the new likelihood’s dispersion. The audio localization proposal utilizes a probabilistic steered response power, representing cross-correlation functions as Gaussian mixture models. Moreover, to prevent tracker interference, we introduce a novel mechanism for associating Gaussians with speakers. The evaluation is carried out using the AV16.3 and CAV3D databases for Single- and Multiple-Object Tracking tasks (SOT and MOT, respectively). GAVT significantly improves the localization performance over audio-only and video-only modalities, with up to [Formula: see text] average relative improvement in 3D when compared with the video-only modality. When compared to the state of the art, our audiovisual system achieves up to [Formula: see text] average relative improvement for the SOT and MOT tasks in the AV16.3 dataset (2D comparison), and up to [Formula: see text] average relative improvement in the MOT task for the CAV3D dataset (3D comparison). MDPI 2023-08-05 /pmc/articles/PMC10422319/ /pubmed/37571754 http://dx.doi.org/10.3390/s23156969 Text en © 2023 by the authors. https://creativecommons.org/licenses/by/4.0/Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
spellingShingle	Article Sanabria-Macias, Frank Marron-Romera, Marta Macias-Guarasa, Javier Audiovisual Tracking of Multiple Speakers in Smart Spaces
title	Audiovisual Tracking of Multiple Speakers in Smart Spaces
title_full	Audiovisual Tracking of Multiple Speakers in Smart Spaces
title_fullStr	Audiovisual Tracking of Multiple Speakers in Smart Spaces
title_full_unstemmed	Audiovisual Tracking of Multiple Speakers in Smart Spaces
title_short	Audiovisual Tracking of Multiple Speakers in Smart Spaces
title_sort	audiovisual tracking of multiple speakers in smart spaces
topic	Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10422319/ https://www.ncbi.nlm.nih.gov/pubmed/37571754 http://dx.doi.org/10.3390/s23156969
work_keys_str_mv	AT sanabriamaciasfrank audiovisualtrackingofmultiplespeakersinsmartspaces AT marronromeramarta audiovisualtrackingofmultiplespeakersinsmartspaces AT maciasguarasajavier audiovisualtrackingofmultiplespeakersinsmartspaces

Audiovisual Tracking of Multiple Speakers in Smart Spaces

Ejemplares similares