Cargando…
Audiovisual Tracking of Multiple Speakers in Smart Spaces
This paper presents GAVT, a highly accurate audiovisual 3D tracking system based on particle filters and a probabilistic framework, employing a single camera and a microphone array. Our first contribution is a complex visual appearance model that accurately locates the speaker’s mouth. It transforms...
Autores principales: | , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
MDPI
2023
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10422319/ https://www.ncbi.nlm.nih.gov/pubmed/37571754 http://dx.doi.org/10.3390/s23156969 |
_version_ | 1785089179888123904 |
---|---|
author | Sanabria-Macias, Frank Marron-Romera, Marta Macias-Guarasa, Javier |
author_facet | Sanabria-Macias, Frank Marron-Romera, Marta Macias-Guarasa, Javier |
author_sort | Sanabria-Macias, Frank |
collection | PubMed |
description | This paper presents GAVT, a highly accurate audiovisual 3D tracking system based on particle filters and a probabilistic framework, employing a single camera and a microphone array. Our first contribution is a complex visual appearance model that accurately locates the speaker’s mouth. It transforms a Viola & Jones face detector classifier kernel into a likelihood estimator, leveraging knowledge from multiple classifiers trained for different face poses. Additionally, we propose a mechanism to handle occlusions based on the new likelihood’s dispersion. The audio localization proposal utilizes a probabilistic steered response power, representing cross-correlation functions as Gaussian mixture models. Moreover, to prevent tracker interference, we introduce a novel mechanism for associating Gaussians with speakers. The evaluation is carried out using the AV16.3 and CAV3D databases for Single- and Multiple-Object Tracking tasks (SOT and MOT, respectively). GAVT significantly improves the localization performance over audio-only and video-only modalities, with up to [Formula: see text] average relative improvement in 3D when compared with the video-only modality. When compared to the state of the art, our audiovisual system achieves up to [Formula: see text] average relative improvement for the SOT and MOT tasks in the AV16.3 dataset (2D comparison), and up to [Formula: see text] average relative improvement in the MOT task for the CAV3D dataset (3D comparison). |
format | Online Article Text |
id | pubmed-10422319 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2023 |
publisher | MDPI |
record_format | MEDLINE/PubMed |
spelling | pubmed-104223192023-08-13 Audiovisual Tracking of Multiple Speakers in Smart Spaces Sanabria-Macias, Frank Marron-Romera, Marta Macias-Guarasa, Javier Sensors (Basel) Article This paper presents GAVT, a highly accurate audiovisual 3D tracking system based on particle filters and a probabilistic framework, employing a single camera and a microphone array. Our first contribution is a complex visual appearance model that accurately locates the speaker’s mouth. It transforms a Viola & Jones face detector classifier kernel into a likelihood estimator, leveraging knowledge from multiple classifiers trained for different face poses. Additionally, we propose a mechanism to handle occlusions based on the new likelihood’s dispersion. The audio localization proposal utilizes a probabilistic steered response power, representing cross-correlation functions as Gaussian mixture models. Moreover, to prevent tracker interference, we introduce a novel mechanism for associating Gaussians with speakers. The evaluation is carried out using the AV16.3 and CAV3D databases for Single- and Multiple-Object Tracking tasks (SOT and MOT, respectively). GAVT significantly improves the localization performance over audio-only and video-only modalities, with up to [Formula: see text] average relative improvement in 3D when compared with the video-only modality. When compared to the state of the art, our audiovisual system achieves up to [Formula: see text] average relative improvement for the SOT and MOT tasks in the AV16.3 dataset (2D comparison), and up to [Formula: see text] average relative improvement in the MOT task for the CAV3D dataset (3D comparison). MDPI 2023-08-05 /pmc/articles/PMC10422319/ /pubmed/37571754 http://dx.doi.org/10.3390/s23156969 Text en © 2023 by the authors. https://creativecommons.org/licenses/by/4.0/Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/). |
spellingShingle | Article Sanabria-Macias, Frank Marron-Romera, Marta Macias-Guarasa, Javier Audiovisual Tracking of Multiple Speakers in Smart Spaces |
title | Audiovisual Tracking of Multiple Speakers in Smart Spaces |
title_full | Audiovisual Tracking of Multiple Speakers in Smart Spaces |
title_fullStr | Audiovisual Tracking of Multiple Speakers in Smart Spaces |
title_full_unstemmed | Audiovisual Tracking of Multiple Speakers in Smart Spaces |
title_short | Audiovisual Tracking of Multiple Speakers in Smart Spaces |
title_sort | audiovisual tracking of multiple speakers in smart spaces |
topic | Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10422319/ https://www.ncbi.nlm.nih.gov/pubmed/37571754 http://dx.doi.org/10.3390/s23156969 |
work_keys_str_mv | AT sanabriamaciasfrank audiovisualtrackingofmultiplespeakersinsmartspaces AT marronromeramarta audiovisualtrackingofmultiplespeakersinsmartspaces AT maciasguarasajavier audiovisualtrackingofmultiplespeakersinsmartspaces |