Cargando…

Multimodal Speaker Diarization Using a Pre-Trained Audio-Visual Synchronization Model

Speaker diarization systems aim to find ‘who spoke when?’ in multi-speaker recordings. The dataset usually consists of meetings, TV/talk shows, telephone and multi-party interaction recordings. In this paper, we propose a novel multimodal speaker diarization technique, which finds the active speaker...

Descripción completa

Detalles Bibliográficos
Autores principales:	Ahmad, Rehan, Zubair, Syed, Alquhayz, Hani, Ditta, Allah
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	MDPI 2019
Materias:	Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6929047/ https://www.ncbi.nlm.nih.gov/pubmed/31775385 http://dx.doi.org/10.3390/s19235163

_version_	1783482614075621376
author	Ahmad, Rehan Zubair, Syed Alquhayz, Hani Ditta, Allah
author_facet	Ahmad, Rehan Zubair, Syed Alquhayz, Hani Ditta, Allah
author_sort	Ahmad, Rehan
collection	PubMed
description	Speaker diarization systems aim to find ‘who spoke when?’ in multi-speaker recordings. The dataset usually consists of meetings, TV/talk shows, telephone and multi-party interaction recordings. In this paper, we propose a novel multimodal speaker diarization technique, which finds the active speaker through audio-visual synchronization model for diarization. A pre-trained audio-visual synchronization model is used to find the synchronization between a visible person and the respective audio. For that purpose, short video segments comprised of face-only regions are acquired using a face detection technique and are then fed to the pre-trained model. This model is a two streamed network which matches audio frames with their respective visual input segments. On the basis of high confidence video segments inferred by the model, the respective audio frames are used to train Gaussian mixture model (GMM)-based clusters. This method helps in generating speaker specific clusters with high probability. We tested our approach on a popular subset of AMI meeting corpus consisting of 5.4 h of recordings for audio and 5.8 h of different set of multimodal recordings. A significant improvement is noticed with the proposed method in term of DER when compared to conventional and fully supervised audio based speaker diarization. The results of the proposed technique are very close to the complex state-of-the art multimodal diarization which shows significance of such simple yet effective technique.
format	Online Article Text
id	pubmed-6929047
institution	National Center for Biotechnology Information
language	English
publishDate	2019
publisher	MDPI
record_format	MEDLINE/PubMed
spelling	pubmed-69290472019-12-26 Multimodal Speaker Diarization Using a Pre-Trained Audio-Visual Synchronization Model Ahmad, Rehan Zubair, Syed Alquhayz, Hani Ditta, Allah Sensors (Basel) Article Speaker diarization systems aim to find ‘who spoke when?’ in multi-speaker recordings. The dataset usually consists of meetings, TV/talk shows, telephone and multi-party interaction recordings. In this paper, we propose a novel multimodal speaker diarization technique, which finds the active speaker through audio-visual synchronization model for diarization. A pre-trained audio-visual synchronization model is used to find the synchronization between a visible person and the respective audio. For that purpose, short video segments comprised of face-only regions are acquired using a face detection technique and are then fed to the pre-trained model. This model is a two streamed network which matches audio frames with their respective visual input segments. On the basis of high confidence video segments inferred by the model, the respective audio frames are used to train Gaussian mixture model (GMM)-based clusters. This method helps in generating speaker specific clusters with high probability. We tested our approach on a popular subset of AMI meeting corpus consisting of 5.4 h of recordings for audio and 5.8 h of different set of multimodal recordings. A significant improvement is noticed with the proposed method in term of DER when compared to conventional and fully supervised audio based speaker diarization. The results of the proposed technique are very close to the complex state-of-the art multimodal diarization which shows significance of such simple yet effective technique. MDPI 2019-11-25 /pmc/articles/PMC6929047/ /pubmed/31775385 http://dx.doi.org/10.3390/s19235163 Text en © 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
spellingShingle	Article Ahmad, Rehan Zubair, Syed Alquhayz, Hani Ditta, Allah Multimodal Speaker Diarization Using a Pre-Trained Audio-Visual Synchronization Model
title	Multimodal Speaker Diarization Using a Pre-Trained Audio-Visual Synchronization Model
title_full	Multimodal Speaker Diarization Using a Pre-Trained Audio-Visual Synchronization Model
title_fullStr	Multimodal Speaker Diarization Using a Pre-Trained Audio-Visual Synchronization Model
title_full_unstemmed	Multimodal Speaker Diarization Using a Pre-Trained Audio-Visual Synchronization Model
title_short	Multimodal Speaker Diarization Using a Pre-Trained Audio-Visual Synchronization Model
title_sort	multimodal speaker diarization using a pre-trained audio-visual synchronization model
topic	Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6929047/ https://www.ncbi.nlm.nih.gov/pubmed/31775385 http://dx.doi.org/10.3390/s19235163
work_keys_str_mv	AT ahmadrehan multimodalspeakerdiarizationusingapretrainedaudiovisualsynchronizationmodel AT zubairsyed multimodalspeakerdiarizationusingapretrainedaudiovisualsynchronizationmodel AT alquhayzhani multimodalspeakerdiarizationusingapretrainedaudiovisualsynchronizationmodel AT dittaallah multimodalspeakerdiarizationusingapretrainedaudiovisualsynchronizationmodel

Multimodal Speaker Diarization Using a Pre-Trained Audio-Visual Synchronization Model

Ejemplares similares