Cargando…

Multimodal Speaker Diarization Using a Pre-Trained Audio-Visual Synchronization Model

Speaker diarization systems aim to find ‘who spoke when?’ in multi-speaker recordings. The dataset usually consists of meetings, TV/talk shows, telephone and multi-party interaction recordings. In this paper, we propose a novel multimodal speaker diarization technique, which finds the active speaker...

Descripción completa

Detalles Bibliográficos
Autores principales: Ahmad, Rehan, Zubair, Syed, Alquhayz, Hani, Ditta, Allah
Formato: Online Artículo Texto
Lenguaje:English
Publicado: MDPI 2019
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6929047/
https://www.ncbi.nlm.nih.gov/pubmed/31775385
http://dx.doi.org/10.3390/s19235163
_version_ 1783482614075621376
author Ahmad, Rehan
Zubair, Syed
Alquhayz, Hani
Ditta, Allah
author_facet Ahmad, Rehan
Zubair, Syed
Alquhayz, Hani
Ditta, Allah
author_sort Ahmad, Rehan
collection PubMed
description Speaker diarization systems aim to find ‘who spoke when?’ in multi-speaker recordings. The dataset usually consists of meetings, TV/talk shows, telephone and multi-party interaction recordings. In this paper, we propose a novel multimodal speaker diarization technique, which finds the active speaker through audio-visual synchronization model for diarization. A pre-trained audio-visual synchronization model is used to find the synchronization between a visible person and the respective audio. For that purpose, short video segments comprised of face-only regions are acquired using a face detection technique and are then fed to the pre-trained model. This model is a two streamed network which matches audio frames with their respective visual input segments. On the basis of high confidence video segments inferred by the model, the respective audio frames are used to train Gaussian mixture model (GMM)-based clusters. This method helps in generating speaker specific clusters with high probability. We tested our approach on a popular subset of AMI meeting corpus consisting of 5.4 h of recordings for audio and 5.8 h of different set of multimodal recordings. A significant improvement is noticed with the proposed method in term of DER when compared to conventional and fully supervised audio based speaker diarization. The results of the proposed technique are very close to the complex state-of-the art multimodal diarization which shows significance of such simple yet effective technique.
format Online
Article
Text
id pubmed-6929047
institution National Center for Biotechnology Information
language English
publishDate 2019
publisher MDPI
record_format MEDLINE/PubMed
spelling pubmed-69290472019-12-26 Multimodal Speaker Diarization Using a Pre-Trained Audio-Visual Synchronization Model Ahmad, Rehan Zubair, Syed Alquhayz, Hani Ditta, Allah Sensors (Basel) Article Speaker diarization systems aim to find ‘who spoke when?’ in multi-speaker recordings. The dataset usually consists of meetings, TV/talk shows, telephone and multi-party interaction recordings. In this paper, we propose a novel multimodal speaker diarization technique, which finds the active speaker through audio-visual synchronization model for diarization. A pre-trained audio-visual synchronization model is used to find the synchronization between a visible person and the respective audio. For that purpose, short video segments comprised of face-only regions are acquired using a face detection technique and are then fed to the pre-trained model. This model is a two streamed network which matches audio frames with their respective visual input segments. On the basis of high confidence video segments inferred by the model, the respective audio frames are used to train Gaussian mixture model (GMM)-based clusters. This method helps in generating speaker specific clusters with high probability. We tested our approach on a popular subset of AMI meeting corpus consisting of 5.4 h of recordings for audio and 5.8 h of different set of multimodal recordings. A significant improvement is noticed with the proposed method in term of DER when compared to conventional and fully supervised audio based speaker diarization. The results of the proposed technique are very close to the complex state-of-the art multimodal diarization which shows significance of such simple yet effective technique. MDPI 2019-11-25 /pmc/articles/PMC6929047/ /pubmed/31775385 http://dx.doi.org/10.3390/s19235163 Text en © 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
spellingShingle Article
Ahmad, Rehan
Zubair, Syed
Alquhayz, Hani
Ditta, Allah
Multimodal Speaker Diarization Using a Pre-Trained Audio-Visual Synchronization Model
title Multimodal Speaker Diarization Using a Pre-Trained Audio-Visual Synchronization Model
title_full Multimodal Speaker Diarization Using a Pre-Trained Audio-Visual Synchronization Model
title_fullStr Multimodal Speaker Diarization Using a Pre-Trained Audio-Visual Synchronization Model
title_full_unstemmed Multimodal Speaker Diarization Using a Pre-Trained Audio-Visual Synchronization Model
title_short Multimodal Speaker Diarization Using a Pre-Trained Audio-Visual Synchronization Model
title_sort multimodal speaker diarization using a pre-trained audio-visual synchronization model
topic Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6929047/
https://www.ncbi.nlm.nih.gov/pubmed/31775385
http://dx.doi.org/10.3390/s19235163
work_keys_str_mv AT ahmadrehan multimodalspeakerdiarizationusingapretrainedaudiovisualsynchronizationmodel
AT zubairsyed multimodalspeakerdiarizationusingapretrainedaudiovisualsynchronizationmodel
AT alquhayzhani multimodalspeakerdiarizationusingapretrainedaudiovisualsynchronizationmodel
AT dittaallah multimodalspeakerdiarizationusingapretrainedaudiovisualsynchronizationmodel