Cargando…
Multimodal Speaker Diarization Using a Pre-Trained Audio-Visual Synchronization Model
Speaker diarization systems aim to find ‘who spoke when?’ in multi-speaker recordings. The dataset usually consists of meetings, TV/talk shows, telephone and multi-party interaction recordings. In this paper, we propose a novel multimodal speaker diarization technique, which finds the active speaker...
Autores principales: | , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
MDPI
2019
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6929047/ https://www.ncbi.nlm.nih.gov/pubmed/31775385 http://dx.doi.org/10.3390/s19235163 |
_version_ | 1783482614075621376 |
---|---|
author | Ahmad, Rehan Zubair, Syed Alquhayz, Hani Ditta, Allah |
author_facet | Ahmad, Rehan Zubair, Syed Alquhayz, Hani Ditta, Allah |
author_sort | Ahmad, Rehan |
collection | PubMed |
description | Speaker diarization systems aim to find ‘who spoke when?’ in multi-speaker recordings. The dataset usually consists of meetings, TV/talk shows, telephone and multi-party interaction recordings. In this paper, we propose a novel multimodal speaker diarization technique, which finds the active speaker through audio-visual synchronization model for diarization. A pre-trained audio-visual synchronization model is used to find the synchronization between a visible person and the respective audio. For that purpose, short video segments comprised of face-only regions are acquired using a face detection technique and are then fed to the pre-trained model. This model is a two streamed network which matches audio frames with their respective visual input segments. On the basis of high confidence video segments inferred by the model, the respective audio frames are used to train Gaussian mixture model (GMM)-based clusters. This method helps in generating speaker specific clusters with high probability. We tested our approach on a popular subset of AMI meeting corpus consisting of 5.4 h of recordings for audio and 5.8 h of different set of multimodal recordings. A significant improvement is noticed with the proposed method in term of DER when compared to conventional and fully supervised audio based speaker diarization. The results of the proposed technique are very close to the complex state-of-the art multimodal diarization which shows significance of such simple yet effective technique. |
format | Online Article Text |
id | pubmed-6929047 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2019 |
publisher | MDPI |
record_format | MEDLINE/PubMed |
spelling | pubmed-69290472019-12-26 Multimodal Speaker Diarization Using a Pre-Trained Audio-Visual Synchronization Model Ahmad, Rehan Zubair, Syed Alquhayz, Hani Ditta, Allah Sensors (Basel) Article Speaker diarization systems aim to find ‘who spoke when?’ in multi-speaker recordings. The dataset usually consists of meetings, TV/talk shows, telephone and multi-party interaction recordings. In this paper, we propose a novel multimodal speaker diarization technique, which finds the active speaker through audio-visual synchronization model for diarization. A pre-trained audio-visual synchronization model is used to find the synchronization between a visible person and the respective audio. For that purpose, short video segments comprised of face-only regions are acquired using a face detection technique and are then fed to the pre-trained model. This model is a two streamed network which matches audio frames with their respective visual input segments. On the basis of high confidence video segments inferred by the model, the respective audio frames are used to train Gaussian mixture model (GMM)-based clusters. This method helps in generating speaker specific clusters with high probability. We tested our approach on a popular subset of AMI meeting corpus consisting of 5.4 h of recordings for audio and 5.8 h of different set of multimodal recordings. A significant improvement is noticed with the proposed method in term of DER when compared to conventional and fully supervised audio based speaker diarization. The results of the proposed technique are very close to the complex state-of-the art multimodal diarization which shows significance of such simple yet effective technique. MDPI 2019-11-25 /pmc/articles/PMC6929047/ /pubmed/31775385 http://dx.doi.org/10.3390/s19235163 Text en © 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/). |
spellingShingle | Article Ahmad, Rehan Zubair, Syed Alquhayz, Hani Ditta, Allah Multimodal Speaker Diarization Using a Pre-Trained Audio-Visual Synchronization Model |
title | Multimodal Speaker Diarization Using a Pre-Trained Audio-Visual Synchronization Model |
title_full | Multimodal Speaker Diarization Using a Pre-Trained Audio-Visual Synchronization Model |
title_fullStr | Multimodal Speaker Diarization Using a Pre-Trained Audio-Visual Synchronization Model |
title_full_unstemmed | Multimodal Speaker Diarization Using a Pre-Trained Audio-Visual Synchronization Model |
title_short | Multimodal Speaker Diarization Using a Pre-Trained Audio-Visual Synchronization Model |
title_sort | multimodal speaker diarization using a pre-trained audio-visual synchronization model |
topic | Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6929047/ https://www.ncbi.nlm.nih.gov/pubmed/31775385 http://dx.doi.org/10.3390/s19235163 |
work_keys_str_mv | AT ahmadrehan multimodalspeakerdiarizationusingapretrainedaudiovisualsynchronizationmodel AT zubairsyed multimodalspeakerdiarizationusingapretrainedaudiovisualsynchronizationmodel AT alquhayzhani multimodalspeakerdiarizationusingapretrainedaudiovisualsynchronizationmodel AT dittaallah multimodalspeakerdiarizationusingapretrainedaudiovisualsynchronizationmodel |