Cargando…

Generalisation Gap of Keyword Spotters in a Cross-Speaker Low-Resource Scenario

Models for keyword spotting in continuous recordings can significantly improve the experience of navigating vast libraries of audio recordings. In this paper, we describe the development of such a keyword spotting system detecting regions of interest in Polish call centre conversations. Unfortunatel...

Descripción completa

Detalles Bibliográficos
Autores principales:	Lepak, Łukasz, Radzikowski, Kacper, Nowak, Robert, Piczak, Karol J.
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	MDPI 2021
Materias:	Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8704929/ https://www.ncbi.nlm.nih.gov/pubmed/34960407 http://dx.doi.org/10.3390/s21248313

_version_	1784621824143785984
author	Lepak, Łukasz Radzikowski, Kacper Nowak, Robert Piczak, Karol J.
author_facet	Lepak, Łukasz Radzikowski, Kacper Nowak, Robert Piczak, Karol J.
author_sort	Lepak, Łukasz
collection	PubMed
description	Models for keyword spotting in continuous recordings can significantly improve the experience of navigating vast libraries of audio recordings. In this paper, we describe the development of such a keyword spotting system detecting regions of interest in Polish call centre conversations. Unfortunately, in spite of recent advancements in automatic speech recognition systems, human-level transcription accuracy reported on English benchmarks does not reflect the performance achievable in low-resource languages, such as Polish. Therefore, in this work, we shift our focus from complete speech-to-text conversion to acoustic similarity matching in the hope of reducing the demand for data annotation. As our primary approach, we evaluate Siamese and prototypical neural networks trained on several datasets of English and Polish recordings. While we obtain usable results in English, our models’ performance remains unsatisfactory when applied to Polish speech, both after mono- and cross-lingual training. This performance gap shows that generalisation with limited training resources is a significant obstacle for actual deployments in low-resource languages. As a potential countermeasure, we implement a detector using audio embeddings generated with a generic pre-trained model provided by Google. It has a much more favourable profile when applied in a cross-lingual setup to detect Polish audio patterns. Nevertheless, despite these promising results, its performance on out-of-distribution data are still far from stellar. It would indicate that, in spite of the richness of internal representations created by more generic models, such speech embeddings are not entirely malleable to cross-language transfer.
format	Online Article Text
id	pubmed-8704929
institution	National Center for Biotechnology Information
language	English
publishDate	2021
publisher	MDPI
record_format	MEDLINE/PubMed
spelling	pubmed-87049292021-12-25 Generalisation Gap of Keyword Spotters in a Cross-Speaker Low-Resource Scenario Lepak, Łukasz Radzikowski, Kacper Nowak, Robert Piczak, Karol J. Sensors (Basel) Article Models for keyword spotting in continuous recordings can significantly improve the experience of navigating vast libraries of audio recordings. In this paper, we describe the development of such a keyword spotting system detecting regions of interest in Polish call centre conversations. Unfortunately, in spite of recent advancements in automatic speech recognition systems, human-level transcription accuracy reported on English benchmarks does not reflect the performance achievable in low-resource languages, such as Polish. Therefore, in this work, we shift our focus from complete speech-to-text conversion to acoustic similarity matching in the hope of reducing the demand for data annotation. As our primary approach, we evaluate Siamese and prototypical neural networks trained on several datasets of English and Polish recordings. While we obtain usable results in English, our models’ performance remains unsatisfactory when applied to Polish speech, both after mono- and cross-lingual training. This performance gap shows that generalisation with limited training resources is a significant obstacle for actual deployments in low-resource languages. As a potential countermeasure, we implement a detector using audio embeddings generated with a generic pre-trained model provided by Google. It has a much more favourable profile when applied in a cross-lingual setup to detect Polish audio patterns. Nevertheless, despite these promising results, its performance on out-of-distribution data are still far from stellar. It would indicate that, in spite of the richness of internal representations created by more generic models, such speech embeddings are not entirely malleable to cross-language transfer. MDPI 2021-12-12 /pmc/articles/PMC8704929/ /pubmed/34960407 http://dx.doi.org/10.3390/s21248313 Text en © 2021 by the authors. https://creativecommons.org/licenses/by/4.0/Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
spellingShingle	Article Lepak, Łukasz Radzikowski, Kacper Nowak, Robert Piczak, Karol J. Generalisation Gap of Keyword Spotters in a Cross-Speaker Low-Resource Scenario
title	Generalisation Gap of Keyword Spotters in a Cross-Speaker Low-Resource Scenario
title_full	Generalisation Gap of Keyword Spotters in a Cross-Speaker Low-Resource Scenario
title_fullStr	Generalisation Gap of Keyword Spotters in a Cross-Speaker Low-Resource Scenario
title_full_unstemmed	Generalisation Gap of Keyword Spotters in a Cross-Speaker Low-Resource Scenario
title_short	Generalisation Gap of Keyword Spotters in a Cross-Speaker Low-Resource Scenario
title_sort	generalisation gap of keyword spotters in a cross-speaker low-resource scenario
topic	Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8704929/ https://www.ncbi.nlm.nih.gov/pubmed/34960407 http://dx.doi.org/10.3390/s21248313
work_keys_str_mv	AT lepakłukasz generalisationgapofkeywordspottersinacrossspeakerlowresourcescenario AT radzikowskikacper generalisationgapofkeywordspottersinacrossspeakerlowresourcescenario AT nowakrobert generalisationgapofkeywordspottersinacrossspeakerlowresourcescenario AT piczakkarolj generalisationgapofkeywordspottersinacrossspeakerlowresourcescenario

Generalisation Gap of Keyword Spotters in a Cross-Speaker Low-Resource Scenario

Ejemplares similares