Cargando…

Dynamic Acoustic Unit Augmentation with BPE-Dropout for Low-Resource End-to-End Speech Recognition

With the rapid development of speech assistants, adapting server-intended automatic speech recognition (ASR) solutions to a direct device has become crucial. For on-device speech recognition tasks, researchers and industry prefer end-to-end ASR systems as they can be made resource-efficient while ma...

Descripción completa

Detalles Bibliográficos
Autores principales:	Laptev, Aleksandr, Andrusenko, Andrei, Podluzhny, Ivan, Mitrofanov, Anton, Medennikov, Ivan, Matveev, Yuri
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	MDPI 2021
Materias:	Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8124527/ https://www.ncbi.nlm.nih.gov/pubmed/33924798 http://dx.doi.org/10.3390/s21093063

_version_	1783693231603580928
author	Laptev, Aleksandr Andrusenko, Andrei Podluzhny, Ivan Mitrofanov, Anton Medennikov, Ivan Matveev, Yuri
author_facet	Laptev, Aleksandr Andrusenko, Andrei Podluzhny, Ivan Mitrofanov, Anton Medennikov, Ivan Matveev, Yuri
author_sort	Laptev, Aleksandr
collection	PubMed
description	With the rapid development of speech assistants, adapting server-intended automatic speech recognition (ASR) solutions to a direct device has become crucial. For on-device speech recognition tasks, researchers and industry prefer end-to-end ASR systems as they can be made resource-efficient while maintaining a higher quality compared to hybrid systems. However, building end-to-end models requires a significant amount of speech data. Personalization, which is mainly handling out-of-vocabulary (OOV) words, is another challenging task associated with speech assistants. In this work, we consider building an effective end-to-end ASR system in low-resource setups with a high OOV rate, embodied in Babel Turkish and Babel Georgian tasks. We propose a method of dynamic acoustic unit augmentation based on the Byte Pair Encoding with dropout (BPE-dropout) technique. The method non-deterministically tokenizes utterances to extend the token’s contexts and to regularize their distribution for the model’s recognition of unseen words. It also reduces the need for optimal subword vocabulary size search. The technique provides a steady improvement in regular and personalized (OOV-oriented) speech recognition tasks (at least 6% relative word error rate (WER) and 25% relative F-score) at no additional computational cost. Owing to the BPE-dropout use, our monolingual Turkish Conformer has achieved a competitive result with 22.2% character error rate (CER) and 38.9% WER, which is close to the best published multilingual system.
format	Online Article Text
id	pubmed-8124527
institution	National Center for Biotechnology Information
language	English
publishDate	2021
publisher	MDPI
record_format	MEDLINE/PubMed
spelling	pubmed-81245272021-05-17 Dynamic Acoustic Unit Augmentation with BPE-Dropout for Low-Resource End-to-End Speech Recognition Laptev, Aleksandr Andrusenko, Andrei Podluzhny, Ivan Mitrofanov, Anton Medennikov, Ivan Matveev, Yuri Sensors (Basel) Article With the rapid development of speech assistants, adapting server-intended automatic speech recognition (ASR) solutions to a direct device has become crucial. For on-device speech recognition tasks, researchers and industry prefer end-to-end ASR systems as they can be made resource-efficient while maintaining a higher quality compared to hybrid systems. However, building end-to-end models requires a significant amount of speech data. Personalization, which is mainly handling out-of-vocabulary (OOV) words, is another challenging task associated with speech assistants. In this work, we consider building an effective end-to-end ASR system in low-resource setups with a high OOV rate, embodied in Babel Turkish and Babel Georgian tasks. We propose a method of dynamic acoustic unit augmentation based on the Byte Pair Encoding with dropout (BPE-dropout) technique. The method non-deterministically tokenizes utterances to extend the token’s contexts and to regularize their distribution for the model’s recognition of unseen words. It also reduces the need for optimal subword vocabulary size search. The technique provides a steady improvement in regular and personalized (OOV-oriented) speech recognition tasks (at least 6% relative word error rate (WER) and 25% relative F-score) at no additional computational cost. Owing to the BPE-dropout use, our monolingual Turkish Conformer has achieved a competitive result with 22.2% character error rate (CER) and 38.9% WER, which is close to the best published multilingual system. MDPI 2021-04-28 /pmc/articles/PMC8124527/ /pubmed/33924798 http://dx.doi.org/10.3390/s21093063 Text en © 2021 by the authors. https://creativecommons.org/licenses/by/4.0/Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
spellingShingle	Article Laptev, Aleksandr Andrusenko, Andrei Podluzhny, Ivan Mitrofanov, Anton Medennikov, Ivan Matveev, Yuri Dynamic Acoustic Unit Augmentation with BPE-Dropout for Low-Resource End-to-End Speech Recognition
title	Dynamic Acoustic Unit Augmentation with BPE-Dropout for Low-Resource End-to-End Speech Recognition
title_full	Dynamic Acoustic Unit Augmentation with BPE-Dropout for Low-Resource End-to-End Speech Recognition
title_fullStr	Dynamic Acoustic Unit Augmentation with BPE-Dropout for Low-Resource End-to-End Speech Recognition
title_full_unstemmed	Dynamic Acoustic Unit Augmentation with BPE-Dropout for Low-Resource End-to-End Speech Recognition
title_short	Dynamic Acoustic Unit Augmentation with BPE-Dropout for Low-Resource End-to-End Speech Recognition
title_sort	dynamic acoustic unit augmentation with bpe-dropout for low-resource end-to-end speech recognition
topic	Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8124527/ https://www.ncbi.nlm.nih.gov/pubmed/33924798 http://dx.doi.org/10.3390/s21093063
work_keys_str_mv	AT laptevaleksandr dynamicacousticunitaugmentationwithbpedropoutforlowresourceendtoendspeechrecognition AT andrusenkoandrei dynamicacousticunitaugmentationwithbpedropoutforlowresourceendtoendspeechrecognition AT podluzhnyivan dynamicacousticunitaugmentationwithbpedropoutforlowresourceendtoendspeechrecognition AT mitrofanovanton dynamicacousticunitaugmentationwithbpedropoutforlowresourceendtoendspeechrecognition AT medennikovivan dynamicacousticunitaugmentationwithbpedropoutforlowresourceendtoendspeechrecognition AT matveevyuri dynamicacousticunitaugmentationwithbpedropoutforlowresourceendtoendspeechrecognition

Dynamic Acoustic Unit Augmentation with BPE-Dropout for Low-Resource End-to-End Speech Recognition

Ejemplares similares