Cargando…

Deep multiple instance learning for foreground speech localization in ambient audio from wearable devices

Over the recent years, machine learning techniques have been employed to produce state-of-the-art results in several audio related tasks. The success of these approaches has been largely due to access to large amounts of open-source datasets and enhancement of computational resources. However, a sho...

Descripción completa

Detalles Bibliográficos
Autores principales:	Hebbar, Rajat, Papadopoulos, Pavlos, Reyes, Ramon, Danvers, Alexander F., Polsinelli, Angelina J., Moseley, Suzanne A., Sbarra, David A., Mehl, Matthias R., Narayanan, Shrikanth
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Springer International Publishing 2021
Materias:	Research
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7858549/ https://www.ncbi.nlm.nih.gov/pubmed/33584835 http://dx.doi.org/10.1186/s13636-020-00194-0

_version_	1783646622580736000
author	Hebbar, Rajat Papadopoulos, Pavlos Reyes, Ramon Danvers, Alexander F. Polsinelli, Angelina J. Moseley, Suzanne A. Sbarra, David A. Mehl, Matthias R. Narayanan, Shrikanth
author_facet	Hebbar, Rajat Papadopoulos, Pavlos Reyes, Ramon Danvers, Alexander F. Polsinelli, Angelina J. Moseley, Suzanne A. Sbarra, David A. Mehl, Matthias R. Narayanan, Shrikanth
author_sort	Hebbar, Rajat
collection	PubMed
description	Over the recent years, machine learning techniques have been employed to produce state-of-the-art results in several audio related tasks. The success of these approaches has been largely due to access to large amounts of open-source datasets and enhancement of computational resources. However, a shortcoming of these methods is that they often fail to generalize well to tasks from real life scenarios, due to domain mismatch. One such task is foreground speech detection from wearable audio devices. Several interfering factors such as dynamically varying environmental conditions, including background speakers, TV, or radio audio, render foreground speech detection to be a challenging task. Moreover, obtaining precise moment-to-moment annotations of audio streams for analysis and model training is also time-consuming and costly. In this work, we use multiple instance learning (MIL) to facilitate development of such models using annotations available at a lower time-resolution (coarsely labeled). We show how MIL can be applied to localize foreground speech in coarsely labeled audio and show both bag-level and instance-level results. We also study different pooling methods and how they can be adapted to densely distributed events as observed in our application. Finally, we show improvements using speech activity detection embeddings as features for foreground detection.
format	Online Article Text
id	pubmed-7858549
institution	National Center for Biotechnology Information
language	English
publishDate	2021
publisher	Springer International Publishing
record_format	MEDLINE/PubMed
spelling	pubmed-78585492021-02-11 Deep multiple instance learning for foreground speech localization in ambient audio from wearable devices Hebbar, Rajat Papadopoulos, Pavlos Reyes, Ramon Danvers, Alexander F. Polsinelli, Angelina J. Moseley, Suzanne A. Sbarra, David A. Mehl, Matthias R. Narayanan, Shrikanth EURASIP J Audio Speech Music Process Research Over the recent years, machine learning techniques have been employed to produce state-of-the-art results in several audio related tasks. The success of these approaches has been largely due to access to large amounts of open-source datasets and enhancement of computational resources. However, a shortcoming of these methods is that they often fail to generalize well to tasks from real life scenarios, due to domain mismatch. One such task is foreground speech detection from wearable audio devices. Several interfering factors such as dynamically varying environmental conditions, including background speakers, TV, or radio audio, render foreground speech detection to be a challenging task. Moreover, obtaining precise moment-to-moment annotations of audio streams for analysis and model training is also time-consuming and costly. In this work, we use multiple instance learning (MIL) to facilitate development of such models using annotations available at a lower time-resolution (coarsely labeled). We show how MIL can be applied to localize foreground speech in coarsely labeled audio and show both bag-level and instance-level results. We also study different pooling methods and how they can be adapted to densely distributed events as observed in our application. Finally, we show improvements using speech activity detection embeddings as features for foreground detection. Springer International Publishing 2021-02-03 2021 /pmc/articles/PMC7858549/ /pubmed/33584835 http://dx.doi.org/10.1186/s13636-020-00194-0 Text en © The Author(s) 2021 Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
spellingShingle	Research Hebbar, Rajat Papadopoulos, Pavlos Reyes, Ramon Danvers, Alexander F. Polsinelli, Angelina J. Moseley, Suzanne A. Sbarra, David A. Mehl, Matthias R. Narayanan, Shrikanth Deep multiple instance learning for foreground speech localization in ambient audio from wearable devices
title	Deep multiple instance learning for foreground speech localization in ambient audio from wearable devices
title_full	Deep multiple instance learning for foreground speech localization in ambient audio from wearable devices
title_fullStr	Deep multiple instance learning for foreground speech localization in ambient audio from wearable devices
title_full_unstemmed	Deep multiple instance learning for foreground speech localization in ambient audio from wearable devices
title_short	Deep multiple instance learning for foreground speech localization in ambient audio from wearable devices
title_sort	deep multiple instance learning for foreground speech localization in ambient audio from wearable devices
topic	Research
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7858549/ https://www.ncbi.nlm.nih.gov/pubmed/33584835 http://dx.doi.org/10.1186/s13636-020-00194-0
work_keys_str_mv	AT hebbarrajat deepmultipleinstancelearningforforegroundspeechlocalizationinambientaudiofromwearabledevices AT papadopoulospavlos deepmultipleinstancelearningforforegroundspeechlocalizationinambientaudiofromwearabledevices AT reyesramon deepmultipleinstancelearningforforegroundspeechlocalizationinambientaudiofromwearabledevices AT danversalexanderf deepmultipleinstancelearningforforegroundspeechlocalizationinambientaudiofromwearabledevices AT polsinelliangelinaj deepmultipleinstancelearningforforegroundspeechlocalizationinambientaudiofromwearabledevices AT moseleysuzannea deepmultipleinstancelearningforforegroundspeechlocalizationinambientaudiofromwearabledevices AT sbarradavida deepmultipleinstancelearningforforegroundspeechlocalizationinambientaudiofromwearabledevices AT mehlmatthiasr deepmultipleinstancelearningforforegroundspeechlocalizationinambientaudiofromwearabledevices AT narayananshrikanth deepmultipleinstancelearningforforegroundspeechlocalizationinambientaudiofromwearabledevices

Deep multiple instance learning for foreground speech localization in ambient audio from wearable devices

Ejemplares similares