Cargando…
Privacy-Preserving Deep Speaker Separation for Smartphone-Based Passive Speech Assessment
Goal: Smartphones can be used to passively assess and monitor patients’ speech impairments caused by ailments such as Parkinson’s disease, Traumatic Brain Injury (TBI), Post-Traumatic Stress Disorder (PTSD) and neurodegenerative diseases such as Alzheimer’s disease and dementia. However, passive aud...
Formato: | Online Artículo Texto |
---|---|
Lenguaje: | English |
Publicado: |
IEEE
2021
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8940203/ https://www.ncbi.nlm.nih.gov/pubmed/35402977 http://dx.doi.org/10.1109/OJEMB.2021.3063994 |
_version_ | 1784672878729363456 |
---|---|
collection | PubMed |
description | Goal: Smartphones can be used to passively assess and monitor patients’ speech impairments caused by ailments such as Parkinson’s disease, Traumatic Brain Injury (TBI), Post-Traumatic Stress Disorder (PTSD) and neurodegenerative diseases such as Alzheimer’s disease and dementia. However, passive audio recordings in natural settings often capture the speech of non-target speakers (cross-talk). Consequently, speaker separation, which identifies the target speakers’ speech in audio recordings with two or more speakers’ voices, is a crucial pre-processing step in such scenarios. Prior speech separation methods analyzed raw audio. However, in order to preserve speaker privacy, passively recorded smartphone audio and machine learning-based speech assessment are often performed on derived speech features such as Mel-Frequency Cepstral Coefficients (MFCCs). In this paper, we propose a novel Deep MFCC bAsed SpeaKer Separation (Deep-MASKS). Methods: Deep-MASKS uses an autoencoder to reconstruct MFCC components of an individual’s speech from an i-vector, x-vector or d-vector representation of their speech learned during the enrollment period. Deep-MASKS utilizes a Deep Neural Network (DNN) for MFCC signal reconstructions, which yields a more accurate, higher-order function compared to prior work that utilized a mask. Unlike prior work that operates on utterances, Deep-MASKS operates on continuous audio recordings. Results: Deep-MASKS outperforms baselines, reducing the Mean Squared Error (MSE) of MFCC reconstruction by up to 44% and the number of additional bits required to represent clean speech entropy by 36%. |
format | Online Article Text |
id | pubmed-8940203 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2021 |
publisher | IEEE |
record_format | MEDLINE/PubMed |
spelling | pubmed-89402032022-04-07 Privacy-Preserving Deep Speaker Separation for Smartphone-Based Passive Speech Assessment IEEE Open J Eng Med Biol Article Goal: Smartphones can be used to passively assess and monitor patients’ speech impairments caused by ailments such as Parkinson’s disease, Traumatic Brain Injury (TBI), Post-Traumatic Stress Disorder (PTSD) and neurodegenerative diseases such as Alzheimer’s disease and dementia. However, passive audio recordings in natural settings often capture the speech of non-target speakers (cross-talk). Consequently, speaker separation, which identifies the target speakers’ speech in audio recordings with two or more speakers’ voices, is a crucial pre-processing step in such scenarios. Prior speech separation methods analyzed raw audio. However, in order to preserve speaker privacy, passively recorded smartphone audio and machine learning-based speech assessment are often performed on derived speech features such as Mel-Frequency Cepstral Coefficients (MFCCs). In this paper, we propose a novel Deep MFCC bAsed SpeaKer Separation (Deep-MASKS). Methods: Deep-MASKS uses an autoencoder to reconstruct MFCC components of an individual’s speech from an i-vector, x-vector or d-vector representation of their speech learned during the enrollment period. Deep-MASKS utilizes a Deep Neural Network (DNN) for MFCC signal reconstructions, which yields a more accurate, higher-order function compared to prior work that utilized a mask. Unlike prior work that operates on utterances, Deep-MASKS operates on continuous audio recordings. Results: Deep-MASKS outperforms baselines, reducing the Mean Squared Error (MSE) of MFCC reconstruction by up to 44% and the number of additional bits required to represent clean speech entropy by 36%. IEEE 2021-03-04 /pmc/articles/PMC8940203/ /pubmed/35402977 http://dx.doi.org/10.1109/OJEMB.2021.3063994 Text en https://creativecommons.org/licenses/by/4.0/This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ |
spellingShingle | Article Privacy-Preserving Deep Speaker Separation for Smartphone-Based Passive Speech Assessment |
title | Privacy-Preserving Deep Speaker Separation for Smartphone-Based Passive Speech Assessment |
title_full | Privacy-Preserving Deep Speaker Separation for Smartphone-Based Passive Speech Assessment |
title_fullStr | Privacy-Preserving Deep Speaker Separation for Smartphone-Based Passive Speech Assessment |
title_full_unstemmed | Privacy-Preserving Deep Speaker Separation for Smartphone-Based Passive Speech Assessment |
title_short | Privacy-Preserving Deep Speaker Separation for Smartphone-Based Passive Speech Assessment |
title_sort | privacy-preserving deep speaker separation for smartphone-based passive speech assessment |
topic | Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8940203/ https://www.ncbi.nlm.nih.gov/pubmed/35402977 http://dx.doi.org/10.1109/OJEMB.2021.3063994 |
work_keys_str_mv | AT privacypreservingdeepspeakerseparationforsmartphonebasedpassivespeechassessment AT privacypreservingdeepspeakerseparationforsmartphonebasedpassivespeechassessment AT privacypreservingdeepspeakerseparationforsmartphonebasedpassivespeechassessment |