Cargando…

Ensemble learning with speaker embeddings in multiple speech task stimuli for depression detection

INTRODUCTION: As a biomarker of depression, speech signal has attracted the interest of many researchers due to its characteristics of easy collection and non-invasive. However, subjects’ speech variation under different scenes and emotional stimuli, the insufficient amount of depression speech data...

Descripción completa

Detalles Bibliográficos
Autores principales:	Liu, Zhenyu, Yu, Huimin, Li, Gang, Chen, Qiongqiong, Ding, Zhijie, Feng, Lei, Yao, Zhijun, Hu, Bin
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Frontiers Media S.A. 2023
Materias:	Neuroscience
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10076578/ https://www.ncbi.nlm.nih.gov/pubmed/37034153 http://dx.doi.org/10.3389/fnins.2023.1141621

_version_	1785020158209687552
author	Liu, Zhenyu Yu, Huimin Li, Gang Chen, Qiongqiong Ding, Zhijie Feng, Lei Yao, Zhijun Hu, Bin
author_facet	Liu, Zhenyu Yu, Huimin Li, Gang Chen, Qiongqiong Ding, Zhijie Feng, Lei Yao, Zhijun Hu, Bin
author_sort	Liu, Zhenyu
collection	PubMed
description	INTRODUCTION: As a biomarker of depression, speech signal has attracted the interest of many researchers due to its characteristics of easy collection and non-invasive. However, subjects’ speech variation under different scenes and emotional stimuli, the insufficient amount of depression speech data for deep learning, and the variable length of speech frame-level features have an impact on the recognition performance. METHODS: The above problems, this study proposes a multi-task ensemble learning method based on speaker embeddings for depression classification. First, we extract the Mel Frequency Cepstral Coefficients (MFCC), the Perceptual Linear Predictive Coefficients (PLP), and the Filter Bank (FBANK) from the out-domain dataset (CN-Celeb) and train the Resnet x-vector extractor, Time delay neural network (TDNN) x-vector extractor, and i-vector extractor. Then, we extract the corresponding speaker embeddings of fixed length from the depression speech database of the Gansu Provincial Key Laboratory of Wearable Computing. Support Vector Machine (SVM) and Random Forest (RF) are used to obtain the classification results of speaker embeddings in nine speech tasks. To make full use of the information of speech tasks with different scenes and emotions, we aggregate the classification results of nine tasks into new features and then obtain the final classification results by using Multilayer Perceptron (MLP). In order to take advantage of the complementary effects of different features, Resnet x-vectors based on different acoustic features are fused in the ensemble learning method. RESULTS: Experimental results demonstrate that (1) MFCC-based Resnet x-vectors perform best among the nine speaker embeddings for depression detection; (2) interview speech is better than picture descriptions speech, and neutral stimulus is the best among the three emotional valences in the depression recognition task; (3) our multi-task ensemble learning method with MFCC-based Resnet x-vectors can effectively identify depressed patients; (4) in all cases, the combination of MFCC-based Resnet x-vectors and PLP-based Resnet x-vectors in our ensemble learning method achieves the best results, outperforming other literature studies using the depression speech database. DISCUSSION: Our multi-task ensemble learning method with MFCC-based Resnet x-vectors can fuse the depression related information of different stimuli effectively, which provides a new approach for depression detection. The limitation of this method is that speaker embeddings extractors were pre-trained on the out-domain dataset. We will consider using the augmented in-domain dataset for pre-training to improve the depression recognition performance further.
format	Online Article Text
id	pubmed-10076578
institution	National Center for Biotechnology Information
language	English
publishDate	2023
publisher	Frontiers Media S.A.
record_format	MEDLINE/PubMed
spelling	pubmed-100765782023-04-07 Ensemble learning with speaker embeddings in multiple speech task stimuli for depression detection Liu, Zhenyu Yu, Huimin Li, Gang Chen, Qiongqiong Ding, Zhijie Feng, Lei Yao, Zhijun Hu, Bin Front Neurosci Neuroscience INTRODUCTION: As a biomarker of depression, speech signal has attracted the interest of many researchers due to its characteristics of easy collection and non-invasive. However, subjects’ speech variation under different scenes and emotional stimuli, the insufficient amount of depression speech data for deep learning, and the variable length of speech frame-level features have an impact on the recognition performance. METHODS: The above problems, this study proposes a multi-task ensemble learning method based on speaker embeddings for depression classification. First, we extract the Mel Frequency Cepstral Coefficients (MFCC), the Perceptual Linear Predictive Coefficients (PLP), and the Filter Bank (FBANK) from the out-domain dataset (CN-Celeb) and train the Resnet x-vector extractor, Time delay neural network (TDNN) x-vector extractor, and i-vector extractor. Then, we extract the corresponding speaker embeddings of fixed length from the depression speech database of the Gansu Provincial Key Laboratory of Wearable Computing. Support Vector Machine (SVM) and Random Forest (RF) are used to obtain the classification results of speaker embeddings in nine speech tasks. To make full use of the information of speech tasks with different scenes and emotions, we aggregate the classification results of nine tasks into new features and then obtain the final classification results by using Multilayer Perceptron (MLP). In order to take advantage of the complementary effects of different features, Resnet x-vectors based on different acoustic features are fused in the ensemble learning method. RESULTS: Experimental results demonstrate that (1) MFCC-based Resnet x-vectors perform best among the nine speaker embeddings for depression detection; (2) interview speech is better than picture descriptions speech, and neutral stimulus is the best among the three emotional valences in the depression recognition task; (3) our multi-task ensemble learning method with MFCC-based Resnet x-vectors can effectively identify depressed patients; (4) in all cases, the combination of MFCC-based Resnet x-vectors and PLP-based Resnet x-vectors in our ensemble learning method achieves the best results, outperforming other literature studies using the depression speech database. DISCUSSION: Our multi-task ensemble learning method with MFCC-based Resnet x-vectors can fuse the depression related information of different stimuli effectively, which provides a new approach for depression detection. The limitation of this method is that speaker embeddings extractors were pre-trained on the out-domain dataset. We will consider using the augmented in-domain dataset for pre-training to improve the depression recognition performance further. Frontiers Media S.A. 2023-03-23 /pmc/articles/PMC10076578/ /pubmed/37034153 http://dx.doi.org/10.3389/fnins.2023.1141621 Text en Copyright © 2023 Liu, Yu, Li, Chen, Ding, Feng, Yao and Hu. https://creativecommons.org/licenses/by/4.0/This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
spellingShingle	Neuroscience Liu, Zhenyu Yu, Huimin Li, Gang Chen, Qiongqiong Ding, Zhijie Feng, Lei Yao, Zhijun Hu, Bin Ensemble learning with speaker embeddings in multiple speech task stimuli for depression detection
title	Ensemble learning with speaker embeddings in multiple speech task stimuli for depression detection
title_full	Ensemble learning with speaker embeddings in multiple speech task stimuli for depression detection
title_fullStr	Ensemble learning with speaker embeddings in multiple speech task stimuli for depression detection
title_full_unstemmed	Ensemble learning with speaker embeddings in multiple speech task stimuli for depression detection
title_short	Ensemble learning with speaker embeddings in multiple speech task stimuli for depression detection
title_sort	ensemble learning with speaker embeddings in multiple speech task stimuli for depression detection
topic	Neuroscience
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10076578/ https://www.ncbi.nlm.nih.gov/pubmed/37034153 http://dx.doi.org/10.3389/fnins.2023.1141621
work_keys_str_mv	AT liuzhenyu ensemblelearningwithspeakerembeddingsinmultiplespeechtaskstimulifordepressiondetection AT yuhuimin ensemblelearningwithspeakerembeddingsinmultiplespeechtaskstimulifordepressiondetection AT ligang ensemblelearningwithspeakerembeddingsinmultiplespeechtaskstimulifordepressiondetection AT chenqiongqiong ensemblelearningwithspeakerembeddingsinmultiplespeechtaskstimulifordepressiondetection AT dingzhijie ensemblelearningwithspeakerembeddingsinmultiplespeechtaskstimulifordepressiondetection AT fenglei ensemblelearningwithspeakerembeddingsinmultiplespeechtaskstimulifordepressiondetection AT yaozhijun ensemblelearningwithspeakerembeddingsinmultiplespeechtaskstimulifordepressiondetection AT hubin ensemblelearningwithspeakerembeddingsinmultiplespeechtaskstimulifordepressiondetection

Ensemble learning with speaker embeddings in multiple speech task stimuli for depression detection

Ejemplares similares