Cargando…

Analysis of acoustic and voice quality features for the classification of infant and mother vocalizations

Classification of infant and parent vocalizations, particularly emotional vocalizations, is critical to understanding how infants learn to regulate emotions in social dyadic processes. This work is an experimental study of classifiers, features, and data augmentation strategies applied to the task o...

Descripción completa

Detalles Bibliográficos
Autores principales:	Li, Jialu, Hasegawa-Johnson, Mark, McElwain, Nancy L.
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	2021
Materias:	Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9435967/ https://www.ncbi.nlm.nih.gov/pubmed/36062214 http://dx.doi.org/10.1016/j.specom.2021.07.010

_version_	1784781264277995520
author	Li, Jialu Hasegawa-Johnson, Mark McElwain, Nancy L.
author_facet	Li, Jialu Hasegawa-Johnson, Mark McElwain, Nancy L.
author_sort	Li, Jialu
collection	PubMed
description	Classification of infant and parent vocalizations, particularly emotional vocalizations, is critical to understanding how infants learn to regulate emotions in social dyadic processes. This work is an experimental study of classifiers, features, and data augmentation strategies applied to the task of classifying infant and parent vocalization types. Our data were recorded both in the home and in the laboratory. Infant vocalizations were manually labeled as cry, fus (fuss), lau (laugh), bab (babble) or scr (screech), while parent (mostly mother) vocalizations were labeled as ids (infant-directed speech), ads (adult-directed speech), pla (playful), rhy (rhythmic speech or singing), lau (laugh) or whi (whisper). Linear discriminant analysis (LDA) was selected as a baseline classifier, because it gave the highest accuracy in a previously published study covering part of this corpus. LDA was compared to two neural network architectures: a two-layer fully-connected network (FCN), and a convolutional neural network with self-attention (CNSA). Baseline features extracted using the OpenSMILE toolkit were augmented by extra voice quality, phonetic, and prosodic features, each targeting perceptual features of one or more of the vocalization types. Three web data augmentation and transfer learning methods were tested: pre-training of network weights for a related task (adult emotion classification), augmentation of under-represented classes using data uniformly sampled from other corpora, and augmentation of under-represented classes using data selected by a minimum cross-corpus information difference criterion. Feature selection using Fisher scores and experiments of using weighted and unweighted samplers were also tested. Two datasets were evaluated: a benchmark dataset (CRIED) and our own corpus. In terms of unweighted-average recall of CRIED dataset, the CNSA achieved the best UAR compared with previous studies. In terms of classification accuracy, weighted F1, and macro F1 of our own dataset, the neural networks both significantly outperformed LDA; the FCN slightly (but not significantly) outperformed the CNSA. Cross-examining features selected by different feature selection algorithms permits a type of post-hoc feature analysis, in which the most important acoustic features for each binary type discrimination are listed. Examples of each vocalization type of overlapped features were selected, and their spectrograms are presented, and discussed with respect to the type-discriminative acoustic features selected by various algorithms. MFCC, log Mel Frequency Band Energy, LSP frequency, and F1 are found to be the most important spectral envelope features; F0 is found to be the most important prosodic feature.
format	Online Article Text
id	pubmed-9435967
institution	National Center for Biotechnology Information
language	English
publishDate	2021
record_format	MEDLINE/PubMed
spelling	pubmed-94359672022-09-01 Analysis of acoustic and voice quality features for the classification of infant and mother vocalizations Li, Jialu Hasegawa-Johnson, Mark McElwain, Nancy L. Speech Commun Article Classification of infant and parent vocalizations, particularly emotional vocalizations, is critical to understanding how infants learn to regulate emotions in social dyadic processes. This work is an experimental study of classifiers, features, and data augmentation strategies applied to the task of classifying infant and parent vocalization types. Our data were recorded both in the home and in the laboratory. Infant vocalizations were manually labeled as cry, fus (fuss), lau (laugh), bab (babble) or scr (screech), while parent (mostly mother) vocalizations were labeled as ids (infant-directed speech), ads (adult-directed speech), pla (playful), rhy (rhythmic speech or singing), lau (laugh) or whi (whisper). Linear discriminant analysis (LDA) was selected as a baseline classifier, because it gave the highest accuracy in a previously published study covering part of this corpus. LDA was compared to two neural network architectures: a two-layer fully-connected network (FCN), and a convolutional neural network with self-attention (CNSA). Baseline features extracted using the OpenSMILE toolkit were augmented by extra voice quality, phonetic, and prosodic features, each targeting perceptual features of one or more of the vocalization types. Three web data augmentation and transfer learning methods were tested: pre-training of network weights for a related task (adult emotion classification), augmentation of under-represented classes using data uniformly sampled from other corpora, and augmentation of under-represented classes using data selected by a minimum cross-corpus information difference criterion. Feature selection using Fisher scores and experiments of using weighted and unweighted samplers were also tested. Two datasets were evaluated: a benchmark dataset (CRIED) and our own corpus. In terms of unweighted-average recall of CRIED dataset, the CNSA achieved the best UAR compared with previous studies. In terms of classification accuracy, weighted F1, and macro F1 of our own dataset, the neural networks both significantly outperformed LDA; the FCN slightly (but not significantly) outperformed the CNSA. Cross-examining features selected by different feature selection algorithms permits a type of post-hoc feature analysis, in which the most important acoustic features for each binary type discrimination are listed. Examples of each vocalization type of overlapped features were selected, and their spectrograms are presented, and discussed with respect to the type-discriminative acoustic features selected by various algorithms. MFCC, log Mel Frequency Band Energy, LSP frequency, and F1 are found to be the most important spectral envelope features; F0 is found to be the most important prosodic feature. 2021-10 2021-08-18 /pmc/articles/PMC9435967/ /pubmed/36062214 http://dx.doi.org/10.1016/j.specom.2021.07.010 Text en https://creativecommons.org/licenses/by-nc-nd/4.0/This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/ (https://creativecommons.org/licenses/by-nc-nd/4.0/) ).
spellingShingle	Article Li, Jialu Hasegawa-Johnson, Mark McElwain, Nancy L. Analysis of acoustic and voice quality features for the classification of infant and mother vocalizations
title	Analysis of acoustic and voice quality features for the classification of infant and mother vocalizations
title_full	Analysis of acoustic and voice quality features for the classification of infant and mother vocalizations
title_fullStr	Analysis of acoustic and voice quality features for the classification of infant and mother vocalizations
title_full_unstemmed	Analysis of acoustic and voice quality features for the classification of infant and mother vocalizations
title_short	Analysis of acoustic and voice quality features for the classification of infant and mother vocalizations
title_sort	analysis of acoustic and voice quality features for the classification of infant and mother vocalizations
topic	Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9435967/ https://www.ncbi.nlm.nih.gov/pubmed/36062214 http://dx.doi.org/10.1016/j.specom.2021.07.010
work_keys_str_mv	AT lijialu analysisofacousticandvoicequalityfeaturesfortheclassificationofinfantandmothervocalizations AT hasegawajohnsonmark analysisofacousticandvoicequalityfeaturesfortheclassificationofinfantandmothervocalizations AT mcelwainnancyl analysisofacousticandvoicequalityfeaturesfortheclassificationofinfantandmothervocalizations

Analysis of acoustic and voice quality features for the classification of infant and mother vocalizations

Ejemplares similares