Cargando…

Improving the workflow to crack Small, Unbalanced, Noisy, but Genuine (SUNG) datasets in bioacoustics: The case of bonobo calls

Despite the accumulation of data and studies, deciphering animal vocal communication remains challenging. In most cases, researchers must deal with the sparse recordings composing Small, Unbalanced, Noisy, but Genuine (SUNG) datasets. SUNG datasets are characterized by a limited number of recordings...

Descripción completa

Detalles Bibliográficos
Autores principales: Arnaud, Vincent, Pellegrino, François, Keenan, Sumir, St-Gelais, Xavier, Mathevon, Nicolas, Levréro, Florence, Coupé, Christophe
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Public Library of Science 2023
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10129004/
https://www.ncbi.nlm.nih.gov/pubmed/37053268
http://dx.doi.org/10.1371/journal.pcbi.1010325
_version_ 1785030636779601920
author Arnaud, Vincent
Pellegrino, François
Keenan, Sumir
St-Gelais, Xavier
Mathevon, Nicolas
Levréro, Florence
Coupé, Christophe
author_facet Arnaud, Vincent
Pellegrino, François
Keenan, Sumir
St-Gelais, Xavier
Mathevon, Nicolas
Levréro, Florence
Coupé, Christophe
author_sort Arnaud, Vincent
collection PubMed
description Despite the accumulation of data and studies, deciphering animal vocal communication remains challenging. In most cases, researchers must deal with the sparse recordings composing Small, Unbalanced, Noisy, but Genuine (SUNG) datasets. SUNG datasets are characterized by a limited number of recordings, most often noisy, and unbalanced in number between the individuals or categories of vocalizations. SUNG datasets therefore offer a valuable but inevitably distorted vision of communication systems. Adopting the best practices in their analysis is essential to effectively extract the available information and draw reliable conclusions. Here we show that the most recent advances in machine learning applied to a SUNG dataset succeed in unraveling the complex vocal repertoire of the bonobo, and we propose a workflow that can be effective with other animal species. We implement acoustic parameterization in three feature spaces and run a Supervised Uniform Manifold Approximation and Projection (S-UMAP) to evaluate how call types and individual signatures cluster in the bonobo acoustic space. We then implement three classification algorithms (Support Vector Machine, xgboost, neural networks) and their combination to explore the structure and variability of bonobo calls, as well as the robustness of the individual signature they encode. We underscore how classification performance is affected by the feature set and identify the most informative features. In addition, we highlight the need to address data leakage in the evaluation of classification performance to avoid misleading interpretations. Our results lead to identifying several practical approaches that are generalizable to any other animal communication system. To improve the reliability and replicability of vocal communication studies with SUNG datasets, we thus recommend: i) comparing several acoustic parameterizations; ii) visualizing the dataset with supervised UMAP to examine the species acoustic space; iii) adopting Support Vector Machines as the baseline classification approach; iv) explicitly evaluating data leakage and possibly implementing a mitigation strategy.
format Online
Article
Text
id pubmed-10129004
institution National Center for Biotechnology Information
language English
publishDate 2023
publisher Public Library of Science
record_format MEDLINE/PubMed
spelling pubmed-101290042023-04-26 Improving the workflow to crack Small, Unbalanced, Noisy, but Genuine (SUNG) datasets in bioacoustics: The case of bonobo calls Arnaud, Vincent Pellegrino, François Keenan, Sumir St-Gelais, Xavier Mathevon, Nicolas Levréro, Florence Coupé, Christophe PLoS Comput Biol Research Article Despite the accumulation of data and studies, deciphering animal vocal communication remains challenging. In most cases, researchers must deal with the sparse recordings composing Small, Unbalanced, Noisy, but Genuine (SUNG) datasets. SUNG datasets are characterized by a limited number of recordings, most often noisy, and unbalanced in number between the individuals or categories of vocalizations. SUNG datasets therefore offer a valuable but inevitably distorted vision of communication systems. Adopting the best practices in their analysis is essential to effectively extract the available information and draw reliable conclusions. Here we show that the most recent advances in machine learning applied to a SUNG dataset succeed in unraveling the complex vocal repertoire of the bonobo, and we propose a workflow that can be effective with other animal species. We implement acoustic parameterization in three feature spaces and run a Supervised Uniform Manifold Approximation and Projection (S-UMAP) to evaluate how call types and individual signatures cluster in the bonobo acoustic space. We then implement three classification algorithms (Support Vector Machine, xgboost, neural networks) and their combination to explore the structure and variability of bonobo calls, as well as the robustness of the individual signature they encode. We underscore how classification performance is affected by the feature set and identify the most informative features. In addition, we highlight the need to address data leakage in the evaluation of classification performance to avoid misleading interpretations. Our results lead to identifying several practical approaches that are generalizable to any other animal communication system. To improve the reliability and replicability of vocal communication studies with SUNG datasets, we thus recommend: i) comparing several acoustic parameterizations; ii) visualizing the dataset with supervised UMAP to examine the species acoustic space; iii) adopting Support Vector Machines as the baseline classification approach; iv) explicitly evaluating data leakage and possibly implementing a mitigation strategy. Public Library of Science 2023-04-13 /pmc/articles/PMC10129004/ /pubmed/37053268 http://dx.doi.org/10.1371/journal.pcbi.1010325 Text en © 2023 Arnaud et al https://creativecommons.org/licenses/by/4.0/This is an open access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
spellingShingle Research Article
Arnaud, Vincent
Pellegrino, François
Keenan, Sumir
St-Gelais, Xavier
Mathevon, Nicolas
Levréro, Florence
Coupé, Christophe
Improving the workflow to crack Small, Unbalanced, Noisy, but Genuine (SUNG) datasets in bioacoustics: The case of bonobo calls
title Improving the workflow to crack Small, Unbalanced, Noisy, but Genuine (SUNG) datasets in bioacoustics: The case of bonobo calls
title_full Improving the workflow to crack Small, Unbalanced, Noisy, but Genuine (SUNG) datasets in bioacoustics: The case of bonobo calls
title_fullStr Improving the workflow to crack Small, Unbalanced, Noisy, but Genuine (SUNG) datasets in bioacoustics: The case of bonobo calls
title_full_unstemmed Improving the workflow to crack Small, Unbalanced, Noisy, but Genuine (SUNG) datasets in bioacoustics: The case of bonobo calls
title_short Improving the workflow to crack Small, Unbalanced, Noisy, but Genuine (SUNG) datasets in bioacoustics: The case of bonobo calls
title_sort improving the workflow to crack small, unbalanced, noisy, but genuine (sung) datasets in bioacoustics: the case of bonobo calls
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10129004/
https://www.ncbi.nlm.nih.gov/pubmed/37053268
http://dx.doi.org/10.1371/journal.pcbi.1010325
work_keys_str_mv AT arnaudvincent improvingtheworkflowtocracksmallunbalancednoisybutgenuinesungdatasetsinbioacousticsthecaseofbonobocalls
AT pellegrinofrancois improvingtheworkflowtocracksmallunbalancednoisybutgenuinesungdatasetsinbioacousticsthecaseofbonobocalls
AT keenansumir improvingtheworkflowtocracksmallunbalancednoisybutgenuinesungdatasetsinbioacousticsthecaseofbonobocalls
AT stgelaisxavier improvingtheworkflowtocracksmallunbalancednoisybutgenuinesungdatasetsinbioacousticsthecaseofbonobocalls
AT mathevonnicolas improvingtheworkflowtocracksmallunbalancednoisybutgenuinesungdatasetsinbioacousticsthecaseofbonobocalls
AT levreroflorence improvingtheworkflowtocracksmallunbalancednoisybutgenuinesungdatasetsinbioacousticsthecaseofbonobocalls
AT coupechristophe improvingtheworkflowtocracksmallunbalancednoisybutgenuinesungdatasetsinbioacousticsthecaseofbonobocalls