Cargando…

Manipulating Voice Attributes by Adversarial Learning of Structured Disentangled Representations

Voice conversion (VC) consists of digitally altering the voice of an individual to manipulate part of its content, primarily its identity, while maintaining the rest unchanged. Research in neural VC has accomplished considerable breakthroughs with the capacity to falsify a voice identity using a sma...

Descripción completa

Detalles Bibliográficos
Autores principales: Benaroya, Laurent, Obin, Nicolas, Roebel, Axel
Formato: Online Artículo Texto
Lenguaje:English
Publicado: MDPI 2023
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9955323/
https://www.ncbi.nlm.nih.gov/pubmed/36832741
http://dx.doi.org/10.3390/e25020375
_version_ 1784894320544841728
author Benaroya, Laurent
Obin, Nicolas
Roebel, Axel
author_facet Benaroya, Laurent
Obin, Nicolas
Roebel, Axel
author_sort Benaroya, Laurent
collection PubMed
description Voice conversion (VC) consists of digitally altering the voice of an individual to manipulate part of its content, primarily its identity, while maintaining the rest unchanged. Research in neural VC has accomplished considerable breakthroughs with the capacity to falsify a voice identity using a small amount of data with a highly realistic rendering. This paper goes beyond voice identity manipulation and presents an original neural architecture that allows the manipulation of voice attributes (e.g., gender and age). The proposed architecture is inspired by the fader network, transferring the same ideas to voice manipulation. The information conveyed by the speech signal is disentangled into interpretative voice attributes by means of minimizing adversarial loss to make the encoded information mutually independent while preserving the capacity to generate a speech signal from the disentangled codes. During inference for voice conversion, the disentangled voice attributes can be manipulated and the speech signal can be generated accordingly. For experimental evaluation, the proposed method is applied to the task of voice gender conversion using the freely available VCTK dataset. Quantitative measurements of mutual information between the variables of speaker identity and speaker gender show that the proposed architecture can learn gender-independent representation of speakers. Additional measurements of speaker recognition indicate that speaker identity can be recognized accurately from the gender-independent representation. Finally, a subjective experiment conducted on the task of voice gender manipulation shows that the proposed architecture can convert voice gender with very high efficiency and good naturalness.
format Online
Article
Text
id pubmed-9955323
institution National Center for Biotechnology Information
language English
publishDate 2023
publisher MDPI
record_format MEDLINE/PubMed
spelling pubmed-99553232023-02-25 Manipulating Voice Attributes by Adversarial Learning of Structured Disentangled Representations Benaroya, Laurent Obin, Nicolas Roebel, Axel Entropy (Basel) Article Voice conversion (VC) consists of digitally altering the voice of an individual to manipulate part of its content, primarily its identity, while maintaining the rest unchanged. Research in neural VC has accomplished considerable breakthroughs with the capacity to falsify a voice identity using a small amount of data with a highly realistic rendering. This paper goes beyond voice identity manipulation and presents an original neural architecture that allows the manipulation of voice attributes (e.g., gender and age). The proposed architecture is inspired by the fader network, transferring the same ideas to voice manipulation. The information conveyed by the speech signal is disentangled into interpretative voice attributes by means of minimizing adversarial loss to make the encoded information mutually independent while preserving the capacity to generate a speech signal from the disentangled codes. During inference for voice conversion, the disentangled voice attributes can be manipulated and the speech signal can be generated accordingly. For experimental evaluation, the proposed method is applied to the task of voice gender conversion using the freely available VCTK dataset. Quantitative measurements of mutual information between the variables of speaker identity and speaker gender show that the proposed architecture can learn gender-independent representation of speakers. Additional measurements of speaker recognition indicate that speaker identity can be recognized accurately from the gender-independent representation. Finally, a subjective experiment conducted on the task of voice gender manipulation shows that the proposed architecture can convert voice gender with very high efficiency and good naturalness. MDPI 2023-02-18 /pmc/articles/PMC9955323/ /pubmed/36832741 http://dx.doi.org/10.3390/e25020375 Text en © 2023 by the authors. https://creativecommons.org/licenses/by/4.0/Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
spellingShingle Article
Benaroya, Laurent
Obin, Nicolas
Roebel, Axel
Manipulating Voice Attributes by Adversarial Learning of Structured Disentangled Representations
title Manipulating Voice Attributes by Adversarial Learning of Structured Disentangled Representations
title_full Manipulating Voice Attributes by Adversarial Learning of Structured Disentangled Representations
title_fullStr Manipulating Voice Attributes by Adversarial Learning of Structured Disentangled Representations
title_full_unstemmed Manipulating Voice Attributes by Adversarial Learning of Structured Disentangled Representations
title_short Manipulating Voice Attributes by Adversarial Learning of Structured Disentangled Representations
title_sort manipulating voice attributes by adversarial learning of structured disentangled representations
topic Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9955323/
https://www.ncbi.nlm.nih.gov/pubmed/36832741
http://dx.doi.org/10.3390/e25020375
work_keys_str_mv AT benaroyalaurent manipulatingvoiceattributesbyadversariallearningofstructureddisentangledrepresentations
AT obinnicolas manipulatingvoiceattributesbyadversariallearningofstructureddisentangledrepresentations
AT roebelaxel manipulatingvoiceattributesbyadversariallearningofstructureddisentangledrepresentations