Cargando…

A Voice Cloning Method Based on the Improved HiFi-GAN Model

With the aim of adapting a source Text to Speech (TTS) model to synthesize a personal voice by using a few speech samples from the target speaker, voice cloning provides a specific TTS service. Although the Tacotron 2-based multi-speaker TTS system can implement voice cloning by introducing a d-vect...

Descripción completa

Detalles Bibliográficos
Autores principales: Qiu, Zeyu, Tang, Jun, Zhang, Yaxin, Li, Jiaxin, Bai, Xishan
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Hindawi 2022
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9578849/
https://www.ncbi.nlm.nih.gov/pubmed/36268148
http://dx.doi.org/10.1155/2022/6707304
_version_ 1784812051009372160
author Qiu, Zeyu
Tang, Jun
Zhang, Yaxin
Li, Jiaxin
Bai, Xishan
author_facet Qiu, Zeyu
Tang, Jun
Zhang, Yaxin
Li, Jiaxin
Bai, Xishan
author_sort Qiu, Zeyu
collection PubMed
description With the aim of adapting a source Text to Speech (TTS) model to synthesize a personal voice by using a few speech samples from the target speaker, voice cloning provides a specific TTS service. Although the Tacotron 2-based multi-speaker TTS system can implement voice cloning by introducing a d-vector into the speaker encoder, the speaker characteristics described by the d-vector cannot allow for the voice information of the entire utterance. This affects the similarity of voice cloning. As a vocoder, WaveNet sacrifices speech generation speed. To balance the relationship between model parameters, inference speed, and voice quality, a voice cloning method based on improved HiFi-GAN has been proposed in this paper. (1) To improve the feature representation ability of the speaker encoder, the x-vector is used as the embedding vector that can characterize the target speaker. (2) To improve the performance of the HiFi-GAN vocoder, the input Mel spectrum is processed by a competitive multiscale convolution strategy. (3) The one-dimensional depth-wise separable convolution is used to replace all standard one-dimensional convolutions, significantly reducing the model parameters and increasing the inference speed. The improved HiFi-GAN model remarkably reduces the number of vocoder model parameters by about 68.58% and boosts the model's inference speed. The inference speed on the GPU and CPU has increased by 11.84% and 30.99%, respectively. Voice quality has also been marginally improved as MOS increased by 0.13 and PESQ increased by 0.11. The improved HiFi-GAN model exhibits outstanding performance and remarkable compatibility in the voice cloning task. Combined with the x-vector embedding, the proposed model achieves the highest score of all the models and test sets.
format Online
Article
Text
id pubmed-9578849
institution National Center for Biotechnology Information
language English
publishDate 2022
publisher Hindawi
record_format MEDLINE/PubMed
spelling pubmed-95788492022-10-19 A Voice Cloning Method Based on the Improved HiFi-GAN Model Qiu, Zeyu Tang, Jun Zhang, Yaxin Li, Jiaxin Bai, Xishan Comput Intell Neurosci Research Article With the aim of adapting a source Text to Speech (TTS) model to synthesize a personal voice by using a few speech samples from the target speaker, voice cloning provides a specific TTS service. Although the Tacotron 2-based multi-speaker TTS system can implement voice cloning by introducing a d-vector into the speaker encoder, the speaker characteristics described by the d-vector cannot allow for the voice information of the entire utterance. This affects the similarity of voice cloning. As a vocoder, WaveNet sacrifices speech generation speed. To balance the relationship between model parameters, inference speed, and voice quality, a voice cloning method based on improved HiFi-GAN has been proposed in this paper. (1) To improve the feature representation ability of the speaker encoder, the x-vector is used as the embedding vector that can characterize the target speaker. (2) To improve the performance of the HiFi-GAN vocoder, the input Mel spectrum is processed by a competitive multiscale convolution strategy. (3) The one-dimensional depth-wise separable convolution is used to replace all standard one-dimensional convolutions, significantly reducing the model parameters and increasing the inference speed. The improved HiFi-GAN model remarkably reduces the number of vocoder model parameters by about 68.58% and boosts the model's inference speed. The inference speed on the GPU and CPU has increased by 11.84% and 30.99%, respectively. Voice quality has also been marginally improved as MOS increased by 0.13 and PESQ increased by 0.11. The improved HiFi-GAN model exhibits outstanding performance and remarkable compatibility in the voice cloning task. Combined with the x-vector embedding, the proposed model achieves the highest score of all the models and test sets. Hindawi 2022-10-11 /pmc/articles/PMC9578849/ /pubmed/36268148 http://dx.doi.org/10.1155/2022/6707304 Text en Copyright © 2022 Zeyu Qiu et al. https://creativecommons.org/licenses/by/4.0/This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Research Article
Qiu, Zeyu
Tang, Jun
Zhang, Yaxin
Li, Jiaxin
Bai, Xishan
A Voice Cloning Method Based on the Improved HiFi-GAN Model
title A Voice Cloning Method Based on the Improved HiFi-GAN Model
title_full A Voice Cloning Method Based on the Improved HiFi-GAN Model
title_fullStr A Voice Cloning Method Based on the Improved HiFi-GAN Model
title_full_unstemmed A Voice Cloning Method Based on the Improved HiFi-GAN Model
title_short A Voice Cloning Method Based on the Improved HiFi-GAN Model
title_sort voice cloning method based on the improved hifi-gan model
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9578849/
https://www.ncbi.nlm.nih.gov/pubmed/36268148
http://dx.doi.org/10.1155/2022/6707304
work_keys_str_mv AT qiuzeyu avoicecloningmethodbasedontheimprovedhifiganmodel
AT tangjun avoicecloningmethodbasedontheimprovedhifiganmodel
AT zhangyaxin avoicecloningmethodbasedontheimprovedhifiganmodel
AT lijiaxin avoicecloningmethodbasedontheimprovedhifiganmodel
AT baixishan avoicecloningmethodbasedontheimprovedhifiganmodel
AT qiuzeyu voicecloningmethodbasedontheimprovedhifiganmodel
AT tangjun voicecloningmethodbasedontheimprovedhifiganmodel
AT zhangyaxin voicecloningmethodbasedontheimprovedhifiganmodel
AT lijiaxin voicecloningmethodbasedontheimprovedhifiganmodel
AT baixishan voicecloningmethodbasedontheimprovedhifiganmodel