Cargando…
A Voice Cloning Method Based on the Improved HiFi-GAN Model
With the aim of adapting a source Text to Speech (TTS) model to synthesize a personal voice by using a few speech samples from the target speaker, voice cloning provides a specific TTS service. Although the Tacotron 2-based multi-speaker TTS system can implement voice cloning by introducing a d-vect...
Autores principales: | , , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Hindawi
2022
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9578849/ https://www.ncbi.nlm.nih.gov/pubmed/36268148 http://dx.doi.org/10.1155/2022/6707304 |
_version_ | 1784812051009372160 |
---|---|
author | Qiu, Zeyu Tang, Jun Zhang, Yaxin Li, Jiaxin Bai, Xishan |
author_facet | Qiu, Zeyu Tang, Jun Zhang, Yaxin Li, Jiaxin Bai, Xishan |
author_sort | Qiu, Zeyu |
collection | PubMed |
description | With the aim of adapting a source Text to Speech (TTS) model to synthesize a personal voice by using a few speech samples from the target speaker, voice cloning provides a specific TTS service. Although the Tacotron 2-based multi-speaker TTS system can implement voice cloning by introducing a d-vector into the speaker encoder, the speaker characteristics described by the d-vector cannot allow for the voice information of the entire utterance. This affects the similarity of voice cloning. As a vocoder, WaveNet sacrifices speech generation speed. To balance the relationship between model parameters, inference speed, and voice quality, a voice cloning method based on improved HiFi-GAN has been proposed in this paper. (1) To improve the feature representation ability of the speaker encoder, the x-vector is used as the embedding vector that can characterize the target speaker. (2) To improve the performance of the HiFi-GAN vocoder, the input Mel spectrum is processed by a competitive multiscale convolution strategy. (3) The one-dimensional depth-wise separable convolution is used to replace all standard one-dimensional convolutions, significantly reducing the model parameters and increasing the inference speed. The improved HiFi-GAN model remarkably reduces the number of vocoder model parameters by about 68.58% and boosts the model's inference speed. The inference speed on the GPU and CPU has increased by 11.84% and 30.99%, respectively. Voice quality has also been marginally improved as MOS increased by 0.13 and PESQ increased by 0.11. The improved HiFi-GAN model exhibits outstanding performance and remarkable compatibility in the voice cloning task. Combined with the x-vector embedding, the proposed model achieves the highest score of all the models and test sets. |
format | Online Article Text |
id | pubmed-9578849 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2022 |
publisher | Hindawi |
record_format | MEDLINE/PubMed |
spelling | pubmed-95788492022-10-19 A Voice Cloning Method Based on the Improved HiFi-GAN Model Qiu, Zeyu Tang, Jun Zhang, Yaxin Li, Jiaxin Bai, Xishan Comput Intell Neurosci Research Article With the aim of adapting a source Text to Speech (TTS) model to synthesize a personal voice by using a few speech samples from the target speaker, voice cloning provides a specific TTS service. Although the Tacotron 2-based multi-speaker TTS system can implement voice cloning by introducing a d-vector into the speaker encoder, the speaker characteristics described by the d-vector cannot allow for the voice information of the entire utterance. This affects the similarity of voice cloning. As a vocoder, WaveNet sacrifices speech generation speed. To balance the relationship between model parameters, inference speed, and voice quality, a voice cloning method based on improved HiFi-GAN has been proposed in this paper. (1) To improve the feature representation ability of the speaker encoder, the x-vector is used as the embedding vector that can characterize the target speaker. (2) To improve the performance of the HiFi-GAN vocoder, the input Mel spectrum is processed by a competitive multiscale convolution strategy. (3) The one-dimensional depth-wise separable convolution is used to replace all standard one-dimensional convolutions, significantly reducing the model parameters and increasing the inference speed. The improved HiFi-GAN model remarkably reduces the number of vocoder model parameters by about 68.58% and boosts the model's inference speed. The inference speed on the GPU and CPU has increased by 11.84% and 30.99%, respectively. Voice quality has also been marginally improved as MOS increased by 0.13 and PESQ increased by 0.11. The improved HiFi-GAN model exhibits outstanding performance and remarkable compatibility in the voice cloning task. Combined with the x-vector embedding, the proposed model achieves the highest score of all the models and test sets. Hindawi 2022-10-11 /pmc/articles/PMC9578849/ /pubmed/36268148 http://dx.doi.org/10.1155/2022/6707304 Text en Copyright © 2022 Zeyu Qiu et al. https://creativecommons.org/licenses/by/4.0/This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. |
spellingShingle | Research Article Qiu, Zeyu Tang, Jun Zhang, Yaxin Li, Jiaxin Bai, Xishan A Voice Cloning Method Based on the Improved HiFi-GAN Model |
title | A Voice Cloning Method Based on the Improved HiFi-GAN Model |
title_full | A Voice Cloning Method Based on the Improved HiFi-GAN Model |
title_fullStr | A Voice Cloning Method Based on the Improved HiFi-GAN Model |
title_full_unstemmed | A Voice Cloning Method Based on the Improved HiFi-GAN Model |
title_short | A Voice Cloning Method Based on the Improved HiFi-GAN Model |
title_sort | voice cloning method based on the improved hifi-gan model |
topic | Research Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9578849/ https://www.ncbi.nlm.nih.gov/pubmed/36268148 http://dx.doi.org/10.1155/2022/6707304 |
work_keys_str_mv | AT qiuzeyu avoicecloningmethodbasedontheimprovedhifiganmodel AT tangjun avoicecloningmethodbasedontheimprovedhifiganmodel AT zhangyaxin avoicecloningmethodbasedontheimprovedhifiganmodel AT lijiaxin avoicecloningmethodbasedontheimprovedhifiganmodel AT baixishan avoicecloningmethodbasedontheimprovedhifiganmodel AT qiuzeyu voicecloningmethodbasedontheimprovedhifiganmodel AT tangjun voicecloningmethodbasedontheimprovedhifiganmodel AT zhangyaxin voicecloningmethodbasedontheimprovedhifiganmodel AT lijiaxin voicecloningmethodbasedontheimprovedhifiganmodel AT baixishan voicecloningmethodbasedontheimprovedhifiganmodel |