Cargando…

Research on Speech Synthesis Based on Mixture Alignment Mechanism

In recent years, deep learning-based speech synthesis has attracted a lot of attention from the machine learning and speech communities. In this paper, we propose Mixture-TTS, a non-autoregressive speech synthesis model based on mixture alignment mechanism. Mixture-TTS aims to optimize the alignment...

Descripción completa

Detalles Bibliográficos
Autores principales: Deng, Yan, Wu, Ning, Qiu, Chengjun, Chen, Yan, Gao, Xueshan
Formato: Online Artículo Texto
Lenguaje:English
Publicado: MDPI 2023
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10457820/
https://www.ncbi.nlm.nih.gov/pubmed/37631819
http://dx.doi.org/10.3390/s23167283
_version_ 1785097016383111168
author Deng, Yan
Wu, Ning
Qiu, Chengjun
Chen, Yan
Gao, Xueshan
author_facet Deng, Yan
Wu, Ning
Qiu, Chengjun
Chen, Yan
Gao, Xueshan
author_sort Deng, Yan
collection PubMed
description In recent years, deep learning-based speech synthesis has attracted a lot of attention from the machine learning and speech communities. In this paper, we propose Mixture-TTS, a non-autoregressive speech synthesis model based on mixture alignment mechanism. Mixture-TTS aims to optimize the alignment information between text sequences and mel-spectrogram. Mixture-TTS uses a linguistic encoder based on soft phoneme-level alignment and hard word-level alignment approaches, which explicitly extract word-level semantic information, and introduce pitch and energy predictors to optimally predict the rhythmic information of the audio. Specifically, Mixture-TTS introduces a post-net based on a five-layer 1D convolution network to optimize the reconfiguration capability of the mel-spectrogram. We connect the output of the decoder to the post-net through the residual network. The mel-spectrogram is converted into the final audio by the HiFi-GAN vocoder. We evaluate the performance of the Mixture-TTS on the AISHELL3 and LJSpeech datasets. Experimental results show that Mixture-TTS is somewhat better in alignment information between the text sequences and mel-spectrogram, and is able to achieve high-quality audio. The ablation studies demonstrate that the structure of Mixture-TTS is effective.
format Online
Article
Text
id pubmed-10457820
institution National Center for Biotechnology Information
language English
publishDate 2023
publisher MDPI
record_format MEDLINE/PubMed
spelling pubmed-104578202023-08-27 Research on Speech Synthesis Based on Mixture Alignment Mechanism Deng, Yan Wu, Ning Qiu, Chengjun Chen, Yan Gao, Xueshan Sensors (Basel) Article In recent years, deep learning-based speech synthesis has attracted a lot of attention from the machine learning and speech communities. In this paper, we propose Mixture-TTS, a non-autoregressive speech synthesis model based on mixture alignment mechanism. Mixture-TTS aims to optimize the alignment information between text sequences and mel-spectrogram. Mixture-TTS uses a linguistic encoder based on soft phoneme-level alignment and hard word-level alignment approaches, which explicitly extract word-level semantic information, and introduce pitch and energy predictors to optimally predict the rhythmic information of the audio. Specifically, Mixture-TTS introduces a post-net based on a five-layer 1D convolution network to optimize the reconfiguration capability of the mel-spectrogram. We connect the output of the decoder to the post-net through the residual network. The mel-spectrogram is converted into the final audio by the HiFi-GAN vocoder. We evaluate the performance of the Mixture-TTS on the AISHELL3 and LJSpeech datasets. Experimental results show that Mixture-TTS is somewhat better in alignment information between the text sequences and mel-spectrogram, and is able to achieve high-quality audio. The ablation studies demonstrate that the structure of Mixture-TTS is effective. MDPI 2023-08-20 /pmc/articles/PMC10457820/ /pubmed/37631819 http://dx.doi.org/10.3390/s23167283 Text en © 2023 by the authors. https://creativecommons.org/licenses/by/4.0/Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
spellingShingle Article
Deng, Yan
Wu, Ning
Qiu, Chengjun
Chen, Yan
Gao, Xueshan
Research on Speech Synthesis Based on Mixture Alignment Mechanism
title Research on Speech Synthesis Based on Mixture Alignment Mechanism
title_full Research on Speech Synthesis Based on Mixture Alignment Mechanism
title_fullStr Research on Speech Synthesis Based on Mixture Alignment Mechanism
title_full_unstemmed Research on Speech Synthesis Based on Mixture Alignment Mechanism
title_short Research on Speech Synthesis Based on Mixture Alignment Mechanism
title_sort research on speech synthesis based on mixture alignment mechanism
topic Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10457820/
https://www.ncbi.nlm.nih.gov/pubmed/37631819
http://dx.doi.org/10.3390/s23167283
work_keys_str_mv AT dengyan researchonspeechsynthesisbasedonmixturealignmentmechanism
AT wuning researchonspeechsynthesisbasedonmixturealignmentmechanism
AT qiuchengjun researchonspeechsynthesisbasedonmixturealignmentmechanism
AT chenyan researchonspeechsynthesisbasedonmixturealignmentmechanism
AT gaoxueshan researchonspeechsynthesisbasedonmixturealignmentmechanism