Cargando…

Research on Speech Synthesis Based on Mixture Alignment Mechanism

In recent years, deep learning-based speech synthesis has attracted a lot of attention from the machine learning and speech communities. In this paper, we propose Mixture-TTS, a non-autoregressive speech synthesis model based on mixture alignment mechanism. Mixture-TTS aims to optimize the alignment...

Descripción completa

Detalles Bibliográficos
Autores principales: Deng, Yan, Wu, Ning, Qiu, Chengjun, Chen, Yan, Gao, Xueshan
Formato: Online Artículo Texto
Lenguaje:English
Publicado: MDPI 2023
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10457820/
https://www.ncbi.nlm.nih.gov/pubmed/37631819
http://dx.doi.org/10.3390/s23167283
Descripción
Sumario:In recent years, deep learning-based speech synthesis has attracted a lot of attention from the machine learning and speech communities. In this paper, we propose Mixture-TTS, a non-autoregressive speech synthesis model based on mixture alignment mechanism. Mixture-TTS aims to optimize the alignment information between text sequences and mel-spectrogram. Mixture-TTS uses a linguistic encoder based on soft phoneme-level alignment and hard word-level alignment approaches, which explicitly extract word-level semantic information, and introduce pitch and energy predictors to optimally predict the rhythmic information of the audio. Specifically, Mixture-TTS introduces a post-net based on a five-layer 1D convolution network to optimize the reconfiguration capability of the mel-spectrogram. We connect the output of the decoder to the post-net through the residual network. The mel-spectrogram is converted into the final audio by the HiFi-GAN vocoder. We evaluate the performance of the Mixture-TTS on the AISHELL3 and LJSpeech datasets. Experimental results show that Mixture-TTS is somewhat better in alignment information between the text sequences and mel-spectrogram, and is able to achieve high-quality audio. The ablation studies demonstrate that the structure of Mixture-TTS is effective.