Cargando…
Research on Speech Synthesis Based on Mixture Alignment Mechanism
In recent years, deep learning-based speech synthesis has attracted a lot of attention from the machine learning and speech communities. In this paper, we propose Mixture-TTS, a non-autoregressive speech synthesis model based on mixture alignment mechanism. Mixture-TTS aims to optimize the alignment...
Autores principales: | , , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
MDPI
2023
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10457820/ https://www.ncbi.nlm.nih.gov/pubmed/37631819 http://dx.doi.org/10.3390/s23167283 |
_version_ | 1785097016383111168 |
---|---|
author | Deng, Yan Wu, Ning Qiu, Chengjun Chen, Yan Gao, Xueshan |
author_facet | Deng, Yan Wu, Ning Qiu, Chengjun Chen, Yan Gao, Xueshan |
author_sort | Deng, Yan |
collection | PubMed |
description | In recent years, deep learning-based speech synthesis has attracted a lot of attention from the machine learning and speech communities. In this paper, we propose Mixture-TTS, a non-autoregressive speech synthesis model based on mixture alignment mechanism. Mixture-TTS aims to optimize the alignment information between text sequences and mel-spectrogram. Mixture-TTS uses a linguistic encoder based on soft phoneme-level alignment and hard word-level alignment approaches, which explicitly extract word-level semantic information, and introduce pitch and energy predictors to optimally predict the rhythmic information of the audio. Specifically, Mixture-TTS introduces a post-net based on a five-layer 1D convolution network to optimize the reconfiguration capability of the mel-spectrogram. We connect the output of the decoder to the post-net through the residual network. The mel-spectrogram is converted into the final audio by the HiFi-GAN vocoder. We evaluate the performance of the Mixture-TTS on the AISHELL3 and LJSpeech datasets. Experimental results show that Mixture-TTS is somewhat better in alignment information between the text sequences and mel-spectrogram, and is able to achieve high-quality audio. The ablation studies demonstrate that the structure of Mixture-TTS is effective. |
format | Online Article Text |
id | pubmed-10457820 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2023 |
publisher | MDPI |
record_format | MEDLINE/PubMed |
spelling | pubmed-104578202023-08-27 Research on Speech Synthesis Based on Mixture Alignment Mechanism Deng, Yan Wu, Ning Qiu, Chengjun Chen, Yan Gao, Xueshan Sensors (Basel) Article In recent years, deep learning-based speech synthesis has attracted a lot of attention from the machine learning and speech communities. In this paper, we propose Mixture-TTS, a non-autoregressive speech synthesis model based on mixture alignment mechanism. Mixture-TTS aims to optimize the alignment information between text sequences and mel-spectrogram. Mixture-TTS uses a linguistic encoder based on soft phoneme-level alignment and hard word-level alignment approaches, which explicitly extract word-level semantic information, and introduce pitch and energy predictors to optimally predict the rhythmic information of the audio. Specifically, Mixture-TTS introduces a post-net based on a five-layer 1D convolution network to optimize the reconfiguration capability of the mel-spectrogram. We connect the output of the decoder to the post-net through the residual network. The mel-spectrogram is converted into the final audio by the HiFi-GAN vocoder. We evaluate the performance of the Mixture-TTS on the AISHELL3 and LJSpeech datasets. Experimental results show that Mixture-TTS is somewhat better in alignment information between the text sequences and mel-spectrogram, and is able to achieve high-quality audio. The ablation studies demonstrate that the structure of Mixture-TTS is effective. MDPI 2023-08-20 /pmc/articles/PMC10457820/ /pubmed/37631819 http://dx.doi.org/10.3390/s23167283 Text en © 2023 by the authors. https://creativecommons.org/licenses/by/4.0/Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/). |
spellingShingle | Article Deng, Yan Wu, Ning Qiu, Chengjun Chen, Yan Gao, Xueshan Research on Speech Synthesis Based on Mixture Alignment Mechanism |
title | Research on Speech Synthesis Based on Mixture Alignment Mechanism |
title_full | Research on Speech Synthesis Based on Mixture Alignment Mechanism |
title_fullStr | Research on Speech Synthesis Based on Mixture Alignment Mechanism |
title_full_unstemmed | Research on Speech Synthesis Based on Mixture Alignment Mechanism |
title_short | Research on Speech Synthesis Based on Mixture Alignment Mechanism |
title_sort | research on speech synthesis based on mixture alignment mechanism |
topic | Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10457820/ https://www.ncbi.nlm.nih.gov/pubmed/37631819 http://dx.doi.org/10.3390/s23167283 |
work_keys_str_mv | AT dengyan researchonspeechsynthesisbasedonmixturealignmentmechanism AT wuning researchonspeechsynthesisbasedonmixturealignmentmechanism AT qiuchengjun researchonspeechsynthesisbasedonmixturealignmentmechanism AT chenyan researchonspeechsynthesisbasedonmixturealignmentmechanism AT gaoxueshan researchonspeechsynthesisbasedonmixturealignmentmechanism |