Cargando…

DIA-TTS: Deep-Inherited Attention-Based Text-to-Speech Synthesizer

Text-to-speech (TTS) synthesizers have been widely used as a vital assistive tool in various fields. Traditional sequence-to-sequence (seq2seq) TTS such as Tacotron2 uses a single soft attention mechanism for encoder and decoder alignment tasks, which is the biggest shortcoming that incorrectly or r...

Descripción completa

Detalles Bibliográficos
Autores principales: Yu, Junxiao, Xu, Zhengyuan, He, Xu, Wang, Jian, Liu, Bin, Feng, Rui, Zhu, Songsheng, Wang, Wei, Li, Jianqing
Formato: Online Artículo Texto
Lenguaje:English
Publicado: MDPI 2022
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9857677/
https://www.ncbi.nlm.nih.gov/pubmed/36673182
http://dx.doi.org/10.3390/e25010041
_version_ 1784873922607448064
author Yu, Junxiao
Xu, Zhengyuan
He, Xu
Wang, Jian
Liu, Bin
Feng, Rui
Zhu, Songsheng
Wang, Wei
Li, Jianqing
author_facet Yu, Junxiao
Xu, Zhengyuan
He, Xu
Wang, Jian
Liu, Bin
Feng, Rui
Zhu, Songsheng
Wang, Wei
Li, Jianqing
author_sort Yu, Junxiao
collection PubMed
description Text-to-speech (TTS) synthesizers have been widely used as a vital assistive tool in various fields. Traditional sequence-to-sequence (seq2seq) TTS such as Tacotron2 uses a single soft attention mechanism for encoder and decoder alignment tasks, which is the biggest shortcoming that incorrectly or repeatedly generates words when dealing with long sentences. It may also generate sentences with run-on and wrong breaks regardless of punctuation marks, which causes the synthesized waveform to lack emotion and sound unnatural. In this paper, we propose an end-to-end neural generative TTS model that is based on the deep-inherited attention (DIA) mechanism along with an adjustable local-sensitive factor (LSF). The inheritance mechanism allows multiple iterations of the DIA by sharing the same training parameter, which tightens the token–frame correlation, as well as fastens the alignment process. In addition, LSF is adopted to enhance the context connection by expanding the DIA concentration region. In addition, a multi-RNN block is used in the decoder for better acoustic feature extraction and generation. Hidden-state information driven from the multi-RNN layers is utilized for attention alignment. The collaborative work of the DIA and multi-RNN layers contributes to outperformance in the high-quality prediction of the phrase breaks of the synthesized speech. We used WaveGlow as a vocoder for real-time, human-like audio synthesis. Human subjective experiments show that the DIA-TTS achieved a mean opinion score (MOS) of 4.48 in terms of naturalness. Ablation studies further prove the superiority of the DIA mechanism for the enhancement of phrase breaks and attention robustness.
format Online
Article
Text
id pubmed-9857677
institution National Center for Biotechnology Information
language English
publishDate 2022
publisher MDPI
record_format MEDLINE/PubMed
spelling pubmed-98576772023-01-21 DIA-TTS: Deep-Inherited Attention-Based Text-to-Speech Synthesizer Yu, Junxiao Xu, Zhengyuan He, Xu Wang, Jian Liu, Bin Feng, Rui Zhu, Songsheng Wang, Wei Li, Jianqing Entropy (Basel) Article Text-to-speech (TTS) synthesizers have been widely used as a vital assistive tool in various fields. Traditional sequence-to-sequence (seq2seq) TTS such as Tacotron2 uses a single soft attention mechanism for encoder and decoder alignment tasks, which is the biggest shortcoming that incorrectly or repeatedly generates words when dealing with long sentences. It may also generate sentences with run-on and wrong breaks regardless of punctuation marks, which causes the synthesized waveform to lack emotion and sound unnatural. In this paper, we propose an end-to-end neural generative TTS model that is based on the deep-inherited attention (DIA) mechanism along with an adjustable local-sensitive factor (LSF). The inheritance mechanism allows multiple iterations of the DIA by sharing the same training parameter, which tightens the token–frame correlation, as well as fastens the alignment process. In addition, LSF is adopted to enhance the context connection by expanding the DIA concentration region. In addition, a multi-RNN block is used in the decoder for better acoustic feature extraction and generation. Hidden-state information driven from the multi-RNN layers is utilized for attention alignment. The collaborative work of the DIA and multi-RNN layers contributes to outperformance in the high-quality prediction of the phrase breaks of the synthesized speech. We used WaveGlow as a vocoder for real-time, human-like audio synthesis. Human subjective experiments show that the DIA-TTS achieved a mean opinion score (MOS) of 4.48 in terms of naturalness. Ablation studies further prove the superiority of the DIA mechanism for the enhancement of phrase breaks and attention robustness. MDPI 2022-12-26 /pmc/articles/PMC9857677/ /pubmed/36673182 http://dx.doi.org/10.3390/e25010041 Text en © 2022 by the authors. https://creativecommons.org/licenses/by/4.0/Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
spellingShingle Article
Yu, Junxiao
Xu, Zhengyuan
He, Xu
Wang, Jian
Liu, Bin
Feng, Rui
Zhu, Songsheng
Wang, Wei
Li, Jianqing
DIA-TTS: Deep-Inherited Attention-Based Text-to-Speech Synthesizer
title DIA-TTS: Deep-Inherited Attention-Based Text-to-Speech Synthesizer
title_full DIA-TTS: Deep-Inherited Attention-Based Text-to-Speech Synthesizer
title_fullStr DIA-TTS: Deep-Inherited Attention-Based Text-to-Speech Synthesizer
title_full_unstemmed DIA-TTS: Deep-Inherited Attention-Based Text-to-Speech Synthesizer
title_short DIA-TTS: Deep-Inherited Attention-Based Text-to-Speech Synthesizer
title_sort dia-tts: deep-inherited attention-based text-to-speech synthesizer
topic Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9857677/
https://www.ncbi.nlm.nih.gov/pubmed/36673182
http://dx.doi.org/10.3390/e25010041
work_keys_str_mv AT yujunxiao diattsdeepinheritedattentionbasedtexttospeechsynthesizer
AT xuzhengyuan diattsdeepinheritedattentionbasedtexttospeechsynthesizer
AT hexu diattsdeepinheritedattentionbasedtexttospeechsynthesizer
AT wangjian diattsdeepinheritedattentionbasedtexttospeechsynthesizer
AT liubin diattsdeepinheritedattentionbasedtexttospeechsynthesizer
AT fengrui diattsdeepinheritedattentionbasedtexttospeechsynthesizer
AT zhusongsheng diattsdeepinheritedattentionbasedtexttospeechsynthesizer
AT wangwei diattsdeepinheritedattentionbasedtexttospeechsynthesizer
AT lijianqing diattsdeepinheritedattentionbasedtexttospeechsynthesizer