Cargando…

The First Vietnamese FOSD-Tacotron-2-based Text-to-Speech Model Dataset

Recent trends in voicebot application development have enabled utilization of both speech-to-text and text-to-speech (TTS) generation techniques. In order to generate a voice response to a given speech, one needs to use a TTS engine. The recently developed TTS engines are shifting towards end-to-end...

Descripción completa

Detalles Bibliográficos
Autor principal: Tran, Duc Chung
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Elsevier 2020
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7287261/
https://www.ncbi.nlm.nih.gov/pubmed/32551347
http://dx.doi.org/10.1016/j.dib.2020.105775
_version_ 1783545034226794496
author Tran, Duc Chung
author_facet Tran, Duc Chung
author_sort Tran, Duc Chung
collection PubMed
description Recent trends in voicebot application development have enabled utilization of both speech-to-text and text-to-speech (TTS) generation techniques. In order to generate a voice response to a given speech, one needs to use a TTS engine. The recently developed TTS engines are shifting towards end-to-end approaches utilizing models such as Tacotron, Tacotron-2, WaveNet, and WaveGlow. The reason is that it enables a TTS service provider to focus on developing training and validating datasets comprising of labelled texts and recorded speeches instead of designing an entirely new model that outperforms the others which is time-consuming and costly. In this context, this work introduces the first Vietnamese FPT Open Speech Data (FOSD)-Tacotron-2-based TTS model dataset. This dataset comprises of a configuration file in *.json format; training and validating text input files (in *.csv format); a 225,000-step checkpoint of the trained model; and several sample generated audios. The published dataset is extremely worth for serving as a model for benchmarking with other newly developed TTS models / engines. In addition, it opens an entirely new TTS research optimization problem to be addressed: How to effectively generate speech from text given: a black box TTS (trained) model and its training and validation input texts.
format Online
Article
Text
id pubmed-7287261
institution National Center for Biotechnology Information
language English
publishDate 2020
publisher Elsevier
record_format MEDLINE/PubMed
spelling pubmed-72872612020-06-17 The First Vietnamese FOSD-Tacotron-2-based Text-to-Speech Model Dataset Tran, Duc Chung Data Brief Computer Science Recent trends in voicebot application development have enabled utilization of both speech-to-text and text-to-speech (TTS) generation techniques. In order to generate a voice response to a given speech, one needs to use a TTS engine. The recently developed TTS engines are shifting towards end-to-end approaches utilizing models such as Tacotron, Tacotron-2, WaveNet, and WaveGlow. The reason is that it enables a TTS service provider to focus on developing training and validating datasets comprising of labelled texts and recorded speeches instead of designing an entirely new model that outperforms the others which is time-consuming and costly. In this context, this work introduces the first Vietnamese FPT Open Speech Data (FOSD)-Tacotron-2-based TTS model dataset. This dataset comprises of a configuration file in *.json format; training and validating text input files (in *.csv format); a 225,000-step checkpoint of the trained model; and several sample generated audios. The published dataset is extremely worth for serving as a model for benchmarking with other newly developed TTS models / engines. In addition, it opens an entirely new TTS research optimization problem to be addressed: How to effectively generate speech from text given: a black box TTS (trained) model and its training and validation input texts. Elsevier 2020-05-27 /pmc/articles/PMC7287261/ /pubmed/32551347 http://dx.doi.org/10.1016/j.dib.2020.105775 Text en © 2020 The Author(s) http://creativecommons.org/licenses/by/4.0/ This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/).
spellingShingle Computer Science
Tran, Duc Chung
The First Vietnamese FOSD-Tacotron-2-based Text-to-Speech Model Dataset
title The First Vietnamese FOSD-Tacotron-2-based Text-to-Speech Model Dataset
title_full The First Vietnamese FOSD-Tacotron-2-based Text-to-Speech Model Dataset
title_fullStr The First Vietnamese FOSD-Tacotron-2-based Text-to-Speech Model Dataset
title_full_unstemmed The First Vietnamese FOSD-Tacotron-2-based Text-to-Speech Model Dataset
title_short The First Vietnamese FOSD-Tacotron-2-based Text-to-Speech Model Dataset
title_sort first vietnamese fosd-tacotron-2-based text-to-speech model dataset
topic Computer Science
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7287261/
https://www.ncbi.nlm.nih.gov/pubmed/32551347
http://dx.doi.org/10.1016/j.dib.2020.105775
work_keys_str_mv AT tranducchung thefirstvietnamesefosdtacotron2basedtexttospeechmodeldataset
AT tranducchung firstvietnamesefosdtacotron2basedtexttospeechmodeldataset