Cargando…

FlexLip: A Controllable Text-to-Lip System

The task of converting text input into video content is becoming an important topic for synthetic media generation. Several methods have been proposed with some of them reaching close-to-natural performances in constrained tasks. In this paper, we tackle a subissue of the text-to-video generation pr...

Descripción completa

Detalles Bibliográficos
Autores principales:	Oneață, Dan, Lőrincz, Beáta, Stan, Adriana, Cucu, Horia
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	MDPI 2022
Materias:	Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9185457/ https://www.ncbi.nlm.nih.gov/pubmed/35684727 http://dx.doi.org/10.3390/s22114104

_version_	1784724728932466688
author	Oneață, Dan Lőrincz, Beáta Stan, Adriana Cucu, Horia
author_facet	Oneață, Dan Lőrincz, Beáta Stan, Adriana Cucu, Horia
author_sort	Oneață, Dan
collection	PubMed
description	The task of converting text input into video content is becoming an important topic for synthetic media generation. Several methods have been proposed with some of them reaching close-to-natural performances in constrained tasks. In this paper, we tackle a subissue of the text-to-video generation problem, by converting the text into lip landmarks. However, we do this using a modular, controllable system architecture and evaluate each of its individual components. Our system, entitled FlexLip, is split into two separate modules: text-to-speech and speech-to-lip, both having underlying controllable deep neural network architectures. This modularity enables the easy replacement of each of its components, while also ensuring the fast adaptation to new speaker identities by disentangling or projecting the input features. We show that by using as little as 20 min of data for the audio generation component, and as little as 5 min for the speech-to-lip component, the objective measures of the generated lip landmarks are comparable with those obtained when using a larger set of training samples. We also introduce a series of objective evaluation measures over the complete flow of our system by taking into consideration several aspects of the data and system configuration. These aspects pertain to the quality and amount of training data, the use of pretrained models, and the data contained therein, as well as the identity of the target speaker; with regard to the latter, we show that we can perform zero-shot lip adaptation to an unseen identity by simply updating the shape of the lips in our model.
format	Online Article Text
id	pubmed-9185457
institution	National Center for Biotechnology Information
language	English
publishDate	2022
publisher	MDPI
record_format	MEDLINE/PubMed
spelling	pubmed-91854572022-06-11 FlexLip: A Controllable Text-to-Lip System Oneață, Dan Lőrincz, Beáta Stan, Adriana Cucu, Horia Sensors (Basel) Article The task of converting text input into video content is becoming an important topic for synthetic media generation. Several methods have been proposed with some of them reaching close-to-natural performances in constrained tasks. In this paper, we tackle a subissue of the text-to-video generation problem, by converting the text into lip landmarks. However, we do this using a modular, controllable system architecture and evaluate each of its individual components. Our system, entitled FlexLip, is split into two separate modules: text-to-speech and speech-to-lip, both having underlying controllable deep neural network architectures. This modularity enables the easy replacement of each of its components, while also ensuring the fast adaptation to new speaker identities by disentangling or projecting the input features. We show that by using as little as 20 min of data for the audio generation component, and as little as 5 min for the speech-to-lip component, the objective measures of the generated lip landmarks are comparable with those obtained when using a larger set of training samples. We also introduce a series of objective evaluation measures over the complete flow of our system by taking into consideration several aspects of the data and system configuration. These aspects pertain to the quality and amount of training data, the use of pretrained models, and the data contained therein, as well as the identity of the target speaker; with regard to the latter, we show that we can perform zero-shot lip adaptation to an unseen identity by simply updating the shape of the lips in our model. MDPI 2022-05-28 /pmc/articles/PMC9185457/ /pubmed/35684727 http://dx.doi.org/10.3390/s22114104 Text en © 2022 by the authors. https://creativecommons.org/licenses/by/4.0/Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
spellingShingle	Article Oneață, Dan Lőrincz, Beáta Stan, Adriana Cucu, Horia FlexLip: A Controllable Text-to-Lip System
title	FlexLip: A Controllable Text-to-Lip System
title_full	FlexLip: A Controllable Text-to-Lip System
title_fullStr	FlexLip: A Controllable Text-to-Lip System
title_full_unstemmed	FlexLip: A Controllable Text-to-Lip System
title_short	FlexLip: A Controllable Text-to-Lip System
title_sort	flexlip: a controllable text-to-lip system
topic	Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9185457/ https://www.ncbi.nlm.nih.gov/pubmed/35684727 http://dx.doi.org/10.3390/s22114104
work_keys_str_mv	AT oneatadan flexlipacontrollabletexttolipsystem AT lorinczbeata flexlipacontrollabletexttolipsystem AT stanadriana flexlipacontrollabletexttolipsystem AT cucuhoria flexlipacontrollabletexttolipsystem

FlexLip: A Controllable Text-to-Lip System

Ejemplares similares