Cargando…

Transforming an embodied conversational agent into an efficient talking head: from keyframe-based animation to multimodal concatenation synthesis

BACKGROUND: Virtual humans have become part of our everyday life (movies, internet, and computer games). Even though they are becoming more and more realistic, their speech capabilities are, most of the time, limited and not coherent and/or not synchronous with the corresponding acoustic signal. MET...

Descripción completa

Detalles Bibliográficos
Autores principales:	Gibert, Guillaume, Olsen, Kirk N., Leung, Yvonne, Stevens, Catherine J.
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Springer Singapore 2015
Materias:	Research Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5125409/ https://www.ncbi.nlm.nih.gov/pubmed/27980889 http://dx.doi.org/10.1186/s40469-015-0007-8

_version_	1782469973795405824
author	Gibert, Guillaume Olsen, Kirk N. Leung, Yvonne Stevens, Catherine J.
author_facet	Gibert, Guillaume Olsen, Kirk N. Leung, Yvonne Stevens, Catherine J.
author_sort	Gibert, Guillaume
collection	PubMed
description	BACKGROUND: Virtual humans have become part of our everyday life (movies, internet, and computer games). Even though they are becoming more and more realistic, their speech capabilities are, most of the time, limited and not coherent and/or not synchronous with the corresponding acoustic signal. METHODS: We describe a method to convert a virtual human avatar (animated through key frames and interpolation) into a more naturalistic talking head. In fact, speech articulation cannot be accurately replicated using interpolation between key frames and talking heads with good speech capabilities are derived from real speech production data. Motion capture data are commonly used to provide accurate facial motion for visible speech articulators (jaw and lips) synchronous with acoustics. To access tongue trajectories (partially occluded speech articulator), electromagnetic articulography (EMA) is often used. We recorded a large database of phonetically-balanced English sentences with synchronous EMA, motion capture data, and acoustics. An articulatory model was computed on this database to recover missing data and to provide ‘normalized’ animation (i.e., articulatory) parameters. In addition, semi-automatic segmentation was performed on the acoustic stream. A dictionary of multimodal Australian English diphones was created. It is composed of the variation of the articulatory parameters between all the successive stable allophones. RESULTS: The avatar’s facial key frames were converted into articulatory parameters steering its speech articulators (jaw, lips and tongue). The speech production database was used to drive the Embodied Conversational Agent (ECA) and to enhance its speech capabilities. A Text-To-Auditory Visual Speech synthesizer was created based on the MaryTTS software and on the diphone dictionary derived from the speech production database. CONCLUSIONS: We describe a method to transform an ECA with generic tongue model and animation by key frames into a talking head that displays naturalistic tongue, jaw and lip motions. Thanks to a multimodal speech production database, a Text-To-Auditory Visual Speech synthesizer drives the ECA’s facial movements enhancing its speech capabilities. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s40469-015-0007-8) contains supplementary material, which is available to authorized users.
format	Online Article Text
id	pubmed-5125409
institution	National Center for Biotechnology Information
language	English
publishDate	2015
publisher	Springer Singapore
record_format	MEDLINE/PubMed
spelling	pubmed-51254092016-12-13 Transforming an embodied conversational agent into an efficient talking head: from keyframe-based animation to multimodal concatenation synthesis Gibert, Guillaume Olsen, Kirk N. Leung, Yvonne Stevens, Catherine J. Comput Cogn Sci Research Article BACKGROUND: Virtual humans have become part of our everyday life (movies, internet, and computer games). Even though they are becoming more and more realistic, their speech capabilities are, most of the time, limited and not coherent and/or not synchronous with the corresponding acoustic signal. METHODS: We describe a method to convert a virtual human avatar (animated through key frames and interpolation) into a more naturalistic talking head. In fact, speech articulation cannot be accurately replicated using interpolation between key frames and talking heads with good speech capabilities are derived from real speech production data. Motion capture data are commonly used to provide accurate facial motion for visible speech articulators (jaw and lips) synchronous with acoustics. To access tongue trajectories (partially occluded speech articulator), electromagnetic articulography (EMA) is often used. We recorded a large database of phonetically-balanced English sentences with synchronous EMA, motion capture data, and acoustics. An articulatory model was computed on this database to recover missing data and to provide ‘normalized’ animation (i.e., articulatory) parameters. In addition, semi-automatic segmentation was performed on the acoustic stream. A dictionary of multimodal Australian English diphones was created. It is composed of the variation of the articulatory parameters between all the successive stable allophones. RESULTS: The avatar’s facial key frames were converted into articulatory parameters steering its speech articulators (jaw, lips and tongue). The speech production database was used to drive the Embodied Conversational Agent (ECA) and to enhance its speech capabilities. A Text-To-Auditory Visual Speech synthesizer was created based on the MaryTTS software and on the diphone dictionary derived from the speech production database. CONCLUSIONS: We describe a method to transform an ECA with generic tongue model and animation by key frames into a talking head that displays naturalistic tongue, jaw and lip motions. Thanks to a multimodal speech production database, a Text-To-Auditory Visual Speech synthesizer drives the ECA’s facial movements enhancing its speech capabilities. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s40469-015-0007-8) contains supplementary material, which is available to authorized users. Springer Singapore 2015-09-08 2015 /pmc/articles/PMC5125409/ /pubmed/27980889 http://dx.doi.org/10.1186/s40469-015-0007-8 Text en © Gibert et al. 2016 Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
spellingShingle	Research Article Gibert, Guillaume Olsen, Kirk N. Leung, Yvonne Stevens, Catherine J. Transforming an embodied conversational agent into an efficient talking head: from keyframe-based animation to multimodal concatenation synthesis
title	Transforming an embodied conversational agent into an efficient talking head: from keyframe-based animation to multimodal concatenation synthesis
title_full	Transforming an embodied conversational agent into an efficient talking head: from keyframe-based animation to multimodal concatenation synthesis
title_fullStr	Transforming an embodied conversational agent into an efficient talking head: from keyframe-based animation to multimodal concatenation synthesis
title_full_unstemmed	Transforming an embodied conversational agent into an efficient talking head: from keyframe-based animation to multimodal concatenation synthesis
title_short	Transforming an embodied conversational agent into an efficient talking head: from keyframe-based animation to multimodal concatenation synthesis
title_sort	transforming an embodied conversational agent into an efficient talking head: from keyframe-based animation to multimodal concatenation synthesis
topic	Research Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5125409/ https://www.ncbi.nlm.nih.gov/pubmed/27980889 http://dx.doi.org/10.1186/s40469-015-0007-8
work_keys_str_mv	AT gibertguillaume transforminganembodiedconversationalagentintoanefficienttalkingheadfromkeyframebasedanimationtomultimodalconcatenationsynthesis AT olsenkirkn transforminganembodiedconversationalagentintoanefficienttalkingheadfromkeyframebasedanimationtomultimodalconcatenationsynthesis AT leungyvonne transforminganembodiedconversationalagentintoanefficienttalkingheadfromkeyframebasedanimationtomultimodalconcatenationsynthesis AT stevenscatherinej transforminganembodiedconversationalagentintoanefficienttalkingheadfromkeyframebasedanimationtomultimodalconcatenationsynthesis

Transforming an embodied conversational agent into an efficient talking head: from keyframe-based animation to multimodal concatenation synthesis

Ejemplares similares