Cargando…

A Review of Multi-Modal Learning from the Text-Guided Visual Processing Viewpoint

For decades, co-relating different data domains to attain the maximum potential of machines has driven research, especially in neural networks. Similarly, text and visual data (images and videos) are two distinct data domains with extensive research in the past. Recently, using natural language to p...

Descripción completa

Detalles Bibliográficos
Autores principales: Ullah, Ubaid, Lee, Jeong-Sik, An, Chang-Hyeon, Lee, Hyeonjin, Park, Su-Yeong, Baek, Rock-Hyun, Choi, Hyun-Chul
Formato: Online Artículo Texto
Lenguaje:English
Publicado: MDPI 2022
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9503702/
https://www.ncbi.nlm.nih.gov/pubmed/36146161
http://dx.doi.org/10.3390/s22186816
_version_ 1784796031256363008
author Ullah, Ubaid
Lee, Jeong-Sik
An, Chang-Hyeon
Lee, Hyeonjin
Park, Su-Yeong
Baek, Rock-Hyun
Choi, Hyun-Chul
author_facet Ullah, Ubaid
Lee, Jeong-Sik
An, Chang-Hyeon
Lee, Hyeonjin
Park, Su-Yeong
Baek, Rock-Hyun
Choi, Hyun-Chul
author_sort Ullah, Ubaid
collection PubMed
description For decades, co-relating different data domains to attain the maximum potential of machines has driven research, especially in neural networks. Similarly, text and visual data (images and videos) are two distinct data domains with extensive research in the past. Recently, using natural language to process 2D or 3D images and videos with the immense power of neural nets has witnessed a promising future. Despite the diverse range of remarkable work in this field, notably in the past few years, rapid improvements have also solved future challenges for researchers. Moreover, the connection between these two domains is mainly subjected to GAN, thus limiting the horizons of this field. This review analyzes Text-to-Image (T2I) synthesis as a broader picture, Text-guided Visual-output (T2Vo), with the primary goal being to highlight the gaps by proposing a more comprehensive taxonomy. We broadly categorize text-guided visual output into three main divisions and meaningful subdivisions by critically examining an extensive body of literature from top-tier computer vision venues and closely related fields, such as machine learning and human–computer interaction, aiming at state-of-the-art models with a comparative analysis. This study successively follows previous surveys on T2I, adding value by analogously evaluating the diverse range of existing methods, including different generative models, several types of visual output, critical examination of various approaches, and highlighting the shortcomings, suggesting the future direction of research.
format Online
Article
Text
id pubmed-9503702
institution National Center for Biotechnology Information
language English
publishDate 2022
publisher MDPI
record_format MEDLINE/PubMed
spelling pubmed-95037022022-09-24 A Review of Multi-Modal Learning from the Text-Guided Visual Processing Viewpoint Ullah, Ubaid Lee, Jeong-Sik An, Chang-Hyeon Lee, Hyeonjin Park, Su-Yeong Baek, Rock-Hyun Choi, Hyun-Chul Sensors (Basel) Review For decades, co-relating different data domains to attain the maximum potential of machines has driven research, especially in neural networks. Similarly, text and visual data (images and videos) are two distinct data domains with extensive research in the past. Recently, using natural language to process 2D or 3D images and videos with the immense power of neural nets has witnessed a promising future. Despite the diverse range of remarkable work in this field, notably in the past few years, rapid improvements have also solved future challenges for researchers. Moreover, the connection between these two domains is mainly subjected to GAN, thus limiting the horizons of this field. This review analyzes Text-to-Image (T2I) synthesis as a broader picture, Text-guided Visual-output (T2Vo), with the primary goal being to highlight the gaps by proposing a more comprehensive taxonomy. We broadly categorize text-guided visual output into three main divisions and meaningful subdivisions by critically examining an extensive body of literature from top-tier computer vision venues and closely related fields, such as machine learning and human–computer interaction, aiming at state-of-the-art models with a comparative analysis. This study successively follows previous surveys on T2I, adding value by analogously evaluating the diverse range of existing methods, including different generative models, several types of visual output, critical examination of various approaches, and highlighting the shortcomings, suggesting the future direction of research. MDPI 2022-09-08 /pmc/articles/PMC9503702/ /pubmed/36146161 http://dx.doi.org/10.3390/s22186816 Text en © 2022 by the authors. https://creativecommons.org/licenses/by/4.0/Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
spellingShingle Review
Ullah, Ubaid
Lee, Jeong-Sik
An, Chang-Hyeon
Lee, Hyeonjin
Park, Su-Yeong
Baek, Rock-Hyun
Choi, Hyun-Chul
A Review of Multi-Modal Learning from the Text-Guided Visual Processing Viewpoint
title A Review of Multi-Modal Learning from the Text-Guided Visual Processing Viewpoint
title_full A Review of Multi-Modal Learning from the Text-Guided Visual Processing Viewpoint
title_fullStr A Review of Multi-Modal Learning from the Text-Guided Visual Processing Viewpoint
title_full_unstemmed A Review of Multi-Modal Learning from the Text-Guided Visual Processing Viewpoint
title_short A Review of Multi-Modal Learning from the Text-Guided Visual Processing Viewpoint
title_sort review of multi-modal learning from the text-guided visual processing viewpoint
topic Review
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9503702/
https://www.ncbi.nlm.nih.gov/pubmed/36146161
http://dx.doi.org/10.3390/s22186816
work_keys_str_mv AT ullahubaid areviewofmultimodallearningfromthetextguidedvisualprocessingviewpoint
AT leejeongsik areviewofmultimodallearningfromthetextguidedvisualprocessingviewpoint
AT anchanghyeon areviewofmultimodallearningfromthetextguidedvisualprocessingviewpoint
AT leehyeonjin areviewofmultimodallearningfromthetextguidedvisualprocessingviewpoint
AT parksuyeong areviewofmultimodallearningfromthetextguidedvisualprocessingviewpoint
AT baekrockhyun areviewofmultimodallearningfromthetextguidedvisualprocessingviewpoint
AT choihyunchul areviewofmultimodallearningfromthetextguidedvisualprocessingviewpoint
AT ullahubaid reviewofmultimodallearningfromthetextguidedvisualprocessingviewpoint
AT leejeongsik reviewofmultimodallearningfromthetextguidedvisualprocessingviewpoint
AT anchanghyeon reviewofmultimodallearningfromthetextguidedvisualprocessingviewpoint
AT leehyeonjin reviewofmultimodallearningfromthetextguidedvisualprocessingviewpoint
AT parksuyeong reviewofmultimodallearningfromthetextguidedvisualprocessingviewpoint
AT baekrockhyun reviewofmultimodallearningfromthetextguidedvisualprocessingviewpoint
AT choihyunchul reviewofmultimodallearningfromthetextguidedvisualprocessingviewpoint