Cargando…
A Review of Multi-Modal Learning from the Text-Guided Visual Processing Viewpoint
For decades, co-relating different data domains to attain the maximum potential of machines has driven research, especially in neural networks. Similarly, text and visual data (images and videos) are two distinct data domains with extensive research in the past. Recently, using natural language to p...
Autores principales: | , , , , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
MDPI
2022
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9503702/ https://www.ncbi.nlm.nih.gov/pubmed/36146161 http://dx.doi.org/10.3390/s22186816 |
_version_ | 1784796031256363008 |
---|---|
author | Ullah, Ubaid Lee, Jeong-Sik An, Chang-Hyeon Lee, Hyeonjin Park, Su-Yeong Baek, Rock-Hyun Choi, Hyun-Chul |
author_facet | Ullah, Ubaid Lee, Jeong-Sik An, Chang-Hyeon Lee, Hyeonjin Park, Su-Yeong Baek, Rock-Hyun Choi, Hyun-Chul |
author_sort | Ullah, Ubaid |
collection | PubMed |
description | For decades, co-relating different data domains to attain the maximum potential of machines has driven research, especially in neural networks. Similarly, text and visual data (images and videos) are two distinct data domains with extensive research in the past. Recently, using natural language to process 2D or 3D images and videos with the immense power of neural nets has witnessed a promising future. Despite the diverse range of remarkable work in this field, notably in the past few years, rapid improvements have also solved future challenges for researchers. Moreover, the connection between these two domains is mainly subjected to GAN, thus limiting the horizons of this field. This review analyzes Text-to-Image (T2I) synthesis as a broader picture, Text-guided Visual-output (T2Vo), with the primary goal being to highlight the gaps by proposing a more comprehensive taxonomy. We broadly categorize text-guided visual output into three main divisions and meaningful subdivisions by critically examining an extensive body of literature from top-tier computer vision venues and closely related fields, such as machine learning and human–computer interaction, aiming at state-of-the-art models with a comparative analysis. This study successively follows previous surveys on T2I, adding value by analogously evaluating the diverse range of existing methods, including different generative models, several types of visual output, critical examination of various approaches, and highlighting the shortcomings, suggesting the future direction of research. |
format | Online Article Text |
id | pubmed-9503702 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2022 |
publisher | MDPI |
record_format | MEDLINE/PubMed |
spelling | pubmed-95037022022-09-24 A Review of Multi-Modal Learning from the Text-Guided Visual Processing Viewpoint Ullah, Ubaid Lee, Jeong-Sik An, Chang-Hyeon Lee, Hyeonjin Park, Su-Yeong Baek, Rock-Hyun Choi, Hyun-Chul Sensors (Basel) Review For decades, co-relating different data domains to attain the maximum potential of machines has driven research, especially in neural networks. Similarly, text and visual data (images and videos) are two distinct data domains with extensive research in the past. Recently, using natural language to process 2D or 3D images and videos with the immense power of neural nets has witnessed a promising future. Despite the diverse range of remarkable work in this field, notably in the past few years, rapid improvements have also solved future challenges for researchers. Moreover, the connection between these two domains is mainly subjected to GAN, thus limiting the horizons of this field. This review analyzes Text-to-Image (T2I) synthesis as a broader picture, Text-guided Visual-output (T2Vo), with the primary goal being to highlight the gaps by proposing a more comprehensive taxonomy. We broadly categorize text-guided visual output into three main divisions and meaningful subdivisions by critically examining an extensive body of literature from top-tier computer vision venues and closely related fields, such as machine learning and human–computer interaction, aiming at state-of-the-art models with a comparative analysis. This study successively follows previous surveys on T2I, adding value by analogously evaluating the diverse range of existing methods, including different generative models, several types of visual output, critical examination of various approaches, and highlighting the shortcomings, suggesting the future direction of research. MDPI 2022-09-08 /pmc/articles/PMC9503702/ /pubmed/36146161 http://dx.doi.org/10.3390/s22186816 Text en © 2022 by the authors. https://creativecommons.org/licenses/by/4.0/Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/). |
spellingShingle | Review Ullah, Ubaid Lee, Jeong-Sik An, Chang-Hyeon Lee, Hyeonjin Park, Su-Yeong Baek, Rock-Hyun Choi, Hyun-Chul A Review of Multi-Modal Learning from the Text-Guided Visual Processing Viewpoint |
title | A Review of Multi-Modal Learning from the Text-Guided Visual Processing Viewpoint |
title_full | A Review of Multi-Modal Learning from the Text-Guided Visual Processing Viewpoint |
title_fullStr | A Review of Multi-Modal Learning from the Text-Guided Visual Processing Viewpoint |
title_full_unstemmed | A Review of Multi-Modal Learning from the Text-Guided Visual Processing Viewpoint |
title_short | A Review of Multi-Modal Learning from the Text-Guided Visual Processing Viewpoint |
title_sort | review of multi-modal learning from the text-guided visual processing viewpoint |
topic | Review |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9503702/ https://www.ncbi.nlm.nih.gov/pubmed/36146161 http://dx.doi.org/10.3390/s22186816 |
work_keys_str_mv | AT ullahubaid areviewofmultimodallearningfromthetextguidedvisualprocessingviewpoint AT leejeongsik areviewofmultimodallearningfromthetextguidedvisualprocessingviewpoint AT anchanghyeon areviewofmultimodallearningfromthetextguidedvisualprocessingviewpoint AT leehyeonjin areviewofmultimodallearningfromthetextguidedvisualprocessingviewpoint AT parksuyeong areviewofmultimodallearningfromthetextguidedvisualprocessingviewpoint AT baekrockhyun areviewofmultimodallearningfromthetextguidedvisualprocessingviewpoint AT choihyunchul areviewofmultimodallearningfromthetextguidedvisualprocessingviewpoint AT ullahubaid reviewofmultimodallearningfromthetextguidedvisualprocessingviewpoint AT leejeongsik reviewofmultimodallearningfromthetextguidedvisualprocessingviewpoint AT anchanghyeon reviewofmultimodallearningfromthetextguidedvisualprocessingviewpoint AT leehyeonjin reviewofmultimodallearningfromthetextguidedvisualprocessingviewpoint AT parksuyeong reviewofmultimodallearningfromthetextguidedvisualprocessingviewpoint AT baekrockhyun reviewofmultimodallearningfromthetextguidedvisualprocessingviewpoint AT choihyunchul reviewofmultimodallearningfromthetextguidedvisualprocessingviewpoint |