Cargando…

Vision–Language–Knowledge Co-Embedding for Visual Commonsense Reasoning

Visual commonsense reasoning is an intelligent task performed to decide the most appropriate answer to a question while providing the rationale or reason for the answer when an image, a natural language question, and candidate responses are given. For effective visual commonsense reasoning, both the...

Descripción completa

Detalles Bibliográficos
Autores principales:	Lee, JaeYun, Kim, Incheol
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	MDPI 2021
Materias:	Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8122639/ https://www.ncbi.nlm.nih.gov/pubmed/33919196 http://dx.doi.org/10.3390/s21092911

_version_	1783692672164167680
author	Lee, JaeYun Kim, Incheol
author_facet	Lee, JaeYun Kim, Incheol
author_sort	Lee, JaeYun
collection	PubMed
description	Visual commonsense reasoning is an intelligent task performed to decide the most appropriate answer to a question while providing the rationale or reason for the answer when an image, a natural language question, and candidate responses are given. For effective visual commonsense reasoning, both the knowledge acquisition problem and the multimodal alignment problem need to be solved. Therefore, we propose a novel Vision–Language–Knowledge Co-embedding (ViLaKC) model that extracts knowledge graphs relevant to the question from an external knowledge base, ConceptNet, and uses them together with the input image to answer the question. The proposed model uses a pretrained vision–language–knowledge embedding module, which co-embeds multimodal data including images, natural language texts, and knowledge graphs into a single feature vector. To reflect the structural information of the knowledge graph, the proposed model uses the graph convolutional neural network layer to embed the knowledge graph first and then uses multi-head self-attention layers to co-embed it with the image and natural language question. The effectiveness and performance of the proposed model are experimentally validated using the VCR v1.0 benchmark dataset.
format	Online Article Text
id	pubmed-8122639
institution	National Center for Biotechnology Information
language	English
publishDate	2021
publisher	MDPI
record_format	MEDLINE/PubMed
spelling	pubmed-81226392021-05-16 Vision–Language–Knowledge Co-Embedding for Visual Commonsense Reasoning Lee, JaeYun Kim, Incheol Sensors (Basel) Article Visual commonsense reasoning is an intelligent task performed to decide the most appropriate answer to a question while providing the rationale or reason for the answer when an image, a natural language question, and candidate responses are given. For effective visual commonsense reasoning, both the knowledge acquisition problem and the multimodal alignment problem need to be solved. Therefore, we propose a novel Vision–Language–Knowledge Co-embedding (ViLaKC) model that extracts knowledge graphs relevant to the question from an external knowledge base, ConceptNet, and uses them together with the input image to answer the question. The proposed model uses a pretrained vision–language–knowledge embedding module, which co-embeds multimodal data including images, natural language texts, and knowledge graphs into a single feature vector. To reflect the structural information of the knowledge graph, the proposed model uses the graph convolutional neural network layer to embed the knowledge graph first and then uses multi-head self-attention layers to co-embed it with the image and natural language question. The effectiveness and performance of the proposed model are experimentally validated using the VCR v1.0 benchmark dataset. MDPI 2021-04-21 /pmc/articles/PMC8122639/ /pubmed/33919196 http://dx.doi.org/10.3390/s21092911 Text en © 2021 by the authors. https://creativecommons.org/licenses/by/4.0/Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
spellingShingle	Article Lee, JaeYun Kim, Incheol Vision–Language–Knowledge Co-Embedding for Visual Commonsense Reasoning
title	Vision–Language–Knowledge Co-Embedding for Visual Commonsense Reasoning
title_full	Vision–Language–Knowledge Co-Embedding for Visual Commonsense Reasoning
title_fullStr	Vision–Language–Knowledge Co-Embedding for Visual Commonsense Reasoning
title_full_unstemmed	Vision–Language–Knowledge Co-Embedding for Visual Commonsense Reasoning
title_short	Vision–Language–Knowledge Co-Embedding for Visual Commonsense Reasoning
title_sort	vision–language–knowledge co-embedding for visual commonsense reasoning
topic	Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8122639/ https://www.ncbi.nlm.nih.gov/pubmed/33919196 http://dx.doi.org/10.3390/s21092911
work_keys_str_mv	AT leejaeyun visionlanguageknowledgecoembeddingforvisualcommonsensereasoning AT kimincheol visionlanguageknowledgecoembeddingforvisualcommonsensereasoning

Vision–Language–Knowledge Co-Embedding for Visual Commonsense Reasoning

Ejemplares similares