Cargando…

Image to English translation and comprehension: INT2-VQA method based on inter-modality and intra-modality collaborations

Existing visual question answering methods typically concentrate only on visual targets in images, ignoring the key textual content in the images, thereby limiting the depth and accuracy of image content comprehension. Inspired by this, we pay attention to the task of text-based visual question answ...

Descripción completa

Detalles Bibliográficos
Autor principal: Sheng, Xianli
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Public Library of Science 2023
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10468077/
https://www.ncbi.nlm.nih.gov/pubmed/37647277
http://dx.doi.org/10.1371/journal.pone.0290315
_version_ 1785099167551455232
author Sheng, Xianli
author_facet Sheng, Xianli
author_sort Sheng, Xianli
collection PubMed
description Existing visual question answering methods typically concentrate only on visual targets in images, ignoring the key textual content in the images, thereby limiting the depth and accuracy of image content comprehension. Inspired by this, we pay attention to the task of text-based visual question answering, address the performance bottleneck issue caused by over-fitting risk in existing self-attention-based models, and propose a scenario text visual question answering method called INT2-VQA that fuses knowledge manifestation based on inter-modality and intra-modality collaborations. Specifically, we model the complementary priori knowledge of locational collaboration between visual targets and textual targets across modalities and the contextual semantical collaboration among textual word targets within a modality. Based on this, a universal knowledge-reinforced attention module is designed to achieve a unified encoding manifestation of both relations. Extensive ablation experiments, contrast experiments, and visual analyses demonstrate the effectiveness of the proposed method and prove its superiority over the other state-of-the-art methods.
format Online
Article
Text
id pubmed-10468077
institution National Center for Biotechnology Information
language English
publishDate 2023
publisher Public Library of Science
record_format MEDLINE/PubMed
spelling pubmed-104680772023-08-31 Image to English translation and comprehension: INT2-VQA method based on inter-modality and intra-modality collaborations Sheng, Xianli PLoS One Research Article Existing visual question answering methods typically concentrate only on visual targets in images, ignoring the key textual content in the images, thereby limiting the depth and accuracy of image content comprehension. Inspired by this, we pay attention to the task of text-based visual question answering, address the performance bottleneck issue caused by over-fitting risk in existing self-attention-based models, and propose a scenario text visual question answering method called INT2-VQA that fuses knowledge manifestation based on inter-modality and intra-modality collaborations. Specifically, we model the complementary priori knowledge of locational collaboration between visual targets and textual targets across modalities and the contextual semantical collaboration among textual word targets within a modality. Based on this, a universal knowledge-reinforced attention module is designed to achieve a unified encoding manifestation of both relations. Extensive ablation experiments, contrast experiments, and visual analyses demonstrate the effectiveness of the proposed method and prove its superiority over the other state-of-the-art methods. Public Library of Science 2023-08-30 /pmc/articles/PMC10468077/ /pubmed/37647277 http://dx.doi.org/10.1371/journal.pone.0290315 Text en © 2023 Xianli Sheng https://creativecommons.org/licenses/by/4.0/This is an open access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
spellingShingle Research Article
Sheng, Xianli
Image to English translation and comprehension: INT2-VQA method based on inter-modality and intra-modality collaborations
title Image to English translation and comprehension: INT2-VQA method based on inter-modality and intra-modality collaborations
title_full Image to English translation and comprehension: INT2-VQA method based on inter-modality and intra-modality collaborations
title_fullStr Image to English translation and comprehension: INT2-VQA method based on inter-modality and intra-modality collaborations
title_full_unstemmed Image to English translation and comprehension: INT2-VQA method based on inter-modality and intra-modality collaborations
title_short Image to English translation and comprehension: INT2-VQA method based on inter-modality and intra-modality collaborations
title_sort image to english translation and comprehension: int2-vqa method based on inter-modality and intra-modality collaborations
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10468077/
https://www.ncbi.nlm.nih.gov/pubmed/37647277
http://dx.doi.org/10.1371/journal.pone.0290315
work_keys_str_mv AT shengxianli imagetoenglishtranslationandcomprehensionint2vqamethodbasedonintermodalityandintramodalitycollaborations