Cargando…
Image to English translation and comprehension: INT2-VQA method based on inter-modality and intra-modality collaborations
Existing visual question answering methods typically concentrate only on visual targets in images, ignoring the key textual content in the images, thereby limiting the depth and accuracy of image content comprehension. Inspired by this, we pay attention to the task of text-based visual question answ...
Autor principal: | |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Public Library of Science
2023
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10468077/ https://www.ncbi.nlm.nih.gov/pubmed/37647277 http://dx.doi.org/10.1371/journal.pone.0290315 |
_version_ | 1785099167551455232 |
---|---|
author | Sheng, Xianli |
author_facet | Sheng, Xianli |
author_sort | Sheng, Xianli |
collection | PubMed |
description | Existing visual question answering methods typically concentrate only on visual targets in images, ignoring the key textual content in the images, thereby limiting the depth and accuracy of image content comprehension. Inspired by this, we pay attention to the task of text-based visual question answering, address the performance bottleneck issue caused by over-fitting risk in existing self-attention-based models, and propose a scenario text visual question answering method called INT2-VQA that fuses knowledge manifestation based on inter-modality and intra-modality collaborations. Specifically, we model the complementary priori knowledge of locational collaboration between visual targets and textual targets across modalities and the contextual semantical collaboration among textual word targets within a modality. Based on this, a universal knowledge-reinforced attention module is designed to achieve a unified encoding manifestation of both relations. Extensive ablation experiments, contrast experiments, and visual analyses demonstrate the effectiveness of the proposed method and prove its superiority over the other state-of-the-art methods. |
format | Online Article Text |
id | pubmed-10468077 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2023 |
publisher | Public Library of Science |
record_format | MEDLINE/PubMed |
spelling | pubmed-104680772023-08-31 Image to English translation and comprehension: INT2-VQA method based on inter-modality and intra-modality collaborations Sheng, Xianli PLoS One Research Article Existing visual question answering methods typically concentrate only on visual targets in images, ignoring the key textual content in the images, thereby limiting the depth and accuracy of image content comprehension. Inspired by this, we pay attention to the task of text-based visual question answering, address the performance bottleneck issue caused by over-fitting risk in existing self-attention-based models, and propose a scenario text visual question answering method called INT2-VQA that fuses knowledge manifestation based on inter-modality and intra-modality collaborations. Specifically, we model the complementary priori knowledge of locational collaboration between visual targets and textual targets across modalities and the contextual semantical collaboration among textual word targets within a modality. Based on this, a universal knowledge-reinforced attention module is designed to achieve a unified encoding manifestation of both relations. Extensive ablation experiments, contrast experiments, and visual analyses demonstrate the effectiveness of the proposed method and prove its superiority over the other state-of-the-art methods. Public Library of Science 2023-08-30 /pmc/articles/PMC10468077/ /pubmed/37647277 http://dx.doi.org/10.1371/journal.pone.0290315 Text en © 2023 Xianli Sheng https://creativecommons.org/licenses/by/4.0/This is an open access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. |
spellingShingle | Research Article Sheng, Xianli Image to English translation and comprehension: INT2-VQA method based on inter-modality and intra-modality collaborations |
title | Image to English translation and comprehension: INT2-VQA method based on inter-modality and intra-modality collaborations |
title_full | Image to English translation and comprehension: INT2-VQA method based on inter-modality and intra-modality collaborations |
title_fullStr | Image to English translation and comprehension: INT2-VQA method based on inter-modality and intra-modality collaborations |
title_full_unstemmed | Image to English translation and comprehension: INT2-VQA method based on inter-modality and intra-modality collaborations |
title_short | Image to English translation and comprehension: INT2-VQA method based on inter-modality and intra-modality collaborations |
title_sort | image to english translation and comprehension: int2-vqa method based on inter-modality and intra-modality collaborations |
topic | Research Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10468077/ https://www.ncbi.nlm.nih.gov/pubmed/37647277 http://dx.doi.org/10.1371/journal.pone.0290315 |
work_keys_str_mv | AT shengxianli imagetoenglishtranslationandcomprehensionint2vqamethodbasedonintermodalityandintramodalitycollaborations |