Cargando…
Context-Fused Guidance for Image Captioning Using Sequence-Level Training
Recent image captioning models based on the encoder-decoder framework have achieved remarkable success in humanlike sentence generation. However, an explicit separation between encoder and decoder brings out a disconnection between the image and sentence. It usually leads to a rough image descriptio...
Autores principales: | , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Hindawi
2022
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8754620/ https://www.ncbi.nlm.nih.gov/pubmed/35035470 http://dx.doi.org/10.1155/2022/9743123 |
_version_ | 1784632309449752576 |
---|---|
author | Feng, Junlong Zhao, Jianping |
author_facet | Feng, Junlong Zhao, Jianping |
author_sort | Feng, Junlong |
collection | PubMed |
description | Recent image captioning models based on the encoder-decoder framework have achieved remarkable success in humanlike sentence generation. However, an explicit separation between encoder and decoder brings out a disconnection between the image and sentence. It usually leads to a rough image description: the generated caption only contains main instances but neglects additional objects and scenes unexpectedly, which reduces the caption consistency of the image. To address this issue, we proposed an image captioning system within context-fused guidance in this paper. It incorporates regional and global image representation as the compositional visual features to learn the objects and attributes in images. To integrate image-level semantic information, the visual concept is employed. To avoid misleading decoding, a context fusion gate is introduced to calculate the textual context by selectively aggregating the information of visual concept and word embedding. Subsequently, the context-fused image guidance is formulated based on the compositional visual features and textual context. It provides the decoder with informative semantic knowledge. Finally, a captioner with a two-layer LSTM architecture is constructed to generate captions. Moreover, to overcome the exposure bias, we train the proposed model through sequence decision-making. The experiments conducted on the MS COCO dataset show the outstanding performance of our work. The linguistic analysis demonstrates that our model improves the caption consistency of the image. |
format | Online Article Text |
id | pubmed-8754620 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2022 |
publisher | Hindawi |
record_format | MEDLINE/PubMed |
spelling | pubmed-87546202022-01-13 Context-Fused Guidance for Image Captioning Using Sequence-Level Training Feng, Junlong Zhao, Jianping Comput Intell Neurosci Research Article Recent image captioning models based on the encoder-decoder framework have achieved remarkable success in humanlike sentence generation. However, an explicit separation between encoder and decoder brings out a disconnection between the image and sentence. It usually leads to a rough image description: the generated caption only contains main instances but neglects additional objects and scenes unexpectedly, which reduces the caption consistency of the image. To address this issue, we proposed an image captioning system within context-fused guidance in this paper. It incorporates regional and global image representation as the compositional visual features to learn the objects and attributes in images. To integrate image-level semantic information, the visual concept is employed. To avoid misleading decoding, a context fusion gate is introduced to calculate the textual context by selectively aggregating the information of visual concept and word embedding. Subsequently, the context-fused image guidance is formulated based on the compositional visual features and textual context. It provides the decoder with informative semantic knowledge. Finally, a captioner with a two-layer LSTM architecture is constructed to generate captions. Moreover, to overcome the exposure bias, we train the proposed model through sequence decision-making. The experiments conducted on the MS COCO dataset show the outstanding performance of our work. The linguistic analysis demonstrates that our model improves the caption consistency of the image. Hindawi 2022-01-05 /pmc/articles/PMC8754620/ /pubmed/35035470 http://dx.doi.org/10.1155/2022/9743123 Text en Copyright © 2022 Junlong Feng and Jianping Zhao. https://creativecommons.org/licenses/by/4.0/This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. |
spellingShingle | Research Article Feng, Junlong Zhao, Jianping Context-Fused Guidance for Image Captioning Using Sequence-Level Training |
title | Context-Fused Guidance for Image Captioning Using Sequence-Level Training |
title_full | Context-Fused Guidance for Image Captioning Using Sequence-Level Training |
title_fullStr | Context-Fused Guidance for Image Captioning Using Sequence-Level Training |
title_full_unstemmed | Context-Fused Guidance for Image Captioning Using Sequence-Level Training |
title_short | Context-Fused Guidance for Image Captioning Using Sequence-Level Training |
title_sort | context-fused guidance for image captioning using sequence-level training |
topic | Research Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8754620/ https://www.ncbi.nlm.nih.gov/pubmed/35035470 http://dx.doi.org/10.1155/2022/9743123 |
work_keys_str_mv | AT fengjunlong contextfusedguidanceforimagecaptioningusingsequenceleveltraining AT zhaojianping contextfusedguidanceforimagecaptioningusingsequenceleveltraining |