Cargando…

Context-Fused Guidance for Image Captioning Using Sequence-Level Training

Recent image captioning models based on the encoder-decoder framework have achieved remarkable success in humanlike sentence generation. However, an explicit separation between encoder and decoder brings out a disconnection between the image and sentence. It usually leads to a rough image descriptio...

Descripción completa

Detalles Bibliográficos
Autores principales: Feng, Junlong, Zhao, Jianping
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Hindawi 2022
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8754620/
https://www.ncbi.nlm.nih.gov/pubmed/35035470
http://dx.doi.org/10.1155/2022/9743123
_version_ 1784632309449752576
author Feng, Junlong
Zhao, Jianping
author_facet Feng, Junlong
Zhao, Jianping
author_sort Feng, Junlong
collection PubMed
description Recent image captioning models based on the encoder-decoder framework have achieved remarkable success in humanlike sentence generation. However, an explicit separation between encoder and decoder brings out a disconnection between the image and sentence. It usually leads to a rough image description: the generated caption only contains main instances but neglects additional objects and scenes unexpectedly, which reduces the caption consistency of the image. To address this issue, we proposed an image captioning system within context-fused guidance in this paper. It incorporates regional and global image representation as the compositional visual features to learn the objects and attributes in images. To integrate image-level semantic information, the visual concept is employed. To avoid misleading decoding, a context fusion gate is introduced to calculate the textual context by selectively aggregating the information of visual concept and word embedding. Subsequently, the context-fused image guidance is formulated based on the compositional visual features and textual context. It provides the decoder with informative semantic knowledge. Finally, a captioner with a two-layer LSTM architecture is constructed to generate captions. Moreover, to overcome the exposure bias, we train the proposed model through sequence decision-making. The experiments conducted on the MS COCO dataset show the outstanding performance of our work. The linguistic analysis demonstrates that our model improves the caption consistency of the image.
format Online
Article
Text
id pubmed-8754620
institution National Center for Biotechnology Information
language English
publishDate 2022
publisher Hindawi
record_format MEDLINE/PubMed
spelling pubmed-87546202022-01-13 Context-Fused Guidance for Image Captioning Using Sequence-Level Training Feng, Junlong Zhao, Jianping Comput Intell Neurosci Research Article Recent image captioning models based on the encoder-decoder framework have achieved remarkable success in humanlike sentence generation. However, an explicit separation between encoder and decoder brings out a disconnection between the image and sentence. It usually leads to a rough image description: the generated caption only contains main instances but neglects additional objects and scenes unexpectedly, which reduces the caption consistency of the image. To address this issue, we proposed an image captioning system within context-fused guidance in this paper. It incorporates regional and global image representation as the compositional visual features to learn the objects and attributes in images. To integrate image-level semantic information, the visual concept is employed. To avoid misleading decoding, a context fusion gate is introduced to calculate the textual context by selectively aggregating the information of visual concept and word embedding. Subsequently, the context-fused image guidance is formulated based on the compositional visual features and textual context. It provides the decoder with informative semantic knowledge. Finally, a captioner with a two-layer LSTM architecture is constructed to generate captions. Moreover, to overcome the exposure bias, we train the proposed model through sequence decision-making. The experiments conducted on the MS COCO dataset show the outstanding performance of our work. The linguistic analysis demonstrates that our model improves the caption consistency of the image. Hindawi 2022-01-05 /pmc/articles/PMC8754620/ /pubmed/35035470 http://dx.doi.org/10.1155/2022/9743123 Text en Copyright © 2022 Junlong Feng and Jianping Zhao. https://creativecommons.org/licenses/by/4.0/This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Research Article
Feng, Junlong
Zhao, Jianping
Context-Fused Guidance for Image Captioning Using Sequence-Level Training
title Context-Fused Guidance for Image Captioning Using Sequence-Level Training
title_full Context-Fused Guidance for Image Captioning Using Sequence-Level Training
title_fullStr Context-Fused Guidance for Image Captioning Using Sequence-Level Training
title_full_unstemmed Context-Fused Guidance for Image Captioning Using Sequence-Level Training
title_short Context-Fused Guidance for Image Captioning Using Sequence-Level Training
title_sort context-fused guidance for image captioning using sequence-level training
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8754620/
https://www.ncbi.nlm.nih.gov/pubmed/35035470
http://dx.doi.org/10.1155/2022/9743123
work_keys_str_mv AT fengjunlong contextfusedguidanceforimagecaptioningusingsequenceleveltraining
AT zhaojianping contextfusedguidanceforimagecaptioningusingsequenceleveltraining