Cargando…

Context-Fused Guidance for Image Captioning Using Sequence-Level Training

Recent image captioning models based on the encoder-decoder framework have achieved remarkable success in humanlike sentence generation. However, an explicit separation between encoder and decoder brings out a disconnection between the image and sentence. It usually leads to a rough image descriptio...

Descripción completa

Detalles Bibliográficos
Autores principales:	Feng, Junlong, Zhao, Jianping
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Hindawi 2022
Materias:	Research Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8754620/ https://www.ncbi.nlm.nih.gov/pubmed/35035470 http://dx.doi.org/10.1155/2022/9743123

_version_	1784632309449752576
author	Feng, Junlong Zhao, Jianping
author_facet	Feng, Junlong Zhao, Jianping
author_sort	Feng, Junlong
collection	PubMed
description	Recent image captioning models based on the encoder-decoder framework have achieved remarkable success in humanlike sentence generation. However, an explicit separation between encoder and decoder brings out a disconnection between the image and sentence. It usually leads to a rough image description: the generated caption only contains main instances but neglects additional objects and scenes unexpectedly, which reduces the caption consistency of the image. To address this issue, we proposed an image captioning system within context-fused guidance in this paper. It incorporates regional and global image representation as the compositional visual features to learn the objects and attributes in images. To integrate image-level semantic information, the visual concept is employed. To avoid misleading decoding, a context fusion gate is introduced to calculate the textual context by selectively aggregating the information of visual concept and word embedding. Subsequently, the context-fused image guidance is formulated based on the compositional visual features and textual context. It provides the decoder with informative semantic knowledge. Finally, a captioner with a two-layer LSTM architecture is constructed to generate captions. Moreover, to overcome the exposure bias, we train the proposed model through sequence decision-making. The experiments conducted on the MS COCO dataset show the outstanding performance of our work. The linguistic analysis demonstrates that our model improves the caption consistency of the image.
format	Online Article Text
id	pubmed-8754620
institution	National Center for Biotechnology Information
language	English
publishDate	2022
publisher	Hindawi
record_format	MEDLINE/PubMed
spelling	pubmed-87546202022-01-13 Context-Fused Guidance for Image Captioning Using Sequence-Level Training Feng, Junlong Zhao, Jianping Comput Intell Neurosci Research Article Recent image captioning models based on the encoder-decoder framework have achieved remarkable success in humanlike sentence generation. However, an explicit separation between encoder and decoder brings out a disconnection between the image and sentence. It usually leads to a rough image description: the generated caption only contains main instances but neglects additional objects and scenes unexpectedly, which reduces the caption consistency of the image. To address this issue, we proposed an image captioning system within context-fused guidance in this paper. It incorporates regional and global image representation as the compositional visual features to learn the objects and attributes in images. To integrate image-level semantic information, the visual concept is employed. To avoid misleading decoding, a context fusion gate is introduced to calculate the textual context by selectively aggregating the information of visual concept and word embedding. Subsequently, the context-fused image guidance is formulated based on the compositional visual features and textual context. It provides the decoder with informative semantic knowledge. Finally, a captioner with a two-layer LSTM architecture is constructed to generate captions. Moreover, to overcome the exposure bias, we train the proposed model through sequence decision-making. The experiments conducted on the MS COCO dataset show the outstanding performance of our work. The linguistic analysis demonstrates that our model improves the caption consistency of the image. Hindawi 2022-01-05 /pmc/articles/PMC8754620/ /pubmed/35035470 http://dx.doi.org/10.1155/2022/9743123 Text en Copyright © 2022 Junlong Feng and Jianping Zhao. https://creativecommons.org/licenses/by/4.0/This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle	Research Article Feng, Junlong Zhao, Jianping Context-Fused Guidance for Image Captioning Using Sequence-Level Training
title	Context-Fused Guidance for Image Captioning Using Sequence-Level Training
title_full	Context-Fused Guidance for Image Captioning Using Sequence-Level Training
title_fullStr	Context-Fused Guidance for Image Captioning Using Sequence-Level Training
title_full_unstemmed	Context-Fused Guidance for Image Captioning Using Sequence-Level Training
title_short	Context-Fused Guidance for Image Captioning Using Sequence-Level Training
title_sort	context-fused guidance for image captioning using sequence-level training
topic	Research Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8754620/ https://www.ncbi.nlm.nih.gov/pubmed/35035470 http://dx.doi.org/10.1155/2022/9743123
work_keys_str_mv	AT fengjunlong contextfusedguidanceforimagecaptioningusingsequenceleveltraining AT zhaojianping contextfusedguidanceforimagecaptioningusingsequenceleveltraining

Context-Fused Guidance for Image Captioning Using Sequence-Level Training

Ejemplares similares