Cargando…

Visual-Text Reference Pretraining Model for Image Captioning

People can accurately describe an image by constantly referring to the visual information and key text information of the image. Inspired by this idea, we propose the VTR-PTM (Visual-Text Reference Pretraining Model) for image captioning. First, based on the pretraining model (BERT/UNIML), we design...

Descripción completa

Detalles Bibliográficos
Autores principales:	Li, Pengfei, Zhang, Min, Lin, Peijie, Wan, Jian, Jiang, Ming
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Hindawi 2022
Materias:	Research Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8799330/ https://www.ncbi.nlm.nih.gov/pubmed/35096050 http://dx.doi.org/10.1155/2022/9400999

_version_	1784642045982277632
author	Li, Pengfei Zhang, Min Lin, Peijie Wan, Jian Jiang, Ming
author_facet	Li, Pengfei Zhang, Min Lin, Peijie Wan, Jian Jiang, Ming
author_sort	Li, Pengfei
collection	PubMed
description	People can accurately describe an image by constantly referring to the visual information and key text information of the image. Inspired by this idea, we propose the VTR-PTM (Visual-Text Reference Pretraining Model) for image captioning. First, based on the pretraining model (BERT/UNIML), we design the dual-stream input mode of image reference and text reference and use two different mask modes (bidirectional and sequence to sequence) to realize the VTR-PTM suitable for generating tasks. Second, the target dataset is used to fine tune the VTR-PTM. To the best of our knowledge, VTR-PTM is the first reported pretraining model to use visual-text references in the learning process. To evaluate the model, we conduct several experiments on the benchmark datasets of image captioning, including MS COCO and Visual Genome, and achieve significant improvements on most metrics. The code is available at https://github.com/lpfworld/VTR-PTM.
format	Online Article Text
id	pubmed-8799330
institution	National Center for Biotechnology Information
language	English
publishDate	2022
publisher	Hindawi
record_format	MEDLINE/PubMed
spelling	pubmed-87993302022-01-29 Visual-Text Reference Pretraining Model for Image Captioning Li, Pengfei Zhang, Min Lin, Peijie Wan, Jian Jiang, Ming Comput Intell Neurosci Research Article People can accurately describe an image by constantly referring to the visual information and key text information of the image. Inspired by this idea, we propose the VTR-PTM (Visual-Text Reference Pretraining Model) for image captioning. First, based on the pretraining model (BERT/UNIML), we design the dual-stream input mode of image reference and text reference and use two different mask modes (bidirectional and sequence to sequence) to realize the VTR-PTM suitable for generating tasks. Second, the target dataset is used to fine tune the VTR-PTM. To the best of our knowledge, VTR-PTM is the first reported pretraining model to use visual-text references in the learning process. To evaluate the model, we conduct several experiments on the benchmark datasets of image captioning, including MS COCO and Visual Genome, and achieve significant improvements on most metrics. The code is available at https://github.com/lpfworld/VTR-PTM. Hindawi 2022-01-21 /pmc/articles/PMC8799330/ /pubmed/35096050 http://dx.doi.org/10.1155/2022/9400999 Text en Copyright © 2022 Pengfei Li et al. https://creativecommons.org/licenses/by/4.0/This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle	Research Article Li, Pengfei Zhang, Min Lin, Peijie Wan, Jian Jiang, Ming Visual-Text Reference Pretraining Model for Image Captioning
title	Visual-Text Reference Pretraining Model for Image Captioning
title_full	Visual-Text Reference Pretraining Model for Image Captioning
title_fullStr	Visual-Text Reference Pretraining Model for Image Captioning
title_full_unstemmed	Visual-Text Reference Pretraining Model for Image Captioning
title_short	Visual-Text Reference Pretraining Model for Image Captioning
title_sort	visual-text reference pretraining model for image captioning
topic	Research Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8799330/ https://www.ncbi.nlm.nih.gov/pubmed/35096050 http://dx.doi.org/10.1155/2022/9400999
work_keys_str_mv	AT lipengfei visualtextreferencepretrainingmodelforimagecaptioning AT zhangmin visualtextreferencepretrainingmodelforimagecaptioning AT linpeijie visualtextreferencepretrainingmodelforimagecaptioning AT wanjian visualtextreferencepretrainingmodelforimagecaptioning AT jiangming visualtextreferencepretrainingmodelforimagecaptioning

Visual-Text Reference Pretraining Model for Image Captioning

Ejemplares similares