Cargando…

Visual-Text Reference Pretraining Model for Image Captioning

People can accurately describe an image by constantly referring to the visual information and key text information of the image. Inspired by this idea, we propose the VTR-PTM (Visual-Text Reference Pretraining Model) for image captioning. First, based on the pretraining model (BERT/UNIML), we design...

Descripción completa

Detalles Bibliográficos
Autores principales: Li, Pengfei, Zhang, Min, Lin, Peijie, Wan, Jian, Jiang, Ming
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Hindawi 2022
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8799330/
https://www.ncbi.nlm.nih.gov/pubmed/35096050
http://dx.doi.org/10.1155/2022/9400999
_version_ 1784642045982277632
author Li, Pengfei
Zhang, Min
Lin, Peijie
Wan, Jian
Jiang, Ming
author_facet Li, Pengfei
Zhang, Min
Lin, Peijie
Wan, Jian
Jiang, Ming
author_sort Li, Pengfei
collection PubMed
description People can accurately describe an image by constantly referring to the visual information and key text information of the image. Inspired by this idea, we propose the VTR-PTM (Visual-Text Reference Pretraining Model) for image captioning. First, based on the pretraining model (BERT/UNIML), we design the dual-stream input mode of image reference and text reference and use two different mask modes (bidirectional and sequence to sequence) to realize the VTR-PTM suitable for generating tasks. Second, the target dataset is used to fine tune the VTR-PTM. To the best of our knowledge, VTR-PTM is the first reported pretraining model to use visual-text references in the learning process. To evaluate the model, we conduct several experiments on the benchmark datasets of image captioning, including MS COCO and Visual Genome, and achieve significant improvements on most metrics. The code is available at https://github.com/lpfworld/VTR-PTM.
format Online
Article
Text
id pubmed-8799330
institution National Center for Biotechnology Information
language English
publishDate 2022
publisher Hindawi
record_format MEDLINE/PubMed
spelling pubmed-87993302022-01-29 Visual-Text Reference Pretraining Model for Image Captioning Li, Pengfei Zhang, Min Lin, Peijie Wan, Jian Jiang, Ming Comput Intell Neurosci Research Article People can accurately describe an image by constantly referring to the visual information and key text information of the image. Inspired by this idea, we propose the VTR-PTM (Visual-Text Reference Pretraining Model) for image captioning. First, based on the pretraining model (BERT/UNIML), we design the dual-stream input mode of image reference and text reference and use two different mask modes (bidirectional and sequence to sequence) to realize the VTR-PTM suitable for generating tasks. Second, the target dataset is used to fine tune the VTR-PTM. To the best of our knowledge, VTR-PTM is the first reported pretraining model to use visual-text references in the learning process. To evaluate the model, we conduct several experiments on the benchmark datasets of image captioning, including MS COCO and Visual Genome, and achieve significant improvements on most metrics. The code is available at https://github.com/lpfworld/VTR-PTM. Hindawi 2022-01-21 /pmc/articles/PMC8799330/ /pubmed/35096050 http://dx.doi.org/10.1155/2022/9400999 Text en Copyright © 2022 Pengfei Li et al. https://creativecommons.org/licenses/by/4.0/This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Research Article
Li, Pengfei
Zhang, Min
Lin, Peijie
Wan, Jian
Jiang, Ming
Visual-Text Reference Pretraining Model for Image Captioning
title Visual-Text Reference Pretraining Model for Image Captioning
title_full Visual-Text Reference Pretraining Model for Image Captioning
title_fullStr Visual-Text Reference Pretraining Model for Image Captioning
title_full_unstemmed Visual-Text Reference Pretraining Model for Image Captioning
title_short Visual-Text Reference Pretraining Model for Image Captioning
title_sort visual-text reference pretraining model for image captioning
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8799330/
https://www.ncbi.nlm.nih.gov/pubmed/35096050
http://dx.doi.org/10.1155/2022/9400999
work_keys_str_mv AT lipengfei visualtextreferencepretrainingmodelforimagecaptioning
AT zhangmin visualtextreferencepretrainingmodelforimagecaptioning
AT linpeijie visualtextreferencepretrainingmodelforimagecaptioning
AT wanjian visualtextreferencepretrainingmodelforimagecaptioning
AT jiangming visualtextreferencepretrainingmodelforimagecaptioning