Cargando…
Visual-Text Reference Pretraining Model for Image Captioning
People can accurately describe an image by constantly referring to the visual information and key text information of the image. Inspired by this idea, we propose the VTR-PTM (Visual-Text Reference Pretraining Model) for image captioning. First, based on the pretraining model (BERT/UNIML), we design...
Autores principales: | , , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Hindawi
2022
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8799330/ https://www.ncbi.nlm.nih.gov/pubmed/35096050 http://dx.doi.org/10.1155/2022/9400999 |
_version_ | 1784642045982277632 |
---|---|
author | Li, Pengfei Zhang, Min Lin, Peijie Wan, Jian Jiang, Ming |
author_facet | Li, Pengfei Zhang, Min Lin, Peijie Wan, Jian Jiang, Ming |
author_sort | Li, Pengfei |
collection | PubMed |
description | People can accurately describe an image by constantly referring to the visual information and key text information of the image. Inspired by this idea, we propose the VTR-PTM (Visual-Text Reference Pretraining Model) for image captioning. First, based on the pretraining model (BERT/UNIML), we design the dual-stream input mode of image reference and text reference and use two different mask modes (bidirectional and sequence to sequence) to realize the VTR-PTM suitable for generating tasks. Second, the target dataset is used to fine tune the VTR-PTM. To the best of our knowledge, VTR-PTM is the first reported pretraining model to use visual-text references in the learning process. To evaluate the model, we conduct several experiments on the benchmark datasets of image captioning, including MS COCO and Visual Genome, and achieve significant improvements on most metrics. The code is available at https://github.com/lpfworld/VTR-PTM. |
format | Online Article Text |
id | pubmed-8799330 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2022 |
publisher | Hindawi |
record_format | MEDLINE/PubMed |
spelling | pubmed-87993302022-01-29 Visual-Text Reference Pretraining Model for Image Captioning Li, Pengfei Zhang, Min Lin, Peijie Wan, Jian Jiang, Ming Comput Intell Neurosci Research Article People can accurately describe an image by constantly referring to the visual information and key text information of the image. Inspired by this idea, we propose the VTR-PTM (Visual-Text Reference Pretraining Model) for image captioning. First, based on the pretraining model (BERT/UNIML), we design the dual-stream input mode of image reference and text reference and use two different mask modes (bidirectional and sequence to sequence) to realize the VTR-PTM suitable for generating tasks. Second, the target dataset is used to fine tune the VTR-PTM. To the best of our knowledge, VTR-PTM is the first reported pretraining model to use visual-text references in the learning process. To evaluate the model, we conduct several experiments on the benchmark datasets of image captioning, including MS COCO and Visual Genome, and achieve significant improvements on most metrics. The code is available at https://github.com/lpfworld/VTR-PTM. Hindawi 2022-01-21 /pmc/articles/PMC8799330/ /pubmed/35096050 http://dx.doi.org/10.1155/2022/9400999 Text en Copyright © 2022 Pengfei Li et al. https://creativecommons.org/licenses/by/4.0/This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. |
spellingShingle | Research Article Li, Pengfei Zhang, Min Lin, Peijie Wan, Jian Jiang, Ming Visual-Text Reference Pretraining Model for Image Captioning |
title | Visual-Text Reference Pretraining Model for Image Captioning |
title_full | Visual-Text Reference Pretraining Model for Image Captioning |
title_fullStr | Visual-Text Reference Pretraining Model for Image Captioning |
title_full_unstemmed | Visual-Text Reference Pretraining Model for Image Captioning |
title_short | Visual-Text Reference Pretraining Model for Image Captioning |
title_sort | visual-text reference pretraining model for image captioning |
topic | Research Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8799330/ https://www.ncbi.nlm.nih.gov/pubmed/35096050 http://dx.doi.org/10.1155/2022/9400999 |
work_keys_str_mv | AT lipengfei visualtextreferencepretrainingmodelforimagecaptioning AT zhangmin visualtextreferencepretrainingmodelforimagecaptioning AT linpeijie visualtextreferencepretrainingmodelforimagecaptioning AT wanjian visualtextreferencepretrainingmodelforimagecaptioning AT jiangming visualtextreferencepretrainingmodelforimagecaptioning |