Cargando…

Video captioning based on vision transformer and reinforcement learning

Global encoding of visual features in video captioning is important for improving the description accuracy. In this paper, we propose a video captioning method that combines Vision Transformer (ViT) and reinforcement learning. Firstly, Resnet-152 and ResNeXt-101 are used to extract features from vid...

Descripción completa

Detalles Bibliográficos
Autores principales: Zhao, Hong, Chen, Zhiwen, Guo, Lan, Han, Zeyu
Formato: Online Artículo Texto
Lenguaje:English
Publicado: PeerJ Inc. 2022
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9044334/
https://www.ncbi.nlm.nih.gov/pubmed/35494808
http://dx.doi.org/10.7717/peerj-cs.916
_version_ 1784695083559288832
author Zhao, Hong
Chen, Zhiwen
Guo, Lan
Han, Zeyu
author_facet Zhao, Hong
Chen, Zhiwen
Guo, Lan
Han, Zeyu
author_sort Zhao, Hong
collection PubMed
description Global encoding of visual features in video captioning is important for improving the description accuracy. In this paper, we propose a video captioning method that combines Vision Transformer (ViT) and reinforcement learning. Firstly, Resnet-152 and ResNeXt-101 are used to extract features from videos. Secondly, the encoding block of the ViT network is applied to encode video features. Thirdly, the encoded features are fed into a Long Short-Term Memory (LSTM) network to generate a video content description. Finally, the accuracy of video content description is further improved by fine-tuning reinforcement learning. We conducted experiments on the benchmark dataset MSR-VTT used for video captioning. The results show that compared with the current mainstream methods, the model in this paper has improved by 2.9%, 1.4%, 0.9% and 4.8% under the four evaluation indicators of LEU-4, METEOR, ROUGE-L and CIDEr-D, respectively.
format Online
Article
Text
id pubmed-9044334
institution National Center for Biotechnology Information
language English
publishDate 2022
publisher PeerJ Inc.
record_format MEDLINE/PubMed
spelling pubmed-90443342022-04-28 Video captioning based on vision transformer and reinforcement learning Zhao, Hong Chen, Zhiwen Guo, Lan Han, Zeyu PeerJ Comput Sci Artificial Intelligence Global encoding of visual features in video captioning is important for improving the description accuracy. In this paper, we propose a video captioning method that combines Vision Transformer (ViT) and reinforcement learning. Firstly, Resnet-152 and ResNeXt-101 are used to extract features from videos. Secondly, the encoding block of the ViT network is applied to encode video features. Thirdly, the encoded features are fed into a Long Short-Term Memory (LSTM) network to generate a video content description. Finally, the accuracy of video content description is further improved by fine-tuning reinforcement learning. We conducted experiments on the benchmark dataset MSR-VTT used for video captioning. The results show that compared with the current mainstream methods, the model in this paper has improved by 2.9%, 1.4%, 0.9% and 4.8% under the four evaluation indicators of LEU-4, METEOR, ROUGE-L and CIDEr-D, respectively. PeerJ Inc. 2022-03-16 /pmc/articles/PMC9044334/ /pubmed/35494808 http://dx.doi.org/10.7717/peerj-cs.916 Text en © 2022 Zhao et al. https://creativecommons.org/licenses/by/4.0/This is an open access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ Computer Science) and either DOI or URL of the article must be cited.
spellingShingle Artificial Intelligence
Zhao, Hong
Chen, Zhiwen
Guo, Lan
Han, Zeyu
Video captioning based on vision transformer and reinforcement learning
title Video captioning based on vision transformer and reinforcement learning
title_full Video captioning based on vision transformer and reinforcement learning
title_fullStr Video captioning based on vision transformer and reinforcement learning
title_full_unstemmed Video captioning based on vision transformer and reinforcement learning
title_short Video captioning based on vision transformer and reinforcement learning
title_sort video captioning based on vision transformer and reinforcement learning
topic Artificial Intelligence
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9044334/
https://www.ncbi.nlm.nih.gov/pubmed/35494808
http://dx.doi.org/10.7717/peerj-cs.916
work_keys_str_mv AT zhaohong videocaptioningbasedonvisiontransformerandreinforcementlearning
AT chenzhiwen videocaptioningbasedonvisiontransformerandreinforcementlearning
AT guolan videocaptioningbasedonvisiontransformerandreinforcementlearning
AT hanzeyu videocaptioningbasedonvisiontransformerandreinforcementlearning