Cargando…

Video captioning based on vision transformer and reinforcement learning

Global encoding of visual features in video captioning is important for improving the description accuracy. In this paper, we propose a video captioning method that combines Vision Transformer (ViT) and reinforcement learning. Firstly, Resnet-152 and ResNeXt-101 are used to extract features from vid...

Descripción completa

Detalles Bibliográficos
Autores principales:	Zhao, Hong, Chen, Zhiwen, Guo, Lan, Han, Zeyu
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	PeerJ Inc. 2022
Materias:	Artificial Intelligence
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9044334/ https://www.ncbi.nlm.nih.gov/pubmed/35494808 http://dx.doi.org/10.7717/peerj-cs.916

_version_	1784695083559288832
author	Zhao, Hong Chen, Zhiwen Guo, Lan Han, Zeyu
author_facet	Zhao, Hong Chen, Zhiwen Guo, Lan Han, Zeyu
author_sort	Zhao, Hong
collection	PubMed
description	Global encoding of visual features in video captioning is important for improving the description accuracy. In this paper, we propose a video captioning method that combines Vision Transformer (ViT) and reinforcement learning. Firstly, Resnet-152 and ResNeXt-101 are used to extract features from videos. Secondly, the encoding block of the ViT network is applied to encode video features. Thirdly, the encoded features are fed into a Long Short-Term Memory (LSTM) network to generate a video content description. Finally, the accuracy of video content description is further improved by fine-tuning reinforcement learning. We conducted experiments on the benchmark dataset MSR-VTT used for video captioning. The results show that compared with the current mainstream methods, the model in this paper has improved by 2.9%, 1.4%, 0.9% and 4.8% under the four evaluation indicators of LEU-4, METEOR, ROUGE-L and CIDEr-D, respectively.
format	Online Article Text
id	pubmed-9044334
institution	National Center for Biotechnology Information
language	English
publishDate	2022
publisher	PeerJ Inc.
record_format	MEDLINE/PubMed
spelling	pubmed-90443342022-04-28 Video captioning based on vision transformer and reinforcement learning Zhao, Hong Chen, Zhiwen Guo, Lan Han, Zeyu PeerJ Comput Sci Artificial Intelligence Global encoding of visual features in video captioning is important for improving the description accuracy. In this paper, we propose a video captioning method that combines Vision Transformer (ViT) and reinforcement learning. Firstly, Resnet-152 and ResNeXt-101 are used to extract features from videos. Secondly, the encoding block of the ViT network is applied to encode video features. Thirdly, the encoded features are fed into a Long Short-Term Memory (LSTM) network to generate a video content description. Finally, the accuracy of video content description is further improved by fine-tuning reinforcement learning. We conducted experiments on the benchmark dataset MSR-VTT used for video captioning. The results show that compared with the current mainstream methods, the model in this paper has improved by 2.9%, 1.4%, 0.9% and 4.8% under the four evaluation indicators of LEU-4, METEOR, ROUGE-L and CIDEr-D, respectively. PeerJ Inc. 2022-03-16 /pmc/articles/PMC9044334/ /pubmed/35494808 http://dx.doi.org/10.7717/peerj-cs.916 Text en © 2022 Zhao et al. https://creativecommons.org/licenses/by/4.0/This is an open access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ Computer Science) and either DOI or URL of the article must be cited.
spellingShingle	Artificial Intelligence Zhao, Hong Chen, Zhiwen Guo, Lan Han, Zeyu Video captioning based on vision transformer and reinforcement learning
title	Video captioning based on vision transformer and reinforcement learning
title_full	Video captioning based on vision transformer and reinforcement learning
title_fullStr	Video captioning based on vision transformer and reinforcement learning
title_full_unstemmed	Video captioning based on vision transformer and reinforcement learning
title_short	Video captioning based on vision transformer and reinforcement learning
title_sort	video captioning based on vision transformer and reinforcement learning
topic	Artificial Intelligence
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9044334/ https://www.ncbi.nlm.nih.gov/pubmed/35494808 http://dx.doi.org/10.7717/peerj-cs.916
work_keys_str_mv	AT zhaohong videocaptioningbasedonvisiontransformerandreinforcementlearning AT chenzhiwen videocaptioningbasedonvisiontransformerandreinforcementlearning AT guolan videocaptioningbasedonvisiontransformerandreinforcementlearning AT hanzeyu videocaptioningbasedonvisiontransformerandreinforcementlearning

Video captioning based on vision transformer and reinforcement learning

Ejemplares similares