Cargando…
Video captioning based on vision transformer and reinforcement learning
Global encoding of visual features in video captioning is important for improving the description accuracy. In this paper, we propose a video captioning method that combines Vision Transformer (ViT) and reinforcement learning. Firstly, Resnet-152 and ResNeXt-101 are used to extract features from vid...
Autores principales: | , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
PeerJ Inc.
2022
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9044334/ https://www.ncbi.nlm.nih.gov/pubmed/35494808 http://dx.doi.org/10.7717/peerj-cs.916 |
_version_ | 1784695083559288832 |
---|---|
author | Zhao, Hong Chen, Zhiwen Guo, Lan Han, Zeyu |
author_facet | Zhao, Hong Chen, Zhiwen Guo, Lan Han, Zeyu |
author_sort | Zhao, Hong |
collection | PubMed |
description | Global encoding of visual features in video captioning is important for improving the description accuracy. In this paper, we propose a video captioning method that combines Vision Transformer (ViT) and reinforcement learning. Firstly, Resnet-152 and ResNeXt-101 are used to extract features from videos. Secondly, the encoding block of the ViT network is applied to encode video features. Thirdly, the encoded features are fed into a Long Short-Term Memory (LSTM) network to generate a video content description. Finally, the accuracy of video content description is further improved by fine-tuning reinforcement learning. We conducted experiments on the benchmark dataset MSR-VTT used for video captioning. The results show that compared with the current mainstream methods, the model in this paper has improved by 2.9%, 1.4%, 0.9% and 4.8% under the four evaluation indicators of LEU-4, METEOR, ROUGE-L and CIDEr-D, respectively. |
format | Online Article Text |
id | pubmed-9044334 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2022 |
publisher | PeerJ Inc. |
record_format | MEDLINE/PubMed |
spelling | pubmed-90443342022-04-28 Video captioning based on vision transformer and reinforcement learning Zhao, Hong Chen, Zhiwen Guo, Lan Han, Zeyu PeerJ Comput Sci Artificial Intelligence Global encoding of visual features in video captioning is important for improving the description accuracy. In this paper, we propose a video captioning method that combines Vision Transformer (ViT) and reinforcement learning. Firstly, Resnet-152 and ResNeXt-101 are used to extract features from videos. Secondly, the encoding block of the ViT network is applied to encode video features. Thirdly, the encoded features are fed into a Long Short-Term Memory (LSTM) network to generate a video content description. Finally, the accuracy of video content description is further improved by fine-tuning reinforcement learning. We conducted experiments on the benchmark dataset MSR-VTT used for video captioning. The results show that compared with the current mainstream methods, the model in this paper has improved by 2.9%, 1.4%, 0.9% and 4.8% under the four evaluation indicators of LEU-4, METEOR, ROUGE-L and CIDEr-D, respectively. PeerJ Inc. 2022-03-16 /pmc/articles/PMC9044334/ /pubmed/35494808 http://dx.doi.org/10.7717/peerj-cs.916 Text en © 2022 Zhao et al. https://creativecommons.org/licenses/by/4.0/This is an open access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ Computer Science) and either DOI or URL of the article must be cited. |
spellingShingle | Artificial Intelligence Zhao, Hong Chen, Zhiwen Guo, Lan Han, Zeyu Video captioning based on vision transformer and reinforcement learning |
title | Video captioning based on vision transformer and reinforcement learning |
title_full | Video captioning based on vision transformer and reinforcement learning |
title_fullStr | Video captioning based on vision transformer and reinforcement learning |
title_full_unstemmed | Video captioning based on vision transformer and reinforcement learning |
title_short | Video captioning based on vision transformer and reinforcement learning |
title_sort | video captioning based on vision transformer and reinforcement learning |
topic | Artificial Intelligence |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9044334/ https://www.ncbi.nlm.nih.gov/pubmed/35494808 http://dx.doi.org/10.7717/peerj-cs.916 |
work_keys_str_mv | AT zhaohong videocaptioningbasedonvisiontransformerandreinforcementlearning AT chenzhiwen videocaptioningbasedonvisiontransformerandreinforcementlearning AT guolan videocaptioningbasedonvisiontransformerandreinforcementlearning AT hanzeyu videocaptioningbasedonvisiontransformerandreinforcementlearning |