Cargando…

Video captioning based on vision transformer and reinforcement learning

Global encoding of visual features in video captioning is important for improving the description accuracy. In this paper, we propose a video captioning method that combines Vision Transformer (ViT) and reinforcement learning. Firstly, Resnet-152 and ResNeXt-101 are used to extract features from vid...

Descripción completa

Detalles Bibliográficos
Autores principales: Zhao, Hong, Chen, Zhiwen, Guo, Lan, Han, Zeyu
Formato: Online Artículo Texto
Lenguaje:English
Publicado: PeerJ Inc. 2022
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9044334/
https://www.ncbi.nlm.nih.gov/pubmed/35494808
http://dx.doi.org/10.7717/peerj-cs.916