Cargando…

Video captioning based on vision transformer and reinforcement learning

Global encoding of visual features in video captioning is important for improving the description accuracy. In this paper, we propose a video captioning method that combines Vision Transformer (ViT) and reinforcement learning. Firstly, Resnet-152 and ResNeXt-101 are used to extract features from vid...

Descripción completa

Detalles Bibliográficos
Autores principales:	Zhao, Hong, Chen, Zhiwen, Guo, Lan, Han, Zeyu
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	PeerJ Inc. 2022
Materias:	Artificial Intelligence
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9044334/ https://www.ncbi.nlm.nih.gov/pubmed/35494808 http://dx.doi.org/10.7717/peerj-cs.916

Internet

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9044334/
https://www.ncbi.nlm.nih.gov/pubmed/35494808
http://dx.doi.org/10.7717/peerj-cs.916

Video captioning based on vision transformer and reinforcement learning

Internet

Ejemplares similares