Cargando…

Research on Video Captioning Based on Multifeature Fusion

Aiming at the problems that the existing video captioning models pay attention to incomplete information and the generation of expression text is not accurate enough, a video captioning model that integrates image, audio, and motion optical flow is proposed. A variety of large-scale dataset pretrain...

Descripción completa

Detalles Bibliográficos
Autores principales: Zhao, Hong, Guo, Lan, Chen, ZhiWen, Zheng, HouZe
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Hindawi 2022
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9071958/
https://www.ncbi.nlm.nih.gov/pubmed/35528356
http://dx.doi.org/10.1155/2022/1204909
_version_ 1784700946020827136
author Zhao, Hong
Guo, Lan
Chen, ZhiWen
Zheng, HouZe
author_facet Zhao, Hong
Guo, Lan
Chen, ZhiWen
Zheng, HouZe
author_sort Zhao, Hong
collection PubMed
description Aiming at the problems that the existing video captioning models pay attention to incomplete information and the generation of expression text is not accurate enough, a video captioning model that integrates image, audio, and motion optical flow is proposed. A variety of large-scale dataset pretraining models are used to extract video frame features, motion information, audio features, and video sequence features. An embedded layer structure based on self-attention mechanism is designed to embed single-mode features and learn single-mode feature parameters. Then, two schemes of joint representation and cooperative representation are used to fuse the multimodal features of the feature vectors output by the embedded layer, so that the model can pay attention to different targets in the video and their interactive relationships, which effectively improves the performance of the video captioning model. The experiment is carried out on large datasets MSR-VTT and LSMDC. Under the metrics BLEU4, METEOR, ROUGEL, and CIDEr, the MSR-VTT benchmark dataset obtained scores of 0.443, 0.327, 0.619, and 0.521, respectively. The result shows that the proposed method can effectively improve the performance of the video captioning model, and the evaluation indexes are improved compared with comparison models.
format Online
Article
Text
id pubmed-9071958
institution National Center for Biotechnology Information
language English
publishDate 2022
publisher Hindawi
record_format MEDLINE/PubMed
spelling pubmed-90719582022-05-06 Research on Video Captioning Based on Multifeature Fusion Zhao, Hong Guo, Lan Chen, ZhiWen Zheng, HouZe Comput Intell Neurosci Research Article Aiming at the problems that the existing video captioning models pay attention to incomplete information and the generation of expression text is not accurate enough, a video captioning model that integrates image, audio, and motion optical flow is proposed. A variety of large-scale dataset pretraining models are used to extract video frame features, motion information, audio features, and video sequence features. An embedded layer structure based on self-attention mechanism is designed to embed single-mode features and learn single-mode feature parameters. Then, two schemes of joint representation and cooperative representation are used to fuse the multimodal features of the feature vectors output by the embedded layer, so that the model can pay attention to different targets in the video and their interactive relationships, which effectively improves the performance of the video captioning model. The experiment is carried out on large datasets MSR-VTT and LSMDC. Under the metrics BLEU4, METEOR, ROUGEL, and CIDEr, the MSR-VTT benchmark dataset obtained scores of 0.443, 0.327, 0.619, and 0.521, respectively. The result shows that the proposed method can effectively improve the performance of the video captioning model, and the evaluation indexes are improved compared with comparison models. Hindawi 2022-04-28 /pmc/articles/PMC9071958/ /pubmed/35528356 http://dx.doi.org/10.1155/2022/1204909 Text en Copyright © 2022 Hong Zhao et al. https://creativecommons.org/licenses/by/4.0/This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Research Article
Zhao, Hong
Guo, Lan
Chen, ZhiWen
Zheng, HouZe
Research on Video Captioning Based on Multifeature Fusion
title Research on Video Captioning Based on Multifeature Fusion
title_full Research on Video Captioning Based on Multifeature Fusion
title_fullStr Research on Video Captioning Based on Multifeature Fusion
title_full_unstemmed Research on Video Captioning Based on Multifeature Fusion
title_short Research on Video Captioning Based on Multifeature Fusion
title_sort research on video captioning based on multifeature fusion
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9071958/
https://www.ncbi.nlm.nih.gov/pubmed/35528356
http://dx.doi.org/10.1155/2022/1204909
work_keys_str_mv AT zhaohong researchonvideocaptioningbasedonmultifeaturefusion
AT guolan researchonvideocaptioningbasedonmultifeaturefusion
AT chenzhiwen researchonvideocaptioningbasedonmultifeaturefusion
AT zhenghouze researchonvideocaptioningbasedonmultifeaturefusion