Cargando…

Fusion of Multi-Modal Features to Enhance Dense Video Caption

Dense video caption is a task that aims to help computers analyze the content of a video by generating abstract captions for a sequence of video frames. However, most of the existing methods only use visual features in the video and ignore the audio features that are also essential for understanding...

Descripción completa

Detalles Bibliográficos
Autores principales: Huang, Xuefei, Chan, Ka-Hou, Wu, Weifan, Sheng, Hao, Ke, Wei
Formato: Online Artículo Texto
Lenguaje:English
Publicado: MDPI 2023
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10304565/
https://www.ncbi.nlm.nih.gov/pubmed/37420732
http://dx.doi.org/10.3390/s23125565
_version_ 1785065540785537024
author Huang, Xuefei
Chan, Ka-Hou
Wu, Weifan
Sheng, Hao
Ke, Wei
author_facet Huang, Xuefei
Chan, Ka-Hou
Wu, Weifan
Sheng, Hao
Ke, Wei
author_sort Huang, Xuefei
collection PubMed
description Dense video caption is a task that aims to help computers analyze the content of a video by generating abstract captions for a sequence of video frames. However, most of the existing methods only use visual features in the video and ignore the audio features that are also essential for understanding the video. In this paper, we propose a fusion model that combines the Transformer framework to integrate both visual and audio features in the video for captioning. We use multi-head attention to deal with the variations in sequence lengths between the models involved in our approach. We also introduce a Common Pool to store the generated features and align them with the time steps, thus filtering the information and eliminating redundancy based on the confidence scores. Moreover, we use LSTM as a decoder to generate the description sentences, which reduces the memory size of the entire network. Experiments show that our method is competitive on the ActivityNet Captions dataset.
format Online
Article
Text
id pubmed-10304565
institution National Center for Biotechnology Information
language English
publishDate 2023
publisher MDPI
record_format MEDLINE/PubMed
spelling pubmed-103045652023-06-29 Fusion of Multi-Modal Features to Enhance Dense Video Caption Huang, Xuefei Chan, Ka-Hou Wu, Weifan Sheng, Hao Ke, Wei Sensors (Basel) Article Dense video caption is a task that aims to help computers analyze the content of a video by generating abstract captions for a sequence of video frames. However, most of the existing methods only use visual features in the video and ignore the audio features that are also essential for understanding the video. In this paper, we propose a fusion model that combines the Transformer framework to integrate both visual and audio features in the video for captioning. We use multi-head attention to deal with the variations in sequence lengths between the models involved in our approach. We also introduce a Common Pool to store the generated features and align them with the time steps, thus filtering the information and eliminating redundancy based on the confidence scores. Moreover, we use LSTM as a decoder to generate the description sentences, which reduces the memory size of the entire network. Experiments show that our method is competitive on the ActivityNet Captions dataset. MDPI 2023-06-14 /pmc/articles/PMC10304565/ /pubmed/37420732 http://dx.doi.org/10.3390/s23125565 Text en © 2023 by the authors. https://creativecommons.org/licenses/by/4.0/Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
spellingShingle Article
Huang, Xuefei
Chan, Ka-Hou
Wu, Weifan
Sheng, Hao
Ke, Wei
Fusion of Multi-Modal Features to Enhance Dense Video Caption
title Fusion of Multi-Modal Features to Enhance Dense Video Caption
title_full Fusion of Multi-Modal Features to Enhance Dense Video Caption
title_fullStr Fusion of Multi-Modal Features to Enhance Dense Video Caption
title_full_unstemmed Fusion of Multi-Modal Features to Enhance Dense Video Caption
title_short Fusion of Multi-Modal Features to Enhance Dense Video Caption
title_sort fusion of multi-modal features to enhance dense video caption
topic Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10304565/
https://www.ncbi.nlm.nih.gov/pubmed/37420732
http://dx.doi.org/10.3390/s23125565
work_keys_str_mv AT huangxuefei fusionofmultimodalfeaturestoenhancedensevideocaption
AT chankahou fusionofmultimodalfeaturestoenhancedensevideocaption
AT wuweifan fusionofmultimodalfeaturestoenhancedensevideocaption
AT shenghao fusionofmultimodalfeaturestoenhancedensevideocaption
AT kewei fusionofmultimodalfeaturestoenhancedensevideocaption