Cargando…

Lightweight dense video captioning with cross-modal attention and knowledge-enhanced unbiased scene graph

Dense video captioning (DVC) aims at generating description for each scene in a video. Despite attractive progress for this task, previous works usually only concentrate on exploiting visual features while neglecting audio information in the video, resulting in inaccurate scene event location. In th...

Descripción completa

Detalles Bibliográficos
Autores principales: Han, Shixing, Liu, Jin, Zhang, Jinyingming, Gong, Peizhu, Zhang, Xiliang, He, Huihua
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Springer International Publishing 2023
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9950023/
https://www.ncbi.nlm.nih.gov/pubmed/36855683
http://dx.doi.org/10.1007/s40747-023-00998-5
_version_ 1784893073505910784
author Han, Shixing
Liu, Jin
Zhang, Jinyingming
Gong, Peizhu
Zhang, Xiliang
He, Huihua
author_facet Han, Shixing
Liu, Jin
Zhang, Jinyingming
Gong, Peizhu
Zhang, Xiliang
He, Huihua
author_sort Han, Shixing
collection PubMed
description Dense video captioning (DVC) aims at generating description for each scene in a video. Despite attractive progress for this task, previous works usually only concentrate on exploiting visual features while neglecting audio information in the video, resulting in inaccurate scene event location. In this article, we propose a novel DVC model named CMCR, which is mainly composed of a cross-modal processing (CM) module and a commonsense reasoning (CR) module. CM utilizes a cross-modal attention mechanism to encode data in different modalities. An event refactoring algorithm is proposed to deal with inaccurate event localization caused by overlapping events. Besides, a shared encoder is utilized to reduce model redundancy. CR optimizes the logic of generated captions with both heterogeneous prior knowledge and entities’ association reasoning achieved by building a knowledge-enhanced unbiased scene graph. Extensive experiments are conducted on ActivityNet Captions dataset, the results demonstrate that our model achieves better performance than state-of-the-art methods. To better understand the performance achieved by CMCR, we also apply ablation experiments to analyze the contributions of different modules.
format Online
Article
Text
id pubmed-9950023
institution National Center for Biotechnology Information
language English
publishDate 2023
publisher Springer International Publishing
record_format MEDLINE/PubMed
spelling pubmed-99500232023-02-24 Lightweight dense video captioning with cross-modal attention and knowledge-enhanced unbiased scene graph Han, Shixing Liu, Jin Zhang, Jinyingming Gong, Peizhu Zhang, Xiliang He, Huihua Complex Intell Systems Original Article Dense video captioning (DVC) aims at generating description for each scene in a video. Despite attractive progress for this task, previous works usually only concentrate on exploiting visual features while neglecting audio information in the video, resulting in inaccurate scene event location. In this article, we propose a novel DVC model named CMCR, which is mainly composed of a cross-modal processing (CM) module and a commonsense reasoning (CR) module. CM utilizes a cross-modal attention mechanism to encode data in different modalities. An event refactoring algorithm is proposed to deal with inaccurate event localization caused by overlapping events. Besides, a shared encoder is utilized to reduce model redundancy. CR optimizes the logic of generated captions with both heterogeneous prior knowledge and entities’ association reasoning achieved by building a knowledge-enhanced unbiased scene graph. Extensive experiments are conducted on ActivityNet Captions dataset, the results demonstrate that our model achieves better performance than state-of-the-art methods. To better understand the performance achieved by CMCR, we also apply ablation experiments to analyze the contributions of different modules. Springer International Publishing 2023-02-24 /pmc/articles/PMC9950023/ /pubmed/36855683 http://dx.doi.org/10.1007/s40747-023-00998-5 Text en © The Author(s) 2023 https://creativecommons.org/licenses/by/4.0/Open AccessThis article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) .
spellingShingle Original Article
Han, Shixing
Liu, Jin
Zhang, Jinyingming
Gong, Peizhu
Zhang, Xiliang
He, Huihua
Lightweight dense video captioning with cross-modal attention and knowledge-enhanced unbiased scene graph
title Lightweight dense video captioning with cross-modal attention and knowledge-enhanced unbiased scene graph
title_full Lightweight dense video captioning with cross-modal attention and knowledge-enhanced unbiased scene graph
title_fullStr Lightweight dense video captioning with cross-modal attention and knowledge-enhanced unbiased scene graph
title_full_unstemmed Lightweight dense video captioning with cross-modal attention and knowledge-enhanced unbiased scene graph
title_short Lightweight dense video captioning with cross-modal attention and knowledge-enhanced unbiased scene graph
title_sort lightweight dense video captioning with cross-modal attention and knowledge-enhanced unbiased scene graph
topic Original Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9950023/
https://www.ncbi.nlm.nih.gov/pubmed/36855683
http://dx.doi.org/10.1007/s40747-023-00998-5
work_keys_str_mv AT hanshixing lightweightdensevideocaptioningwithcrossmodalattentionandknowledgeenhancedunbiasedscenegraph
AT liujin lightweightdensevideocaptioningwithcrossmodalattentionandknowledgeenhancedunbiasedscenegraph
AT zhangjinyingming lightweightdensevideocaptioningwithcrossmodalattentionandknowledgeenhancedunbiasedscenegraph
AT gongpeizhu lightweightdensevideocaptioningwithcrossmodalattentionandknowledgeenhancedunbiasedscenegraph
AT zhangxiliang lightweightdensevideocaptioningwithcrossmodalattentionandknowledgeenhancedunbiasedscenegraph
AT hehuihua lightweightdensevideocaptioningwithcrossmodalattentionandknowledgeenhancedunbiasedscenegraph