Cargando…
Lightweight dense video captioning with cross-modal attention and knowledge-enhanced unbiased scene graph
Dense video captioning (DVC) aims at generating description for each scene in a video. Despite attractive progress for this task, previous works usually only concentrate on exploiting visual features while neglecting audio information in the video, resulting in inaccurate scene event location. In th...
Autores principales: | , , , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Springer International Publishing
2023
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9950023/ https://www.ncbi.nlm.nih.gov/pubmed/36855683 http://dx.doi.org/10.1007/s40747-023-00998-5 |
_version_ | 1784893073505910784 |
---|---|
author | Han, Shixing Liu, Jin Zhang, Jinyingming Gong, Peizhu Zhang, Xiliang He, Huihua |
author_facet | Han, Shixing Liu, Jin Zhang, Jinyingming Gong, Peizhu Zhang, Xiliang He, Huihua |
author_sort | Han, Shixing |
collection | PubMed |
description | Dense video captioning (DVC) aims at generating description for each scene in a video. Despite attractive progress for this task, previous works usually only concentrate on exploiting visual features while neglecting audio information in the video, resulting in inaccurate scene event location. In this article, we propose a novel DVC model named CMCR, which is mainly composed of a cross-modal processing (CM) module and a commonsense reasoning (CR) module. CM utilizes a cross-modal attention mechanism to encode data in different modalities. An event refactoring algorithm is proposed to deal with inaccurate event localization caused by overlapping events. Besides, a shared encoder is utilized to reduce model redundancy. CR optimizes the logic of generated captions with both heterogeneous prior knowledge and entities’ association reasoning achieved by building a knowledge-enhanced unbiased scene graph. Extensive experiments are conducted on ActivityNet Captions dataset, the results demonstrate that our model achieves better performance than state-of-the-art methods. To better understand the performance achieved by CMCR, we also apply ablation experiments to analyze the contributions of different modules. |
format | Online Article Text |
id | pubmed-9950023 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2023 |
publisher | Springer International Publishing |
record_format | MEDLINE/PubMed |
spelling | pubmed-99500232023-02-24 Lightweight dense video captioning with cross-modal attention and knowledge-enhanced unbiased scene graph Han, Shixing Liu, Jin Zhang, Jinyingming Gong, Peizhu Zhang, Xiliang He, Huihua Complex Intell Systems Original Article Dense video captioning (DVC) aims at generating description for each scene in a video. Despite attractive progress for this task, previous works usually only concentrate on exploiting visual features while neglecting audio information in the video, resulting in inaccurate scene event location. In this article, we propose a novel DVC model named CMCR, which is mainly composed of a cross-modal processing (CM) module and a commonsense reasoning (CR) module. CM utilizes a cross-modal attention mechanism to encode data in different modalities. An event refactoring algorithm is proposed to deal with inaccurate event localization caused by overlapping events. Besides, a shared encoder is utilized to reduce model redundancy. CR optimizes the logic of generated captions with both heterogeneous prior knowledge and entities’ association reasoning achieved by building a knowledge-enhanced unbiased scene graph. Extensive experiments are conducted on ActivityNet Captions dataset, the results demonstrate that our model achieves better performance than state-of-the-art methods. To better understand the performance achieved by CMCR, we also apply ablation experiments to analyze the contributions of different modules. Springer International Publishing 2023-02-24 /pmc/articles/PMC9950023/ /pubmed/36855683 http://dx.doi.org/10.1007/s40747-023-00998-5 Text en © The Author(s) 2023 https://creativecommons.org/licenses/by/4.0/Open AccessThis article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) . |
spellingShingle | Original Article Han, Shixing Liu, Jin Zhang, Jinyingming Gong, Peizhu Zhang, Xiliang He, Huihua Lightweight dense video captioning with cross-modal attention and knowledge-enhanced unbiased scene graph |
title | Lightweight dense video captioning with cross-modal attention and knowledge-enhanced unbiased scene graph |
title_full | Lightweight dense video captioning with cross-modal attention and knowledge-enhanced unbiased scene graph |
title_fullStr | Lightweight dense video captioning with cross-modal attention and knowledge-enhanced unbiased scene graph |
title_full_unstemmed | Lightweight dense video captioning with cross-modal attention and knowledge-enhanced unbiased scene graph |
title_short | Lightweight dense video captioning with cross-modal attention and knowledge-enhanced unbiased scene graph |
title_sort | lightweight dense video captioning with cross-modal attention and knowledge-enhanced unbiased scene graph |
topic | Original Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9950023/ https://www.ncbi.nlm.nih.gov/pubmed/36855683 http://dx.doi.org/10.1007/s40747-023-00998-5 |
work_keys_str_mv | AT hanshixing lightweightdensevideocaptioningwithcrossmodalattentionandknowledgeenhancedunbiasedscenegraph AT liujin lightweightdensevideocaptioningwithcrossmodalattentionandknowledgeenhancedunbiasedscenegraph AT zhangjinyingming lightweightdensevideocaptioningwithcrossmodalattentionandknowledgeenhancedunbiasedscenegraph AT gongpeizhu lightweightdensevideocaptioningwithcrossmodalattentionandknowledgeenhancedunbiasedscenegraph AT zhangxiliang lightweightdensevideocaptioningwithcrossmodalattentionandknowledgeenhancedunbiasedscenegraph AT hehuihua lightweightdensevideocaptioningwithcrossmodalattentionandknowledgeenhancedunbiasedscenegraph |