Cargando…

Video captioning with stacked attention and semantic hard pull

Video captioning, i.e., the task of generating captions from video sequences creates a bridge between the Natural Language Processing and Computer Vision domains of computer science. The task of generating a semantically accurate description of a video is quite complex. Considering the complexity, o...

Descripción completa

Detalles Bibliográficos
Autores principales: Rahman, Md. Mushfiqur, Abedin, Thasin, Prottoy, Khondokar S.S., Moshruba, Ayana, Siddiqui, Fazlul Hasan
Formato: Online Artículo Texto
Lenguaje:English
Publicado: PeerJ Inc. 2021
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8356660/
https://www.ncbi.nlm.nih.gov/pubmed/34435104
http://dx.doi.org/10.7717/peerj-cs.664
_version_ 1783736988773384192
author Rahman, Md. Mushfiqur
Abedin, Thasin
Prottoy, Khondokar S.S.
Moshruba, Ayana
Siddiqui, Fazlul Hasan
author_facet Rahman, Md. Mushfiqur
Abedin, Thasin
Prottoy, Khondokar S.S.
Moshruba, Ayana
Siddiqui, Fazlul Hasan
author_sort Rahman, Md. Mushfiqur
collection PubMed
description Video captioning, i.e., the task of generating captions from video sequences creates a bridge between the Natural Language Processing and Computer Vision domains of computer science. The task of generating a semantically accurate description of a video is quite complex. Considering the complexity, of the problem, the results obtained in recent research works are praiseworthy. However, there is plenty of scope for further investigation. This paper addresses this scope and proposes a novel solution. Most video captioning models comprise two sequential/recurrent layers—one as a video-to-context encoder and the other as a context-to-caption decoder. This paper proposes a novel architecture, namely Semantically Sensible Video Captioning (SSVC) which modifies the context generation mechanism by using two novel approaches—“stacked attention” and “spatial hard pull”. As there are no exclusive metrics for evaluating video captioning models, we emphasize both quantitative and qualitative analysis of our model. Hence, we have used the BLEU scoring metric for quantitative analysis and have proposed a human evaluation metric for qualitative analysis, namely the Semantic Sensibility (SS) scoring metric. SS Score overcomes the shortcomings of common automated scoring metrics. This paper reports that the use of the aforementioned novelties improves the performance of state-of-the-art architectures.
format Online
Article
Text
id pubmed-8356660
institution National Center for Biotechnology Information
language English
publishDate 2021
publisher PeerJ Inc.
record_format MEDLINE/PubMed
spelling pubmed-83566602021-08-24 Video captioning with stacked attention and semantic hard pull Rahman, Md. Mushfiqur Abedin, Thasin Prottoy, Khondokar S.S. Moshruba, Ayana Siddiqui, Fazlul Hasan PeerJ Comput Sci Human–Computer Interaction Video captioning, i.e., the task of generating captions from video sequences creates a bridge between the Natural Language Processing and Computer Vision domains of computer science. The task of generating a semantically accurate description of a video is quite complex. Considering the complexity, of the problem, the results obtained in recent research works are praiseworthy. However, there is plenty of scope for further investigation. This paper addresses this scope and proposes a novel solution. Most video captioning models comprise two sequential/recurrent layers—one as a video-to-context encoder and the other as a context-to-caption decoder. This paper proposes a novel architecture, namely Semantically Sensible Video Captioning (SSVC) which modifies the context generation mechanism by using two novel approaches—“stacked attention” and “spatial hard pull”. As there are no exclusive metrics for evaluating video captioning models, we emphasize both quantitative and qualitative analysis of our model. Hence, we have used the BLEU scoring metric for quantitative analysis and have proposed a human evaluation metric for qualitative analysis, namely the Semantic Sensibility (SS) scoring metric. SS Score overcomes the shortcomings of common automated scoring metrics. This paper reports that the use of the aforementioned novelties improves the performance of state-of-the-art architectures. PeerJ Inc. 2021-08-05 /pmc/articles/PMC8356660/ /pubmed/34435104 http://dx.doi.org/10.7717/peerj-cs.664 Text en ©2021 Rahman et al. https://creativecommons.org/licenses/by/4.0/This is an open access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ Computer Science) and either DOI or URL of the article must be cited.
spellingShingle Human–Computer Interaction
Rahman, Md. Mushfiqur
Abedin, Thasin
Prottoy, Khondokar S.S.
Moshruba, Ayana
Siddiqui, Fazlul Hasan
Video captioning with stacked attention and semantic hard pull
title Video captioning with stacked attention and semantic hard pull
title_full Video captioning with stacked attention and semantic hard pull
title_fullStr Video captioning with stacked attention and semantic hard pull
title_full_unstemmed Video captioning with stacked attention and semantic hard pull
title_short Video captioning with stacked attention and semantic hard pull
title_sort video captioning with stacked attention and semantic hard pull
topic Human–Computer Interaction
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8356660/
https://www.ncbi.nlm.nih.gov/pubmed/34435104
http://dx.doi.org/10.7717/peerj-cs.664
work_keys_str_mv AT rahmanmdmushfiqur videocaptioningwithstackedattentionandsemantichardpull
AT abedinthasin videocaptioningwithstackedattentionandsemantichardpull
AT prottoykhondokarss videocaptioningwithstackedattentionandsemantichardpull
AT moshrubaayana videocaptioningwithstackedattentionandsemantichardpull
AT siddiquifazlulhasan videocaptioningwithstackedattentionandsemantichardpull