Cargando…

Visual Relationship Detection with Multimodal Fusion and Reasoning

Visual relationship detection aims to completely understand visual scenes and has recently received increasing attention. However, current methods only use the visual features of images to train the semantic network, which does not match human habits in which we know obvious features of scenes and i...

Descripción completa

Detalles Bibliográficos
Autores principales: Xiao, Shouguan, Fu, Weiping
Formato: Online Artículo Texto
Lenguaje:English
Publicado: MDPI 2022
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9611296/
https://www.ncbi.nlm.nih.gov/pubmed/36298269
http://dx.doi.org/10.3390/s22207918
Descripción
Sumario:Visual relationship detection aims to completely understand visual scenes and has recently received increasing attention. However, current methods only use the visual features of images to train the semantic network, which does not match human habits in which we know obvious features of scenes and infer covert states using common sense. Therefore, these methods cannot predict some hidden relationships of object-pairs from complex scenes. To address this problem, we propose unifying vision–language fusion and knowledge graph reasoning to combine visual feature embedding with external common sense knowledge to determine the visual relationships of objects. In addition, before training the relationship detection network, we devise an object–pair proposal module to solve the combination explosion problem. Extensive experiments show that our proposed method outperforms the state-of-the-art methods on the Visual Genome and Visual Relationship Detection datasets.