Cargando…

What Does a Language-And-Vision Transformer See: The Impact of Semantic Information on Visual Representations

Neural networks have proven to be very successful in automatically capturing the composition of language and different structures across a range of multi-modal tasks. Thus, an important question to investigate is how neural networks learn and organise such structures. Numerous studies have examined...

Descripción completa

Detalles Bibliográficos
Autores principales:	Ilinykh, Nikolai, Dobnik, Simon
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Frontiers Media S.A. 2021
Materias:	Artificial Intelligence
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8679841/ https://www.ncbi.nlm.nih.gov/pubmed/34927063 http://dx.doi.org/10.3389/frai.2021.767971

_version_	1784616614439682048
author	Ilinykh, Nikolai Dobnik, Simon
author_facet	Ilinykh, Nikolai Dobnik, Simon
author_sort	Ilinykh, Nikolai
collection	PubMed
description	Neural networks have proven to be very successful in automatically capturing the composition of language and different structures across a range of multi-modal tasks. Thus, an important question to investigate is how neural networks learn and organise such structures. Numerous studies have examined the knowledge captured by language models (LSTMs, transformers) and vision architectures (CNNs, vision transformers) for respective uni-modal tasks. However, very few have explored what structures are acquired by multi-modal transformers where linguistic and visual features are combined. It is critical to understand the representations learned by each modality, their respective interplay, and the task’s effect on these representations in large-scale architectures. In this paper, we take a multi-modal transformer trained for image captioning and examine the structure of the self-attention patterns extracted from the visual stream. Our results indicate that the information about different relations between objects in the visual stream is hierarchical and varies from local to a global object-level understanding of the image. In particular, while visual representations in the first layers encode the knowledge of relations between semantically similar object detections, often constituting neighbouring objects, deeper layers expand their attention across more distant objects and learn global relations between them. We also show that globally attended objects in deeper layers can be linked with entities described in image descriptions, indicating a critical finding - the indirect effect of language on visual representations. In addition, we highlight how object-based input representations affect the structure of learned visual knowledge and guide the model towards more accurate image descriptions. A parallel question that we investigate is whether the insights from cognitive science echo the structure of representations that the current neural architecture learns. The proposed analysis of the inner workings of multi-modal transformers can be used to better understand and improve on such problems as pre-training of large-scale multi-modal architectures, multi-modal information fusion and probing of attention weights. In general, we contribute to the explainable multi-modal natural language processing and currently shallow understanding of how the input representations and the structure of the multi-modal transformer affect visual representations.
format	Online Article Text
id	pubmed-8679841
institution	National Center for Biotechnology Information
language	English
publishDate	2021
publisher	Frontiers Media S.A.
record_format	MEDLINE/PubMed
spelling	pubmed-86798412021-12-18 What Does a Language-And-Vision Transformer See: The Impact of Semantic Information on Visual Representations Ilinykh, Nikolai Dobnik, Simon Front Artif Intell Artificial Intelligence Neural networks have proven to be very successful in automatically capturing the composition of language and different structures across a range of multi-modal tasks. Thus, an important question to investigate is how neural networks learn and organise such structures. Numerous studies have examined the knowledge captured by language models (LSTMs, transformers) and vision architectures (CNNs, vision transformers) for respective uni-modal tasks. However, very few have explored what structures are acquired by multi-modal transformers where linguistic and visual features are combined. It is critical to understand the representations learned by each modality, their respective interplay, and the task’s effect on these representations in large-scale architectures. In this paper, we take a multi-modal transformer trained for image captioning and examine the structure of the self-attention patterns extracted from the visual stream. Our results indicate that the information about different relations between objects in the visual stream is hierarchical and varies from local to a global object-level understanding of the image. In particular, while visual representations in the first layers encode the knowledge of relations between semantically similar object detections, often constituting neighbouring objects, deeper layers expand their attention across more distant objects and learn global relations between them. We also show that globally attended objects in deeper layers can be linked with entities described in image descriptions, indicating a critical finding - the indirect effect of language on visual representations. In addition, we highlight how object-based input representations affect the structure of learned visual knowledge and guide the model towards more accurate image descriptions. A parallel question that we investigate is whether the insights from cognitive science echo the structure of representations that the current neural architecture learns. The proposed analysis of the inner workings of multi-modal transformers can be used to better understand and improve on such problems as pre-training of large-scale multi-modal architectures, multi-modal information fusion and probing of attention weights. In general, we contribute to the explainable multi-modal natural language processing and currently shallow understanding of how the input representations and the structure of the multi-modal transformer affect visual representations. Frontiers Media S.A. 2021-12-03 /pmc/articles/PMC8679841/ /pubmed/34927063 http://dx.doi.org/10.3389/frai.2021.767971 Text en Copyright © 2021 Ilinykh and Dobnik. https://creativecommons.org/licenses/by/4.0/This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
spellingShingle	Artificial Intelligence Ilinykh, Nikolai Dobnik, Simon What Does a Language-And-Vision Transformer See: The Impact of Semantic Information on Visual Representations
title	What Does a Language-And-Vision Transformer See: The Impact of Semantic Information on Visual Representations
title_full	What Does a Language-And-Vision Transformer See: The Impact of Semantic Information on Visual Representations
title_fullStr	What Does a Language-And-Vision Transformer See: The Impact of Semantic Information on Visual Representations
title_full_unstemmed	What Does a Language-And-Vision Transformer See: The Impact of Semantic Information on Visual Representations
title_short	What Does a Language-And-Vision Transformer See: The Impact of Semantic Information on Visual Representations
title_sort	what does a language-and-vision transformer see: the impact of semantic information on visual representations
topic	Artificial Intelligence
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8679841/ https://www.ncbi.nlm.nih.gov/pubmed/34927063 http://dx.doi.org/10.3389/frai.2021.767971
work_keys_str_mv	AT ilinykhnikolai whatdoesalanguageandvisiontransformerseetheimpactofsemanticinformationonvisualrepresentations AT dobniksimon whatdoesalanguageandvisiontransformerseetheimpactofsemanticinformationonvisualrepresentations

What Does a Language-And-Vision Transformer See: The Impact of Semantic Information on Visual Representations

Ejemplares similares