Cargando…

Transformers bridge vision and language to estimate and understand scene meaning

Humans rapidly process and understand real-world scenes with ease. Our stored semantic knowledge gained from experience is thought to be central to this ability by organizing perceptual information into meaningful units to efficiently guide our attention in scenes. However, the role stored semantic...

Descripción completa

Detalles Bibliográficos
Autores principales: Hayes, Taylor R., Henderson, John M.
Formato: Online Artículo Texto
Lenguaje:English
Publicado: American Journal Experts 2023
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10312955/
https://www.ncbi.nlm.nih.gov/pubmed/37398443
http://dx.doi.org/10.21203/rs.3.rs-2968381/v1
_version_ 1785067018384310272
author Hayes, Taylor R.
Henderson, John M.
author_facet Hayes, Taylor R.
Henderson, John M.
author_sort Hayes, Taylor R.
collection PubMed
description Humans rapidly process and understand real-world scenes with ease. Our stored semantic knowledge gained from experience is thought to be central to this ability by organizing perceptual information into meaningful units to efficiently guide our attention in scenes. However, the role stored semantic representations play in scene guidance remains difficult to study and poorly understood. Here, we apply a state-of-the-art multimodal transformer trained on billions of image-text pairs to help advance our understanding of the role semantic representations play in scene understanding. We demonstrate across multiple studies that this transformer-based approach can be used to automatically estimate local scene meaning in indoor and outdoor scenes, predict where people look in these scenes, detect changes in local semantic content, and provide a human-interpretable account of why one scene region is more meaningful than another. Taken together, these findings highlight how multimodal transformers can advance our understanding of the role scene semantics play in scene understanding by serving as a representational framework that bridges vision and language.
format Online
Article
Text
id pubmed-10312955
institution National Center for Biotechnology Information
language English
publishDate 2023
publisher American Journal Experts
record_format MEDLINE/PubMed
spelling pubmed-103129552023-07-01 Transformers bridge vision and language to estimate and understand scene meaning Hayes, Taylor R. Henderson, John M. Res Sq Article Humans rapidly process and understand real-world scenes with ease. Our stored semantic knowledge gained from experience is thought to be central to this ability by organizing perceptual information into meaningful units to efficiently guide our attention in scenes. However, the role stored semantic representations play in scene guidance remains difficult to study and poorly understood. Here, we apply a state-of-the-art multimodal transformer trained on billions of image-text pairs to help advance our understanding of the role semantic representations play in scene understanding. We demonstrate across multiple studies that this transformer-based approach can be used to automatically estimate local scene meaning in indoor and outdoor scenes, predict where people look in these scenes, detect changes in local semantic content, and provide a human-interpretable account of why one scene region is more meaningful than another. Taken together, these findings highlight how multimodal transformers can advance our understanding of the role scene semantics play in scene understanding by serving as a representational framework that bridges vision and language. American Journal Experts 2023-05-29 /pmc/articles/PMC10312955/ /pubmed/37398443 http://dx.doi.org/10.21203/rs.3.rs-2968381/v1 Text en https://creativecommons.org/licenses/by/4.0/This work is licensed under a Creative Commons Attribution 4.0 International License (https://creativecommons.org/licenses/by/4.0/) , which allows reusers to distribute, remix, adapt, and build upon the material in any medium or format, so long as attribution is given to the creator. The license allows for commercial use.
spellingShingle Article
Hayes, Taylor R.
Henderson, John M.
Transformers bridge vision and language to estimate and understand scene meaning
title Transformers bridge vision and language to estimate and understand scene meaning
title_full Transformers bridge vision and language to estimate and understand scene meaning
title_fullStr Transformers bridge vision and language to estimate and understand scene meaning
title_full_unstemmed Transformers bridge vision and language to estimate and understand scene meaning
title_short Transformers bridge vision and language to estimate and understand scene meaning
title_sort transformers bridge vision and language to estimate and understand scene meaning
topic Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10312955/
https://www.ncbi.nlm.nih.gov/pubmed/37398443
http://dx.doi.org/10.21203/rs.3.rs-2968381/v1
work_keys_str_mv AT hayestaylorr transformersbridgevisionandlanguagetoestimateandunderstandscenemeaning
AT hendersonjohnm transformersbridgevisionandlanguagetoestimateandunderstandscenemeaning