Cargando…

Transformers bridge vision and language to estimate and understand scene meaning

Humans rapidly process and understand real-world scenes with ease. Our stored semantic knowledge gained from experience is thought to be central to this ability by organizing perceptual information into meaningful units to efficiently guide our attention in scenes. However, the role stored semantic...

Descripción completa

Detalles Bibliográficos
Autores principales:	Hayes, Taylor R., Henderson, John M.
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	American Journal Experts 2023
Materias:	Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10312955/ https://www.ncbi.nlm.nih.gov/pubmed/37398443 http://dx.doi.org/10.21203/rs.3.rs-2968381/v1

_version_	1785067018384310272
author	Hayes, Taylor R. Henderson, John M.
author_facet	Hayes, Taylor R. Henderson, John M.
author_sort	Hayes, Taylor R.
collection	PubMed
description	Humans rapidly process and understand real-world scenes with ease. Our stored semantic knowledge gained from experience is thought to be central to this ability by organizing perceptual information into meaningful units to efficiently guide our attention in scenes. However, the role stored semantic representations play in scene guidance remains difficult to study and poorly understood. Here, we apply a state-of-the-art multimodal transformer trained on billions of image-text pairs to help advance our understanding of the role semantic representations play in scene understanding. We demonstrate across multiple studies that this transformer-based approach can be used to automatically estimate local scene meaning in indoor and outdoor scenes, predict where people look in these scenes, detect changes in local semantic content, and provide a human-interpretable account of why one scene region is more meaningful than another. Taken together, these findings highlight how multimodal transformers can advance our understanding of the role scene semantics play in scene understanding by serving as a representational framework that bridges vision and language.
format	Online Article Text
id	pubmed-10312955
institution	National Center for Biotechnology Information
language	English
publishDate	2023
publisher	American Journal Experts
record_format	MEDLINE/PubMed
spelling	pubmed-103129552023-07-01 Transformers bridge vision and language to estimate and understand scene meaning Hayes, Taylor R. Henderson, John M. Res Sq Article Humans rapidly process and understand real-world scenes with ease. Our stored semantic knowledge gained from experience is thought to be central to this ability by organizing perceptual information into meaningful units to efficiently guide our attention in scenes. However, the role stored semantic representations play in scene guidance remains difficult to study and poorly understood. Here, we apply a state-of-the-art multimodal transformer trained on billions of image-text pairs to help advance our understanding of the role semantic representations play in scene understanding. We demonstrate across multiple studies that this transformer-based approach can be used to automatically estimate local scene meaning in indoor and outdoor scenes, predict where people look in these scenes, detect changes in local semantic content, and provide a human-interpretable account of why one scene region is more meaningful than another. Taken together, these findings highlight how multimodal transformers can advance our understanding of the role scene semantics play in scene understanding by serving as a representational framework that bridges vision and language. American Journal Experts 2023-05-29 /pmc/articles/PMC10312955/ /pubmed/37398443 http://dx.doi.org/10.21203/rs.3.rs-2968381/v1 Text en https://creativecommons.org/licenses/by/4.0/This work is licensed under a Creative Commons Attribution 4.0 International License (https://creativecommons.org/licenses/by/4.0/) , which allows reusers to distribute, remix, adapt, and build upon the material in any medium or format, so long as attribution is given to the creator. The license allows for commercial use.
spellingShingle	Article Hayes, Taylor R. Henderson, John M. Transformers bridge vision and language to estimate and understand scene meaning
title	Transformers bridge vision and language to estimate and understand scene meaning
title_full	Transformers bridge vision and language to estimate and understand scene meaning
title_fullStr	Transformers bridge vision and language to estimate and understand scene meaning
title_full_unstemmed	Transformers bridge vision and language to estimate and understand scene meaning
title_short	Transformers bridge vision and language to estimate and understand scene meaning
title_sort	transformers bridge vision and language to estimate and understand scene meaning
topic	Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10312955/ https://www.ncbi.nlm.nih.gov/pubmed/37398443 http://dx.doi.org/10.21203/rs.3.rs-2968381/v1
work_keys_str_mv	AT hayestaylorr transformersbridgevisionandlanguagetoestimateandunderstandscenemeaning AT hendersonjohnm transformersbridgevisionandlanguagetoestimateandunderstandscenemeaning

Transformers bridge vision and language to estimate and understand scene meaning

Ejemplares similares