Cargando…
Transformers bridge vision and language to estimate and understand scene meaning
Humans rapidly process and understand real-world scenes with ease. Our stored semantic knowledge gained from experience is thought to be central to this ability by organizing perceptual information into meaningful units to efficiently guide our attention in scenes. However, the role stored semantic...
Autores principales: | , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
American Journal Experts
2023
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10312955/ https://www.ncbi.nlm.nih.gov/pubmed/37398443 http://dx.doi.org/10.21203/rs.3.rs-2968381/v1 |
_version_ | 1785067018384310272 |
---|---|
author | Hayes, Taylor R. Henderson, John M. |
author_facet | Hayes, Taylor R. Henderson, John M. |
author_sort | Hayes, Taylor R. |
collection | PubMed |
description | Humans rapidly process and understand real-world scenes with ease. Our stored semantic knowledge gained from experience is thought to be central to this ability by organizing perceptual information into meaningful units to efficiently guide our attention in scenes. However, the role stored semantic representations play in scene guidance remains difficult to study and poorly understood. Here, we apply a state-of-the-art multimodal transformer trained on billions of image-text pairs to help advance our understanding of the role semantic representations play in scene understanding. We demonstrate across multiple studies that this transformer-based approach can be used to automatically estimate local scene meaning in indoor and outdoor scenes, predict where people look in these scenes, detect changes in local semantic content, and provide a human-interpretable account of why one scene region is more meaningful than another. Taken together, these findings highlight how multimodal transformers can advance our understanding of the role scene semantics play in scene understanding by serving as a representational framework that bridges vision and language. |
format | Online Article Text |
id | pubmed-10312955 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2023 |
publisher | American Journal Experts |
record_format | MEDLINE/PubMed |
spelling | pubmed-103129552023-07-01 Transformers bridge vision and language to estimate and understand scene meaning Hayes, Taylor R. Henderson, John M. Res Sq Article Humans rapidly process and understand real-world scenes with ease. Our stored semantic knowledge gained from experience is thought to be central to this ability by organizing perceptual information into meaningful units to efficiently guide our attention in scenes. However, the role stored semantic representations play in scene guidance remains difficult to study and poorly understood. Here, we apply a state-of-the-art multimodal transformer trained on billions of image-text pairs to help advance our understanding of the role semantic representations play in scene understanding. We demonstrate across multiple studies that this transformer-based approach can be used to automatically estimate local scene meaning in indoor and outdoor scenes, predict where people look in these scenes, detect changes in local semantic content, and provide a human-interpretable account of why one scene region is more meaningful than another. Taken together, these findings highlight how multimodal transformers can advance our understanding of the role scene semantics play in scene understanding by serving as a representational framework that bridges vision and language. American Journal Experts 2023-05-29 /pmc/articles/PMC10312955/ /pubmed/37398443 http://dx.doi.org/10.21203/rs.3.rs-2968381/v1 Text en https://creativecommons.org/licenses/by/4.0/This work is licensed under a Creative Commons Attribution 4.0 International License (https://creativecommons.org/licenses/by/4.0/) , which allows reusers to distribute, remix, adapt, and build upon the material in any medium or format, so long as attribution is given to the creator. The license allows for commercial use. |
spellingShingle | Article Hayes, Taylor R. Henderson, John M. Transformers bridge vision and language to estimate and understand scene meaning |
title | Transformers bridge vision and language to estimate and understand scene meaning |
title_full | Transformers bridge vision and language to estimate and understand scene meaning |
title_fullStr | Transformers bridge vision and language to estimate and understand scene meaning |
title_full_unstemmed | Transformers bridge vision and language to estimate and understand scene meaning |
title_short | Transformers bridge vision and language to estimate and understand scene meaning |
title_sort | transformers bridge vision and language to estimate and understand scene meaning |
topic | Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10312955/ https://www.ncbi.nlm.nih.gov/pubmed/37398443 http://dx.doi.org/10.21203/rs.3.rs-2968381/v1 |
work_keys_str_mv | AT hayestaylorr transformersbridgevisionandlanguagetoestimateandunderstandscenemeaning AT hendersonjohnm transformersbridgevisionandlanguagetoestimateandunderstandscenemeaning |