Cargando…

Neural Foundations of Mental Simulation: Future Prediction of Latent Representations on Dynamic Scenes

Humans and animals have a rich and flexible understanding of the physical world, which enables them to infer the underlying dynamical trajectories of objects and events, plausible future states, and use that to plan and anticipate the consequences of actions. However, the neural mechanisms underlyin...

Descripción completa

Detalles Bibliográficos
Autores principales: Nayebi, Aran, Rajalingham, Rishi, Jazayeri, Mehrdad, Yang, Guangyu Robert
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Cornell University 2023
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10246064/
https://www.ncbi.nlm.nih.gov/pubmed/37292459
_version_ 1785054970939179008
author Nayebi, Aran
Rajalingham, Rishi
Jazayeri, Mehrdad
Yang, Guangyu Robert
author_facet Nayebi, Aran
Rajalingham, Rishi
Jazayeri, Mehrdad
Yang, Guangyu Robert
author_sort Nayebi, Aran
collection PubMed
description Humans and animals have a rich and flexible understanding of the physical world, which enables them to infer the underlying dynamical trajectories of objects and events, plausible future states, and use that to plan and anticipate the consequences of actions. However, the neural mechanisms underlying these computations are unclear. We combine a goal-driven modeling approach with dense neurophysiological data and high-throughput human behavioral readouts that contain thousands of comparisons to directly impinge on this question. Specifically, we construct and evaluate several classes of sensory-cognitive networks to predict the future state of rich, ethologically-relevant environments, ranging from self-supervised end-to-end models with pixel-wise or object-slot objectives, to models that future predict in the latent space of purely static image-pretrained or dynamic video-pretrained foundation models. We find that “scale is not all you need”, and that many state-of-the-art machine learning models fail to perform well on our neural and behavioral benchmarks for future prediction. In fact, only one class of models matches these data well overall. We find that neural responses are currently best predicted by models trained to predict the future state of their environment in the latent space of pretrained foundation models optimized for dynamic scenes in a self-supervised manner. These models also approach the neurons’ ability to predict the environmental state variables that are visually hidden from view, despite not being explicitly trained to do so. Finally, we find that not all foundation model latents are equal. Notably, models that future predict in the latent space of video foundation models that are optimized to support a diverse range of egocentric sensorimotor tasks, reasonably match both human behavioral error patterns and neural dynamics across all environmental scenarios that we were able to test. Overall, these findings suggest that the neural mechanisms and behaviors of primate mental simulation have strong inductive biases associated with them, and are thus far most consistent with being optimized to future predict on reusable visual representations that are useful for Embodied AI more generally.
format Online
Article
Text
id pubmed-10246064
institution National Center for Biotechnology Information
language English
publishDate 2023
publisher Cornell University
record_format MEDLINE/PubMed
spelling pubmed-102460642023-06-08 Neural Foundations of Mental Simulation: Future Prediction of Latent Representations on Dynamic Scenes Nayebi, Aran Rajalingham, Rishi Jazayeri, Mehrdad Yang, Guangyu Robert ArXiv Article Humans and animals have a rich and flexible understanding of the physical world, which enables them to infer the underlying dynamical trajectories of objects and events, plausible future states, and use that to plan and anticipate the consequences of actions. However, the neural mechanisms underlying these computations are unclear. We combine a goal-driven modeling approach with dense neurophysiological data and high-throughput human behavioral readouts that contain thousands of comparisons to directly impinge on this question. Specifically, we construct and evaluate several classes of sensory-cognitive networks to predict the future state of rich, ethologically-relevant environments, ranging from self-supervised end-to-end models with pixel-wise or object-slot objectives, to models that future predict in the latent space of purely static image-pretrained or dynamic video-pretrained foundation models. We find that “scale is not all you need”, and that many state-of-the-art machine learning models fail to perform well on our neural and behavioral benchmarks for future prediction. In fact, only one class of models matches these data well overall. We find that neural responses are currently best predicted by models trained to predict the future state of their environment in the latent space of pretrained foundation models optimized for dynamic scenes in a self-supervised manner. These models also approach the neurons’ ability to predict the environmental state variables that are visually hidden from view, despite not being explicitly trained to do so. Finally, we find that not all foundation model latents are equal. Notably, models that future predict in the latent space of video foundation models that are optimized to support a diverse range of egocentric sensorimotor tasks, reasonably match both human behavioral error patterns and neural dynamics across all environmental scenarios that we were able to test. Overall, these findings suggest that the neural mechanisms and behaviors of primate mental simulation have strong inductive biases associated with them, and are thus far most consistent with being optimized to future predict on reusable visual representations that are useful for Embodied AI more generally. Cornell University 2023-10-25 /pmc/articles/PMC10246064/ /pubmed/37292459 Text en https://creativecommons.org/licenses/by-nc-sa/4.0/This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License (https://creativecommons.org/licenses/by-nc-sa/4.0/) , which allows reusers to distribute, remix, adapt, and build upon the material in any medium or format for noncommercial purposes only, and only so long as attribution is given to the creator. If you remix, adapt, or build upon the material, you must license the modified material under identical terms.
spellingShingle Article
Nayebi, Aran
Rajalingham, Rishi
Jazayeri, Mehrdad
Yang, Guangyu Robert
Neural Foundations of Mental Simulation: Future Prediction of Latent Representations on Dynamic Scenes
title Neural Foundations of Mental Simulation: Future Prediction of Latent Representations on Dynamic Scenes
title_full Neural Foundations of Mental Simulation: Future Prediction of Latent Representations on Dynamic Scenes
title_fullStr Neural Foundations of Mental Simulation: Future Prediction of Latent Representations on Dynamic Scenes
title_full_unstemmed Neural Foundations of Mental Simulation: Future Prediction of Latent Representations on Dynamic Scenes
title_short Neural Foundations of Mental Simulation: Future Prediction of Latent Representations on Dynamic Scenes
title_sort neural foundations of mental simulation: future prediction of latent representations on dynamic scenes
topic Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10246064/
https://www.ncbi.nlm.nih.gov/pubmed/37292459
work_keys_str_mv AT nayebiaran neuralfoundationsofmentalsimulationfuturepredictionoflatentrepresentationsondynamicscenes
AT rajalinghamrishi neuralfoundationsofmentalsimulationfuturepredictionoflatentrepresentationsondynamicscenes
AT jazayerimehrdad neuralfoundationsofmentalsimulationfuturepredictionoflatentrepresentationsondynamicscenes
AT yangguangyurobert neuralfoundationsofmentalsimulationfuturepredictionoflatentrepresentationsondynamicscenes