Cargando…
Neural Foundations of Mental Simulation: Future Prediction of Latent Representations on Dynamic Scenes
Humans and animals have a rich and flexible understanding of the physical world, which enables them to infer the underlying dynamical trajectories of objects and events, plausible future states, and use that to plan and anticipate the consequences of actions. However, the neural mechanisms underlyin...
Autores principales: | , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Cornell University
2023
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10246064/ https://www.ncbi.nlm.nih.gov/pubmed/37292459 |
_version_ | 1785054970939179008 |
---|---|
author | Nayebi, Aran Rajalingham, Rishi Jazayeri, Mehrdad Yang, Guangyu Robert |
author_facet | Nayebi, Aran Rajalingham, Rishi Jazayeri, Mehrdad Yang, Guangyu Robert |
author_sort | Nayebi, Aran |
collection | PubMed |
description | Humans and animals have a rich and flexible understanding of the physical world, which enables them to infer the underlying dynamical trajectories of objects and events, plausible future states, and use that to plan and anticipate the consequences of actions. However, the neural mechanisms underlying these computations are unclear. We combine a goal-driven modeling approach with dense neurophysiological data and high-throughput human behavioral readouts that contain thousands of comparisons to directly impinge on this question. Specifically, we construct and evaluate several classes of sensory-cognitive networks to predict the future state of rich, ethologically-relevant environments, ranging from self-supervised end-to-end models with pixel-wise or object-slot objectives, to models that future predict in the latent space of purely static image-pretrained or dynamic video-pretrained foundation models. We find that “scale is not all you need”, and that many state-of-the-art machine learning models fail to perform well on our neural and behavioral benchmarks for future prediction. In fact, only one class of models matches these data well overall. We find that neural responses are currently best predicted by models trained to predict the future state of their environment in the latent space of pretrained foundation models optimized for dynamic scenes in a self-supervised manner. These models also approach the neurons’ ability to predict the environmental state variables that are visually hidden from view, despite not being explicitly trained to do so. Finally, we find that not all foundation model latents are equal. Notably, models that future predict in the latent space of video foundation models that are optimized to support a diverse range of egocentric sensorimotor tasks, reasonably match both human behavioral error patterns and neural dynamics across all environmental scenarios that we were able to test. Overall, these findings suggest that the neural mechanisms and behaviors of primate mental simulation have strong inductive biases associated with them, and are thus far most consistent with being optimized to future predict on reusable visual representations that are useful for Embodied AI more generally. |
format | Online Article Text |
id | pubmed-10246064 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2023 |
publisher | Cornell University |
record_format | MEDLINE/PubMed |
spelling | pubmed-102460642023-06-08 Neural Foundations of Mental Simulation: Future Prediction of Latent Representations on Dynamic Scenes Nayebi, Aran Rajalingham, Rishi Jazayeri, Mehrdad Yang, Guangyu Robert ArXiv Article Humans and animals have a rich and flexible understanding of the physical world, which enables them to infer the underlying dynamical trajectories of objects and events, plausible future states, and use that to plan and anticipate the consequences of actions. However, the neural mechanisms underlying these computations are unclear. We combine a goal-driven modeling approach with dense neurophysiological data and high-throughput human behavioral readouts that contain thousands of comparisons to directly impinge on this question. Specifically, we construct and evaluate several classes of sensory-cognitive networks to predict the future state of rich, ethologically-relevant environments, ranging from self-supervised end-to-end models with pixel-wise or object-slot objectives, to models that future predict in the latent space of purely static image-pretrained or dynamic video-pretrained foundation models. We find that “scale is not all you need”, and that many state-of-the-art machine learning models fail to perform well on our neural and behavioral benchmarks for future prediction. In fact, only one class of models matches these data well overall. We find that neural responses are currently best predicted by models trained to predict the future state of their environment in the latent space of pretrained foundation models optimized for dynamic scenes in a self-supervised manner. These models also approach the neurons’ ability to predict the environmental state variables that are visually hidden from view, despite not being explicitly trained to do so. Finally, we find that not all foundation model latents are equal. Notably, models that future predict in the latent space of video foundation models that are optimized to support a diverse range of egocentric sensorimotor tasks, reasonably match both human behavioral error patterns and neural dynamics across all environmental scenarios that we were able to test. Overall, these findings suggest that the neural mechanisms and behaviors of primate mental simulation have strong inductive biases associated with them, and are thus far most consistent with being optimized to future predict on reusable visual representations that are useful for Embodied AI more generally. Cornell University 2023-10-25 /pmc/articles/PMC10246064/ /pubmed/37292459 Text en https://creativecommons.org/licenses/by-nc-sa/4.0/This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License (https://creativecommons.org/licenses/by-nc-sa/4.0/) , which allows reusers to distribute, remix, adapt, and build upon the material in any medium or format for noncommercial purposes only, and only so long as attribution is given to the creator. If you remix, adapt, or build upon the material, you must license the modified material under identical terms. |
spellingShingle | Article Nayebi, Aran Rajalingham, Rishi Jazayeri, Mehrdad Yang, Guangyu Robert Neural Foundations of Mental Simulation: Future Prediction of Latent Representations on Dynamic Scenes |
title | Neural Foundations of Mental Simulation: Future Prediction of Latent Representations on Dynamic Scenes |
title_full | Neural Foundations of Mental Simulation: Future Prediction of Latent Representations on Dynamic Scenes |
title_fullStr | Neural Foundations of Mental Simulation: Future Prediction of Latent Representations on Dynamic Scenes |
title_full_unstemmed | Neural Foundations of Mental Simulation: Future Prediction of Latent Representations on Dynamic Scenes |
title_short | Neural Foundations of Mental Simulation: Future Prediction of Latent Representations on Dynamic Scenes |
title_sort | neural foundations of mental simulation: future prediction of latent representations on dynamic scenes |
topic | Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10246064/ https://www.ncbi.nlm.nih.gov/pubmed/37292459 |
work_keys_str_mv | AT nayebiaran neuralfoundationsofmentalsimulationfuturepredictionoflatentrepresentationsondynamicscenes AT rajalinghamrishi neuralfoundationsofmentalsimulationfuturepredictionoflatentrepresentationsondynamicscenes AT jazayerimehrdad neuralfoundationsofmentalsimulationfuturepredictionoflatentrepresentationsondynamicscenes AT yangguangyurobert neuralfoundationsofmentalsimulationfuturepredictionoflatentrepresentationsondynamicscenes |