Cargando…

Reinforcement Learning Using a Continuous Time Actor-Critic Framework with Spiking Neurons

Animals repeat rewarded behaviors, but the physiological basis of reward-based learning has only been partially elucidated. On one hand, experimental evidence shows that the neuromodulator dopamine carries information about rewards and affects synaptic plasticity. On the other hand, the theory of re...

Descripción completa

Detalles Bibliográficos
Autores principales: Frémaux, Nicolas, Sprekeler, Henning, Gerstner, Wulfram
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Public Library of Science 2013
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3623741/
https://www.ncbi.nlm.nih.gov/pubmed/23592970
http://dx.doi.org/10.1371/journal.pcbi.1003024
_version_ 1782265958560169984
author Frémaux, Nicolas
Sprekeler, Henning
Gerstner, Wulfram
author_facet Frémaux, Nicolas
Sprekeler, Henning
Gerstner, Wulfram
author_sort Frémaux, Nicolas
collection PubMed
description Animals repeat rewarded behaviors, but the physiological basis of reward-based learning has only been partially elucidated. On one hand, experimental evidence shows that the neuromodulator dopamine carries information about rewards and affects synaptic plasticity. On the other hand, the theory of reinforcement learning provides a framework for reward-based learning. Recent models of reward-modulated spike-timing-dependent plasticity have made first steps towards bridging the gap between the two approaches, but faced two problems. First, reinforcement learning is typically formulated in a discrete framework, ill-adapted to the description of natural situations. Second, biologically plausible models of reward-modulated spike-timing-dependent plasticity require precise calculation of the reward prediction error, yet it remains to be shown how this can be computed by neurons. Here we propose a solution to these problems by extending the continuous temporal difference (TD) learning of Doya (2000) to the case of spiking neurons in an actor-critic network operating in continuous time, and with continuous state and action representations. In our model, the critic learns to predict expected future rewards in real time. Its activity, together with actual rewards, conditions the delivery of a neuromodulatory TD signal to itself and to the actor, which is responsible for action choice. In simulations, we show that such an architecture can solve a Morris water-maze-like navigation task, in a number of trials consistent with reported animal performance. We also use our model to solve the acrobot and the cartpole problems, two complex motor control tasks. Our model provides a plausible way of computing reward prediction error in the brain. Moreover, the analytically derived learning rule is consistent with experimental evidence for dopamine-modulated spike-timing-dependent plasticity.
format Online
Article
Text
id pubmed-3623741
institution National Center for Biotechnology Information
language English
publishDate 2013
publisher Public Library of Science
record_format MEDLINE/PubMed
spelling pubmed-36237412013-04-16 Reinforcement Learning Using a Continuous Time Actor-Critic Framework with Spiking Neurons Frémaux, Nicolas Sprekeler, Henning Gerstner, Wulfram PLoS Comput Biol Research Article Animals repeat rewarded behaviors, but the physiological basis of reward-based learning has only been partially elucidated. On one hand, experimental evidence shows that the neuromodulator dopamine carries information about rewards and affects synaptic plasticity. On the other hand, the theory of reinforcement learning provides a framework for reward-based learning. Recent models of reward-modulated spike-timing-dependent plasticity have made first steps towards bridging the gap between the two approaches, but faced two problems. First, reinforcement learning is typically formulated in a discrete framework, ill-adapted to the description of natural situations. Second, biologically plausible models of reward-modulated spike-timing-dependent plasticity require precise calculation of the reward prediction error, yet it remains to be shown how this can be computed by neurons. Here we propose a solution to these problems by extending the continuous temporal difference (TD) learning of Doya (2000) to the case of spiking neurons in an actor-critic network operating in continuous time, and with continuous state and action representations. In our model, the critic learns to predict expected future rewards in real time. Its activity, together with actual rewards, conditions the delivery of a neuromodulatory TD signal to itself and to the actor, which is responsible for action choice. In simulations, we show that such an architecture can solve a Morris water-maze-like navigation task, in a number of trials consistent with reported animal performance. We also use our model to solve the acrobot and the cartpole problems, two complex motor control tasks. Our model provides a plausible way of computing reward prediction error in the brain. Moreover, the analytically derived learning rule is consistent with experimental evidence for dopamine-modulated spike-timing-dependent plasticity. Public Library of Science 2013-04-11 /pmc/articles/PMC3623741/ /pubmed/23592970 http://dx.doi.org/10.1371/journal.pcbi.1003024 Text en © 2013 Frémaux et al http://creativecommons.org/licenses/by/4.0/ This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are properly credited.
spellingShingle Research Article
Frémaux, Nicolas
Sprekeler, Henning
Gerstner, Wulfram
Reinforcement Learning Using a Continuous Time Actor-Critic Framework with Spiking Neurons
title Reinforcement Learning Using a Continuous Time Actor-Critic Framework with Spiking Neurons
title_full Reinforcement Learning Using a Continuous Time Actor-Critic Framework with Spiking Neurons
title_fullStr Reinforcement Learning Using a Continuous Time Actor-Critic Framework with Spiking Neurons
title_full_unstemmed Reinforcement Learning Using a Continuous Time Actor-Critic Framework with Spiking Neurons
title_short Reinforcement Learning Using a Continuous Time Actor-Critic Framework with Spiking Neurons
title_sort reinforcement learning using a continuous time actor-critic framework with spiking neurons
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3623741/
https://www.ncbi.nlm.nih.gov/pubmed/23592970
http://dx.doi.org/10.1371/journal.pcbi.1003024
work_keys_str_mv AT fremauxnicolas reinforcementlearningusingacontinuoustimeactorcriticframeworkwithspikingneurons
AT sprekelerhenning reinforcementlearningusingacontinuoustimeactorcriticframeworkwithspikingneurons
AT gerstnerwulfram reinforcementlearningusingacontinuoustimeactorcriticframeworkwithspikingneurons