Cargando…

Safe reinforcement learning under temporal logic with reward design and quantum action selection

This paper proposes an advanced Reinforcement Learning (RL) method, incorporating reward-shaping, safety value functions, and a quantum action selection algorithm. The method is model-free and can synthesize a finite policy that maximizes the probability of satisfying a complex task. Although RL is...

Descripción completa

Detalles Bibliográficos
Autores principales:	Cai, Mingyu, Xiao, Shaoping, Li, Junchao, Kan, Zhen
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Nature Publishing Group UK 2023
Materias:	Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9894922/ https://www.ncbi.nlm.nih.gov/pubmed/36732441 http://dx.doi.org/10.1038/s41598-023-28582-4

_version_	1784881837202472960
author	Cai, Mingyu Xiao, Shaoping Li, Junchao Kan, Zhen
author_facet	Cai, Mingyu Xiao, Shaoping Li, Junchao Kan, Zhen
author_sort	Cai, Mingyu
collection	PubMed
description	This paper proposes an advanced Reinforcement Learning (RL) method, incorporating reward-shaping, safety value functions, and a quantum action selection algorithm. The method is model-free and can synthesize a finite policy that maximizes the probability of satisfying a complex task. Although RL is a promising approach, it suffers from unsafe traps and sparse rewards and becomes impractical when applied to real-world problems. To improve safety during training, we introduce a concept of safety values, which results in a model-based adaptive scenario due to online updates of transition probabilities. On the other hand, a high-level complex task is usually formulated via formal languages, including Linear Temporal Logic (LTL). Another novelty of this work is using an Embedded Limit-Deterministic Generalized Büchi Automaton (E-LDGBA) to represent an LTL formula. The obtained deterministic policy can generalize the tasks over infinite and finite horizons. We design an automaton-based reward, and the theoretical analysis shows that an agent can accomplish task specifications with the maximum probability by following the optimal policy. Furthermore, a reward shaping process is developed to avoid sparse rewards and enforce the RL convergence while keeping the optimal policies invariant. In addition, inspired by quantum computing, we propose a quantum action selection algorithm to replace the existing [Formula: see text] -greedy algorithm for the balance of exploration and exploitation during learning. Simulations demonstrate how the proposed framework can achieve good performance by dramatically reducing the times to visit unsafe states while converging optimal policies.
format	Online Article Text
id	pubmed-9894922
institution	National Center for Biotechnology Information
language	English
publishDate	2023
publisher	Nature Publishing Group UK
record_format	MEDLINE/PubMed
spelling	pubmed-98949222023-02-04 Safe reinforcement learning under temporal logic with reward design and quantum action selection Cai, Mingyu Xiao, Shaoping Li, Junchao Kan, Zhen Sci Rep Article This paper proposes an advanced Reinforcement Learning (RL) method, incorporating reward-shaping, safety value functions, and a quantum action selection algorithm. The method is model-free and can synthesize a finite policy that maximizes the probability of satisfying a complex task. Although RL is a promising approach, it suffers from unsafe traps and sparse rewards and becomes impractical when applied to real-world problems. To improve safety during training, we introduce a concept of safety values, which results in a model-based adaptive scenario due to online updates of transition probabilities. On the other hand, a high-level complex task is usually formulated via formal languages, including Linear Temporal Logic (LTL). Another novelty of this work is using an Embedded Limit-Deterministic Generalized Büchi Automaton (E-LDGBA) to represent an LTL formula. The obtained deterministic policy can generalize the tasks over infinite and finite horizons. We design an automaton-based reward, and the theoretical analysis shows that an agent can accomplish task specifications with the maximum probability by following the optimal policy. Furthermore, a reward shaping process is developed to avoid sparse rewards and enforce the RL convergence while keeping the optimal policies invariant. In addition, inspired by quantum computing, we propose a quantum action selection algorithm to replace the existing [Formula: see text] -greedy algorithm for the balance of exploration and exploitation during learning. Simulations demonstrate how the proposed framework can achieve good performance by dramatically reducing the times to visit unsafe states while converging optimal policies. Nature Publishing Group UK 2023-02-02 /pmc/articles/PMC9894922/ /pubmed/36732441 http://dx.doi.org/10.1038/s41598-023-28582-4 Text en © The Author(s) 2023 https://creativecommons.org/licenses/by/4.0/Open AccessThis article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) .
spellingShingle	Article Cai, Mingyu Xiao, Shaoping Li, Junchao Kan, Zhen Safe reinforcement learning under temporal logic with reward design and quantum action selection
title	Safe reinforcement learning under temporal logic with reward design and quantum action selection
title_full	Safe reinforcement learning under temporal logic with reward design and quantum action selection
title_fullStr	Safe reinforcement learning under temporal logic with reward design and quantum action selection
title_full_unstemmed	Safe reinforcement learning under temporal logic with reward design and quantum action selection
title_short	Safe reinforcement learning under temporal logic with reward design and quantum action selection
title_sort	safe reinforcement learning under temporal logic with reward design and quantum action selection
topic	Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9894922/ https://www.ncbi.nlm.nih.gov/pubmed/36732441 http://dx.doi.org/10.1038/s41598-023-28582-4
work_keys_str_mv	AT caimingyu safereinforcementlearningundertemporallogicwithrewarddesignandquantumactionselection AT xiaoshaoping safereinforcementlearningundertemporallogicwithrewarddesignandquantumactionselection AT lijunchao safereinforcementlearningundertemporallogicwithrewarddesignandquantumactionselection AT kanzhen safereinforcementlearningundertemporallogicwithrewarddesignandquantumactionselection

Safe reinforcement learning under temporal logic with reward design and quantum action selection

Ejemplares similares