Skip to yearly menu bar Skip to main content


Poster

Policy improvement by planning with Gumbel

Ivo Danihelka · Arthur Guez · Julian Schrittwieser · David Silver

Virtual

Keywords: [ reinforcement learning ] [ MuZero ]


Abstract:

AlphaZero is a powerful reinforcement learning algorithm based on approximate policy iteration and tree search. However, AlphaZero can fail to improve its policy network, if not visiting all actions at the root of a search tree. To address this issue, we propose a policy improvement algorithm based on sampling actions without replacement. Furthermore, we use the idea of policy improvement to replace the more heuristic mechanisms by which AlphaZero selects and uses actions, both at root nodes and at non-root nodes. Our new algorithms, Gumbel AlphaZero and Gumbel MuZero, respectively without and with model-learning, match the state of the art on Go, chess, and Atari, and significantly improve prior performance when planning with few simulations.

Chat is not available.