Poster

Actor-Critic Policy Optimization in a Large-Scale Imperfect-Information Game

Haobo Fu ⋅ Weiming Liu ⋅ Shuang Wu ⋅ Yijia Wang ⋅ Tao Yang ⋅ Kai Li ⋅ Junliang Xing ⋅ Bin Li ⋅ Bo Ma ⋅ QIANG FU ⋅ Yang Wei

Keywords: policy optimization

2022 Poster

[ Visit Poster at Spot A3 in Virtual World ] [ Slides] [ OpenReview]

Abstract

The deep policy gradient method has demonstrated promising results in many large-scale games, where the agent learns purely from its own experience. Yet, policy gradient methods with self-play suffer convergence problems to a Nash Equilibrium (NE) in multi-agent situations. Counterfactual regret minimization (CFR) has a convergence guarantee to a NE in 2-player zero-sum games, but it usually needs domain-specific abstractions to deal with large-scale games. Inheriting merits from both methods, in this paper we extend the actor-critic algorithm framework in deep reinforcement learning to tackle a large-scale 2-player zero-sum imperfect-information game, 1-on-1 Mahjong, whose information set size and game length are much larger than poker. The proposed algorithm, named Actor-Critic Hedge (ACH), modifies the policy optimization objective from originally maximizing the discounted returns to minimizing a type of weighted cumulative counterfactual regret. This modification is achieved by approximating the regret via a deep neural network and minimizing the regret via generating self-play policies using Hedge. ACH is theoretically justified as it is derived from a neural-based weighted CFR, for which we prove the convergence to a NE under certain conditions. Experimental results on the proposed 1-on-1 Mahjong benchmark and benchmarks from the literature demonstrate that ACH outperforms related state-of-the-art methods. Also, the agent obtained by ACH defeats a human champion in 1-on-1 Mahjong.

Video

Chat is not available.