Poster
in
Workshop: Bridging the Gap Between Practice and Theory in Deep Learning

Distributed Reward-Free Exploration: A Provably Efficient Policy Optimization Algorithm

Hongyi Guo ⋅ Zhuoran Yang ⋅ Zhaoran Wang

Project Page [ OpenReview]

Abstract

Distributed training has become essential to the success of applying deep reinforcement learning (RL) algorithms to large-scale problems that require a huge number of samples. However, there is few work that studies distributed RL algorithms on the theoretical side. In this paper, we fill this vacancy by proposing DPORF, a distributed policy optimization algorithm for reward-free exploration and giving a theoretical analysis of our algorithm. Distinct from existing works, our algorithm does NOT require aggregating the collected samples in every iteration. Thus, we can hugely reduce the communication overhead, which is known to be the bottleneck for distributed RL. We show that, for the case with linear function approximations, DPORF needs $\tilde{\mathcal{O}} (d^2H^6/(M\epsilon^2))$ episodes to output an $\epsilon$-optimal policy for any reward function, where $d$ is the feature dimension, $H$ is the episode horizon, and $M$ is the number of agents. We also establish an $\Omega(d^2H^3/(M\epsilon^2))$ lower bound, which indicates that our upper bound matches the lower bound in terms of the dependence on $d, M$, and $\epsilon$. To the best of our knowledge, DPORF is the first provably efficient policy optimization algorithm for distributed reward-free reinforcement learning that does not require data aggregation in every iteration.

Chat is not available.