Beyond Softmax and Entropy: $f$-Regularized Policy Gradients with Coupled Parametrizations
Safwan Labbi · Daniil Tiapkin · Paul Mangold · Eric Moulines
Abstract
We introduce $\texttt{f-PG}$, a new class of stochastic policy gradient methods regularized by a family of $f$-divergences, including entropy and Tsallis divergences. For each divergence, we employed a $\textit{coupled}$ parameterization, defined by $f$-softargmax, which allows us to establish the first explicit, non-asymptotic, last-iterate convergence rates for stochastic policy gradient. To derive our analysis, we prove that the $f$-regularized value function is smooth and satisfies a Polyak-Ćojasiewicz inequality as a function of $f$-softargmax parameters. To establish the latter, we introduce a general policy improvement operator that restricts optimization to a well-defined policy space that excludes ill-behaved policies. In the case of softmax, this allows to escape the "gravitational pull" and yields the first $\textit{explicit}$ convergence guarantees for this parameterization, closing a gap in the literature. Finally, we leverage these rates to derive sample complexity bounds for the unregularized problem and show that $\texttt{f-PG}$ with Tsallis divergences provides a provably better sample complexity/regularization bias trade-off compared to softmax-based policy gradient with entropy regularization.
Successful Page Load