Relative Value Learning
Marc Höftmann · Jan Robine · Stefan Harmeling
Abstract
In reinforcement learning (RL), critics traditionally learn absolute state values, estimating how good a particular situation is in isolation. Adding any constant to $V(s)$ leaves action preferences unchanged. Thus only value differences are relevant for decision making. Motivated by this fact, we ask the question whether these differences can be learned directly. For this, we propose \emph{Relative Value Learning} (RV), a framework that considers antisymmetric value differences $\Delta(s_i, s_j) = V(s_i) - V(s_j)$. We define a new pairwise Bellman operator and prove it is a $\gamma$-contraction with a unique fixed point equal to the true value differences, derive well-posed $1$-step/$n$-step/$\lambda$-return targets and reconstruct generalized advantage estimation from pairwise differences to obtain an unbiased policy-gradient estimator (R-GAE). Besides rigorous theoretical contributions, we integrate RV with PPO and achieve competitive performance on the Atari benchmark (49 games, ALE) compared to standard PPO, indicating that relative value estimation is an effective alternative to absolute critics. Source code will be made available.
Successful Page Load