Poster
in
Workshop: Algorithmic Fairness Across Alignment Procedures and Agentic Systems

Distortion of AI Alignment Revisited: RLHF Is a Decent Utilitarian Aligner

Kazusato Oko ⋅ Annie Ulichney ⋅ Nika Haghtalab ⋅ Han Bao

Project Page [ OpenReview]

Abstract

While Reinforcement Learning from Human Feedback (RLHF) is the standard paradigm for aligning large language models with human preferences, its effectiveness in pluralistic settings has been called into question. Notably, recent work by Golz et al. (2025) demonstrated that the *distortion* — defined as the multiplicative gap between the average user utility of the RLHF policy and the optimal average utility — can scale exponentially with the Bradley-Terry temperature parameter $\beta$ when users have heterogeneous preferences. In this work, we present a fine-grained analysis of the distortion of RLHF with reward clipping and demonstrate that such exponential degradation is not a fundamental property of the algorithm but rather a consequence of distribution mismatch between the distribution generating preference data ($\mu$) and the KL reference policy ($\pi_{\mathrm{ref}}$). We establish tight upper and lower bounds on the distortion of RLHF across multiple regimes of the KL regularization strength. We show that in a representative regime, under the Bradley-Terry model, the distortion is $\tilde{\Theta}(\beta B)$, where $B$ is an upper bound on the log density ratio between $\mu$ and $\pi_{\mathrm{ref}}$. As a consequence, when there is no distribution mismatch (i.e., $\mu = \pi_{\mathrm{ref}}$), RLHF achieves the optimal distortion of $O(\beta)$ up to a constant.

Chat is not available.