Poster

Language Models Learn to Mislead Humans via RLHF

Jiaxin Wen ⋅ Ruiqi Zhong ⋅ Akbir Khan ⋅ Ethan Perez ⋅ Jacob Steinhardt ⋅ Minlie Huang ⋅ Sam Bowman ⋅ He He ⋅ Shi Feng

2025 Poster

[ OpenReview]

Abstract

Language models (LMs) can produce errors that are hard to detect for humans, especially when the task is complex.RLHF, the most popular post-training method, may exacerbate this problem: to achieve higher rewards, LMs might get better at convincing humans that they are right even when they are wrong. We study this phenomenon under a standard RLHF pipeline, calling it ``U-Sophistry'' since it is \textbf{U}nintended by model developers. Specifically, we ask time-constrained (e.g., 3-10 minutes) human subjects to evaluate the correctness of model outputs and calculate humans' accuracy against gold labels. On a question-answering task (QuALITY) and programming task (APPS), RLHF makes LMs better at convincing our subjects but not at completing the task correctly. RLHF also makes the model harder to evaluate: our subjects' false positive rate increases by 24.1% on QuALITY and 18.3% on APPS.Finally, we show that probing, a state-of-the-art approach for detecting \textbf{I}ntended Sophistry (e.g.~backdoored LMs), does not generalize to U-Sophistry. Our results highlight an important failure mode of RLHF and call for more research in assisting humans to align them.

Video

Chat is not available.