Skip to yearly menu bar Skip to main content


Poster

Language Models Learn to Mislead Humans via RLHF

Jiaxin Wen ⋅ Ruiqi Zhong ⋅ Akbir Khan ⋅ Ethan Perez ⋅ Jacob Steinhardt ⋅ Minlie Huang ⋅ Sam Bowman ⋅ He He ⋅ Shi Feng
2025 Poster

Abstract

Video

Chat is not available.