Skip to yearly menu bar Skip to main content


Oral
in
Workshop: ICLR 2025 Workshop on Bidirectional Human-AI Alignment

ICLR Oral 2: Preference Optimization For Concept Bottleneck Models

Emiliano Penaloza


Abstract:

Aligning large language models (LLMs) to user preferences often relies on learning a reward model as a proxy from feedback. However, such reward models can fail on out-of-distribution examples and, if kept static, may reinforce incorrect preferences. We propose a dynamic alignment method that uses an energybased out-of-distribution (OOD) scoring mechanism to identify potential misjudgments, then judiciously collects oracle feedback to refine both the policy and reward model. By focusing on the OOD examples, our approach iteratively improves alignment and robustness in preference-based training. Empirically, we show that our method enhances the policy model’s generative capabilities on the LM Eval Harness benchmark and improves the reward model’s judgment capability onRewardBench.

Chat is not available.