Oral
in
Workshop: ICLR 2025 Workshop on Bidirectional Human-AI Alignment

ICLR Oral 2: Preference Optimization For Concept Bottleneck Models

Emiliano Penaloza

2025 Oral
in
Workshop: ICLR 2025 Workshop on Bidirectional Human-AI Alignment

Project Page [ OpenReview]

Abstract

Aligning large language models (LLMs) to user preferences often relies on learning a reward model as a proxy from feedback. However, such reward models can fail on out-of-distribution examples and, if kept static, may reinforce incorrect preferences. We propose a dynamic alignment method that uses an energybased out-of-distribution (OOD) scoring mechanism to identify potential misjudgments, then judiciously collects oracle feedback to refine both the policy and reward model. By focusing on the OOD examples, our approach iteratively improves alignment and robustness in preference-based training. Empirically, we show that our method enhances the policy model’s generative capabilities on the LM Eval Harness benchmark and improves the reward model’s judgment capability onRewardBench.

Video

Chat is not available.