Oral
in
Workshop: Workshop on Spurious Correlation and Shortcut Learning: Foundations and Solutions

Bias-Resilient Preference Optimization: Addressing Content-Aware, Multi-Source Biases in Preference Learning

Amirabbas Afzali · Amirhossein Afsharrad · Seyed Mousavi · Sanjay Lall

Keywords: Preference Alignment Robust Preference Optimization Backdoor Attack Reinforcement Learning from Human Feedback

Project Page [ OpenReview]

Abstract

Large Language Models (LLMs) have made significant strides in generating human-like responses, largely due to preference alignment techniques. However, these methods often assume unbiased human feedback, which is rarely the case in real-world scenarios. This paper introduces Bias-Resilient Preference Optimization (BRPO), a novel framework that addresses multiple sources of content-dependent bias in preference learning. BRPO employs a multi-objective optimization approach to separate true preferences from biases, effectively mitigating their impact. We leverage backdoor attack mechanisms to efficiently learn and control for various biases within a single model. Theoretical analysis and extensive experiments on both synthetic and real-world datasets demonstrate that BRPO significantly improves alignment with primary human preferences while controlling for secondary biases such as response length and harmfulness.

Chat is not available.