Skip to yearly menu bar Skip to main content


Oral
in
Workshop: Workshop on Spurious Correlation and Shortcut Learning: Foundations and Solutions

Bias-Resilient Preference Optimization: Addressing Content-Aware, Multi-Source Biases in Preference Learning

Amirabbas Afzali · Amirhossein Afsharrad · Seyed Mousavi · Sanjay Lall

Keywords: [ Preference Alignment ] [ Robust Preference Optimization ] [ Backdoor Attack ] [ Reinforcement Learning from Human Feedback ]


Abstract:

Large Language Models (LLMs) have made significant strides in generating human-like responses, largely due to preference alignment techniques. However, these methods often assume unbiased human feedback, which is rarely the case in real-world scenarios. This paper introduces Bias-Resilient Preference Optimization (BRPO), a novel framework that addresses multiple sources of content-dependent bias in preference learning. BRPO employs a multi-objective optimization approach to separate true preferences from biases, effectively mitigating their impact. We leverage backdoor attack mechanisms to efficiently learn and control for various biases within a single model. Theoretical analysis and extensive experiments on both synthetic and real-world datasets demonstrate that BRPO significantly improves alignment with primary human preferences while controlling for secondary biases such as response length and harmfulness.

Chat is not available.