Bias-Resilient Preference Optimization: Addressing Content-Aware, Multi-Source Biases in Preference Learning
Abstract
Large Language Models (LLMs) have made significant strides in generating human-like responses, largely due to preference alignment techniques. However, these methods often assume unbiased human feedback, which is rarely the case in real-world scenarios. This paper introduces Bias-Resilient Preference Optimization (BRPO), a novel framework that addresses multiple sources of content-dependent bias in preference learning. BRPO employs a multi-objective optimization approach to separate true preferences from biases, effectively mitigating their impact. We leverage backdoor attack mechanisms to efficiently learn and control for various biases within a single model. Theoretical analysis and extensive experiments on both synthetic and real-world datasets demonstrate that BRPO significantly improves alignment with primary human preferences while controlling for secondary biases such as response length and harmfulness.