Towards Statistical Verification of Fairness in Agentic Alignment
Abstract
Although modern machine learning systems such as large language models have achieved strong performance in low-stakes domains, their increasing deployment in autonomous decision-making roles can lead to systematic unfairness. While alignment strategies (e.g., RLHF) effectively optimize for general preferences, they typically rely on empirical validation rather than statistical certification, often failing to guarantee normative fairness constraints in the face of uncertainty. One emerging strategy imposes fairness-based probabilistic constraints that explicitly account for statistical uncertainty arising from the data-generating process and its interaction with the learning pipeline. In this work, we analyze these approaches as alignment mechanisms for enforcing high-confidence fairness and reliability in agentic and generative settings. We present a reduction that characterizes the relationship between high-confidence guarantees and their statistical verification, and develop a theoretical framework that situates a broad class of existing methods within a common structure. Using this framework, we reveal how underlying assumptions can fail in practice, particularly in composite data regimes common to modern alignment (e.g., RLAIF), where the difference between proxy feedback and ground-truth compromises standard statistical guarantees. Lastly, we theoretically characterize sufficient conditions under which these guarantees can be recovered, and empirically illustrate their practical implications.