Fair in Mind, Fair in Action? A Synchronous Benchmark for Understanding and Generation in UMLLMs
Abstract
As artificial intelligence (AI) permeates society, ensuring fairness has become a foundational challenge. However, the field faces a “Babel Tower” dilemma: fairness metrics abound, yet their underlying philosophical assumptions often conflict, hindering unified paradigms—particularly in unified multimodal large language models (UMLLMs), where biases propagate systemically across tasks. To address this, we introduce the IRIS Benchmark, to our knowledge the first benchmark designed to synchronously evaluate the fairness of both the understanding and generation in UMLLMs. Enabled by our high-fidelity demographic classifier, ARES, and four supporting large-scale datasets, the benchmark is designed to normalize and aggregate arbitrary metrics into a high-dimensional “fairness space”, integrating 60 granular metrics across three dimensions—Ideal Fairness, Real-world Fidelity, and Bias Inertia & Steerability (IRIS). Through this benchmark, our evaluation of leading UMLLMs uncovers systemic phenomena such as the “generation gap”, individual inconsistencies like “personality splits”, and the “counter-stereotype reward”, while offering diagnostics to guide the optimization of their fairness capabilities. With its novel and extensible framework, the IRIS benchmark is capable of integrating ever-evolving fairness metrics, ultimately helping to resolve the “Babel Tower” impasse.