Invited Talk by Isabela Albuquerque: From Pixels to Actions: A Roadmap for Evaluating Safety in Multimodal Agents
Abstract
As generative models evolve into autonomous agents, evaluation must shift from static "Pixel Parity" to dynamic "Procedural Parity". While traditional safety frameworks audit a single output, agentic workflows introduce sequential risks like behavioral drift over long horizons. This talk outlines a roadmap toward a formal Evaluation Science, focusing on critical safety aspects like fairness and alignment in multimodal agents. We bridge foundational statistical baselines used to measure multimodal alignment and distributional skew with Equality of Service, assessing whether quality remain uniform across demographic intersections. By investigating how agents resolve query ambiguity, we highlight the importance of evaluating the equity of the decision-making process itself. Finally, we describe a statistical method that unifies human and automated signals, providing a scalable path to help realize the ultimate goal of treating responsible AI as a measurable, predictable property of autonomous systems.