Principled Design for Trustworthy AI: Interpretability, Robustness, and Safety Across Modalities
Abstract
Modern AI systems, particularly large language models, vision-language models, and deep vision networks, are increasingly deployed in high-stakes settings such as healthcare, autonomous driving, and legal decisions. Yet, their lack of transparency, fragility to distributional shifts between train/test environments, and representation misalignment in emerging tasks and data/feature modalities raise serious concerns about their trustworthiness. This workshop focuses on developing trustworthy AI systems by principled design: models that are interpretable, robust, and aligned across the full lifecycle – from training and evaluation to inference-time behavior and deployment. We aim to unify efforts across modalities (language, vision, audio, and time series) and across technical areas spanning interpretability, robustness, uncertainty, safety, and policy. Our goal is to create a workshop platform for cross-disciplinary discussion and idea exchange across key dimensions of trustworthiness in modern AI systems. These include interpretability & mechanistic transparency, uncertainty quantification & risk assessment for safe operation, adversarial & distributional robustness, and representation & safety alignment across diverse tasks & modalities. By bringing together these efforts under a cohesive design paradigm, the workshop seeks to advance forward-looking solutions and foster community building around shared technical & societal challenges in building trustworthy AI systems. This workshop differs from the recent prior workshop efforts (e.g ICML’24 TiFA, NeurIPS’24 Interpretable AI, IJCAI’24 Trustworthy AI) in its unique focus on building Trustworthy AI systems by design and its broad coverage of the full machine learning lifecycle across both single- and multi-modal settings. Topics of interest include 6 pillars: (1) Interpretable and Intervenable Models: concept bottlenecks and modular architectures, neuron tracing and causal influence methods, mechanistic interpretability and concept-based reasoning, interpretability for control and real-time intervention; (2) Inference-Time Safety and Monitoring: reasoning trace auditing in LLMs and VLMs, inference-time safeguards and safety mechanisms, chain-of-thought consistency and hallucination detection, real-time monitoring and failure intervention mechanisms; (3) Multimodal Trust Challenges: grounding failures and cross-modal misalignment, safety in vision-language and deep vision systems, cross-modal alignment and robust multimodal reasoning, trust and uncertainty in video, audio, and time-series models; (4) Robustness and Threat Models: adversarial attacks and defenses, robustness to distributional, conceptual, and cascading shifts, formal verification methods and safety guarantees, robustness under streaming, online, or low-resource conditions; (5) Trust Evaluation and Responsible Deployment: human-AI trust calibration, confidence estimation, and uncertainty quantification, metrics for interpretability, alignment, and robustness, transparent, reproducible, and accountable deployment pipelines, safety alignment in fine-tuning, instruction-tuning, and retrieval-augmented systems. (6) Safety and Trustworthiness in LLM Agents: Autonomous tool use and agentic behavior in LLMs, Safety and failures in planning and action execution, emergent behaviors in multi-agent interactions, intervention and control in agent loops, alignment of long-horizon goals with user intent, auditing and debugging LLM agents in real-world deployment.