Oral
in
Workshop: ICLR 2025 Workshop on Bidirectional Human-AI Alignment

The Alignment Trilemma: A Theoretical Perspective on Recursive Misalignment and Human-AI Adaptation Dynamics

Project Page [ OpenReview]

Abstract

We introduce the \emph{Alignment Trilemma} as a theoretical framework to explain the recursive misalignment observed in contemporary AI alignment methods. Our formulation decomposes misalignment into three interdependent components---direct alignment, capability preservation, and meta-alignment---whose conflicting optimization can trigger cycles of drift. In light of recent work on human-AI adaptation dynamics \citep{Shen2024Bidirectional, Carroll2024DRMDP, Harland2024MORL} and adaptive teaming architectures \citep{Ni2021Adaptive, Mahmood2024Behavior}, we propose a holistic approach that includes a novel metric, the \emph{Alignment Performance Score (APS)}, which captures the overall quality of alignment across these three dimensions. Our insights aim to guide the development of AI systems that co-evolve safely with human partners.

Chat is not available.