Poster
in
Workshop: Building Trust in LLMs and LLM Applications: From Guardrails to Explainability to Regulation

Building Bridges, Not Walls: Advancing Interpretability by Unifying Feature, Data, and Model Component Attribution

Shichang Zhang · Tessa Han · Usha Bhalla · Hima Lakkaraju

Project Page [ OpenReview]

Abstract

The increasing complexity of AI systems has made understanding their behavior and building trust in them a critical challenge, especially for large language models. Numerous methods have been developed to attribute model behavior to three key aspects: input features, training data, and internal model components. However, these attribution methods are studied and applied rather independently, resulting in a fragmented landscape of approaches and terminology. We argues that feature, data, and component attribution methods share fundamental similarities, and bridging them can benefit interpretability research. We conduct a detailed analysis of successful methods of these three attribution aspects and present a unified view to demonstrate that they employ similar approaches: perturbations, gradients, and linear approximations. Our unified view enhances understanding of attribution methods and highlights new directions for interpretability and broader AI areas, including model editing, steering, and regulation.

Chat is not available.