Skip to yearly menu bar Skip to main content


Oral
in
Workshop: ICLR 2025 Workshop on Bidirectional Human-AI Alignment

Position: Interpretability is a Bidirectional Communication Problem


Abstract:

Interpretability is the process of explaining neural networks in a human-understandable way. A good explanation has three core components: it is (1) faithful to the explained model, (2) understandable to the interpreter, and (3) effectively communicated. We argue that current mechanistic interpretability methods focus primarily on faithfulness and could improve by additionally considering the human interpreter and communication process. We propose and analyse two approaches to \emph{Concept Enrichment} for the human interpreter -- \emph{Pre-Explanation Learning} and \emph{Mechanistic Socratic Explanation} -- approaches to using the AI's representations to teach the interpreter novel and useful concepts. We reframe the Interpretability Problem as a Bidirectional Communication Problem between the model and the interpreter, highlighting interpretability's pedagogical aspects. We suggest that Concept Enrichment may be a key way to aid Conceptual Alignment between AIs and humans for improved mutual understanding.

Chat is not available.