ICLR Poster Successor Heads: Recurring, Interpretable Attention Heads In The Wild

Poster

Successor Heads: Recurring, Interpretable Attention Heads In The Wild

Rhys Gould · Euan Ong · George Ogden · Arthur Conmy

Halle B #43

[ Abstract ]

Fri 10 May 1:45 a.m. PDT — 3:45 a.m. PDT

Abstract:

In this work we describe successor heads: attention heads that increment tokens with a natural ordering, such as numbers, months, and days.For example, successor heads increment 'Monday' into 'Tuesday'.We explain the successor head behavior with an approach rooted in mechanistic interpretability, the field that aims to explain how models complete tasks in human-understandable terms.Existing research in this area has struggled to find recurring, mechanistically interpretable large language model (LLM) components beyond small toy models. Further, existing results have led to very little insight to explain the internals of the larger models that are used in practice.In this paper, we analyze the behavior of successor heads in LLMs and find that they implement abstract representations that are common to different architectures. Successor heads form in LLMs with as few as 31 million parameters, and at least as many as 12 billion parameters, such as GPT-2, Pythia, and Llama-2.We find a set of 'mod 10' features that underlie how successor heads increment in LLMs across different architectures and sizes.We perform vector arithmetic with these features to edit head behavior and provide insights into numeric representations within LLMs. Additionally, we study the behavior of successor heads on natural language data, where we find that successor heads are important for achieving a low loss on examples involving succession, and also identify interpretable polysemanticity in a Pythia successor head.

Live content is unavailable. Log in and register to view live content