ICLR Neural Networks Learn Representation Theory: Reverse Engineering how Networks Perform Group Operations

Poster
in
Workshop: Physics for Machine Learning

Neural Networks Learn Representation Theory: Reverse Engineering how Networks Perform Group Operations

Bilal Chughtai · Lawrence Chan · Neel Nanda

[ Abstract ] [ Project Page ]

[ Poster] [ OpenReview]

Abstract:

We present a novel algorithm by which neural networks may implement composition for any finite group via mathematical representation theory, through learning several irreducible representations of the group and converting group composition to matrix multiplication. We show small networks consistently learn this algorithm when trained on composition of group elements by reverse engineering model logits and weights, and confirm our understanding using ablations. We use this as an algorithmic test bed for the hypothesis of universality in mechanistic interpretability -- that different models learn similar features and circuits when trained on similar tasks. By studying networks trained on various groups and architectures, we find mixed evidence for universality: using our algorithm, we can completely characterize the family of circuits and features that networks learn on this task, but for a given network the precise circuits learned -- as well as the order they develop -- are arbitrary.

Chat is not available.

Poster in Workshop: Physics for Machine Learning

Neural Networks Learn Representation Theory: Reverse Engineering how Networks Perform Group Operations

Bilal Chughtai · Lawrence Chan · Neel Nanda

Poster
in
Workshop: Physics for Machine Learning