Correlations in the Data Lead to Semantically Rich Feature Geometry Under Superposition
Abstract
Recent advances in mechanistic interpretability have shown that many features represented by deep learning models can be captured by dictionary learning approaches such as sparse autoencoders. However, our understanding of the structures formed by these internal representations is still limited. Initial “toy-model” analyses showed that in an idealized setting features can be arranged in local structures, such as small regular polytopes, through a phenomenon known as superposition. However, these local structures have not been observed in real language models. In contrast, language models display rich structures like semantically clustered representations or ordered circles for the months of the year which are not predicted by current theories. In this work, we introduce Bag-of-Words Superposition (BOWS), a framework in which autoencoders (AEs) with a non-linearity are trained to compress sparse, binary bag-of-words vectors drawn from Internet-scale text. Our framework reveals that under restrictive bottlenecks, or when trained with weight decay, non-linear AEs linearly encode the low rank structure in the data, arranging feature representations according to their co-activation patterns. This linear superposition gives rise to structures like ordered circles and semantic clusters, similar to those observed in language models. Our findings suggest that the semantically meaningful structures observed in language models could arise driven by compression alone, without necessarily having a functional role beyond efficiently arranging feature representations.