Poster
in
Workshop: Workshop on Sparsity in LLMs (SLLM): Deep Dive into Mixture of Experts, Quantization, Hardware, and Inference
High Frequency Latents Are Features, Not Bugs
Xiaoqing (Lily) Sun · Josh Engels · Max Tegmark
Sparse autoencoders (SAEs) have shown success at decomposing language model activations into a sparse set of interpretable linear representations ("latents"). However, recent work identifies a challenge for SAEs: high frequency latents (HFLs) that are seemingly uninterpretable and occur on greater than 10\% of tokens. In this work, we find that HFLs have many unique properties: 1) most HFLs have a ``pair'', another HFL pointing in the geometrically opposite direction that they never co-occur with; 2) the HFL subspace is robust to the SAE initialization seed, but HFLs themselves are not; 3) when an SAE is trained on activations with the HFL subspace ablated, no new HFLs are learned; and 4) HFLs have uniquely high similarity with the SAE bias vector. Our experiments lead us to hypothesize that the HFL subspace is not an artifact of SAE training, but instead represents a subspace of truly dense language model features. We present preliminary results interpreting this dense subspace, including finding HFLs that represent context position, HFLs that fire continuously on large blocks of text, HFLs that fire on topic sentences, and HFLs that fire on numeric data.