Poster
in
Workshop: Workshop on Sparsity in LLMs (SLLM): Deep Dive into Mixture of Experts, Quantization, Hardware, and Inference
Latent Scaling Robustly Identifies Chat-Specific Latents in Crosscoders
Julian Minder · Clément Dumas · Bilal Chughtai · Neel Nanda
Abstract:
Most state-of-the-art language models undergo chat-tuning, however its effect is still poorly understood.The recently introduced crosscoder – a variant of sparse autoencoders – provides a tool to understand how chat-tuning changes language models by systematically comparing the pre-trained and chat-tuned models' latent spaces.By analyzing the crosscoder training loss, we identify two theoretical challenges, termed Complete Shrinkage and Latent Decoupling, that could lead to misclassification of latents as chat-tuning-specific. We propose Latent Scaling, a technique to detect and filter these issues and empirically observe that it identifies highly interpretable latents appearing unique to the chat model's behavior.
Chat is not available.
Successful Page Load