Poster
in
Workshop: Building Trust in LLMs and LLM Applications: From Guardrails to Explainability to Regulation
Steering Fine-Tuning Generalization with Targeted Concept Ablation
Helena Casademunt · Caden Juang · Samuel Marks · Senthooran Rajamanoharan · Neel Nanda
During fine-tuning, multiple solutions may emerge which perform similarly on training data but generalize differently out of distribution. For instance, a deceptive model may be indistinguishable from an aligned model during training, but perform catastrophically at deployment. We present a novel technique for controlling what models learn during fine-tuning by identifying and ablating specific sparse autoencoder latents that represent undesired concepts. Our approach steers models toward intended generalizations when multiple policies correctly fit the training data. We evaluate our method on two tasks, significantly outperforming baselines: a gender bias task containing spurious correlations and a double multiple choice task where models must learn to focus on intended questions while ignoring others. On gender bias, our method completely eliminates spurious correlations, leading to strong performance out of distribution. In double multiple choice, it succeeds in 12 out of 16 scenarios. Our results mark an initial step toward using interpretability techniques to ensure the safe and reliable deployment of frontier AI systems.