Poster
in
Workshop: XAI4Science: From Understanding Model Behavior to Discovering New Scientific Knowledge
TIME-AWARE FEATURE SELECTION: ADAPTIVE TEMPORAL MASKING FOR STABLE SPARSE AUTOENCODER TRAINING
T. Ed Li · Junyu Ren
Abstract:
Understanding the internal representations of large language models is crucial for ensuring their reliability and safety, with sparse autoencoders (SAEs) emerging as a promising interpretability approach. However, current SAE training methods face feature absorption, where features (or neurons) are absorbed into each other to minimize $L_1$ penalty, making it difficult to consistently identify and analyze model behaviors. We introduce Adaptive Temporal Masking (ATM), a novel training approach that dynamically adjusts feature selection by tracking both rapid and gradual activation patterns through dual exponential moving averages, combined with Bayesian uncertainty estimation for robust thresholding. Through extensive experiments on the Gemma-2-2b model, we demonstrate that ATM achieves substantially lower absorption scores compared to existing methods like TopK and JumpReLU SAEs, while maintaining excellent reconstruction quality. These results establish ATM as a principled solution for learning stable, interpretable features in neural networks, providing a foundation for more reliable model analysis.
Chat is not available.
Successful Page Load