Poster
in
Workshop: Integrating Generative and Experimental Platforms for Biomolecular Design
Scaling Sparse Autoencoders to Interpret Protein Structure Prediction
John J. Yang · David Yang · Nithin Parsan
While protein structure prediction models have achieved remarkable success, their internal mechanisms remain poorly understood. We apply sparse autoencoders (SAEs) to ESM2-3B, the largest pretrained protein language model analyzed to date, enabling the first mechanistic interpretability studies of ESMFold's structure prediction capabilities. We also introduce Matryoshka SAEs, which organize sparse features hierarchically in nested layers, improving both model performance and interpretability. Our evaluations on sequence-based annotations and structural predictions demonstrate that SAEs trained on ESM2-3B capture significantly more biological concepts than those trained on smaller models. Through intervention-based case studies, we show how specific SAE features influence ESMFold's structural predictions, including the ability to increase surface hydrophobicity while maintaining structural integrity. We release our code, trained models, and visualization tools to facilitate further investigation by the research community.