Representing local protein environments with machine learning force fields
Abstract
The local structure of a protein strongly impacts its function and interactions with other molecules. Representing local biomolecular environments remains a key challenge while applying machine learning approaches over protein structures. The structural and chemical variability of these environments makes them challenging to model, and performing representation learning on these objects remains largely under-explored. In this work, we propose representations for local protein environments that leverage intermediate features from machine learning force fields (MLFFs). We extensively benchmark state-of-the-art MLFFs—comparing their performance across latent spaces and downstream tasks—and show that their embeddings capture local structural (e.g., secondary motifs) and chemical features (e.g., amino acid identity and protonation state), organizing protein environments into a structured manifold. We show that these representations enable zero-shot generalization and transfer across diverse downstream tasks. As a case study, we build a physics-informed, uncertainty-aware chemical shift predictor that achieves state-of-the-art accuracy in biomolecular NMR spectroscopy. Our results establish MLFFs as general-purpose, reusable representation learners for protein modeling, opening new directions in representation learning for structured physical systems.