Poster
in
Workshop: Integrating Generative and Experimental Platforms for Biomolecular Design
Structure-Aware Language Models Trained on Ultra-Mega-Scale Metagenomic Data Improve Protein Folding Stability Prediction
Yehlin Cho · Kotaro Tsuboyama · Gabriel Rocklin · Sergey Ovchinnikov
Predicting absolute protein stability remains challenging due to the limited availability of experimental datasets and the intricate interplay between sequence and structure contributions to protein stability. In this study, we experimentally measured the folding stability of 2 million high-quality, diverse metagenomic MGnify sequences using high-throughput cDNA display methods. This dataset includes 814,000 wild-type (WT) proteins and sequences with point mutations and insertions/deletions. We fine-tuned the structure-based large language models, Saprot and ESM-3, using LoRA (Low-Rank Adapter) on stability measurements, achieving a Spearman correlation of 0.87 on the MGnify test dataset. Our results demonstrate that these models can predict absolute folding stability from both insertions/deletions and mutational effects, even in non-cDNA datasets covering a wide stability range, including large proteins.