AntigenLM: Structure-Aware DNA Language Modeling for Influenza
Abstract
Language models have transformed sequence analysis, yet DNA foundation models often underperform compared to task-specific approaches, with the causes remaining poorly understood. We introduce AntigenLM, a generative DNA language model explicitly pretrained on aligned, intact functional units of influenza genomes. This structure-aware pretraining enables AntigenLM to robustly capture evolutionary constraints and transfer effectively to multiple downstream tasks. Fine-tuned on hemagglutinin (HA) and neuraminidase (NA) sequences, AntigenLM accurately forecasts antigenic variants for upcoming influenza seasons across diverse geographic regions—including minor subtypes and regions unseen during training—outperforming conventional phylogenetic and evolution-based models. Beyond forecasting, AntigenLM achieves near-perfect subtype classification (~100% accuracy), demonstrating strong representation learning. Ablation studies reveal that pretraining on unaligned or fragmented gene sequences drastically degrades performance, underscoring the critical—but previously overlooked—role of both alignment and functional-unit preservation in DNA language modeling. AntigenLM thus provides not only a high-accuracy framework for antigen evolution prediction, essential for vaccine design, but also a methodological insight into how respecting biological sequence structure can guide the next generation of DNA foundation models for functional genomics.