Oral
in
Workshop: Workshop on Spurious Correlation and Shortcut Learning: Foundations and Solutions
ASSESSING ROBUSTNESS TO SPURIOUS CORRELATIONS IN POST-TRAINING LANGUAGE MODELS
Julia Shuieh · Prasann Singhal · Apaar Shanker · John Heyer · George Pu · Samuel Denton
Keywords: [ Data bias ] [ Large Language Models ] [ Preference-Based Fine-Tuning ] [ Spurious Correlations ]
Supervised and preference-based fine-tuning techniques have become popular foraligning large language models (LLMs) with user intent and correctness criteria.However, real-world training data often exhibits spurious correlations—arisingfrom biases, dataset artifacts, or other “shortcut” features—that can compromise amodel’s performance or generalization. In this paper, we systematically evaluatethree post-training algorithms—Supervised Fine-Tuning (SFT), Direct PreferenceOptimization (DPO), and KTO (Kahneman-Tversky Optimization)—across a di-verse set of synthetic tasks and spuriousness conditions. Our tasks span mathemat-ical reasoning, constrained instruction-following, and document-grounded ques-tion answering. We vary the degree of spurious correlation (10% vs. 90%) andinvestigate two forms of artifacts: “Feature Ambiguity” and “Distributional Nar-rowness.” Our results show that the models often but not always degrade underhigher spuriousness. The preference-based methods (DPO/KTO) can demonstraterelative robustness in mathematical reasoning tasks. By contrast, SFT maintainsstronger performance in complex, context-intensive tasks. These findings high-light that no single post-training strategy universally outperforms in all scenarios;the best choice depends on the type of target task and the nature of spurious cor-relations.