Oral
in
Workshop: SCOPE: SCALABLE OPTIMIZATION FOR EFFICIENT AND ADPATIVE FOUNDATION MODELS
Overtrained Language Models Are Harder to Fine-Tune
Jacob Springer · Sachin Goyal · Kaiyue Wen · Tanishq Kumar · Xiang Yue · Sadhika Malladi · Graham Neubig · Aditi Raghunathan
Keywords: [ catastrophic forgetting ] [ transfer learning ] [ fine-tuning ] [ pre-training ]
Abstract:
Large language models are pre-trained with an ever-increasing token budget operating under the largely unexamined premise that better pre-training performance translates to better downstream performance, even when keeping the number of model parameters fixed for efficient inference. In this work, we show that this widely-held assumption is in fact false! Pre-training on extremely large number of tokens eventually makes the model harder to fine-tune leading to worse downstream performance. For instance, after instruction tuning or multimodal fine tuning, OLMo-1B models pre-trained on 3T tokens under perform their 2.3T token counterpart by over $2\%$ on standard LLM benchmarks. Controlled experiments and theoretical analysis show that the phenomenon of \textit{catastrophic overtraining} is both fundamental and universal. Our results suggest that as token budgets continue to scale, models will experience increasingly severe fine-tuning degradation across a wider range of tasks. This calls for a critical reassessment of pre-training design that takes into account the entire model lifecycle.
Chat is not available.
Successful Page Load