Poster

The Journey Matters: Average Parameter Count over Pre-training Unifies Sparse and Dense Scaling Laws

Tian Jin · Ahmed Imtiaz Humayun · Utku Evci · Suvinay Subramanian · Amir Yazdanbakhsh · Dan Alistarh · Gintare Karolina Dziugaite

2025 Poster

[ OpenReview]

Abstract

Pruning eliminates unnecessary parameters in neural networks; it offers a promising solution to the growing computational demands of large language models (LLMs). While many focus on post-training pruning, sparse pre-training--which combines pruning and pre-training into a single phase--provides a simpler alternative. In this work, we present the first systematic exploration of optimal sparse pre-training configurations for LLMs through an examination of 80 unique pruning schedules across different sparsity levels and training durations.We find that initiating pruning at 25\% of total training compute and concluding at 75\% achieves near-optimal final evaluation loss.These findings provide valuable insights for efficient and effective sparse pre-training of LLMs.Furthermore, we propose a new scaling law that modifies the Chinchilla scaling law to use the average parameter count over pre-training.Through empirical and theoretical validation, we demonstrate that this modified scaling law accurately models evaluation loss for both sparsely and densely pre-trained LLMs, unifying scaling laws across pre-training paradigms.Our findings indicate that while sparse pre-training achieves the same final model quality as dense pre-training for equivalent compute budgets, it provides substantial benefits through reduced model size, enabling significant potential computational savings during inference.

Video

Chat is not available.