Poster
in
Workshop: 2nd Workshop on Navigating and Addressing Data Problems for Foundation Models (DATA-FM)

Data Efficient Pre-training for Language Models: An Empirical Study of Compute Efficiency and Linguistic Competence

Andreas Paraskeva · Max van Duijn · Maarten de Rijke · Suzan Verberne · Jan Rijn

Project Page [ Poster] [ OpenReview]

Abstract

Training large language models such as Llama is compute- and data-intensive, limiting optimisation, hindering low-resource training, and increasing environmental impact. This paper examines pre-training effectiveness on small, curated datasets based on (i) linguistic competence and (ii) compute efficiency. We compare the use of two small curated datasets for pre-training decoder-only Llama models of different sizes. The first dataset, TinyStories, is a collection of ChatGPT-generated children's stories. The second dataset, BabyLM, is a small, open-domain dataset used for training language models in the BabyLM challenge. We perform experiments with increasing amounts of data (yielding a learning curve) and size-variants of a Llama-based architecture. We found that Llama models trained on BabyLM outperform Llama models trained on TinyStories in formal linguistic competence. However, both datasets yield comparable results on functional linguistic tasks. Our analysis generally indicates more robust training on BabyLM, with lower observed variance across training instances. These findings suggest promising directions for data-efficient pre-training of language models. Narrative data benefits early-stage training and its inclusion in curriculum learning settings is worth investigating. BabyLM shows potential in resource-constrained settings by helping select promising candidate models, given that small data samples appear representative of the model's ultimate performance. Future work will expand datasets and benchmarks to validate these insights.

Video

Chat is not available.