Skip to yearly menu bar Skip to main content


Poster
in
Workshop: I Can't Believe It's Not Better: Challenges in Applied Deep Learning

Data Mixing can Induce Phase Transitions in Knowledge Acquisition

Xinran Gu · Kaifeng Lyu · Jiazheng Li · Jingzhao Zhang


Abstract:

Large Language Models (LLMs) are typically trained on data mixtures: most data come from web scrapes, while a small portion is curated from high-quality sources with dense domain-specific knowledge.In this paper, we show that when training LLMs on such data mixtures, knowledge acquisition from knowledge-dense datasets does not always follow a smooth scaling law but can exhibit phase transitions with respect to the mixing ratio and model size. First, through controlled experiments on a synthetic biography dataset mixed with web-scraped data, we demonstrate that:(1) as we increase the model size to a critical value, the model suddenly transitions from memorizing very few to most of the biographies;(2) below a critical mixing ratio, the model memorizes almostnothing even with extensive training, but beyondthis threshold, it rapidly memorizes more biographies. We then adopt an information-theoretic perspective to understand and characterize the existence and value of the thresholds. Based on these insights, we identify two mitigation strategies that improve the efficiency of knowledge acquisition from knowledge-dense datasets, and validate their effectiveness on both synthetic and real-world Wikipedia datasets.

Chat is not available.