Poster
in
Workshop: Self-Improving Foundation Models Without Human Supervision

D3: A Large Dataset for Training Code Language Models to Act Diff-by-Diff

Ulyana Piterbarg · Kanishk Gandhi · Lerrel Pinto · Noah Goodman · Rob Fergus

Keywords: data filtering synthetic data language models code synthesis

Project Page [ OpenReview]

Abstract

We introduce D3 ("Diverse Data for Diff-by-Diff Coding"), a large dataset for training LMs to iteratively synthesize general-purpose Python source code by generating file diffs. D3 frames code synthesis as a goal-conditioned sequential decision-making problem, where goals, states, and actions are represented by token sequences corresponding to the description of a functionality to add, the current contents of a file, and a file diff, respectively. The dataset contains 8 billion tokens of instruction + file-state + file-diff-sequence examples sampled from 850,000 human-written Python source files. To construct D3, we filter, augment, and annotate source code from The Stack by sampling synthetic file-diff sequences with a code analysis tool and labeling each sample with an LLM-generated rationale. In our experiments, we show that mid-training LMs like Llama 3.2 1b and 3b on D3 prior to supervised fine-tuning (SFT) on task-curated data improves performance on synthesis & editing tasks. On benchmarks like HumanEvalSynth and HumanEvalFix, we observe improvements in pass@1 of 3 to 6 points compared to direct SFT. D3-trained models are particularly strong at completing partial human-written solutions to programming problems.

Chat is not available.