Skip to yearly menu bar Skip to main content


Poster
in
Workshop: Will Synthetic Data Finally Solve the Data Access Problem?

[Tiny] Parameterized Synthetic Text Generation with SimpleStories

Lennart Finke · Juan Rodriguez · Thomas Dooms · Mat Allen · Thomas Marshall · Noa Nabeshima · Dan Braun


Abstract:

We present SimpleStories, a large synthetic story dataset in simple language, consisting of 2 million stories each in English and Japanese. Our method employs parametrization of prompts with features at multiple levels of abstraction, allowing for systematic control over story characteristics to ensure broad syntactic and semantic diversity. Building on and addressing limitations in the TinyStories dataset, our approach demonstrates that simplicity and variety can be achieved simultaneously in synthetic text generation at scale.

Chat is not available.