Skip to yearly menu bar Skip to main content


Poster

Aria-MIDI: A Dataset of Piano MIDI Files for Symbolic Music Modeling

Louis Bradshaw · Simon Colton

[ ] [ Project Page ]
2025 Poster

Abstract:

We introduce an extensive new dataset of MIDI files, created by transcribing audio recordings of piano performances into their constituent notes. The data pipeline we use is multi-stage, employing a language model to autonomously crawl and score audio recordings from the internet based on their metadata, followed by a stage of pruning and segmentation using an audio classifier. The resulting dataset contains over one million distinct MIDI files, comprising roughly 100,000 hours of transcribed audio. We provide an in-depth analysis of our techniques, offering statistical insights, and investigate the content by extracting metadata tags, which we also provide. Dataset available at https://github.com/loubbrad/aria-midi.

Chat is not available.