Invited Talk
in
Workshop: 2nd Workshop on Navigating and Addressing Data Problems for Foundation Models (DATA-FM)

Invited Talk: Kyle Lo 🤝🗣️ The OLMo Cookbook: Open Recipes for Language Model Data Curation

Kyle Lo

2025 Invited Talk
in
Workshop: 2nd Workshop on Navigating and Addressing Data Problems for Foundation Models (DATA-FM)

Abstract

Abstract: Information about pretraining corpora used to train the current best-performing language models is seldom discussed: commercial models rarely detail their data, and even open models are often released without accompanying training data or recipes to reproduce them. As a result, it can be challenging to conduct and advance scientific research on language modeling, such as understanding how training data impacts model capabilities, risks and limitations. In this talk, I'll present how we approach data curation research for OLMo, our project to develop and share fully open language models. Reflecting on our journey from OLMo 1 to our latest release of OLMo 2, I'll explore how data curation practices have matured across our work and the broader open data research ecosystem. Finally, I'll examine key challenges and opportunities for open data amid a rapidly changing language model landscape.

Bio: Kyle Lo is a research scientist at the Allen Institute for AI (Ai2), where he co-leads the OLMo project on open language modeling research. His current work focuses on data-driven approaches to model behavior and efficient language model experimentation. His research on language model development and adaptation, evaluation methods, and human-AI interaction has won awards at ACL, EMNLP, EACL and CHI. Kyle’s work on language models for science research assistance—including fact checking, summarization, and augmented reading—have been featured in Nature, Science, TechCrunch and other publications. Kyle holds a degree in Statistics from the University of Washington. Outside of work, he enjoys board games, boba tea, D&D, and spending time with his cat Belphegor.

Speaker

Kyle Lo

I’m a Lead Scientist at the Allen Institute for AI working on the OLMo and Semantic Scholar projects. I specialize in topics in natural language processing, machine learning and human-AI interaction. My areas of interest include: * Open Data for Language Models * Adapting Language Models to Specialized Texts * Standards and Best Practices in NLP Evaluation * NLP for Sensemaking over Large Collections * AI-Powered Reading Assistance

Video

Chat is not available.