3rd Workshop on Navigating and Addressing Data Problems For Foundation Models (DATA-FM)
Abstract
The past year has witnessed remarkable advances in foundation models (FMs): new post-training paradigms such as reinforcement learning with verifiable rewards (RLVR) that strengthen reasoning, increasingly multimodal and agentic systems, and renewed attention to benchmark design and evaluation. Each of these advances depends on distinct data innovations: verifiable reward signals and reasoning traces for RLVR; aligned cross-modal corpora and interaction logs for multimodality and agency; and leak-resistant, representative test sets for evaluation. Taken together, these dependencies underscore the continuing centrality of data as a design variable at the forefront of FM research. Meanwhile, longstanding challenges in data collection, curation, and synthesis remain unresolved, while concerns surrounding copyright, privacy, and fairness have only intensified. Building on the success of the first two DATA-FM workshops at ICLR 2024 and 2025, the third edition will revisit these persistent issues while highlighting emerging ones at the frontiers of post-training, multimodality, and evaluation. By convening researchers and practitioners across diverse research communities, DATA-FM seeks to advance understanding of data’s evolving role in FMs and foster innovative solutions shaping the next generation of models and applications.