LLM-based agents struggle to execute complex, multi-step Standard Operating Procedures (SOPs) that are fundamental to industrial automation. Existing benchmarks fail to capture the procedural complexity and tool orchestration demands of real-world workflows. We introduce SOP-Bench, a benchmark of 2,000+ tasks from human expert-authored SOPs across 12 business domains (healthcare, logistics, finance, content moderation, etc.). Using a human-AI collaborative framework, experts crafted authentic SOPs while AI generated artifacts (tools, APIs, datasets), all human-validated, yielding realistic tasks with executable interfaces and ground-truth outputs.
SOP-Bench serves as a research enabler for systematically investigating agent architectures, model capabilities, and deployment considerations across diverse procedural tasks. In this talk, we discuss the challenges of real world SOPs and a framework on evaluating them.
NLP and AI Researchers Affinity Social
This affinity group is for NLP and AI researchers with a prior background or interest in the humanities. As AI research becomes increasingly interdisciplinary - drawing from theories and methods from outside of computer science - it is important to engage with what the humanities have to offer in terms of theories and methods to study language, culture, narrative, subjectivity, mind, emotion, etc. Disciplines like literary and cultural studies, media studies, philosophy, and history have a great deal to offer in terms of studying, building and improving language systems. Topics of interest include (but are not limited to): co intelligence and co-creative systems, narrative understanding, cultural analytics, literary NLP, AI literacy, AI ethics, culture and cognition, etc.
Negotiating in AI: How to Get What You're Worth
Let's be honest, nobody teaches researchers how to negotiate in grad school. They spend years mastering ML, publishing papers, and building models, and then suddenly they're staring at an offer letter with no idea if it's good or how to push back.
That's exactly why we're hosting this social.
Rora has been in the trenches with researchers since 2017, we've worked with 1,000+ people navigating offers from top AI labs, startups, and big tech, and have helped negotiate over $1 billion in total compensation. We've seen it all, and we want to share what actually works. Come hang out, grab a drink, and ask the questions you've been too afraid to ask your recruiter.
Whether you're mid-process, just starting to explore, or already have an offer in hand, there's something here for you. No slides, no fluff — just real talk from people who've been doing this for a while.
NL-to-SQL systems
Large language models are increasingly used to translate natural language into SQL. But how reliable are they in real-world settings?
In this session, we present a focused evaluation framework for measuring NL-to-SQL performance, including execution correctness, robustness, and query efficiency under varying levels of database context. We’ll discuss how structured QA and validation approaches can help move LLM systems from benchmark success to production reliability.
We invite researchers and practitioners working on NL-to-SQL systems, LLM evaluation, and database applications to participate in the discussion and share perspectives from real-world deployments.
What is the Role of World Models in Decision-Making?
World models have recently gained popularity thanks to impressive results and the availability of data. However, no consensus has been reached on how they should help improve decision-making. This social aims to foster discussions around the role of world models in decision-making.
World models are used in many ways. Some approaches use world models as synthetic data generators. Others leverage them at test time to reason or evaluate policies. Video models, in addition to an inverse dynamics model, can be used to infer actions. Discussing those choices appears to be essential to assess their role in decision-making.
When data is scarce, some methods only learn a world model to improve representation learning and rely on model-free methods for decision-making to avoid hallucinations. We believe it is fruitful to discuss the scenarios for which world models can be trusted.
The exchanges can focus on discussing in which situations world models have an edge over algorithms that do not directly learn the transition dynamics. It is also not yet clear whether world models are more relevant at certain levels of hierarchy.
Finally, discussing why world models can enable better generalization can provide an answer to the question asked in this social.
Verification Practices Across the ML Research Lifecycle
As AI agents increasingly operate in shared environments — negotiating, transacting, and making joint decisions — questions of coordination and cooperation become inseparable from questions of system design. How do we incentivize cooperation among autonomous agents when no single party controls the system? What coordination protocols, commitment devices, and oversight mechanisms work in distributed settings? How do we prevent collusion and ensure robustness as agent networks scale? This social is hosted by Cooperative AI Foundation and The Institute for Decentralized AI as a gathering for researchers working on multi-agent systems, mechanism design, AI safety, distributed systems, and related areas who are interested in how cooperative and decentralized approaches to AI intersect and inform each other. Drinks will be provided.
Women in Machine Learning (WiML) Social @ ICLR
The Women in Machine Learning (WiML) initiative, founded in 2006, was created to connect and support the relatively small but growing community of researchers in ML who identify as women or nonbinary. Over the years, WiML events at conferences such as NeurIPS, ICML, ICLR, and other conferences have highlighted cutting-edge research, fostered mentorship, and created space for meaningful technical exchange. For ICLR, we propose a WiML Social that keeps WiML’s core mission while emphasising interaction and networking. The event will feature a panel discussion, facilitated roundtables, and structured networking activities designed to spark in-depth conversations and future collaborations. Building on the success of the highly interactive WiML formats at ICLR in the past, we will include small-group discussions, allowing participants to engage directly on open research questions and career paths. The goals remain the same: to celebrate the work of researchers who identify as women or nonbinary, to create opportunities for junior and senior participants to connect, and to strengthen community ties within the broader ICLR ecosystem.
Agent benchmarks are evolving from static prediction tasks to multi-step tasks requiring tool use, yet most evaluations remain short-horizon, loosely coupled, and state-agnostic. These settings fail to capture the properties that determine reliability in real workflows: persistent state, dense relational structure, role-based access control, and policy-constrained execution. This talk introduces an evaluation framework built on three components: workflow-grounded task definitions, scenario construction derived from operational traces, and a stateful execution environment with persistent databases and verifiable outcomes. Agents are evaluated across tasks that require multi-step planning over interconnected subsystems, exposing failure modes that simply don't appear in short-horizon benchmarks. Scoring combines automated outcome verification with rubrics that assess domain reasoning quality, including constraint prioritization and decision-making under ambiguity. We show that performance degrades predictably with increasing horizon length and constraint density. The primary bottleneck for frontier models is strategic planning under constraint, not tool invocation accuracy. Structural failures surface reliably through outcome checks, while subtler breakdowns in judgment require reasoning-quality assessment. These results suggest that deployable autonomy requires evaluation frameworks that integrate realistic workflows, operational constraints, stateful simulation, and scoring that captures both outcome correctness and the quality of domain-grounded reasoning. This session outlines the design principles, empirical findings, and open research challenges that define this next generation of agent evaluation.
As large language models (LLMs) evolve from short-burst chatbots into long-horizon autonomous agents, progress is increasingly bottlenecked by verification asymmetry: rapid gains in domains with cheap correctness signals (e.g., math and code) contrast sharply with limited progress in tasks with weak or delayed verification, such as research planning and strategic decision-making. This talk argues that evaluation and reinforcement learning (RL) beyond easily verifiable domains are the next critical frontier for AI capability. We present results from three new evaluation frameworks. Humanity’s Last Exam (HLE) shows that frontier models are frequently wrong and overconfident at the human-expert level. The Remote Labor Index (RLI) demonstrates that current agents automate only ~2.5% of real, paid freelance work. Visual ToolBench reveals that 70–80% of multimodal agent failures stem from visual perception rather than reasoning. To close these gaps, we introduce Rubrics as Rewards (RaR) within a Group Relative Policy Optimization (GRPO) framework. We show that Dynamic Rubrics, which adaptively elicit evaluation criteria by contrasting model outputs during training, outperform static human-written rubrics and reduce reward hacking in the high-reward regime. These findings motivate a shift from static benchmarks to high-fidelity RL environments, such as Scale Gymnasium, that train agents through interaction rather than imitation.
Queer in AI Social
This social is an informal community gathering organised by Queer in AI to foster connection, visibility, and mutual support among LGBTQ+ researchers and allies in AI/ML. The event provides a welcoming space for queer scientists, students, and practitioners to meet, share experiences, and build professional and personal networks within the broader research community.
X-informed AI
Scientific ML is more than applying ML to science — it's about letting domain structure shape the model itself. In this social, we bring together researchers working across domains (neuroscience, PDEs, physics simulations, and beyond) to discuss: What is the urgent 'X' in your X-informed AI? How do you identify the right inductive bias? How do we build a community that supports this? Come share how your domain shapes your models.
AI for Science Social
AI for Science is rapidly emerging as a key area where machine learning can accelerate discovery in domains such as materials science, biology, physics, and mathematics. Progress in this space increasingly depends on collaboration between machine learning researchers and domain scientists, as well as open ecosystems of data, models, and tools. This socials aims to create an informal space at ICLR for researchers interested in AI for Science and open collaboration to connect, exchange ideas, and build new collaborations.
The socials will bring together participants from several ICLR workshops related to AI for Science, including AI4Mat, FM4Science, AI&PDE, and Sci4DL, and foster interaction across these communities. The session will focus on practical challenges and opportunities in building open scientific ecosystems, including open datasets and benchmarks, open-source tools and foundation models for science, cross-disciplinary collaboration, and community-driven initiatives.
The format will emphasize interaction and networking with brief opening remarks, structured speed networking, themed small-group discussions, and a short open panel conversation where participants can share insights and identify opportunities for collaboration. The social will also provide a welcoming environment for students and early-career researchers to engage with both academic and industry researchers working on AI-driven scientific discovery
LLM's for Metascience Social
More simply put: LLMs for Metascience. This social is focused on facilitating conversations from industry, academia, and government about how we are all using AI to understand the current landscape of AI research. What’s coming at us the fastest? What are the problems we’ve effectively “solved” ? Where can existing methods be applied to support progress in overlooked areas?
This is meant to be a relaxed, discussion-driven space. We’re excited for a cross-sector exchange where people can share tools, workflows, rough ideas, and open questions, as well as the challenges of using AI systems to reason about science itself. Input from those who conduct, fund, and review research will be especially valuable in shaping a more complete picture of the different components in the AI research landscape.
Evaluating LLMs Holistically in a World Where Benchmarks Leak
Benchmark contamination is no longer a theoretical concern. As frontier models are trained on open-web data, public test sets are routinely absorbed into pre-training corpora — and beyond passive contamination, labs are known to actively optimize against known benchmarks and selectively report favorable results. When a model claims state-of-the-art on a public leaderboard, it is increasingly unclear whether that reflects genuine generalization or familiarity with the test. Private-only benchmarks — never released publicly, evaluated under controlled conditions, and continuously refreshed — offer a structural solution. If a model cannot train against a benchmark it has never seen, contamination becomes impossible by design. Built across capability families rather than isolated skills, such benchmarks can also surface cross-domain failure modes that narrow public evaluations miss entirely. This social will examine what private, holistic evaluation infrastructure could look like in practice, with short talks from practitioners followed by open discussion on what it would take for the community to coalesce around shared private evaluation standards.
World Models and Beyond: Bridging Video, Simulation, and Robotic Intelligence
World models have emerged as a unifying thread across video generation, model-based reinforcement learning, and robotic planning — yet the communities working on these problems often don't overlap at conferences. This Social brings together researchers working on learned simulators, video prediction, world models for decision-making, and sim-to-real transfer for an informal, discussion-first session.
Cooperative AI Foundation
As AI agents increasingly operate in shared environments — negotiating, transacting, and making joint decisions — questions of coordination and cooperation become inseparable from questions of system design. How do we incentivize cooperation among autonomous agents when no single party controls the system? What coordination protocols, commitment devices, and oversight mechanisms work in distributed settings? How do we prevent collusion and ensure robustness as agent networks scale?
This social is hosted by Cooperative AI Foundation and The Institute for Decentralized AI as a gathering for researchers working on multi-agent systems, mechanism design, AI safety, distributed systems, and related areas who are interested in how cooperative and decentralized approaches to AI intersect and inform each other. Drinks will be provided.
| ICLR uses cookies for essential functions only. We do not sell your personal information. Our Privacy Policy » |