Recurrent Neural Networks (RNNs) laid the foundation for sequence modeling, but their intrinsic sequential nature restricts parallel computation, creating a fundamental barrier to scaling. This has led to the dominance of parallelizable architectures like Transformers and, more recently, State Space Models (SSMs). While SSMs achieve efficient parallelization through structured linear recurrences, this linearity constraint limits their expressive power and precludes modeling complex, nonlinear sequence-wise dependencies. To address this, we present ParaRNN, a framework that breaks the sequence-parallelization barrier for nonlinear RNNs. Building on prior work, we cast the sequence of nonlinear recurrence relationships as a single system of equations, which we solve in parallel using Newton's iterations combined with custom parallel reductions. Our implementation achieves speedups of up to 665x over naive sequential application, allowing training nonlinear RNNs at unprecedented scales. To showcase this, we apply ParaRNN to adaptations of LSTM and GRU architectures, successfully training models of 7B parameters that attain perplexity comparable to similarly-sized Transformers and Mamba2 architectures. To accelerate research in efficient sequence modeling, we release the ParaRNN codebase as an open-source framework for automatic training-parallelization of nonlinear RNNs, enabling researchers and practitioners to explore new nonlinear RNN models at scale.
Jump Trading processes terabytes of market data daily, searching for predictive signals across thousands of instruments worldwide. Under tight latency constraints, our pipeline spans raw order book events, NLP signal integration, model training, and live execution. The nature of financial data, from irregularity and non-stationarity to microstructure noise, poses unique challenges and opens new directions in HPC, reinforcement learning, AI, and agent research. In this talk, we describe quantitative research workflows applying foundation models and multi-agent systems to market data. We present results from fine-tuning large language models on textual input to produce live trading signals, and from multi-agent systems that combine a broad range of unstructured sources into forecasts consumed by human traders and automated strategies alike. We also discuss hallucination risk management, strict point-in-time correctness, and the evaluation methodology we apply to ensure rigor and reliability in production.
NL-to-SQL systems
Large language models are increasingly used to translate natural language into SQL. But how reliable are they in real-world settings? In this session, We present a focused evaluation framework for measuring NL-to-SQL performance, including execution correctness, robustness, and query efficiency under varying levels of database context. We’ll discuss how structured QA and validation approaches can help move LLM systems from benchmark success to production reliability. We invite researchers and practitioners working on NL-to-SQL systems, LLM evaluation, and database applications to participate in the discussion and share perspectives from real-world deployments.
Digital Olfaction Social
This discussion booth explores the emerging intersection of machine learning and smell technology, also known as digital olfaction. The space is designed for researchers, students, builders, and curious attendees interested in how AI can be used to detect, classify, generate, and interpret scent-related data. Topics may include electronic noses, olfactory sensing, multimodal AI, applications in healthcare, food quality, environmental monitoring, fragrance, and human-computer interaction. The goal is to create an open and interdisciplinary conversation around both the technical challenges and the creative opportunities in this fast-growing field. Whether you work directly on ML models for chemical signal analysis or are simply interested in the future of smell interfaces and sensory intelligence, this booth offers a place to exchange ideas, share projects, discuss collaborations, and connect with others working at the frontier of AI and olfaction.
AI compensation and negotiation
As AI reshapes industries, compensation is changing faster than researchers can track. New labs, startups, and top companies are competing for talent with vastly different pay structures, currencies, and cultural norms. Yet most researchers are never formally taught how to understand their worth or navigate these systems. The result is an uneven landscape where brilliant minds often make life-changing decisions without the information they need.
The session begins with a concise, data-driven talk on current AI compensation and negotiation trends, grounded in real stories and case studies. From there, a fireside chat and open Q&A invite candid, experience-driven insights from researchers who have navigated these conversations firsthand.
Key Takeaways for attendees: * How to evaluate compensation (salary, equity, bonuses) across industry roles such as Research/Applied/Data Scientists, Research/Machine Learning/Software Engineers, and more * How to compare global opportunities and account for regional differences in pay structures * How to identify leverage points and negotiate effectively at different career stages and levels * How to respond to pushback and recognize red vs. green flags in job offers * How to negotiate the deadline, and avoid having an offer rescinded * How to advocate for yourself without counteroffers and amidst having fears
A/B testing is the gold standard for evaluating e-commerce UI changes, yet it is expensive or infeasible for many merchants. At Shopify we have built SimGym, a scalable system for rapid offline A/B testing using merchant-specific AI shoppers that operate the browser as a human would. SimGym leverages per-merchant storefront logs from real shoppers to build the AI shoppers and runs these on both control and treatment storefronts to decide which alternative is better. We validate SimGym against real Add-to-Cart lift from historical UI changes on Shopify shops. Even without alignment post training, SimGym accurately predicts the direction and relative magnitude of the treatment effect, reducing experiment iteration time from weeks to under an hour and enabling rapid screening without exposing real buyers.
The capabilities of frontier AI systems are advancing faster than the methods used to evaluate them. Benchmark saturation, unrepresentative preference data, and safety evaluations that fail to reflect real deployment conditions have created a widening gap between measured performance and real-world impact. Closing this gap requires treating evaluation not as an afterthought to model development, but as a first-class infrastructure problem. One that demands scientific rigor in data collection design, representative and verified human populations, and evaluation paradigms that go beyond static benchmarks. In this talk, we present Prolific's approach to building evaluation methodology and infrastructure for frontier AI. Prolific supports over 200,000 verified participants across 45 countries and has underpinned the data and methodology of more than 30,000 publications. We describe how we build on this foundation to develop evaluations that serve the needs of leading AI labs and research institutions: from demographically stratified preference studies and adversarial red-teaming to domain expert evaluation and alignment data collection. We share what we have learned from designing evaluations that capture realistic scenarios and surface failures that matter. We ground this in two case studies from our own research presented at ICLR 2026. “Unpacking Human Preference for LLMs: Demographically Aware Evaluation with the HUMAINE Framework” reveals how aggregate leaderboards conceal systematic preference disagreement across populations and how evaluation dimensions like trust and safety demand different methodological approaches than standard open-ended comparison. The “Missing Red Line: How Commercial Pressure Erodes AI Safety Boundaries” is an adversarial audit showing that even mild commercial objectives embedded in system prompts can override model safety training, even in scenarios with life-threatening consequences.
An open discussion led by the organizing committee on topics related to ICLR, such as the review process, policy, and venue.
| ICLR uses cookies for essential functions only. We do not sell your personal information. Our Privacy Policy » |