ICLR Expo Talk Panel EUREKA: Evaluating and Understanding Large Foundation Models

Expo Talk Panel

EUREKA: Evaluating and Understanding Large Foundation Models

Besmira Nushi ⋅ Vidhisha Balachandran ⋅ Vibhav Vineet

[ Abstract ]

Abstract:

Rigorous evaluation of large foundation models is critical for assessing the state of the art, informing improvements, and guiding scientific advances in AI. It’s also crucial for app developers using these models. However, practical challenges include benchmark saturation, lack of transparency, difficulties in measuring generative tasks, and numerous capabilities needed for comprehensive model comparison. We also need a deeper understanding of model failures and whether they are consistent over time.

Moreover, with models advancing in reasoning capabilities, a robust evaluation framework is necessary. This session introduces Eureka as a reusable and open framework for standardizing evaluations beyond single-score reporting. We’ll also present Eureka-Bench, which offers benchmarks for challenging and fundamental capabilities in language and vision, including reasoning skills (math, science, hard algorithmic and planning problems). Non-saturated benchmarks help identify meaningful differences between models.

We’ll present insights from analyzing 12 state-of-the-art models, uncovering granular weaknesses and guiding targeted improvements. We’ll also highlight findings from our recent paper on inference-time scaling, which examines reasoning performance and compute tradeoffs. We present an empirical study of inference-time scaling methods for improving reasoning in LLMs across diverse, complex tasks, analyzing their effectiveness, cost-efficiency, and limitations.

Eureka, available as open-source, fosters transparent and reproducible evaluations and has gained significant industry interest, including in prominent press releases.

Useful links:

Blog: https://aka.ms/eureka-ml-insights-blog
Technical report on Eureka: https://aka.ms/eureka-ml-insights-report
Paper on Inference Time Scaling: https://arxiv.org/abs/2504.00294v1
Github repository: https://github.com/microsoft/eureka-ml-insights
Website: https://microsoft.github.io/eureka-ml-insights

Chat is not available.