I Can't Believe It's Not Better: Where Large Language Models need to improve
Abstract
Large language models (LLMs) have advanced rapidly, yet these advances have also highlighted gaps, such as hallucination, brittle reasoning, alignment failures, and hard efficiency/scaling constraints, especially in safety-critical settings. Ideally, evidence of such limitations would immediately lead to improvements to address these gaps, but compute constraints and unfruitful approaches often stall iteration; meanwhile, publication norms still prioritize positive results over informative null or negative findings. This workshop creates a venue for negative results on LLMs including: (i) rigorous studies that demonstrate and analyze limitations (e.g., leak-resistant reasoning probes, alignment stress tests, failure audits in critical applications), and (ii) attempts on well-established ideas that did not deliver expected gains, with analyses that identify failure modes, boundary conditions, and lessons learned. We welcome diagnostics, replications, counterfactual evaluations, and ablations that separate genuine capability from shortcut learning and clarify when methods break, why they break, and how to fix them. By aggregating evidence of negative results and actionable takeaways, the workshop aims to convert setbacks into robust principles and practices for building more reliable LLMs.