Poster

Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaces

Mike Merrill · Alexander Shaw · Nicholas Carlini · Boxuan Li · Harsh Raj · Ivan Bercovich · Lin Shi · Jeong Shin · Thomas Walshe · E. Kelly Buchanan · Junhong Shen · Guanghao Ye · Haowei Lin · Jason Poulos · Maoyu Wang · Jenia Jitsev · Marianna Nezhurina · Di Lu · Orfeas Menis Mastromichalakis · Zhiwei Xu · Zizhao Chen · Yue Liu · Robert Zhang · Leon Liangyu Chen · Anurag Kashyap · Jan-Lucas Uslu · Jeffrey Li · Jianbo Wu · Minghao Yan · Song Bian · Vedang Sharma · Ke Sun · Steven Dillmann · Akshay Anand · Andrew Lanpouthakoun · Bardia Koopah · Changran Hu · Etash Guha · Gabriel Dreiman · Jiacheng Zhu · Karl Krauth · Li Zhong · Niklas Muennighoff · Robert Amanfu · Shangyin Tan · Shreyas Pimpalgaonkar · Tushar Aggarwal · Xiangning Lin · Xin Lan · Xuandong Zhao · Yiqing Liang · Yuanli Wang · Zilong (Ryan) Wang · Jason Chou · David Heineman · Hange Liu · Harsh Trivedi · John Yang · Junhong Lin · Manish Shetty · Michael Yang · Nabil Omi · Negin Raoof · Shanda Li · Terry Yue Zhuo · Wuwei Lin · Yiwei Dai · Yuxin Wang · Wenhao Chai · Shang Zhou · Dariush Wahdany · Ziyu She · Jiaming Hu · Zhikang Dong · Yuxuan Zhu · Sasha Cui · Ahson Saiyed · Arinbjörn Kolbeinsson · Christopher Rytting · Ryan Marten · Yixin Wang · Alex Dimakis · Andy Konwinski · Ludwig Schmidt

Abstract

AI agents may soon become capable of autonomously completing valuable, long-horizon tasks in diverse domains. Current benchmarks either do not measure real-world tasks, or are not sufficiently difficult to meaningfully measure frontier models. To this end, we present Terminal-Bench 1.5: a carefully curated hard benchmark composed of 74 tasks in computer terminal environments inspired by problems from real workflows. Each task features a unique environment, human-written solution, and comprehensive tests for verification. We show that frontier models and agents score less than 50% on the benchmark and conduct an error analysis to identify areas for model and agent improvement. We publish the dataset and evaluation harness to assist developers and researchers in future work.