NetArena: Dynamically Generated LLM Benchmarks for Network Applications
Abstract
As large language models (LLMs) expand into high-stakes domains like network system operations, evaluating their real-world reliability becomes increasingly critical. However, existing benchmarks risk contamination due to static design, show high statistical variance from limited dataset size, and fail to reflect the complexity of production environments. We introduce NetArena, a dynamic benchmark generation framework for network applications. NetArena features a novel abstraction and unified interface that generalizes across applications, effec- tively addressing the challenges of dynamic benchmarking posed by the diversity of network tasks. At runtime, users can generate unlimited queries on demand. NetArena integrates with network emulators to provide execution-time feedback on correctness, safety, and latency. We demonstrate NetArena on three repre- sentative applications and find that (1) it significantly improve statistical reliability among LLM agents (confidence interval overlap reduced from 85% to 0), (2) agents achieve only 13–38% average performance (as low as 3%) for large-scale, realistic queries, (3) it reveals finer-grained behaviors missed by static, correctness-only benchmarks. NetArena also enables use cases such as SFT and RL fine-tuning on network system tasks. Code is available anonymously at https://anonymous.4open.science/r/netarena_iclr2026-BE94/README.md