Poster
in
Workshop: Workshop on Sparsity in LLMs (SLLM): Deep Dive into Mixture of Experts, Quantization, Hardware, and Inference
FASTER, CHEAPER, JUST AS GOOD: COST- AND LATENCY-CONSTRAINED ROUTING FOR LLMS
Javid Lakha · Minlan Yu · Rana Shahout
Large Language Models (LLMs) span a broad spectrum of sizes, each presenting distinct trade-offs in cost, latency, and performance that complicate large-scale deployment. Although larger models often provide higher-quality responses for complex prompts, their increased computational overhead and slower inference can degrade user experience in real-time applications. Meanwhile, AI development is moving toward compound AI systems integrating multiple LLMs of different sizes. In such environments, deciding when to invoke smaller or largermodels becomes critical, especially under shifting system loads, as we must balance high output quality, tight cost budgets, and acceptable response times. We propose SCORE, a routing system that maximizes response quality while respecting user-specified cost and latency constraints. For each incoming request, SCORE predicts each model’s response quality and length and selects the optionthat best meets current cost, latency, and quality requirements – whether a less expensive model with a lighter load or a more resource-intensive model for complex prompts. By continually adapting these decisions as requests arrive, SCORE balances the system load, enforces budget limits, and maintains user satisfaction through timely, cost-effective, and accurate responses.