Track: Oral Session 6B

Sat 26 April 0:30 - 0:42 PDT

MoDeGPT: Modular Decomposition for Large Language Model Compression

Chi-Heng Lin · Shangqian Gao · James Smith · Abhishek Patel · Shikhar Tuli · Yilin Shen · Hongxia Jin · Yen-Chang Hsu

Large Language Models (LLMs) have significantly advanced AI with their exceptional performance across a wide range of tasks. However, their extensive computational requirements restrict their use on devices with limited resources.While recent compression methods based on low-rank matrices show potentialsolutions, they often suffer from significant loss of accuracy or introduce substantialoverhead in parameters and inference time. In this paper, we introduce Modular De-composition (MoDeGPT), a new, efficient, and structured compression frameworkthat overcomes these limitations. MoDeGPT jointly decomposes pairs of consecu-tive subcomponents within Transformer blocks, reduces hidden dimensions throughoutput reconstruction on a larger structural scale than conventional low-rank meth-ods, and repurposes three classical matrix decomposition algorithms—Nyströmapproximation, CR decomposition, and SVD—to ensure bounded errors in ournovel decomposition approach. Our experiments show that MoDeGPT, withoutrelying on backward propagation, consistently matches or surpasses the performance of prior techniques that depend on gradient information, while achieving a98% reduction in compute costs when compressing a 13B-parameter model. OnLLaMA-2/3 and OPT models, MoDeGPT retains 90-95% of zero-shot performancewith compression rates of 25-30%. The compression process can be completed ona single GPU in a few hours, boosting inference throughput by up to 46%.

Sat 26 April 0:42 - 0:54 PDT

AlphaEdit: Null-Space Constrained Model Editing for Language Models

Junfeng Fang · Houcheng Jiang · Kun Wang · Yunshan Ma · Jie Shi · Xiang Wang · Xiangnan He · Tat-Seng Chua

Large language models (LLMs) often exhibit hallucinations, producing incorrect or outdated knowledge. Hence, model editing methods have emerged to enable targeted knowledge updates. To achieve this, a prevailing paradigm is the locating-then-editing approach, which first locates influential parameters and then edits them by introducing a perturbation. While effective, current studies have demonstrated that this perturbation inevitably disrupt the originally preserved knowledge within LLMs, especially in sequential editing scenarios.To address this, we introduce AlphaEdit, a novel solution that projects perturbation onto the null space of the preserved knowledge before applying it to the parameters. We theoretically prove that this projection ensures the output of post-edited LLMs remains unchanged when queried about the preserved knowledge, thereby mitigating the issue of disruption. Extensive experiments on various LLMs, including LLaMA3, GPT2-XL, and GPT-J, show that AlphaEdit boosts the performance of most locating-then-editing methods by an average of 36.7% with a single line of additional code for projection solely.

Sat 26 April 0:54 - 1:06 PDT

Judge Decoding: Faster Speculative Sampling Requires Going Beyond Model Alignment

Gregor Bachmann · Sotiris Anagnostidis · Albert Pumarola · Markos Georgopoulos · Artsiom Sanakoyeu · Yuming Du · Edgar Schoenfeld · Ali Thabet · Jonas Kohler

The performance of large language models (LLMs) is closely linked to their underlying size, leading to ever-growing networks and hence slower inference. Speculative decoding has been proposed as a technique to accelerate autoregressive generation, leveraging a fast draft model to propose candidate tokens, which are then verified in parallel based on their likelihood under the target model. While this approach guarantees to reproduce the target output, it incurs a substantial penalty: many high-quality draft tokens are rejected, even when they represent objectively valid continuations. Indeed, we show that even powerful draft models such as GPT-4o, as well as human text cannot achieve high acceptance rates under the standard verification scheme. This severely limits the speedup potential of current speculative decoding methods, as an early rejection becomes overwhelmingly likely when solely relying on alignment of draft and target.We thus ask the following question: Can we adapt verification to recognize correct, but non-aligned replies? To this end, we draw inspiration from the LLM-as-a-judge framework, which demonstrated that LLMs are able to rate answers in a versatile way. We carefully design a dataset coined TokenCourt to elicit the same capability in the target model by training a compact module on top of the embeddings to produce ``judgements" of the current continuation. We showcase our strategy on the Llama-3.1 family, where our 8B/405B-Judge achieves a speedup of $9\times$ over Llama-405B, while maintaining its quality on a large range of benchmarks. These benefits remain present even in optimized inference frameworks, where our method reaches up to $141$ tokens/s for 8B/70B-Judge and $129$ tokens/s for 8B/405B on $2$ and $8$ H100s respectively.

Sat 26 April 1:06 - 1:18 PDT

Cheating Automatic LLM Benchmarks: Null Models Achieve High Win Rates

Xiaosen Zheng · Tianyu Pang · Chao Du · Qian Liu · Jing Jiang · Min Lin

Automatic LLM benchmarks, such as AlpacaEval 2.0, Arena-Hard-Auto, and MT-Bench, have become popular for evaluating language models due to their cost-effectiveness and scalability compared to human evaluation. Achieving high win rates on these benchmarks can significantly boost the promotional impact of newly released language models. This promotional benefit may motivate tricks, such as manipulating model output length or style to game win rates, even though several mechanisms have been developed to control length and disentangle style to reduce gameability. Nonetheless, we show that even a **"null model"** that always outputs a **constant** response (*irrelevant to input instructions*) can cheat automatic benchmarks and achieve top-ranked win rates: an $86.5\\%$ LC win rate on AlpacaEval 2.0; an $83.0$ score on Arena-Hard-Auto; and a $9.55$ score on MT-Bench. Moreover, the crafted cheating outputs are **transferable** because we assume that the instructions of these benchmarks (e.g., $805$ samples of AlpacaEval 2.0) are *private* and cannot be accessed. While our experiments are primarily proof-of-concept, an adversary could use LLMs to generate more imperceptible cheating responses, unethically benefiting from high win rates and promotional impact. Our findings call for the development of anti-cheating mechanisms for reliable automatic benchmarks. The code is available at https://github.com/sail-sg/Cheating-LLM-Benchmarks.

Sat 26 April 1:18 - 1:30 PDT

Faster Cascades via Speculative Decoding

Harikrishna Narasimhan · Wittawat Jitkrittum · Ankit Singh Rawat · Seungyeon Kim · Neha Gupta · Aditya Krishna Menon · Sanjiv Kumar

Cascades and speculative decoding are two common approaches to improving language models' inference efficiency. Both approaches interleave two models, but via fundamentally distinct mechanisms: deferral rule that invokes the larger model only for “hard” inputs, while speculative decoding uses speculative execution to primarily invoke the larger model in parallel scoring mode. These mechanisms offer different benefits: empirically, cascades offer compelling cost-quality trade-offs, often even outperforming the large model; speculative cascades offer impressive speed-ups, while guaranteeing quality-neutrality. In this paper, we leverage the best of both these approaches by designing new speculative cascading techniques that implement their deferral rule through speculative execution. We characterize the optimal deferral rule for our speculative cascades, and employ a plug-in approximation to the optimal rule. Experiments with Gemma and T5 models on a range of language benchmarks show that our approach yields better cost quality trade-offs than cascading and speculative decoding baselines.

Sat 26 April 1:30 - 1:42 PDT

Syntactic and Semantic Control of Large Language Models via Sequential Monte Carlo

João Loula · Benjamin LeBrun · Li Du · Ben Lipkin · Clemente Pasti · Gabriel Grand · Tianyu Liu · Yahya Emara · Marjorie Freedman · Jason Eisner · Ryan Cotterell · Vikash Mansinghka · Alexander Lew · Tim Vieira · Timothy O'Donnell

A wide range of LM applications require generating text that conforms to syntactic or semantic constraints. Imposing such constraints can be naturally framed as probabilistic conditioning, but exact generation from the resulting distribution—which can differ substantially from the LM’s base distribution—is generally intractable. In this work, we develop an architecture for controlled LM generation based on sequential Monte Carlo (SMC). This SMC framework allows us to flexibly incorporate domain- and problem-specific constraints at inferencetime, and efficiently reallocate computational resources in light of new information during the course of generation. By comparing to a number of alternatives and ablations on four challenging domains—Python code generation for data science, text-to-SQL, goal inference, and molecule synthesis—we demonstrate that, with little overhead, our approach allows small open-source language models to outperform models over 8× larger, as well as closed-source, fine-tuned ones. In support of the probabilistic perspective, we show that these performance improvements are driven by better approximation to the posterior distribution. Our system builds on the framework of Lew et al. (2023) and integrates with its language model probabilistic programming language, giving users a simple, programmable way to apply SMC to a broad variety of controlled generation problems.