Poster
in
Workshop: Building Trust in LLMs and LLM Applications: From Guardrails to Explainability to Regulation
Integrated Gradients Provides Faithful Language Model Attributions for In-Context Learning
Theo Datta · Erik Wang · Kayla Huang · Finale Doshi-Velez
In-context learning (ICL) allows large language models to perform a wide range of tasks without parameter updates by conditioning on example prompts. While effective, the mechanisms by which ICL leverages information within the prompt remain poorly understood. Furthermore, while Integrated Gradients (IG) is one of the most popular feature-attribution methods in deep learning, it has not been widely adopted for explaining large language models. Instead, LLM research has gravitated toward alternative interpretability techniques that focus more on the model’s internal representations. This work investigates the faithfulness of Integrated Gradients (IG) as an explainability method for LLMs, specifically focusing on biases in in-context learning, and aims to shed light on how models utilize in-context examples. We demonstrate that IG attributions reliably capture the influence of individual examples on model predictions and reveal a positional bias consistent with results from non-gradient-based approaches, with later examples generally receiving higher attribution scores. Finally, our results provide insight into what the most ideal prompt formats are for the faithful application of IG.