Poster
in
Workshop: Building Trust in LLMs and LLM Applications: From Guardrails to Explainability to Regulation
The Jailbreak Tax: How Useful are Your Jailbreak Outputs?
Kristina Nikolić · Luze Sun · Jie Zhang · Florian Tramer
Jailbreak attacks bypass the guardrails of large language models to produce harmful outputs. In this paper, we ask whether the model outputs produced by existing jailbreaks are actually useful. For example, when jailbreaking a model to give instructions for building a bomb, does the jailbreak yield good instructions? Since the utility of most unsafe answers (e.g., bomb instructions) is hard to evaluate rigorously, we build new jailbreak evaluation sets with known ground truth answers, by aligning models to refuse questions related to benign and easy-to-evaluate topics (e.g., biology or math). Our evaluation of eight representative jailbreaks across five utility benchmarks shows vast disparities in the utility of model responses. For example, while all jailbreaks we tested bypass guardrails in models aligned to refuse to answer math, this comes at the expense of a drop of up to 97% in accuracy. Overall, our work proposes jailbreak utility as a new important metric in AI safety, and introduces benchmarks to evaluate existing and future jailbreaks.