Oral
in
Workshop: Secure and Trustworthy Large Language Models

Single-pass detection of jailbreaking input in large language models

Leyla Naz Candogan ⋅ Yongtao Wu ⋅ Elias Abad Rocamora ⋅ Grigorios Chrysos ⋅ Volkan Cevher

Project Page [ OpenReview]

Abstract

Recent advancements have exposed the vulnerability of aligned large language models (LLMs) to jailbreaking attacks, which sparked a current wave of research on post-defense strategies. However, some existing approaches require either multiple requests to the models or additional auxiliary LLMs, which is time and resource-consuming. To this end, we propose single-pass detection, SPD, a method for detecting jailbreaking inputs via the logit values in a single forward pass. In open-source Lllama 2 and Vicuna, SPD achieves a higher attack detection rate and detection speed than the existing defense mechanisms with minimal misclassification of benign inputs. Finally, we demonstrate the efficacy of SPD even in the absence of full logit in both GPT-3.5 and GPT-4. We firmly believe that our proposed defense presents a promising approach to safeguarding LLMs against adversarial attacks.

Chat is not available.