Oral
in
Workshop: Secure and Trustworthy Large Language Models

Open Sesame! Universal Black-Box Jailbreaking of Large Language Models

Raz Lapid ⋅ Ron Langberg ⋅ Moshe Sipper

Project Page [ OpenReview]

Abstract

We introduce a novel approach that employs a genetic algorithm (GA) to manipulate LLMs when model architecture and parameters are inaccessible. The GA attack works by optimizing a universal adversarial prompt that—when combined with a user’s query—disrupts the attacked model’s alignment, resulting in unintended and potentially harmful outputs. To our knowledge this is the first automated universal black box jailbreak attack.

Chat is not available.