Poster
in
Workshop: Building Trust in LLMs and LLM Applications: From Guardrails to Explainability to Regulation

VideoJail: Exploiting Video-Modality Vulnerabilities for Jailbreak Attacks on Multimodal Large Language Models

Wenbo Hu · Shishen Gu · Youze Wang · Richang Hong

Project Page [ OpenReview]

Abstract

With the rapid development of multimodal large language models (MLLMs), an increasing number of models focus on video understanding capabilities, while overlooking the security implications of the video modality. Previous studies have highlighted the vulnerability of MLLMs to jailbreak attacks in the image modality. This paper explores the impact of the video modality on the secure alignment of MLLMs. We conduct a systematic empirical analysis of the harmlessness performance of representative MLLMs, revealing vulnerabilities introduced by video input. Motivated by these findings, we propose a novel jailbreak method, VideoJail, which leverages video generation models to amplify harmful content in images. By using carefully crafted text prompts, VideoJail directs the model's attention to malicious queries embedded within the video, successfully breaking through existing defense mechanisms. Experimental results show that VideoJail is highly effective in jailbreaking even the most advanced open-source MLLMs, achieving an average attack success rate (ASR) of 96.53\% for LLaVA-Video and 96.00\% for Qwen2-VL. For closed-source MLLMs with harmful visual content detection capabilities, we take advantage of the dynamic characteristics of the video modality, using a jigsaw-based approach to cleverly bypass their secure alignment mechanisms, achieving an average attack success rate of $92.13\%$ for Gemini-1.5-flash.

Chat is not available.