Auto-RT: Automatic Jailbreak Strategy Exploration for Red-Teaming Large Language Models
Abstract
Automated red-teaming has emerged as an essential approach for identifying vulnerabilities in large language models (LLMs). However, most existing methods rely on fixed attack templates and focus primarily on individual high-severity flaws,limiting their adaptability to evolving defenses and their ability to detect complex, high-exploitability vulnerabilities. To address these limitations, we propose AUTO-RT, a reinforcement learning framework designed for automatic jailbreak strategy exploration, i.e., discovering diverse and effective prompts capable of bypassing the safety restrictions of LLMs. AUTO-RT autonomously explores and optimizes attack strategies by interacting with the target model and generating crafted queries that trigger security failures. Specifically, AUTO-RT introduces two key techniques to improve exploration efficiency and attack effectiveness: 1) Dynamic Strategy Pruning, which focuses exploration on high-potential strategies by eliminating highly redundant paths early, and 2) Progressive Reward Tracking, which leverages intermediate downgrade models and a novel First Inverse Rate (FIR) metric to smooth sparse rewards and guide learning. Extensive experiments across diverse white-box and black-box LLM settings demonstrate that AUTO-RT significantly improves success rates (by up to 16.63%), expands vulnerability coverage, and accelerates discovery compared to existing methods.