Poster Fri, Apr 24, 2026 • 6:30 AM – 9:00 AM PDT Pavilion 4 P4-#5206

Strategic Obfuscation of Deceptive Reasoning in Language Models

Arun Jose ⋅ Niels Warncke ⋅ Mia Taylor

[ OpenReview]

Abstract

Large language models can exhibit different behaviors during training versus deployment, a phenomenon known as alignment faking. While prior work has shown models can explicitly reason about faking alignment to preserve their goals, we demonstrate that Claude 3.5 Sonnet exhibits alignment faking behavior without verbalizing deceptive reasoning. When told its reasoning will be monitored and that it will be trained to comply with harmful queries, the model shows a 24.3\% compliance gap between training and deployment contexts while its verbalization of alignment faking drops by 93.7\%. We compare this to instruction-based approaches where we explicitly tell models to hide certain reasoning from monitors. We find that intrinsic motivation to obfuscate leads to far better performance at evading a monitor than following instructions, even with assistance on how to do so. Even on tasks requiring simpler reasoning, instruction-based methods only reduce verbalization by 47.6\%. Our results indicate that models can exhibit sophisticated deceptive behavior in high-stakes scenarios without accessible reasoning when internally motivated, limiting the reliability of instruction-based elicitation.

Video

Chat is not available.