SocialHarmBench: Revealing LLM Vulnerabilities to Socially Harmful Requests
Abstract
Large language models (LLMs) are increasingly deployed in contexts where their failures have the potential to carry sociopolitical consequences. However, existing safety benchmarks sparsely test vulnerabilities in domains such as political manipulation, propaganda generation, or surveillance and information control. To address this gap, we propose SocialHarmBench, a dataset of 585 prompts spanning 7 sociopolitical categories and 34 countries with real-world events, designed to evaluate LLM vulnerabilities to sociopolitical harms. Using SocialHarmBench, we provide: (1) adversarial evaluation coverage of high-risk domains including authoritarian surveillance, disinformation campaigns, erosion of democratic processes, and crimes against humanity; (2) adversarial evaluations across open-source models, establishing baseline robustness and measuring attack efficiency in politically charged settings; and (3) insights into domain-specific vulnerability comparisons, temporal-wide investigations to trace vulnerable time periods, and region-specific vulnerabilities. Our findings reveal that existing safeguards fail to transfer effectively to sociopolitical contexts, exposing partisan biases and limitations in preserving human rights and democratic values.