Poster
in
Workshop: ICLR 2025 Workshop on Human-AI Coevolution
How Effective Is Constitutional AI in Small LLMs? A Study on DeepSeek-R1 and Its Peers
Antonio-Gabriel Chacón Menke · Phan Tan
[
Abstract
]
[ Project Page ]
[
OpenReview]
2025 Poster
in
Workshop: ICLR 2025 Workshop on Human-AI Coevolution
in
Workshop: ICLR 2025 Workshop on Human-AI Coevolution
Abstract:
Recent incidents highlight safety risks in Large Language Models (LLMs), motivating research into alignment methods like Constitutional AI (CAI). This paper explores CAI's self-critique mechanism on small, uncensored 7-9B parameter models: DeepSeek-R1-8B, Gemma-2-9B, Llama 3.1-8B, and Qwen2.5-7B. We show that while Llama-based models exhibited significant harm reduction through self-critique, other architectures struggled with harm detection post-abliteration. These findings suggest CAI's effectiveness may vary depending on model architecture and reasoning capabilities.
Chat is not available.
Successful Page Load