Poster

Understanding and Enhancing Safety Mechanisms of LLMs via Safety-Specific Neuron

Yiran Zhao ⋅ Wenxuan Zhang ⋅ Yuxi Xie ⋅ Anirudh Goyal ⋅ Kenji Kawaguchi ⋅ Michael Qizhe Shieh

2025 Poster

[ OpenReview]

Abstract

Safety alignment for large language models (LLMs) has become a critical issue due to their rapid progress. However, our understanding of effective safety mechanisms in LLMs remains limited, leading to safety alignment training that mainly focuses on improving optimization, data-level enhancement, or adding extra structures to intentionally block harmful outputs. To address this gap, we develop a neuron detection method to identify safety neurons—those consistently crucial for handling and defending against harmful queries. Our findings reveal that these safety neurons constitute less than $1\%$ of all parameters, are language-specific and are predominantly located in self-attention layers. Moreover, safety is collectively managed by these neurons in the first several layers. Based on these observations, we introduce a $\underline{S}$afety $\underline{N}$euron $\underline{Tun}$ing method, named $\texttt{SN-Tune}$, that exclusively tune safety neurons without compromising models' general capabilities. $\texttt{SN-Tune}$ significantly enhances the safety of instruction-tuned models, notably reducing the harmful scores of Llama3-8B-Instruction from $65.5$ to $2.0$, Mistral-7B-Instruct-v0.2 from $70.8$ to $4.5$, and Vicuna-13B-1.5 from $93.5$ to $3.0$. Moreover, $\texttt{SN-Tune}$ can be applied to base models on efficiently establishing LLMs' safety mechanism. In addition, we propose $\underline{R}$obust $\underline{S}$afety $\underline{N}$euron $\underline{Tun}$ing method ($\texttt{RSN-Tune}$), which preserves the integrity of LLMs' safety mechanisms during downstream task fine-tuning by separating the safety neurons from models' foundation neurons.

Video

Chat is not available.