Skip to yearly menu bar Skip to main content


Oral
in
Workshop: ICLR 2025 Workshop on Bidirectional Human-AI Alignment

Investigating Alignment Signals in Initial Token Representations


Abstract:

Large language models (LLMs) have become integral to modern text-based applications, necessitating the development of robust alignment frameworks that ensure ethical, context-sensitive, and safe outputs. However, the underlying formation of refusal decisions remains underexplored, particularly regarding whether these behaviors emerge from surface-level cues or deeper contextual representations in the earliest generation stages, posing an open problem space. To explore this, we analyze hidden states at the very first assistant token through a straightforward logistic regression probe, aiming to pinpoint when refusal signals emerge and how they are encoded. Our findings reveal that these early internal representations can reliably predict refusal outcomes, achieving 93.1\% accuracy on a held-out test set. Moreover, causal intervention experiments confirm that manipulating these initial hidden states can decisively flip model decisions, underscoring their direct influence on the model's behavior. By illuminating the decisive role of initial hidden states, this study offers new insights that may guide the development of more robust alignment interventions, ultimately promoting safer and more dependable LLM deployments.

Chat is not available.