Probing Implicit Bias Risk Framing in Language Models
Abstract
Do large language models encode when demographic information is implicitly framed as decision-relevant? We study 903 synthetic, LLM-generated decision-support prompts in 15 high-stakes domains, labeled according to a controlled framing distinction: demographic mentions as incidental administrative context versus subtly decision-relevant social context. We train linear probes on hidden states and evaluate under cross-generator transfer, requiring generalization across independently generated prompt distributions. Probes outperform both bag-of-words and frozen transformer baselines (0.93 vs. 0.82 BoW vs. 0.71–0.72 embedding AUROC), indicating the signal is not fully reducible to surface lexical cues or off-the-shelf sentence embeddings. The effect holds across Llama and Qwen models, with layer-wise analysis showing architecture-specific peaks. These results provide preliminary evidence that LLM representations linearly encode this controlled framing distinction, while leaving open broader questions about human-grounded implicit bias.