Semantic Calibration of LLMs Through the Lens of Temperature Scaling
Abstract
Calibration of large language models (LLMs) is typically studied at the token level, but for generative tasks like question-answering (QA), semantic calibration is more relevant. Recent multi-sampling techniques provide a means to elicit semantic confidence measures from LLMs, yet the impact of temperature scaling on these measures remains underexplored. Since temperature scaling influences both generative diversity and token-level calibration, it offers a simple yet effective approach for improving semantic calibration. In this work, we define several semantic confidence measures and evaluate various temperature scaling methods across multiple QA datasets. We introduce a novel calibration loss function, Selective Logit Smoothing (SLS), with a principled Bayesian interpretation.Our empirical findings demonstrate that scalar temperature scaling, when paired with appropriate loss functions such as SLS, can improve semantic calibration.