Poster
in
Workshop: Quantify Uncertainty and Hallucination in Foundation Models: The Next Frontier in Reliable AI
Semantic Calibration of LLMs Through the Lens of Temperature Scaling
Tom A. Lamb · Desi R Ivanova · Philip Torr · Tim G. J. Rudner
Keywords: [ LLM ] [ calibration ] [ uncertainty ] [ semantic calibration ] [ uncertainty quantification ]
Calibration of large language models (LLMs) is typically studied at the token level, but for generative tasks like question-answering (QA), semantic calibration is more relevant. Recent multi-sampling techniques provide a means to elicit semantic confidence measures from LLMs, yet the impact of temperature scaling on these measures remains underexplored. Since temperature scaling influences both generative diversity and token-level calibration, it offers a simple yet effective approach for improving semantic calibration. In this work, we define several semantic confidence measures and evaluate various temperature scaling methods across multiple QA datasets. We introduce a novel calibration loss function, Selective Logit Smoothing (SLS), with a principled Bayesian interpretation.Our empirical findings demonstrate that scalar temperature scaling, when paired with appropriate loss functions such as SLS, can improve semantic calibration.