Poster
in
Workshop: Quantify Uncertainty and Hallucination in Foundation Models: The Next Frontier in Reliable AI

Semantic Calibration of LLMs Through the Lens of Temperature Scaling

Tom A. Lamb · Desi R Ivanova · Philip Torr · Tim G. J. Rudner

Keywords: LLM calibration uncertainty semantic calibration uncertainty quantification

Project Page [ OpenReview]

Abstract

Calibration of large language models (LLMs) is typically studied at the token level, but for generative tasks like question-answering (QA), semantic calibration is more relevant. Recent multi-sampling techniques provide a means to elicit semantic confidence measures from LLMs, yet the impact of temperature scaling on these measures remains underexplored. Since temperature scaling influences both generative diversity and token-level calibration, it offers a simple yet effective approach for improving semantic calibration. In this work, we define several semantic confidence measures and evaluate various temperature scaling methods across multiple QA datasets. We introduce a novel calibration loss function, Selective Logit Smoothing (SLS), with a principled Bayesian interpretation.Our empirical findings demonstrate that scalar temperature scaling, when paired with appropriate loss functions such as SLS, can improve semantic calibration.

Chat is not available.