Epistemic Uncertainty Quantification To Improve Decisions From Black-Box Models
Abstract
Distinguishing a model's lack of knowledge (epistemic uncertainty) from inherent task randomness (aleatoric uncertainty) is crucial for reliable AI. However, standard evaluation metrics of confidence scores target different aspects. AUC and accuracy capture predictive signal, proper scoring rules capture overall uncertainty, and calibration metrics isolate part of the epistemic uncertainty but ignore heterogeneity of the errors within bins, known as grouping loss. We close this evaluation gap by introducing asymptotically consistent and sample-efficient lower-bound estimators for the grouping loss and excess risk, i.e. suboptimality of a prediction. Our estimators complement existing calibration metrics to provide a more complete, fine-grained assessment of epistemic uncertainty. Applied to LLM question-answering with inherent aleatoric noise, our estimator reveals substantial grouping loss which decreases with model scale but is amplified by instruction tuning. The local nature of our estimators provides actionable insights: they automatically identify subgroups with systematic over- or under-confidence for interpretable audits. We also demonstrate that it reveals better the need of post-training. Finally, we leverage our estimator to design efficient LLM cascades that defer to stronger models, achieving higher accuracy at a lower cost than competing approaches.