Calibration

Created January 3, 2021 · Updated March 4, 2026

Confidence i.e. the output of the softmax classifier should be aligned with the observed probability of how many errors are made. But it is usually not the case, as seen in the figure below:

This shows: Softmax output =/= probability

Expected Calibration Error (ECE)

ECE measures the difference in expected accuracy and expected confidence.

\mathrm{ECE}=\sum_{m=1}^{M} \frac{\left|B_{m}\right|}{n}\left|\operatorname{acc}\left(B_{m}\right)-\operatorname{conf}\left(B_{m}\right)\right|

where $$n$$ is the total number of samples across all bins. Perfect calibration is achieved when $\mathrm{ECE}=0,$ that is $\operatorname{acc}\left(B_{m}\right)=\operatorname{conf}\left(B_{m}\right)$ for all bins $$m$$ .

From the figure, we can see that Depth and Trainability > The effect of depth, Depth and Trainability > The effect of depth in wider architectures, and Normalization > Batch normalization — tend to hurt model calibration. Only weight decay seems to improve ECE while improving accuracy.

Maximum Calibration Error (MCE)

MCE is appropriate for high-risk applications, where the goal is to minimize the worst-case deviation between confidence and accuracy.

\mathrm{MCE}=\max _{m \in\{1, \ldots, M\}}\left|\operatorname{acc}\left(B_{m}\right)-\operatorname{conf}\left(B_{m}\right)\right|

Due to overfitting on loss?

In the figure below, we can see that for early epochs, NLL and test error are fairly correlated, but for latter epochs when LR is reduced, the correlation drops and NLL starts to increase again.

Is that the cause of reduced calibration? Could be:

The theory says that a network trained to minimize NLL is actually calibrated, Iff the (absolute) optimum is found.

This proof could be generalized to the following:

Given a fixed function $$f(.),$$ and a new function $$g(.)$$
If $$g(f(.))$$ is minimized with NLL it is calibrated (under some conditions)

Solutions

Temperature Scaling

Use a temperature parameter in Softmax to control the over-confidence predictions. Value of T can be estimated as any other hyperparameters.

P(\hat{\mathbf{y}})=\frac{e^{\mathbf{z} / T}}{\sum_{j} e^{z_{j} / T}}

G-layers

Strip any softmax layers from trained network $$f$$
Train $$g(f(x))$$ on a calibration set X, to minimize NLL
This network is calibrated

References

Guest Lecture, Thomas Mensink, Google Amsterdam
On Calibration of Modern Neural Networks https://arxiv.org/abs/1706.04599
The Importance of Calibrating Your Deep Production Model http://alondaks.com/2017/12/31/the-importance-of-calibrating-your-deep-model/