Calibration

Confidence i.e. the output of the softmax classifier should be aligned with the observed probability of how many errors are made. But it is usually not the case, as seen in the figure below:

calibration-curve

This shows: Softmax output =/= probability

reliability_diagrams

Expected Calibration Error (ECE)

ECE measures the difference in expected accuracy and expected confidence.

$$ \mathrm{ECE}=\sum_{m=1}^{M} \frac{\left|B_{m}\right|}{n}\left|\operatorname{acc}\left(B_{m}\right)-\operatorname{conf}\left(B_{m}\right)\right| $$

where $n$ is the total number of samples across all bins. Perfect calibration is achieved when $\mathrm{ECE}=0,$ that is $\operatorname{acc}\left(B_{m}\right)=\operatorname{conf}\left(B_{m}\right)$ for all bins $m$.

calibration_factors

From the figure, we can see that Depth and Trainability > The effect of depth, Depth and Trainability > The effect of depth in wider architectures, and Normalization > Batch normalization — tend to hurt model calibration. Only weight decay seems to improve ECE while improving accuracy.

Maximum Calibration Error (MCE)

MCE is appropriate for high-risk applications, where the goal is to minimize the worst-case deviation between confidence and accuracy.

$$ \mathrm{MCE}=\max _{m \in\{1, \ldots, M\}}\left|\operatorname{acc}\left(B_{m}\right)-\operatorname{conf}\left(B_{m}\right)\right| $$

Due to overfitting on loss?

In the figure below, we can see that for early epochs, NLL and test error are fairly correlated, but for latter epochs when LR is reduced, the correlation drops and NLL starts to increase again.

NLL-overfitting

Is that the cause of reduced calibration? Could be:

  • The theory says that a network trained to minimize NLL is actually calibrated, Iff the (absolute) optimum is found.

This proof could be generalized to the following:

  • Given a fixed function $f(.),$ and a new function $g(.)$
  • If $g(f(.))$ is minimized with NLL it is calibrated (under some conditions)

Solutions

Temperature Scaling

Use a temperature parameter in Softmax to control the over-confidence predictions. Value of T can be estimated as any other hyperparameters.

$$ P(\hat{\mathbf{y}})=\frac{e^{\mathbf{z} / T}}{\sum_{j} e^{z_{j} / T}} $$

calibration-temperature

G-layers

G-layers

  • Strip any softmax layers from trained network $f$
  • Train $g(f(x))$ on a calibration set X, to minimize NLL
  • This network is calibrated
g-layers

References

  1. Guest Lecture, Thomas Mensink, Google Amsterdam
  2. On Calibration of Modern Neural Networks https://arxiv.org/abs/1706.04599
  3. The Importance of Calibrating Your Deep Production Model http://alondaks.com/2017/12/31/the-importance-of-calibrating-your-deep-model/