Uncertainty in Machine Learning

Created August 2, 2022 · Updated March 4, 2026

There are two distinct types of uncertainty in this modeling process: data uncertainty and model uncertainty.

While the model uncertainty can be reduced by training on more data, the data uncertainty is inherent to the data generating process and is irreducible.

Data Uncertainty

Data uncertainty arises from the stochastic variability inherent in the data generating process. Its also called aleatoric uncertainty.
For example, the toxicity label y for a comment can vary between 0 and 1 depending on raters’ different understandings of the comment or of the annotation guidelines.
A learned classifier $$f_W(x)$$ describes the data uncertainty via its predictive probability, e.g.: $p(y \mid x, W)=\operatorname{sigmoid}\left(f_{W}(x)\right)$

Model Uncertainty

Model uncertainty arises from the model’s lack of knowledge about the world, commonly caused by insufficient coverage of the training data. Its also called epistemic Uncertainty.
For example, at evaluation time, the toxicity classifier may encounter neologisms or misspellings that did not appear in the training data, making it more likely to make a mistake.
A classifier can quantify model uncertainty by using probabilistic methods to learn the posterior distribution of the model parameters: $W \sim p(W)$
This distribution over $$W$$ leads to a distribution over the predictive probabilities $p(y \mid x, W)$ . As a result, at inference time, the model can sample model weights $\left\{W_{m}\right\}_{m=1}^{M}$ from the posterior distribution $$p(W)$$ , and then compute the posterior sample of predictive probabilities $\left\{p\left(y \mid x, W_{m}\right)\right\}_{m=1}^{M}$ . This allows the model to express its model uncertainty through the variance of the posterior distribution $\operatorname{Var}(p(y \mid x, W))$ .

Intuition

Aleatoric: we cant make good prediction because the world is random
Epistemic: we cant make good prediction because we don't know how the world works well enough

Combined Uncertainty

In practice, it is convenient to compute a single uncertainty score capturing both types of uncertainty. To this end, we can first compute the marginalized predictive probability:

p(y \mid x)=\int p(y \mid x, W) p(W) d W

This marginalization captures both types of uncertainty.

Uncertainty Estimation Methods

For deep learning models, these are some common methods for estimating uncertainty taken from Ovadia et al., 2019 (https://arxiv.org/pdf/1906.02530.pdf):

(Vanilla) Maximum softmax probability (Hendrycks & Gimpel, 2017)
(Temp Scaling) Post-hoc calibration by temperature scaling using a validation set Calibration > Temperature Scaling (Guo et al., 2017)
(Dropout) Monte-Carlo Dropout (Gal & Ghahramani, 2016; Srivastava et al., 2015) with rate $$p$$
(Ensembles) Ensembles of $$M$$ networks trained independently on the entire dataset using random initialization (Lakshminarayanan et al., 2017)
(SVI) Stochastic Variational Bayesian Inference for deep learning (Blundell et al., 2015; Graves, 2011; Louizos & Welling, 2017, 2016; Wen et al., 2018).
(LL) Approx. Bayesian inference for the parameters of the last layer only (Riquelme et al., 2018)
- (LLSVI) Mean field stochastic variational inference on the last layer only
- (LL Dropout) Dropout only on the activations before the last layer

Evaluating Uncertainty Quality

Negative Log-Likelihood (NLL)

Commonly used to evaluate the quality of model uncertainty on some held out set. Lower is better.

Drawbacks: Although a proper scoring rule (optimum score corresponds to a perfect prediction), it can over-emphasize tail probabilities.

Brier Score

Proper scoring rule for measuring the accuracy of predicted probabilities. It is computed as the squared error of a predicted probability vector, $p\left(y \mid x_{n}, \boldsymbol{\theta}\right)$ , and the one-hot encoded true response, $y_{n}$ . That is,

\mathrm{BS}=|\mathcal{Y}|^{-1} \sum_{y \in \mathcal{Y}}\left(p\left(y \mid \boldsymbol{x}_{n}, \boldsymbol{\theta}\right)-\delta\left(y-y_{n}\right)\right)^{2}=|\mathcal{Y}|^{-1}\left(1-2 p\left(y_{n} \mid \boldsymbol{x}_{n}, \boldsymbol{\theta}\right)+\sum_{y \in \mathcal{Y}} p\left(y \mid \boldsymbol{x}_{n}, \boldsymbol{\theta}\right)^{2}\right) .

Drawbacks: Brier score is insensitive to predicted probabilities associated with infrequent events.

Calibration > Expected Calibration Error ECE

Predictive Entropy

The smaller the PE, the more confident the model about its predictions

P E=-\sum_{c} \mu_{c} \log \mu_{c}

Calibration AUC

A common approach to evaluate a model’s uncertainty quality is to measure its Calibration performance - whether the model’s predictive uncertainty is indicative of the predictive error.

This metric evaluates uncertainty estimation by recasting it as a binary prediction problem, where the binary label is the model's prediction error $\mathbb{I}\left(f\left(x_{i}\right) \neq y_{i}\right)$ , and the predictive score is the model uncertainty. This formulation leads to the uncertainty confusion matrix:

		Uncertainty	Uncertainty
		Uncertain	Certain
Accuracy	Inaccurate	TP	FN
Accuracy	Accurate	FP	TN

TP - Prediction is inaccurate and the model is uncertain
TN - Prediction is accurate and model is certain
FN - Prediction is inaccurate and model is certain i.e. overconfidence
FP - Prediction is accurate and model is uncertain i.e. under-confidence

Precision - TP/(TP+FP) - fraction of inaccurate examples where model is uncertain
Recall - TP/(TP+FN) - Fraction of uncertain examples where the model is inaccurate
False positive rate (FPR) - Fraction of under-confident examples among correct examples
Accuracy - TN+TP/(TN+TP+FN+FP)

Thus model's calibration performance can be measured using precision-recall curve as Calibration AUPRC and ROC curve with Calibration AUROC.

References

Measuring and Improving Model-Moderator Collaboration using Uncertainty Estimation, Kivlichan et al., 2022
Can You Trust Your Model’s Uncertainty? Evaluating Predictive Uncertainty Under Dataset Shift, Ovadia et al., 2019