Loss Functions

Created January 4, 2021 · Updated March 4, 2026

Mean-Squared-Error Loss

Data: inputs $\mathbf{X}=\left(\mathbf{x}_{1}, \ldots, \mathbf{x}_{N}\right)^{T},$ and targets $\mathbf{t}=\left(t_{1}, \ldots, t_{N}\right)^{T}$
Assume target distribution as Gaussian:

p(t \mid \mathbf{x}, \mathbf{w}) = \mathcal{N}(\mathbf{t}| y(\mathbf{w},\mathbf{x}),\beta^{-1})

single target $$->$$ single output unit: $y(\mathbf{x}, \mathbf{w})=h^{(L)}\left(a^{\text {out }}\right)$
Targets are real valued: identity output activation function:

y(\mathbf{x}, \mathbf{w})=h^{(L)}\left(a^{\text {out }}\right)=a^{\text {out }}

Maximum Likelihood/minimum negative log likelihood:

E(\mathbf{w})=-\ln p(\mathbf{t} \mid \mathbf{X}, \mathbf{w})=\frac{\beta}{2} \sum_{n=1}^{N}\left\{y\left(\mathbf{x}_{n}, \mathbf{w}\right)-t_{n}\right\}^{2}-\frac{N}{2} \ln \beta+\frac{N}{2} \ln 2 \pi \\

Equivalently,

E(\mathbf{w})=\frac{1}{2} \sum_{n=1}^{N}\left\{y\left(\mathbf{x}_{n}, \mathbf{w}\right)-t_{n}\right\}^{2}

commonly referred to as Mean-Squared Error, Quadratic Loss etc.

Binary Cross Entropy Loss

Data: inputs $\mathbf{X}=\left(\mathbf{x}_{1}, \ldots, \mathbf{x}_{N}\right)^{T},$ and targets $\mathbf{t}=\left(t_{1}, \ldots, t_{N}\right)^{T}$
Assume target distribution as Bernoulli, and make prediction for probability for class 1:

y(\mathbf{x}, \mathbf{w})=p(t=1 \mid \mathbf{x})

p(t \mid \mathbf{x}, \mathbf{w})=\quad y(\mathbf{x}, \mathbf{w})^{t}\left(1-y(\mathbf{x}, \mathbf{w})\right)^{1-t}

Targets are binary: Sigmoid output activation function:

y(\mathbf{x}, \mathbf{w})=h^{(L)}\left(a^{\text {out }}\right)=\sigma\left(a^{\text {out }}\right)

Maximum Likelihood/minimum negative log likelihood:

E(\mathbf{w})=-\sum_{n=1}^{N} t_{n} \ln y\left(\mathbf{x}_{n}, \mathbf{w}\right)+\left(1-t_{n}\right) \ln \left(1-y\left(\mathbf{x}_{n}, \mathbf{y}\right)\right)

commonly referred to as Binary Cross entropy loss.

Cross Entropy Loss

Data: inputs $\mathbf{X}=\left(\mathbf{x}_{1}, \ldots, \mathbf{x}_{N}\right)^{T},$ and one-hot encoded targets $\mathbf{T}=\left(\mathbf{t}_{1}, . . ., \mathbf{t}_{N}\right)^{T}$

Assume target distribution as generalized Bernoulli:

p\left(\mathbf{t}_{n} \mid \mathbf{x}_{n}, \mathbf{w}\right) = \prod^{k}_{k=1}y_k(\mathbf{x_n},\mathbf{w})^{t_{nk}}

$\mathrm{K}$ targets $->\mathrm{K}$ output units: $\quad y_{k}(\mathbf{x}, \mathbf{w})=h^{(L)}\left(a_{k}^{\text {out }}\right)$

Categorical targets: Softmax output activation function:

y_{k}(\mathbf{x}, \mathbf{w})=h^{(L)}\left(\mathbf{a}^{\text {out }}\right)=\frac{\exp \left(a_{k}^{\text {out }}\right)}{\sum_{j=1}^{K} \exp \left(a_{j}^{\text {out }}\right)}

Maximum Likelihood/minimum negative log likelihood:

E(\mathbf{w})=-\sum_{n=1}^{N} \sum_{k=1}^{k} t_{nk} \ln y_{k}\left(\mathbf{x}_{n}, \mathbf{w}\right)

commonly referred to as Cross entropy loss.

Other Loss functions

Polyloss