Softmax and Cross Entropy Loss
Understanding the intuition and maths behind softmax and the cross entropy loss — the ubiquitous combination in machine learning.
The Softmax Function
The softmax function takes an N-dimensional vector of real numbers and transforms it into a probability vector where each element lies in the range (0, 1) and all elements sum to 1.
As the name suggests, softmax is a "soft" version of the max function. Rather than selecting a single maximum, it distributes the total probability mass (1) across all elements, with the largest value receiving the greatest share while smaller values receive proportionally less.
This probability distribution output makes softmax well-suited for classification tasks, where we want to interpret model outputs as class probabilities.
A naive Python implementation looks like this:
def softmax(X):
exps = np.exp(X)
return exps / np.sum(exps)
However, floating point numbers in numpy have a limited range. For float64, the upper bound is $10^{308}$, and exponentials can easily exceed this, producing nan.
To make softmax numerically stable, we multiply the numerator and denominator with a constant $C$.
We can choose any value for $\log(C)$, but the standard choice is $\log(C) = -\max(a)$. This shifts all elements so the maximum value becomes zero. Negative values with large exponents saturate to zero rather than infinity, preventing overflow.
The numerically stable version looks like this:
def stable_softmax(X):
exps = np.exp(X - np.max(X))
return exps / np.sum(exps)
Derivative of Softmax
Softmax is commonly used as the output activation in classification networks. To train with backpropagation, we need its gradient with respect to each input $a_j$.
Applying the quotient rule, for $f(x) = \frac{g(x)}{h(x)}$ we have $f'(x) = \frac{g'(x)h(x) - h'(x)g(x)}{h(x)^2}$.
Here $g(a_j) = e^{a_i}$ and $h(a_j) = \sum_{k=1}^N e^{a_k}$. The denominator $h(a_j)$ always contains $e^{a_j}$, so its derivative with respect to $a_j$ is always $e^{a_j}$. The numerator $g(a_j) = e^{a_i}$ depends on $a_j$ only when $i = j$; otherwise its derivative is 0.
If $i=j$,
For $i \neq j$,
So the derivative of the softmax function is:
Or using Kronecker delta $\delta_{ij} = \begin{cases} 1 & \text{if} & i=j \\ 0 & \text{if} & i\neq j \end{cases}$
Derivative of Cross Entropy Loss with Softmax
Softmax paired with cross entropy loss is the standard output layer for classification. Using the softmax derivative
Substituting the softmax derivative from above, and splitting on the $k = i$ and $k \neq i$ cases:
Since $y$ is a one-hot encoded label vector, $\sum_k y_k = 1$, so $y_i + \sum_{k \neq i} y_k = 1$. This gives:
A remarkably simple result. In code
def delta_cross_entropy(X,y):
"""
X is the output from fully connected layer (num_examples x num_classes)
y is labels (num_examples x 1)
Note that y is not one-hot encoded vector.
It can be computed as y.argmax(axis=1) from one-hot encoded vectors of labels if required.
"""
m = y.shape[0]
grad = softmax(X)
grad[range(m),y] -= 1
grad = grad/m
return grad
Citation
If you find this post useful, please cite it as:
Dahal, Paras. (Jun 2017). Softmax and Cross Entropy Loss. Paras Dahal. https://parasdahal.com/softmax-crossentropy.
Or in BibTeX format:
@article{dahal2017softmax,
title = "Softmax and Cross Entropy Loss",
author = "Dahal, Paras",
journal = "parasdahal.com",
year = "2017",
month = "Jun",
url = "https://parasdahal.com/softmax-crossentropy"
}