The Softmax Function

The softmax function takes an N-dimensional vector of real numbers and transforms it into a probability vector where each element lies in the range (0, 1) and all elements sum to 1.

$$ p_i = \frac{e^{a_i}}{\sum_{k=1}^N e^a_k} $$

As the name suggests, softmax is a "soft" version of the max function. Rather than selecting a single maximum, it distributes the total probability mass (1) across all elements, with the largest value receiving the greatest share while smaller values receive proportionally less.

This probability distribution output makes softmax well-suited for classification tasks, where we want to interpret model outputs as class probabilities.

A naive Python implementation looks like this:

def softmax(X):
    exps = np.exp(X)
    return exps / np.sum(exps)

However, floating point numbers in numpy have a limited range. For float64, the upper bound is $10^{308}$, and exponentials can easily exceed this, producing nan.

To make softmax numerically stable, we multiply the numerator and denominator with a constant $C$.

$$ \begin{align*} p_i &= \frac{e^{a_i}}{\sum_{k=1}^N e^{a_k}} \\ &= \frac{Ce^{a_i}}{C\sum_{k=1}^N e^{a_k}} \\ &= \frac{e^{a_i + \log(C)}}{\sum_{k=1}^N e^{a_k + \log(C)}} \\ \end{align*} $$

We can choose any value for $\log(C)$, but the standard choice is $\log(C) = -\max(a)$. This shifts all elements so the maximum value becomes zero. Negative values with large exponents saturate to zero rather than infinity, preventing overflow.

The numerically stable version looks like this:

def stable_softmax(X):
    exps = np.exp(X - np.max(X))
    return exps / np.sum(exps)

Derivative of Softmax

Softmax is commonly used as the output activation in classification networks. To train with backpropagation, we need its gradient with respect to each input $a_j$.

$$ \begin{align*} \frac{\partial p_i}{\partial a_j} &= \frac{\partial \frac{e^{a_i}}{\sum_{k=1}^N e^{a_k}}}{\partial a_j} \\ \end{align*} $$

Applying the quotient rule, for $f(x) = \frac{g(x)}{h(x)}$ we have $f'(x) = \frac{g'(x)h(x) - h'(x)g(x)}{h(x)^2}$.

Here $g(a_j) = e^{a_i}$ and $h(a_j) = \sum_{k=1}^N e^{a_k}$. The denominator $h(a_j)$ always contains $e^{a_j}$, so its derivative with respect to $a_j$ is always $e^{a_j}$. The numerator $g(a_j) = e^{a_i}$ depends on $a_j$ only when $i = j$; otherwise its derivative is 0.

If $i=j$,

$$ \begin{align*} \frac{\partial \frac{e^{a_i}}{\sum_{k=1}^N e^{a_k}}}{\partial a_j}&= \frac{e^{a_i} \sum_{k=1}^N e^{a_k} - e^{a_j}e^{a_i}}{\left( \sum_{k=1}^N e^{a_k}\right)^2} \\ &= \frac{e^{a_i} \left( \sum_{k=1}^N e^{a_k} - e^{a_j}\right )}{\left( \sum_{k=1}^N e^{a_k}\right)^2} \\ &= \frac{ e^{a_j} }{\sum_{k=1}^N e^{a_k} } \times \frac{\left( \sum_{k=1}^N e^{a_k} - e^{a_j}\right ) }{\sum_{k=1}^N e^{a_k} } \\ &= p_i(1-p_j) \end{align*} $$

For $i \neq j$,

$$ \begin{align*} \frac{\partial \frac{e^{a_i}}{\sum_{k=1}^N e^{a_k}}}{\partial a_j}&= \frac{0 - e^{a_j}e^{a_i}}{\left( \sum_{k=1}^N e^{a_k}\right)^2} \\ &= \frac{- e^{a_j} }{\sum_{k=1}^N e^{a_k} } \times \frac{e^{a_i} }{\sum_{k=1}^N e^{a_k} } \\ &= - p_j.p_i \end{align*} $$

So the derivative of the softmax function is:

$$ \frac{\partial p_i}{\partial a_j} = \begin{cases}p_i(1-p_j) & if & i=j \\ -p_j.p_i & if & i \neq j \end{cases} $$

Or using Kronecker delta $\delta_{ij} = \begin{cases} 1 & \text{if} & i=j \\ 0 & \text{if} & i\neq j \end{cases}$

$$ \frac{\partial p_i}{\partial a_j} = p_i(\delta_{ij}-p_j) $$

Derivative of Cross Entropy Loss with Softmax

Softmax paired with cross entropy loss is the standard output layer for classification. Using the softmax derivative derived above, we can compute the gradient of the loss with respect to the pre-activation outputs $o_i$.

$$ \begin{align*} L &= - \sum_i y_i log(p_i) \\ \frac{\partial L}{\partial o_i} &= - \sum_k y_k \frac{\partial log(p_k)}{\partial o_i } \\ &= - \sum_k y_k \frac{\partial log(p_k)}{\partial p_k} \times \frac{\partial p_k}{ \partial o_i} \\ &= - \sum y_k \frac{1}{p_k} \times \frac{\partial p_k}{\partial o_i} \end{align*} $$

Substituting the softmax derivative from above, and splitting on the $k = i$ and $k \neq i$ cases:

$$ \begin{align*} \frac{\partial L}{\partial o_i} &= -y_i(1-p_i) - \sum_{k\neq i} y_k \frac{1}{p_k}(-p_k.p_i) \\ &= -y_i(1-p_i) + \sum_{k \neq i} y_k p_i \\ &= - y_i + y_ip_i + \sum_{k \neq i} y_k p_i \\ &= p_i\left( y_i + \sum_{k \neq i} y_k\right) - y_i \end{align*} $$

Since $y$ is a one-hot encoded label vector, $\sum_k y_k = 1$, so $y_i + \sum_{k \neq i} y_k = 1$. This gives:

$$ \frac{\partial L}{\partial o_i} = p_i - y_i $$

A remarkably simple result. In code:

def delta_cross_entropy(X,y):
   """
   X is the output from fully connected layer (num_examples x num_classes)
   y is labels (num_examples x 1)
   	Note that y is not one-hot encoded vector.
   	It can be computed as y.argmax(axis=1) from one-hot encoded vectors of labels if required.
   """
   m = y.shape[0]
   grad = softmax(X)
   grad[range(m),y] -= 1
   grad = grad/m
   return grad