Home Deriving categorical cross entropy and softmax
Post
Cancel

Deriving categorical cross entropy and softmax

Introduction

Recently, on the Pytorch discussion forum, someone asked the question about the derivation of categorical cross entropy and softmax. So I thought it would be a good idea to write a blog post about it with more details to it.

Cross-Entropy from information theory

The idea of this section is to understand the categorical cross entropy from a bit of information theory background. The definition of cross-entropy:

The cross entropy is the average number of bits required to encode an event drawn from a probability distribution if the other probability distribution is used to define the code.

A layman definition relating to entropy for me will be

How much surprise on average I will get about the outcome of an event if I think that the custom distribution of the event models the true distribution.

Cross-Entropy loss

We would want to minimize this loss/surprise/average number of bits required. The cross-entropy loss is equal to the negative log-likelihood of the actual distribution. So if we have a distribution $ p $ and we want to model it with a distribution $ q $ then the cross entropy loss is equal to

\[\mathcal{L}(s, y) = - \sum_{i=1}^{C} y_i \log s_i\]

where $ s $ is the predicted distribution i.e logits after the softmax function and $ y $ is the actual distribution in the form of one hot encoded vector.

Softmax

Softmax is a generalization of the logistic function to multiple dimensions. It is a function that takes as input a vector of K real numbers and normalizes it into a probability distribution consisting of K probabilities. The output of the softmax function is a vector of probabilities that sum to 1, thus making it a probability distribution as the outputs are non-negative and sum to 1.

Mathematically softmax is a vector function that maps an input vector to an output vector. $ s : \mathbb{R}^n \to \mathbb{R}^n $

\[\text{softmax}(x) = \frac{e^{x / T}}{\sum_{j=1}^{C} e^{x_j / T}}\]

where $T = 1$ generally. This is the temperature of the softmax which I will talk about in more detail in a different blog post later, for now, let’s assume that $T = 1$ and the function becomes

\[s_i = \frac{e^{x_i}}{\sum_{j=1}^{C} e^{x_j}} \tag{1}\]

Why we talked about softmax because we need the softmax and its derivative to get the derivative of the cross-entropy loss.

Derivation of softmax

When we talk about the derivative of a vector function we talk about its jacobian. The jacobian of softmax is a matrix of all first-order partial derivatives of the softmax function.

\[\nabla \text{softmax} = \left(\begin{array}{cccc} \frac{\partial s_1}{\partial x_1} & \frac{\partial s_1}{\partial x_2} & \cdots & \frac{\partial s_1}{\partial x_n} \\ \frac{\partial s_2}{\partial x_1} & \frac{\partial s_2}{\partial x_2} & \cdots & \frac{\partial s_2}{\partial x_n} \\ \vdots & \vdots & \ddots & \vdots \\ \frac{\partial s_n}{\partial x_1} & \frac{\partial s_n}{\partial x_2} & \cdots & \frac{\partial s_n}{\partial x_n} \tag{2} \end{array}\right)\]

Since the output of each element depends on the sum of all other elements the output of the non-diagonal elements will not be zero.

Let’s begin by taking the log of softmax, we often use $\log$ and $\ln$ interchangeably. Generally in information theory, we stick to $\log$ with base 2 but in deep learning math, we more frequently use $\log$ with base $e$ i.e natural log $ = \ln$. \(\begin{align*} \log s_i &= \log \left( \frac{e^{x_i}}{\sum_{j=1}^{C} e^{x_i}} \right) \\ &= \log \left( e^{x_i} \right) - \log \left( \sum_{j=1}^{C} e^{x_i} \right) \tag{log(a/b) = log a - log b} \\ &= x_i - \log \left( \sum_{j=1}^{C} e^{x_i} \right) \end{align*}\) Now differentiating the above equation with respect to some arbitrary $x_k$ we get

\[\begin{align*} \frac{\partial \log s_i}{\partial x_k} &= \frac{\partial x_i}{\partial x_k} - \frac{\partial \log \left( \sum_{j=1}^{C} e^{x_i} \right)}{\partial x_k} \tag{3} \\ \end{align*}\]

The first element of RHS \(\frac{\partial x_i}{\partial x_k} = \begin{cases} 1 & x_i = x_k \\ 0 & \text{otherwise} \end{cases} = \mathbb{1}(x_i = x_k) = \delta_{ik}\) where $\mathbb{1}(x_i = x_k)$ is the Indicator function which in this case can also represented as $\delta_{ik}$ which is the Kronecker delta function.

And the second element of RHS can be further simplified as

\[\begin{align*} \frac{\partial \log \left( \sum_{j=1}^{C} e^{x_i} \right)}{\partial x_k} &= \frac{\partial \log \left( \sum_{j=1}^{C} e^{x_j} \right)}{\partial x_k} \\ &= \frac{1}{\sum_{j=1}^{C} e^{x_j}} \frac{\partial \left( \sum_{j=1}^{C} e^{x_j} \right)}{\partial x_k} \\ &= \frac{1}{\sum_{j=1}^{C} e^{x_j}} \sum_{j=1}^{C} \frac{\partial e^{x_j}}{\partial x_k} \\ &= \frac{1}{\sum_{j=1}^{C} e^{x_j}} \sum_{j=1}^{C} e^{x_j} \frac{\partial x_j}{\partial x_k} \\ &= \frac{1}{\sum_{j=1}^{C} e^{x_j}} \sum_{j=1}^{C} e^{x_j} \mathbb{1}(x_j = x_k) \\ &= \frac{1}{\sum_{j=1}^{C} e^{x_j}} \sum_{j=1}^{C} e^{x_j} \delta_{jk} \\ &= \frac{e^{x_k}}{\sum_{j=1}^{C} e^{x_j}} \\ &= s_k \end{align*}\]

So the final equation becomes:

\[\begin{align*} \frac{\partial \log s_i}{ \partial x_k} &= \delta_{ik} - s_k \\ \frac{1}{s_i} \frac{\partial s_i}{\partial x_k} &= \delta_{ik} - s_k \\ \frac{\partial s_i}{\partial x_k} &= s_i \left(\delta_{ik} - s_k \right) \tag{4} \\ \end{align*}\]

We can now use the $\frac{\partial s_i}{\partial x_k}$ to get the jacobian of softmax in equation (2).

\[\nabla \text{softmax} = \left(\begin{array}{cccc} s_1\left(1 - s_1\right) & -s_1s_2 & \cdots & -s_1s_n \\ -s_2s_1 & s_2(1-s_2) & \cdots & -s_2s_n \\ \vdots & \vdots & \ddots & \vdots \\ -s_ns_1 & -s_ns_2 & \cdots & s_n(1-s_n) \end{array}\right) \tag{5}\]

Eq. 5 is the backward/reverse/jacobian of the softmax function.

Differentiating cross-entropy loss

We differentiate cross entropy with respect to an arbitrary input to the softmax $x_k$ as the local derivative of softmax defined in Eq. 4 will help us to get a trivial solution for cross entropy’s local derivative.

\[\begin{align*} \mathcal{L}(s, y) &= - \sum_{i=1}^{C} y_i \log s_i \\ \frac{\partial \mathcal{L}(s, y)}{\partial x_k} &= - \sum_{i=1}^{C} y_i \frac{\partial \log s_i}{\partial x_k} \\ &= - \sum_{i=1}^{C} \frac{y_i}{s_i} \frac{\partial s_i}{\partial x_k} \\ &= - \sum_{i=1}^{C} \frac{y_i}{s_i} s_i \left(\delta_{ik} - s_k \right) \tag{from 4}\\ &= - \sum_{i=1}^{C} y_i \left(\delta_{ik} - s_k \right) \\ &= - \sum_{i=1}^{C} y_i \delta_{ik} + y_i s_k \end{align*}\]

when $i = k$ only then the first term will become = $y_k$

\[\begin{align*} \frac{\partial \mathcal{L}(s, y)}{\partial x_k} &= - y_k + \sum_{i=1}^{C} y_k s_k \end{align*}\]

where $\sum_{i=1}^C y_k = 1$ as $y_k$ is a one-hot vector.

\[\begin{align*} \frac{\partial \mathcal{L}(s, y)}{\partial x_k} &= - y_k + s_k \\ \frac{\partial \mathcal{L}(s, y)}{\partial x_k} &= s_k - y_k \end{align*}\]

In the vectorized (numpy array/torch tensor) form it can simply be written as

\[\begin{align*} \frac{\partial \mathcal{L}(s, y)}{\partial x} &= s - y \tag{6} \end{align*}\]

which is the derivation of the categorical cross-entropy loss function.

I hope this helps with the theoretical understanding of the softmax and cross-entropy loss function. I have tried to keep it as simple as possible and explain every single step. If you have any questions or suggestions please let me know. I will try to answer them as soon as possible.

This post is licensed under CC BY 4.0 by the author.

Welcome to the blog

The need for sampling temperature and differences between whisper, GPT-3, and probabilistic model's temperature