June 1, 2020 Miscellaneous
Cross Entropy : An intuitive explanation with Entropy and KL-Divergence
Cross-Entropy is something that you see over and over in machine learning and deep learning. This article explains it from Information theory prespective and try to connect the dots. KL-Divergence is also very important and is used in Decision Trees and generative models like Variational Auto Encoders.
Cross Entropy : An intuitive explanation with Entropy and  KL-Divergence

In machine learning and deep learning, cross entropy is used extensively as a loss function in a classification problem. In order to understand the it's intuition, we need to understand it's origin from an area of computer science called Information Theory.

In digital era, the main goal is to transfer the data reliably and efficiently from a sender to a recipient. We know that the transfer is done in bits. But the main problem is how to encode it. And most importantly how to know if that encoding is good or bad, OR how much unnecessary information that we are sending.

While sending a message, not all bits are useful. Claude Shannon proposed a way to measure to measure how efficient the transfer is in his paper that later became the foundation for Information theory.

In Shannon's theory, To transmit one bit of information means to reduce the recipient's uncertainty by a factor of 2

What is Uncertainty factor reduction?

Let's just take an example of transmitting weather information from weather station to you. We consider few cases to understand the reduction in uncertainty factor with different weather states.

1. Weather being random with 2 states

In this case we assume that the weather can either be Sunny or Rainy with equal probability.

P(Sunny)=12=0.5andP(Rainy)=12=0.5P(Sunny)=\frac{1}{2}=0.5 \hspace 2em and \hspace 2em P(Rainy)=\frac{1}{2}=0.5 \hspace2em

These are the probabilities that we are assuming prior to the transmission. If the weather station transmits that it is going to be rainy, then according to Shannon, they(weather station) reduced our uncertainty by a factor of 2.

It is very important to note that we are not trying to predict anything here. The probabilities are something that we assumed from the previous data( generally ) and we are receiving the forecast information from the weather station. All we are trying to do here is to measure how efficient the transfer is. It doesn't matter whether tomorrow that it is actually going to rain or not.

  • One other way to interpret this is before the forecast we are only 50% certain, But we are 100% certain, that means, our certainty has increased from 50% to 100%, with a factor of 2.
50%certainty_factor100%certainty_factor=250\%*certainty\_factor\rightarrow 100\%\Longrightarrow\boxed{certainty\_factor=2}
  • Another way to put this is as follows. We had 2 options out of which 1 is sunny and the other is rainy. We were not sure ( uncertain ) which we are going to receive from the weather station. After the forecast that it is going to be rainy, we are down to one option(rainy) from the available options(rainy and sunny). This is a reduction. Just like before we can calculate the reduction in our uncertainty as follows.
2optionsuncertainty_factor1optionuncertainty_factor=2\frac{2\enspace options}{uncertainty\_factor}\rightarrow 1option\Longrightarrow\boxed{uncertainty\_factor=2}

While calculating certainty factor, we are multiplying the factor because we want to measure by how much our certainty factor has increased.

Similarly, while calculating the uncertainty factor, there is a division because we want to calculate the factor by how much our uncertainty has reduced.

2. Weather being random with 8 possible states

In this example, let's just say the weather can be any of the 8 states say from sunny to rainy. And it is important to note that all are equally likely.

Weather states={S1,S2,...,S8}P(S1)=P(S2)==P(S8)=18\text {Weather states} = \lbrace S_{1},S_{2},...,S_{8} \rbrace\\ P(S_{1}) =P(S_{2}) = \cdots \cdots = P(S_{8}) =\frac{1}{8}

From the weather station, let's just say we received the forecast as S1S_1 . Now, let's just calculate the factor by which there is an increase in certainty or decrease in uncertainty.

  • certainty \rightarrow increased from 12.5%12.5\% to 100%100\%. i.e., from (18\frac {1}{8} to 11)

    12.5%certainty_factor100%certainty_factor=812.5\%*certainty\_factor\rightarrow 100\%\Longrightarrow \boxed{certainty\_factor=8}
  • Uncertainty \rightarrow decreased from 8 possible cases ({S1,,S8}\{S_1,\cdots,S_8\}) to 1 case (S1S_1).

    8optionsuncertainty_factor1optionuncertainty_factor=8\frac{8 \enspace options}{uncertainty\_factor} \rightarrow 1 \enspace option \Longrightarrow \boxed{uncertainty\_factor=8}

3. Weather with 2 states ( But NOT Equally likely )

In this case we assume that there is 75% chance that the weather would be sunny and 25% chance that the weather would be rainy. In the previous two cases, the reduction factor is same (2 and 8 in first and second example respectively) irrespective of the forecast because the probabilities are same. Here, it is little different.

One other way to put this is by considering having 4 options out of which 3 options are sunny and one is rainy.

P(Sunny)=34=0.75andP(Rainy)=14=0.25P(Sunny)=\frac{3}{4} = 0.75 \quad and \quad P( Rainy)=\frac{1}{4}=0.25

Because the probabilities are little different, the reduction in uncertainty factor or increase in certainty factor for sunny and rainy would be different.

3.1 Rainy

First, let's say that the weather station forecasted it as rainy. Before the forecast, we are only 25% sure that it is going to be rainy, but after the forecast, we are 100% sure.

  • certainty \rightarrow increased from 25% to 100%
    25%certainty_factor100%certainty_factor=425\%*certainty\_factor \rightarrow 100\%\Longrightarrow \boxed{certainty\_factor=4}

In other words, we were uncertain that which of the four cases are we gonna receive, but after the forecast, we know for sure that it would be rainy. So we were down from having 4 cases to 1.

  • uncertainty \rightarrow decreased from 4 possible options to 1 option.
    4optionsuncertainty_factor1optionuncertainty_factor=4\frac{4 \enspace options}{uncertainty\_factor} \rightarrow 1 \enspace option\Longrightarrow \boxed{uncertainty\_factor=4}

3.2 Sunny

If the forecast is sunny, we can calculate the increase in certainty or the decrease in uncertainty the same way as before. Here it is little tricky but if you understood the previous things, then it seems intuitive.

  • certainty \rightarrow increased from 75%75\% to 100%100\%

    75%certainty_factor100%certainty_factor=4375\%*certainty\_factor\rightarrow 100\%\Longrightarrow \boxed{certainty\_factor=\frac{4}{3}}
  • uncertainty \rightarrow decreased from 4 possible ( 3 sunny and 1 rainy ) options to 3 options( sunny ).

    4optionsuncertainty_factor3optionsuncertainty_factor=43\frac{4\enspace options}{uncertainty\_factor} \rightarrow 3\enspace options\Longrightarrow \boxed{uncertainty\_factor=\frac43}

What about Information ?

So far we've been learning about the uncertainty factor, but what about the information. According to Shannon, transferring 1 bit means reducing the recipients uncertainty factor by 2. We will try to come up with a general notation below

Reduction in Uncertainty by a factorInformation transferred in bits
21 bit
42 bits
83 bits
flog2f\log_{2}\mathrm{f} bits

And from above all three examples, the reduction factor is exactly inverse of probability that we assumed before the transmission. From this, we can write the ( useful ) information received as follows.

If the probability of an event happening is pep_e , then the (useful) information transferred to the recipient is log21pe    log2pe\log_{2} {\frac1{p_e}} \implies − \log_{2} {p_e}.

Useful Information transfered=log2pe bits.\text {Useful Information transfered} = -\log_2{p_e} \text{ bits}.

We are calling the information transferred as useful because, sender might convey the same information in so many ways ( in bits ). If it is regarding weather, they can encode the information( each weather state) as a string where in each character is of 1 Byte. OR they can decide some character for each weather state and send that instead of whole string. Whatever the encoding it may be, according to Shannon, the useful number of bits, which we are calling information is negative logarithm of probability of that event, i.e., log2pe-\log_2{p_e}bits.

The number of bits that are being transmitted in a transmission on an average is nothing but entropy. The physics definition for this is that it is a measure of uncertainty or randomness, which makes sense.


It is the average amount of useful information that is being transferred. Statistically, it is called as the expected information. And as mentioned before the probabilities are learned based on the past data.

Entropy =H(p)=ipilog2pi\boxed{\text {Entropy }=H(p) = -\sum _{i} p_{i} * \log_{2} p_{i}}

Let's consider the case where we have the probability of weather being sunny is 75% and rainy is 25%. In this case the entropy is 0.810.81bits. i.e.,0.75log(0.75)0.25log(0.25)=0.81-0.75*\log(0.75)-0.25*\log(0.25)=0.81. Please use calculator to verify the same.

We can interpret this entropy as the average amount of information that we can get from one sample drawn from a given probability distribution. OR it's just weighted average with weights being the probabilities.

Entropy - Measure of Uncertainty?

It may be little confusing at first, but it actually tells you how un certain our events are. If the weather is same almost every day, say sunny, then the average amount of information that we get is not that much. In fact close to zero.

Below graph is for the Entropy when we have two events. Here the equation of Entropy is calculated as y=xlog(x)(1x)log(1x)y=-x\log(x) - (1-x)\log(1-x) because the probabilities sums to 1 when there are two events.

Entropy for two classes

As we have only two events, if the probability of one event is high xx \uparrow then the other event's probability would be low (1x)(1 - x)\downarrow.

Low Entropy when events are certain

If we take the previous case as an example with two cases(sunny and rainy), If the weather is almost sunny like 95% of the time, irrespective of the forecast, the information is very low. You can observe the same in the graph as well. It makes sense intuitively too if you think about it. If someone tells you that that it would be sunny where it is sunny almost everyday (95%), On an average, you are not gonna get benefitted from that message.

High Entropy for uncertain events

Consider the case where weather is random everyday. An ideal case would be 50% - 50% for sunny and rainy. In such cases, whatever the message that you receive from the weather station, it would be so much useful on an average. And you can observe the same from the graph as well. the entropy is high.

With this, you should be comfortable with Entropy being measure of uncertainty. The more uncertain the events are, the higher the entropy. And the more certain the events are the lower the entropy

Cross Entropy

Before going to discuss this, first we need to clear few things about Entropy.

  • For entropy, we were just calculating the ideal / useful information given just the probabilities, nothing else.

  • In all the above examples, all the events are dependent and their probabilities sums to 1. It is not true always. The events might be independent. The information that you get from a weather station is completely independent of the information that you get from a different source. As these are independent, we can just sum the information content from these two sources to get total Useful Information.

    • From source, S1S_1, let's just say the events probabilities are p1 and p2p_1 \text{ and }p_2 , and the information I1I_1 is p1logp1p2logp2-p_1\log{p_1}-p_2\log{p_2}.

    • From source S2S_2, say the events probabilities are p3 and p4p_3\text{ and }p_4 , and the information I2I_2 is p3logp3p4logp4-p_3\log{p_3}-p_4\log{p_4}.

    • Total Information, I=i=14pilogpi\displaystyle I=-\sum ^{4}_{i=1} p_{i} *\log p_{i}.

    • This can be interpreted as all the events coming from a single source. The over all information(useful) is not gonna effect. All we need is a list of probabilities of the events to calculate the entropy.

What is cross-Entropy?

It is nothing but the average amount of information(in bits) in a transmission. Consider the following example where we encode all the events as 3 bits and assuming all the events are equally likely. In this case, Cross Entropy is 3 bits.

Cross Entropy - Weather with 8 states - 1

When we are encoding any information in bits, we are implicitly assuming it's probabilities. If we go back to the section where we were calculating the Information when an message is received based on it's corresponding event probability, we can see that if we are transferring 3 bits for an event, that means the probability of the event that we assumed is 123\frac1{2^3} .

Information transferred in bits=log2pe\text {Information transferred in bits} = -\log_{2} p_{e}

From this, we can get the probability as follows.

n bits=log2pepe=12nn\text{ bits}=-\log_{2} p_{e} \Longrightarrow \boxed{p_{e} = \frac{1}{2^{n}}}

In this case we are assuming the probability for each event as 18\frac18 and the true probability is also 18\frac18 for each class. If we assume pp as our true distribution and qq as our predicted distribution, then we can write the cross entropy as follows

H(p,q)=ipilogqi\boxed{H(p,q) = -\sum _{i} p_{i} * \log q_{i}}

where, pp is the true distribution and qq is the predicted distribution. It is exactly equal to entropy if pp is equal to qq. i.e., when our predicted probability distribution is same as our true probability distribution. In this case, it is true. From this , we can conclude that the transmission in this case is ideal.

But this is not the case always. Consider the true distribution for those states are not equally likely as follows with the same encoding.

Cross Entropy - Weather with 8 states - 1

Here, predicted distribution is same (18\frac18) for all the states as we are using same number of bits for all the states. But the true distribution is different. So, entropy and cross entropy would be different.

Entropy, H(p)=0.35log0.350.01log0.01=2.23 bits\begin{aligned} \text{Entropy, } H(p) &= -0.35*\log 0.35-\enspace\cdots \enspace\cdots \enspace\cdots \enspace-0.01*\log 0.01\\ &=2.23 \text{ bits} \end{aligned}

This is the useful number of bits in this transmission ( on average ). Let's just calculate the actual number of bits that are transmitted on an average, which is cross Entropy.

H(p,q)=0.35log(18)0.01log(18)=3 bits\begin{aligned} H(p,q) &= -0.35*\log \left(\frac{1}{8}\right) - \enspace\cdots \enspace\cdots \enspace\cdots \enspace - 0.01*\log\left(\frac{1}{8}\right)\\ &=3 \text{ bits} \end{aligned}

On an average, we are sending 3 bits of information but only 2.23 bits of information is useful. One more thing that is important to note is that cross entropy is always greater than the entropy. You might ask for proof. But if you think intuitively, when we use more number of bits for the events that are more probable, it will cost us few extra bits, It will be more clear with the following examples.

Cross Entropy - Weather with 8 states - 2

With this encoding, our cross entropy would be H(p,q)=2.42 bitsH(p,q)=2.42 \text{ bits} which is close to 2.32 bits2.32 \text{ bits} (entropy). As you have observed, we are still sending additional 0.1 bits of information with this encoding. But it is better than the previous one.

But if we reverse the encoding. i.e., choose more number of bits for more probable events, would result in high cross entropy. For example, take the following encoding.

Cross Entropy - Weather with 8 states - 2

Here, the average message length or cross-entropy is4.584.58bits, which is almost twice as the entropy(2.232.23bits). This is because of the encoding. Here also, we can see that the cross entropy is greater than the entropy by some amount(2.352.35bits ).

Hence, we can conclude that the cross entropy will always be greater than or equal to entropy. i.e., H(p,q)H(p)H(p,q) \geqslant H(p). OR we can say that the cross entropy is always greater than the entropy by some amount (0\geqslant 0) . The extra information is called relative entropy, or kullback leibler divergence or in general, it is called as KL divergence and is denoted asDKL(pq)D_{KL}(p \, || \, q).

DKL(pq)=H(p,q)H(p)\boxed{D_{KL}(p \, || \, q) =H(p,q) - H(p)}

Even though the main topic is cross entropy, we use this is used in several places like building a simple Decision trees, in t-sne ( a dimensionality reduction technique) and even in advanced generative models like Variational Auto Encoders.

But we generally write this in a different formulation which is mostly used to express as a quantity that measures how far the given two distributions are.

DKL(pq)=H(p,q)H(p)=ipilogqiipilogpi=ipilogqipi\begin{aligned} D_{KL}(p\,||\,q) &= H(p,q) - H(p)\\ &=-\sum _{i} p_{i} *\log q_{i} -\sum _{i} p_{i} *\log p_{i}\\ &=-\sum _{i} p_{i} *\log\frac{q_{i}}{p_{i}} \\ \end{aligned}
DKL(pq)=ipilogpiqi\Longrightarrow \boxed{D_{KL}(p\,||\,q) = \sum _{i} p_{i} *\log\frac{p_{i}}{q_{i}}}

where pp is the true probability distribution and qq is the predicted probability distribution.

As you have observed by now, it is not symmetric. The above equation tells you how different q is when compared with p. The order is important. You'll know which one to use based on the problem.

That is all. Now you know what cross entropy is how it is derived from the Information Theory. But it is used quite differently in machine learning. These things doesn't matter when we optimize things. In deep learning, we take true distribution as one-hot representation of the true class label and the predicted probabilities from the model as our predicted probability distribution.

  • Instead of using binary logarithm (base 2), we use natural logarithm( base e). It does not effect as we are just dividing with a constant but it helps while calculating the gradients.

  • For multiclass or multi label classification you might've used categorical cross entropy which is cross entropy. We have n neurons in the final layer for n-class problem.

  • For binary classification, you generally use Binary cross entropy which requires only one neuron in the final layer. You can simply re write the cross entropy with one probability as described in one of the previous sections.

Binary cross entropy =1Ni=1N(yilog(pi)(1yi)log(1pi))\text {Binary cross entropy }= -\frac{1}{N} \sum _{i=1} ^N(y_i\log(p_i) - (1-y_i)\log(1-p_i))

which is also referred as log loss or logistic loss.

Sometimes we have to find the cross entropy but we don't have access to its true distributions. We approximate it. Please read this Wikipedia Article to read more


©2021 designed and coded by Ramana Reddy