Cross-Entropy, Log-Loss, And Intuition Behind It

In this blog, you will get an intuition behind the use of cross-entropy and log-loss in machine learning.

Published in

Towards Data Science

6 min readApr 6, 2020

You need to know nothing before going through this blog as I will start with the basics. The following content will be covered and explained in detail in this blog.

Random variable
Information content
Entropy
Cros-entropy and K-L divergence
Log-loss

Random Variable:

This is defined as a variable that takes the output of a random event. For example, we can define a variable X which takes the value of the output of rolling a dice. Here X can take values from 1 to 6.

We sometimes also calculate the probability that the random variable will take a particular value. For example, P(X=1) is 1/6 and this is the same for other values well, as they are equally likely to occur.

Information Content (IC):

When we take about information content we do in the context of the random variable. You can think it in this way if you know that an event is going to occur and that event occurs frequently then the information content is very less. Now if you know that an event is going to occur and this event occurs very rarely then the information content is very high. For example, there are two random variables X and Y. Here X tells whether the sun will rise or not and Y tells whether there will be an earthquake (surprise element) today. You can easily come to the conclusion that Y contains more information.

Information content is directly proportional to the surprise element.

Therefore, we can conclude that the information content of a random variable depends on the probability of an event occurring. If the probability is very low the information content is very high. We can write it mathematically as follows:

Hence IC is a function of probability. Here P(X=S) means the probability that X will take value S. This will be used in the rest of the blog. Information content shows the following property.

Property of IC if X and Y are independent events.

The second property on the above diagram is seen in the logarithmic family of functions. This gives us the intuition of using the log as a function to calculate IC. Hence now we can define IC mathematically as follows:

Entropy:

The entropy of a random variable is defined as the expected information content of the random variable. We will use the information theory of the electronics branch to explain it in a more intuitive way.

In information theory, the number of bits to send the value of X= some value is called the information content. Now if the random variable takes N values then we say it is a signal. In fact, we define a signal as a random variable taking N values. A signal takes different values at different intervals and hence we can define a random variable that will take N possible values that a signal can take. Now we find the number of bits taken to send the signal and this is the expected information content of the random variable. In other words, we call it entropy. The following example will make it more clear.

Example to show how to calculate entropy.

The entropy/expected IC is calculated using the following formula

Formula to calculate entropy/expected IC

Using this formula we get entropy = (1/2*1)+(1/4*2)+(1/4*2) = 3/2. Hence on an average, we will use 1.5 bits to send this signal.

Cross-entropy and K-L Divergence:

Here at the receiver, we don’t know the actual distribution of the random variable. We will see say 100 signals received and then estimate the distribution of the random variable. Now let us assume that the actual distribution at the sender is ‘y’ and estimated is ‘y^’. Here distribution means the probability that a random variable will take a particular value. The following are the value of the number of bits required to send the data.

Formula to calculate no. bits at the sender and receiver.

Now we already know the number of bits is the same as entropy. The entropy at the sender is called entropy and the estimated entropy at the receiver is called cross-entropy. Now, this is called cross-entropy because we are using the actual distribution and estimated distribution to calculate the estimated entropy at the receiver end. Another question that must be popping in your mind is why we are using actual distribution (y) at the receiver end. The answer to that is, we are estimating the number of bits required for each value of the random variable received which is -log(yi^). The number of bits used will depend on the distribution of the random variable received at the receiver. If this is not clear even after this then let's take an example.

We estimate P(X=A) to be 1/4 and the actual is 1/8 then the estimated number of bits for this will be -(log(1/4)) = 2 but the contribution to the final answer will be 1/8*(2) = 1/4 as we will receive this value at the receiver 1/8 times. Now I think this will be clear by now.

K-L divergence is equal to the difference between cross-entropy and entropy.

Log-loss:

Now we will move to the machine learning section. I am assuming that we know what y^ means. If you don't know then Y^ is the predicted probability of a given data point belonging to a particular class/label. For example, we can have a model in machine learning which will tell whether a text is abusive or not. The formula to calculate y^ is given below

The formula for the probability of a data point belonging to a class.

Here ‘w’ is the weight vector and ‘x’ is d dimensional representation of the data point and ‘b’ is the bias term.

Our main objective in all of machine learning is to properly estimate the distribution of data points in the training dataset.

Here training set can be treated as the sender and the model as the receiver which tries to estimate the distribution.

The best estimation will happen when K-L divergence will be minimum. Hence we will find the (w, b) corresponding to min K-L divergence. While updating (w, b) we ignore the entropy term as this is a constant and only cross-entropy term varies. Hence our loss equation looks as below.

This is the loss term which we generally call as log-loss as this contains log term.

For binary classification where ‘yi’ can be 0 or 1. This loss looks will look like loss = -(y * log (y) + (1- y) * log (1 — y)). This is what most of us are familiar with. This is all for now. I hope you got all of this.

If you loved the content then let me know in the comments below.

References:

https://en.wikipedia.org/wiki/Cross_entropy

https://en.wikipedia.org/wiki/Information_content

https://en.wikipedia.org/wiki/Random_variable

If you want to learn what is calibration then go through this blog.

Calibration in Machine Learning

In this blog, we will learn what is calibration and why and when we should use it.

medium.com

Want to learn how to prevent your model from underfitting and underfitting, then go through this blog.

Overfitting And Underfitting In Machine Learning

In this article, you will learn what overfitting and underfitting are. You will also learn how to prevent the model…

towardsdatascience.com