Have you ever wondered why does the cost function under gradient descent really looks like the way it does? If yes, today we will be building this up the right way. This is going to be a little longer, but totally worth it. So grab your coffee.

A little Kickstarter:

You have a model to build, which you are hoping to improve iteratively(in the true sense of the word). Let’s take the simplest problem of classifying 2 points as Red and Blue.

So, the task at hand: **Classify the Red point as Red and the Blue point as Blue.**

To ‘improve’ model, you need to compare one model with another; so in effect, you select the ‘better’ one. How do you do that? A good model is one that does what we want with greater confidence. In our case, Classifying Red point as the red and Blue point as Blue.

To get confidence: We make the model predict output classes(Red and Blue) in probability.

For example: P(Red)=0.4 meaning that I have 40% confidence that the given point is Red. It also means for the same point that P(Blue)=0.6 given there are only 2 classes.

So far so good? Cool.

Now, for all points classified by my model, I need a number which has following property:

**If this is greater, we have correctly classified the points and if this is smaller, we have some mislabelled classifications.**

Then we **maximize** this number to get the best model.

All we have now is the probabilities of each points belonging to its actual class. Let’s take the product of these as the events are independent.

Say, we got for RED point: P(R)=0.6 and for Blue point: P(B)=0.7

The number we have is product of P(R) for red points and P(B) for blue points. i.e P(R) *P(B) = 0.6 x 0.7 = 0.42

Note that if it had a lower confidence of predicting the correct class, the number would go down. e.g if P(R) for red point was 0.3 instead of 0.6, the resultant would be 0.3 x 0.7 = 0.21 which is less than what we had before. So we do have the above property satisfied.

Just a little more, hang in there.

This could very well be our ‘model comparator’, but notice something, as we are taking products, and we have small numbers(in the range 0–1), the results are getting smaller. We would want to **avoid products also because a slight change in one could have a significant impact.**

So we have Product is BAD. But sure, SUM could work. Let’s see. We want to convert PRODUCT →SUM? **Log** to the rescue, as

log(a * b) = log(a)+log(b)

So our model comparator becomes :

Output = ln(P(R)) + ln(P(B))

Well, we do have another problem here. Log of a number less than 1 is negative. We do need a positive number, so we add minus sign. Also, instead of maximizing it, we’ll have to minimize it, given we added a negative sign. Updating we have:

Loss = — ln(P(R)) — ln(P(B))

The ‘Output’ here is the information you have extracted.

The function looks like — ln(P(y)) for the event occurred.

If we have 2 classes the function looks like :

Loss = — {y * ln(p(y))} — {(1-y) * ln(p(1-y))}

To take the information when a particular event actually occurs, we have a representative term ‘y’ or ‘(1-y)’ to represent the occurrence of that event(There is no information gained when that event did not occur.)

Congratulations, you have built your own Loss function which appeared out of nowhere first. This is formally called the **Cross Entropy Loss Function. **Fancy, right?

Now, the above function is true for 2 output classes, i.e. Binary Classification.

For more output classes, let’s guess? Changes:

The values will not be

yand(1-y)instead y`, y“ …. y“` up to the number of output classes, with a product of log of probability of that event happening.

Representing the ‘+’ for all elements with a summation would give you the formal definition of Cross Entropy Loss function which you can just search over. Il leave the rest to you.

That’s it for today, folks. Happy Machine Learning!

P.S.: As always, feedback welcome!