Causing the COST Function

Have you ever wondered why does the cost function under gradient descent really looks like the way it does? If yes, today we will be building this up the right way. This is going to be a little longer, but totally worth it. So grab your coffee.

A little Kickstarter:

You have a model to build, which you are hoping to improve iteratively(in the true sense of the word). Let’s take the simplest problem of classifying 2 points as Red and Blue.

So, the task at hand: Classify the Red point as Red and the Blue point as Blue.

To ‘improve’ model, you need to compare one model with another; so in effect, you select the ‘better’ one. How do you do that? A good model is one that does what we want with greater confidence. In our case, Classifying Red point as the red and Blue point as Blue.

To get confidence: We make the model predict output classes(Red and Blue) in probability.

For example: P(Red)=0.4 meaning that I have 40% confidence that the given point is Red. It also means for the same point that P(Blue)=0.6 given there are only 2 classes.

So far so good? Cool.

Now, for all points classified by my model, I need a number which has following property:

If this is greater, we have correctly classified the points and if this is smaller, we have some mislabelled classifications.

Then we maximize this number to get the best model.

All we have now is the probabilities of each points belonging to its actual class. Let’s take the product of these as the events are independent.

Say, we got for RED point: P(R)=0.6 and for Blue point: P(B)=0.7

The number we have is product of P(R) for red points and P(B) for blue points. i.e P(R) *P(B) = 0.6 x 0.7 = 0.42

Note that if it had a lower confidence of predicting the correct class, the number would go down. e.g if P(R) for red point was 0.3 instead of 0.6, the resultant would be 0.3 x 0.7 = 0.21 which is less than what we had before. So we do have the above property satisfied.

Just a little more, hang in there.

This could very well be our ‘model comparator’, but notice something, as we are taking products, and we have small numbers(in the range 0–1), the results are getting smaller. We would want to avoid products also because a slight change in one could have a significant impact.

So we have Product is BAD. But sure, SUM could work. Let’s see. We want to convert PRODUCT →SUM? Log to the rescue, as

log(a * b) = log(a)+log(b)

So our model comparator becomes :

Output = ln(P(R)) + ln(P(B))

Well, we do have another problem here. Log of a number less than 1 is negative. We do need a positive number, so we add minus sign. Also, instead of maximizing it, we’ll have to minimize it, given we added a negative sign. Updating we have:

Loss = — ln(P(R))  —  ln(P(B))

The ‘Output’ here is the information you have extracted.

The function looks like  — ln(P(y)) for the event occurred.

If we have 2 classes the function looks like :

 Loss = — {y * ln(p(y))}  —  {(1-y) * ln(p(1-y))}

To take the information when a particular event actually occurs, we have a representative term ‘y’ or ‘(1-y)’ to represent the occurrence of that event(There is no information gained when that event did not occur.)

Congratulations, you have built your own Loss function which appeared out of nowhere first. This is formally called the Cross Entropy Loss Function. Fancy, right?

Now, the above function is true for 2 output classes, i.e. Binary Classification.

For more output classes, let’s guess? Changes:

The values will not be y and (1-y) instead y`, y“ …. y“` up to the number of output classes, with a product of log of probability of that event happening.

Representing the ‘+’ for all elements with a summation would give you the formal definition of Cross Entropy Loss function which you can just search over. Il leave the rest to you.

That’s it for today, folks. Happy Machine Learning!

P.S.: As always, feedback welcome!

Demystifying Regularisation!

The motivation for this specific topic?
Often when I connect around, the answer to ‘When to use Regularization’ is ‘To prevent overfitting of model’. While that is true, it is important to understand how it works and get a sense to have an overall understanding of your model. So, as suggested by one of my colleague at ThoughtWorks, here’s my attempt at intuiting it.

Gradient Descent(some basics)

Let’s get started.

Continue reading “Demystifying Regularisation!”

Machine Learning : How NOT to get started with Machine Learning.

Over past year, I have seen quite a number of folks start with Data Science.  And there are plenty of articles indicating the surface area of the entire domain. Many start, but few continue.  Here, I try to list some traps that could stall an aspiring Data Scientist’s progress. As always, feel free to share any feedback you have.

A typical journey on ML looks like below:

  1. Get fascinated by all the hype and aim to become a data scientist.
  2. Get started with Andrew Ng’s ML course.
  3. Don’t understand what’s going on for 3 weeks, and wonder when will we start ‘actual’  deep learning.
  4. Switch to the course.
  5. Get a feel of learning but deep down still not understanding how this works.
  6. Feel hollow and abandon all hope.

Continue reading “Machine Learning : How NOT to get started with Machine Learning.”

Re-Structuring Machine Learning Execution

Recently, I gained some insight on Structuring Machine Learning projects. How I wish I had this insight when we did some experiments in ML domain in not so distant past. Anyways, I wouldn’t want anybody else to get hit by the same stones, so below is a crux of what I think I have understood. Feel free to share your inputs if you are not totally onboard or have a different insight.


[8 mins read]


A typical ML project pathway would have followed as the ToDo list:

  1. Fit Training set well.
  2. Fit Dev Set well.
  3. Fit Test Set well.
  4. Perform well in real world.

Continue reading “Re-Structuring Machine Learning Execution”

Before starting Andrew Ng’s ML Course…

If you are thinking of starting ML, without a doubt Andrew Ng’s Course on Coursera. is the best place to start.

However, a couple of things below that should ease your journey.

  1. Make sure you complete at least 4 Weeks. The first 2 assignments are the mountain that you must scale before witnessing the beautiful horizons of ML.
  2. Do not skip any videos/lectures (I tried to act smart and tried; Let’s just say, not one of my brightest ideas…).
  3. Do not shy away from watching videos again and again(and again), if required. Everything you need is right there, in the videos.
  4. Use Emacs or Sublime Text as an editor (I spent quite a bit of time setting up the ‘ideal environment’ only to later use Sublime Text. (In the event that you discover a better alternative, please share.).
  5. If you think you lack the fundamentals to get this right, NOW is the time to get them (Looking back, Sleeping through that lecture on probability was not cool). The good news is you are smarter now with more resources at your disposal. It only takes 10 minutes to get each of the fundamentals correct.
  6. It is OKAY to reverse engineer for the first 2 weeks and then try again. There were times when I did not get to the solution directly. I saw the solutions on Github. With that as a reference, reverse engineered it. (Just make sure you re-implement them later on and are able to explain what has been done.)
  7.  Octave-CLI is your ally, trust it; use it. (You’ll know this once you complete installation in Week 2)
  8. Get peers!! Get someone to discuss. With so many equations messing around, I can’t stress enough the importance of having someone to discuss them with. Start in pairs (get someone in the same boat as yourself); that should really help reinforce your learning.
  9. If the speed of videos is slow for you, watch videos at 1.25x or 1.5x.
  10. Don’t hush that inner voice that whispers, ‘You really did not get that, did you?’. Instead, embrace it.
  11. If you feel lost, you are on the right track(that implies you understand a tiny bit of it and are questioning the rest). The transition from an expert programmer to a grad student surely takes a toll.
  12. This is the most fundamental course there is which completely covers basics of ML (AFAIK). So, yes, you’ll have to get through it.
  13. Discuss, Explain and Talk ML. Nothing will concretize your understanding other than explaining it to someone else.
  14. If at some point while watching lectures, you feel that you are not following, PAUSE the video right there, go back and start over. Don’t, seriously DON’T, be in a hurry to finish the video up. Take your time and really understand what’s going on in there. Or it will come back and bite you later.
  15. Take pen and paper! Solve the algorithm with a very small dataset manually on paper. It really helps in understanding whats happening to the data and how the algorithm is working.

That’s it, folks.

Happy Machine Learning :).

[Image subject to copyright by Coursera.]