The motivation for this specific topic?

Often when I connect around, the answer to ‘When to use Regularization’ is ‘To prevent overfitting of model’. While that is true, it is important to understand how it works and get a sense to have an overall understanding of your model.

So, as suggested by one of my colleague at ThoughtWorks, here’s my attempt at intuiting it.

Prerequisites:

Gradient Descent(some basics)

Let’s get started.

One way to think of your model is a curve which separates some outputs from other. Now, this curve is defined by a function(could be a line, for example, is defined by function y=mx+c).

The function will have a slope at any given point in time. This slope gets steeper as the constants associated(weights of coefficients) with them increase rapidly.

To do gradient descent properly, we need a model which has a gentle slope and not a steep slope.

For intuition, one can say, that the model with steep or very low slopes are very certain and gives little room for gradient descent.

Also, if the certain model makes errors in classification, the errors generated will be large and will be difficult to tune that model to correct it.

The intuition here is we say that larger coefficients could probably lead you into overfitting.

So we would want to avoid large coefficients.

Having established above, what do we do?

We punish high coefficients. And how do we do it?

Given that we minimize error using error function, and we want to stay away from high coefficients; we add a penalty term to the error function that will increase error when the coefficients are high, and will not contribute much to error when the coefficients are small.

Now, all we need is some way to get the measure of the coefficients(weights), and to this measure, we can apply our penalty term.

2 approaches to measure coefficients:

a. Simply, add the absolute values of weights(given the weights can be negative and we don’t want them canceling the positive weights) and multiply by our penalty term lambda.

b. Sum of squares of weights and multiply by our penalty term lambda.

The first one is L1-Regularization, and the latter is L2-Regularization.

To summarize:

a. Larger coefficients could well probably lead to overfitting

b. Need to punish them

c.Updatefunctionunctoin to add a term that does the second.

Yes, that was all to it. You should have an idea of what exactly are we doing when we say, ‘regularize model’. That’s it for today.

Oh yes, if you have feedback, please share.

Happy Learning folks.