LSTM Cell – with a magnifier!


Today, we’ll be talking about LSTM Networks QnA Style. The motivation for this is again as I have seen often when people read about LSTM’s, they have more questions than they have answers for. So, here I will try to give a gist of LSTM networks in comparison to FFN(Feed Forward Network  or a regular NN).

So, Lets begin.

Q. How is LSTM different from a FFN cell(or a Feed Forward Network cell or a Node in hidden layer w.r.t. FFN)?

>The fundamental difference between a LSTM Cell and a FFN cell is that the LSTM cell itself is a combination of 4 networks. Yes, you read that right!

Q. Why do we have 4 networks in a single cell?

>Because we need memory or ‘previous state information’. And one way of achieving that is the implemented Networks.

Q. What are these 4 networks?

>We’l go into the details shortly. Refer to section: ‘The Worthy Networks!’

Q. Why do we need ‘previous state information’?

>Because we don’t want to loose context, and we believe that knowing the context could help better predict/classify the target.

Q. What if you are wrong, in the sense, previous state information is not required?

>Well, we could be. Generally, in applications where ‘sequence’ is of importance, LSTM’s do well. They essentially capture the sequence information(previous state) in the ‘memory’ cells.

Q. Wait, you said Sequence. CNN’s(Convolutional Neural Networks) are good with sequences. I could get away by using CNN and not use LSTM?

>Again, you could, depending on you application. One could say that there is a similarity between these two work as opposed to a FFN. They both take sequence(batch) of inputs. But that is where the similarity ends

Once the data is in the hidden layers, for a CNN, the hidden state of the first image window(batch) will not be supplied to next image window; for a LSTM, this will be the case. I know this looks short, but this topic deserves an article on its own; I’ll try to come up with one soon.

Q. Okay, so you say that a single LSTM cell is a cell with 4 networks, then for a hidden layer in LSTM implementation, do I have multiple LSTM cells?

>Well, Let me put it this way.  A single LSTM cell is equivalent to a hidden layer in FFN. Again, that IS true. Similar to what you have number of nodes in a hidden layer, LSTM cell takes the number of hidden nodes as a parameter. This is the number of nodes in each of these 4 layers. As far as stacking goes, you can stack multiple LSTM cells, which will be similar to stacking of hidden layers when compared to FFN.

Okay, Let’s take a stab at the working of LSTM. The cell has following components.

  1. Long Term Memory,
  2. Short Term Memory,
  3. Input

LSTM cell

The above diagram is read from left to right, bottom to top. So, as in case of FFN, you take input and predict output. Also, you use short term memory and long term memory to make a prediction. Also, we need to update the respective memories.

The Worthy Networks!

  1.  The Learn Network.
  2.  The Forget Network
  3.  The Remember Network.
  4.  The Use Network.

Lets take a look at each network in a bit detail.

  1. Learn Network:  This network combines the Event Data and the Short term memory, learns new information from the event, and forgets what is not required.
  2. Forget Network: The long term memory goes here and forgets what is not required. It needs the Short term memory as well to identify what is not required.
  3. Remember Network: We combine the Long term memory and short term memory to get the updated Long term memory
  4. Use Network: As the name suggests, it takes into account, whatever we already know (long term memory) and whatever we learnt recently(short term memory), and combine them to predict the output. The output becomes the new short term memory and prediction.


Learn Network:


There is nothing to worry about.

We had 2 operations, Combine and Ignore for Short Term memory. And we are doing nothing more here.

Why tanh for Combine, and sigmoid for Ignore? Because, it works! I am sorry here. If someone has a better understanding, please share. The documents I found had empirical conclusions, and not something that I could derive intuition from.

Forget Network.

Here, for Long Term memory, we need to forget what is not required. So, we multiply LTM by a forget factor. How do we get forget factor? Another network to rescue!.

Forget Gate

Remember Network

Here, we update the long term memory. How do we do it? Combine Long Term (whatever is important) with Short Term(updated with latest event data).

The Learn Network and Forget Network have exactly what we need. So we just add them.

Remember Gate_1.jpg

Use Network

Here, we take what is useful from Long term memory(Forget Network); take the most recent Short Term memory, combine, and that is going to be our new short term memory.

Use Gate_1

One catch here is the tanh function. If we only wanted the updated LTM, we could just have used output of forget network directly. Empirically, the tanh is proved to be improving performance of networks under test. Again, here, if someone has a better understanding, please share inputs, and I will update the article to be more intuitive.


Putting it all together:

LSTM cell (2).jpg

This is the famous LSTM Cell. Now if you plugin the individual networks in the respective cells, you should see the scary LSTM cell, that one generally sees. It should not look scary now. Try that as an exercise. It is actually fun.

Thats it guys.

Cheers. Happy Learning







Re-Structuring Machine Learning Execution

Recently, I gained some insight on Structuring Machine Learning projects. How I wish I had this insight when we did some experiments in ML domain in not so distant past. Anyways, I wouldn’t want anybody else to get hit by the same stones, so below is a crux of what I think I have understood. Feel free to share your inputs if you are not totally onboard or have a different insight.


[8 mins read]


A typical ML project pathway would have followed as the ToDo list:

  1. Fit Training set well.
  2. Fit Dev Set well.
  3. Fit Test Set well.
  4. Perform well in real world.

Continue reading →

Causing the COST Function

Have you ever wondered why does the cost function under gradient descent really looks like the way it does? If yes, today we will be building this up the right way. This is going to be a little longer, but totally worth it. So grab your coffee.

A little Kickstarter:

You have a model to build, which you are hoping to improve iteratively(in the true sense of the word). Let’s take the simplest problem of classifying 2 points as Red and Blue.

So, the task at hand: Classify the Red point as Red and the Blue point as Blue.

To ‘improve’ model, you need to compare one model with another; so in effect, you select the ‘better’ one. How do you do that? A good model is one that does what we want with greater confidence. In our case, Classifying Red point as the red and Blue point as Blue.

To get confidence: We make the model predict output classes(Red and Blue) in probability.

For example: P(Red)=0.4 meaning that I have 40% confidence that the given point is Red. It also means for the same point that P(Blue)=0.6 given there are only 2 classes.

So far so good? Cool.

Now, for all points classified by my model, I need a number which has following property:

If this is greater, we have correctly classified the points and if this is smaller, we have some mislabelled classifications.

Then we maximize this number to get the best model.

All we have now is the probabilities of each points belonging to its actual class. Let’s take the product of these as the events are independent.

Say, we got for RED point: P(R)=0.6 and for Blue point: P(B)=0.7

The number we have is product of P(R) for red points and P(B) for blue points. i.e P(R) *P(B) = 0.6 x 0.7 = 0.42

Note that if it had a lower confidence of predicting the correct class, the number would go down. e.g if P(R) for red point was 0.3 instead of 0.6, the resultant would be 0.3 x 0.7 = 0.21 which is less than what we had before. So we do have the above property satisfied.

Just a little more, hang in there.

This could very well be our ‘model comparator’, but notice something, as we are taking products, and we have small numbers(in the range 0–1), the results are getting smaller. We would want to avoid products also because a slight change in one could have a significant impact.

So we have Product is BAD. But sure, SUM could work. Let’s see. We want to convert PRODUCT →SUM? Log to the rescue, as

log(a * b) = log(a)+log(b)

So our model comparator becomes :

Output = ln(P(R)) + ln(P(B))

Well, we do have another problem here. Log of a number less than 1 is negative. We do need a positive number, so we add minus sign. Also, instead of maximizing it, we’ll have to minimize it, given we added a negative sign. Updating we have:

Loss = — ln(P(R))  —  ln(P(B))

The ‘Output’ here is the information you have extracted.

The function looks like  — ln(P(y)) for the event occurred.

If we have 2 classes the function looks like :

 Loss = — {y * ln(p(y))}  —  {(1-y) * ln(p(1-y))}

To take the information when a particular event actually occurs, we have a representative term ‘y’ or ‘(1-y)’ to represent the occurrence of that event(There is no information gained when that event did not occur.)

Congratulations, you have built your own Loss function which appeared out of nowhere first. This is formally called the Cross Entropy Loss Function. Fancy, right?

Now, the above function is true for 2 output classes, i.e. Binary Classification.

For more output classes, let’s guess? Changes:

The values will not be y and (1-y) instead y`, y“ …. y“` up to the number of output classes, with a product of log of probability of that event happening.

Representing the ‘+’ for all elements with a summation would give you the formal definition of Cross Entropy Loss function which you can just search over. Il leave the rest to you.

That’s it for today, folks. Happy Machine Learning!

P.S.: As always, feedback welcome!

Demystifying Regularisation!

The motivation for this specific topic?
Often when I connect around, the answer to ‘When to use Regularization’ is ‘To prevent overfitting of model’. While that is true, it is important to understand how it works and get a sense to have an overall understanding of your model. So, as suggested by one of my colleague at ThoughtWorks, here’s my attempt at intuiting it.

Gradient Descent(some basics)

Let’s get started.

Continue reading →

Machine Learning : How NOT to get started with Machine Learning.

Over past year, I have seen quite a number of folks start with Data Science.  And there are plenty of articles indicating the surface area of the entire domain. Many start, but few continue.  Here, I try to list some traps that could stall an aspiring Data Scientist’s progress. As always, feel free to share any feedback you have.

A typical journey on ML looks like below:

  1. Get fascinated by all the hype and aim to become a data scientist.
  2. Get started with Andrew Ng’s ML course.
  3. Don’t understand what’s going on for 3 weeks, and wonder when will we start ‘actual’  deep learning.
  4. Switch to the course.
  5. Get a feel of learning but deep down still not understanding how this works.
  6. Feel hollow and abandon all hope.

Continue reading →

Before starting Andrew Ng’s ML Course…

If you are thinking of starting ML, without a doubt Andrew Ng’s Course on Coursera. is the best place to start.

However, a couple of things below that should ease your journey.

  1. Make sure you complete at least 4 Weeks. The first 2 assignments are the mountain that you must scale before witnessing the beautiful horizons of ML.
  2. Do not skip any videos/lectures (I tried to act smart and tried; Let’s just say, not one of my brightest ideas…).
  3. Do not shy away from watching videos again and again(and again), if required. Everything you need is right there, in the videos.
  4. Use Emacs or Sublime Text as an editor (I spent quite a bit of time setting up the ‘ideal environment’ only to later use Sublime Text. (In the event that you discover a better alternative, please share.).
  5. If you think you lack the fundamentals to get this right, NOW is the time to get them (Looking back, Sleeping through that lecture on probability was not cool). The good news is you are smarter now with more resources at your disposal. It only takes 10 minutes to get each of the fundamentals correct.
  6. It is OKAY to reverse engineer for the first 2 weeks and then try again. There were times when I did not get to the solution directly. I saw the solutions on Github. With that as a reference, reverse engineered it. (Just make sure you re-implement them later on and are able to explain what has been done.)
  7.  Octave-CLI is your ally, trust it; use it. (You’ll know this once you complete installation in Week 2)
  8. Get peers!! Get someone to discuss. With so many equations messing around, I can’t stress enough the importance of having someone to discuss them with. Start in pairs (get someone in the same boat as yourself); that should really help reinforce your learning.
  9. If the speed of videos is slow for you, watch videos at 1.25x or 1.5x.
  10. Don’t hush that inner voice that whispers, ‘You really did not get that, did you?’. Instead, embrace it.
  11. If you feel lost, you are on the right track(that implies you understand a tiny bit of it and are questioning the rest). The transition from an expert programmer to a grad student surely takes a toll.
  12. This is the most fundamental course there is which completely covers basics of ML (AFAIK). So, yes, you’ll have to get through it.
  13. Discuss, Explain and Talk ML. Nothing will concretize your understanding other than explaining it to someone else.
  14. If at some point while watching lectures, you feel that you are not following, PAUSE the video right there, go back and start over. Don’t, seriously DON’T, be in a hurry to finish the video up. Take your time and really understand what’s going on in there. Or it will come back and bite you later.
  15. Take pen and paper! Solve the algorithm with a very small dataset manually on paper. It really helps in understanding whats happening to the data and how the algorithm is working.

That’s it, folks.

Happy Machine Learning :).

[Image subject to copyright by Coursera.]