Long Short Term Memory (LSTM) In Keras

In this article, you will learn how to build an LSTM network in Keras. Here I will explain all the small details which will help you to start working with LSTMs straight away.

Ritesh Ranjan
Towards Data Science

--

Photo by Natasha Connell on Unsplash

In this article, we will first focus on unidirectional and bidirectional LSTMs. I will explain all the steps right from preparing data for training and defining the LSTM model. In this article, I will explain only the sequential model. In this article, I will use LSTMs to do sentiment analysis of amazon fine food reviews. So let's get started.

Vectorization and input format:

In this section, you will learn how to vectorize the data and pass it as input to the architecture. I will take an example so that you can learn to give input based on your dataset. Here I will take the amazon reviews dataset to show how to vectorize your data. You can download the dataset from here.

Before moving forward first let us see the data and do some data cleaning. First, we will load the .csv file using the pandas library. We will take those reviews whose rating is not equal to 3 stars as the 3-star rated review is neutral. Below is the code to do the same.

Code to filter out neutral reviews.

Now we will do standard data preprocessing by removing HTML tags, stopwords, and duplicates. You can see the details here. After cleaning the dataset we will have two columns. One will contain reviews and the other will contains labels 1 or 0. 1 means positive review and 0 means negative review.

Now the next step is to vectorize the data. We will vectorize the data using the tokenizer class of the Keras module. This assigns every unique word an integer number. So now each word will be identified by an integer number. Below is the code that exactly does this.

Code to vectorize the text data.

Every word should be represented by a vector so that model can understand them. There are two ways: one is using a pre-trained word embeddings such as word2vec and glove, the other is to train your own word embedding for your dataset. We can do this by using the Embedding layer of Keras module while training our deep learning model. Embedding layer takes vocabulary size, dimension of word vector and the input length of each review. Here we will keep dimension to 32. Then each word will be represented by a vector of length 32 after training is done. We have already fixed the max length of reviews to 150. Below is the code to define the embedding layer.

Defining the embedding layer.

You can see that the Embedding layer is the first layer of the model. We must specify the below three arguments.

  1. vocab size: This is the number of unique words in the training set. If the words are encoded from 0 to 100 than the vocab size will be 101.
  2. output dim: This is the dimension to which each word in the training set will get encoded. You can choose any dimension. Ex: 8, 32, 100 etc. Here we have selected 32.
  3. input length: This is the length of the input sequences. If all the input sequence of your data is 100, then this value will be 100. For our case, we have selected this to be 150.

Model architecture:

In this section, we will define the model. In our architecture, we will use two layers of the LSTM each of 128 units one stacked on the other. Normal LSTM is 3 to 4 times slower compared to CuDNNLSTM. Therefore if you are training on GPU then train using CuDNNLSTM. You will see that magically the training speed will increase 3 to 4 times. Below is the code to define the architecture.

Code to define the model.

If you want to use stacked layers of LSTMs then use return_sequences=True before passing input to the next LSTM layer. For the last LSTM layer, there is no need to use return_sequences=True. You can change the number of units used. Here e have used 128, you can use what suits your case. You can vary these values and keep the best value. You can also try other activation functions and see which fives the best performance, and then you can select the best one. You can also try to stack more layers and check the improvement is there or not.

Model Training:

In this section, we will train the model that we defined above. We can train the model using ‘model.fit’ method. This takes the following arguments:

  1. x_train: This is the vectorized data for training. Here we have vectorized the reviews and stored it in x_train.
  2. y_train: This contains labels of the reviews that was vectorized.
  3. batch_size: This is the number of data points seen by the model after which the model will update the weights. Don’t keep this low as the training time will increase. Try a bunch of values and find the best value for your case. Here we have taken batch size = 1024.
  4. epochs: Number of times the model will see the data during training before ending the training.
  5. validation_split: This is the percentage of data on which validation will be done during training.
  6. callbacks: These are the functions that are called during training to monitor the states of the model during training. While training we don’t know when to stop as the model will only stop after the number of epochs mentioned has been reached. Sometimes it may happen that the model reached optima before those number of epochs and it still continued training and overfitted. To prevent that we will use callbacks which will monitor the loss of the model during training. If the loss doesn't decrease for some epochs then we halt the training of the model. For this case, we have decided if the loss doesn't decrease even after 4 epochs then we will stop.

The following code does what has been mentioned above.

Code to start the training.

With this architecture, we got an accuracy of 83% on the test dataset. Using bidirectional LSTMs instead of unidirectional LSTMs gave us an accuracy of 92%. Below is the code for using bidirectional LSTMs.

Code of Bidirectional LSTM model.

Now the question is what is the intuition behind using bidirectional LSTMs. In unidirectional LSTM we encode a word by just looking at the words that are on the left side of that word. In bidirectional LSTM we encode a word by looking at the words that are on the left and right side of that word. It is very obvious that we can encode a word better if we observe the words on the left and right sides as well. Right?

This we can see with the results that we have got. Using bidirectional improved accuracy of our model from 83% to 92%, and this is a significant jump.

I hope this will help you to start working with LSTMs. I will provide the link to the jupyter notebook for further reference. You can view the notebook here.

If you have any doubts do let me know.

References:

  1. https://keras.io/getting-started/sequential-model-guide/
  2. http://www.bioinf.jku.at/publications/older/2604.pdf
  3. https://keras.io/callbacks/
  4. https://keras.io/layers/recurrent/

--

--

Machine Learning Enthusiast. Writer in Towards Data Science, Analytics Vidhya, and AI In Plain English. LinkedIn: https://www.linkedin.com/in/riteshranjan11055/