What Is LSTM (Long Short-Term Memory)

Neural Network Tutorials - Herong's Tutorial Examples

∟What Is LSTM (Long Short-Term Memory)

This section provides a quick introduction of LSTM (Long Short-Term Memory) recurrent neural network model. LSTM uses two state vectors, s representing the short-term memory and l representing long-term memory, to resolve the vanishing gradient problem in the standard recurrent neural network model.

What Is LSTM (Long Short-Term Memory)? LSTM, introduced in 1997 by Sepp Hochreiter and Jürgen Schmidhube, is an enhancement of the RNN (Recurrent Neural Network) to resolve the vanishing gradient problem.

In order to enhance the capability of the neural network cell, LSTM uses two state vectors, s representing the short-term memory and l representing long-term memory, as described in this tutorial.

1. If we continue with our compact format of the RNN layer architecture, the LSTM layer architecture, which is a 2-state recursive neural network layer, can be illustrated as the following.

RNN vs. LSTM Model Architecture — RNN vs. LSTM Layer Architecture

2. If unfold the RNN layer architecture, using (..., t-1, t, t+1, ...) to represent the sample sequence, we will have:

LSTM Model - Unfolded Format — LSTM Layer - Unfolded Format

3. Now the recursive function R() for a LSTM Layer can be expressed as the following.

(y_t, s_t, l_t) = R(x_t, W_t, s_t-1, U_t, l_t-1, V_t)

Inputs:

  x_t represents the input vector of the current sample.

  W_t represents the weight matrix on the input vector
  for the current sample.

  s_t-1 represents the short-term memory state vector generated from
  the calculation of the previous sample.

  U_t represents the weight matrix on the short-term memory state vector
  for the current sample.

  l_t-1 represents the long-term memory state vector generated from
  the calculation of the previous sample.

  V_t represents the weight matrix on the long-term memory state vector
  for the current sample.

Outputs:

  y_t represents the output vector of the current sample.

  s_t represents the short-term memory state vector to feed information to
  the next sample.

  l_t represents the long-term memory state vector to feed information to
  the next sample.

4. One popular way of constructing the recursive function R() is to use gate functions to control the flow of information from inputs to outputs. A gate function is a pointwise multiplication of a gate vector of values in the range of (0.0, 1.0) and the targeted information vector. A gate function acts like a switch. The targeted information will pass through, if the switch is on (gate vector of 1). The targeted information will be stoped, if the switch is off (gate vector of 0). Mathematically, a gate function G() for vectors of n elements can be expressed as:

              |g₁| * |i₁|   |g₁*i₁|
              |g₂| * |i₂|   |g₂*i₂|
G() = g * i = |g₃| * |i₃| = |g₃*i₃|
              |g_.| * |i_.|   |g_.*i_.|
              |g_n| * |i_n|   |g_n*i_n|

g represents the gate vector
i represents the target information vector

5. The standard LSTM architecture splits the recursive function R() into two parts, input part R_i() and the output part R_o(). Then 3 gate functions, forget gate G_f(), input gate G_i(), output gate G_o(), are added to control the flow of information. Weight matrices are re-distributed into partial recursive functions and gate functions. The result is shown on the right side of the diagram below:

6. Now we have some options to construct partial recursive functions and gate functions. One of them is commonly used and can be expressed below using 3 gate vectors as temporary variables:

Forget gate vector:
  f = sigmoid(Wgf_t · x_t + Ugf_t · s_t-1)

Input gate vector:
  i = sigmoid(Wgi_t · x_t + Ugi_t · s_t-1)

Output gate vector:
  o = sigmoid(Wgo_t · x_t + Ugo_t · s_t-1)

Forget gate function:
  G_f() = f * l_t-1

Input part of recursive function:
  R_i() = tanh(W_t·x_t + U_t·s_t-1)

Input gate function:
  G_i() = i * R_i()
    = i * tanh(Wi_t·x_t + Ui_t·s_t-1)

Long-term memory vector:
  l_t = G_f() + G_i()
    = f * l_t-1 + i * tanh(Wi_t·x_t + Ui_t·s_t-1)

Output part of recursive function:
  R_o() = tanh(l_t)

Output gate function:
  G_o() = o * R_o()
    = o * tanh(l_t)

Short-term memory vector:
  s_t = G_o()
    = o * tanh(l_t)

Output of the layer:
  y_t = s_t()
    = o * tanh(l_t)

Where:
  · represents the dot operation of a matrix and a vector.
  * represents pointwise multiplication to two vectors.
  Wgf_t, Wgi_t, and Wgi_t represent weight matrices on x_t.
  Ugf_t, Ugi_t, and Ugi_t represent weight matrices on s_t-1.
  V_t representing weight matrices on l_t-1 is not used.

7. If you like matrix format, the LSTM mathematical model can be written as:

8. The above LSTM mathematical model can be illustrated graphically as the diagram below:

9. If we look at those gate vectors are calculated again, they are actually RNN layers themselves. They follow the same forward calculation pattern as the standard RNN layer:

Standard RNN forward calculation pattern:
  y_t = sigmoid(W_t·x_t + U_t·s_t-1)
  s_t = y_t

10. The following diagram shows 4 RNN layers inside the common LSTM model.

11. Of course, the common LSTM model can be illustrated differently. Here are some examples I have collected from the Internet.

LSTM Model Illustration from geeksforgeeks.org

Table of Contents

About This Book

Deep Playground for Classical Neural Networks

Building Neural Networks with Python

Simple Example of Neural Networks

TensorFlow - Machine Learning Platform

PyTorch - Machine Learning Platform

Gradio - ML Demo Platform

CNN (Convolutional Neural Network)

►RNN (Recurrent Neural Network)

What Is RNN (Recurrent Neural Network)

RNN Recursive Function

►What Is LSTM (Long Short-Term Memory)

What Is GRU (Gated Recurrent Unit)

GNN (Graph Neural Network)

GAN (Generative Adversarial Network)

Performance Evaluation Metrics

References

Full Version in PDF/EPUB