**Neural Network Tutorials - Herong's Tutorial Examples** - 1.20, by Dr. Herong Yang

What Is LSTM (Long Short-Term Memory)

This section provides a quick introduction of LSTM (Long Short-Term Memory) recurrent neural network model. LSTM uses two state vectors, s representing the short-term memory and l representing long-term memory, to resolve the vanishing gradient problem in the standard recurrent neural network model.

**What Is LSTM (Long Short-Term Memory)?**
LSTM, introduced in 1997 by Sepp Hochreiter and Jürgen Schmidhube,
is an enhancement of the RNN (Recurrent Neural Network)
to resolve the vanishing gradient problem.

In order to enhance the capability of the neural network cell, LSTM uses two state vectors, s representing the short-term memory and l representing long-term memory, as described in this tutorial.

1. If we continue with our compact format of the RNN layer architecture, the LSTM layer architecture, which is a 2-state recursive neural network layer, can be illustrated as the following.

2. If unfold the RNN layer architecture, using (..., t-1, t, t+1, ...) to represent the sample sequence, we will have:

3. Now the recursive function R() for a LSTM Layer can be expressed as the following.

(y_{t}, s_{t}, l_{t}) = R(x_{t}, W_{t}, s_{t-1}, U_{t}, l_{t-1}, V_{t}) Inputs: x_{t}represents the input vector of the current sample. W_{t}represents the weight matrix on the input vector for the current sample. s_{t-1}represents the short-term memory state vector generated from the calculation of the previous sample. U_{t}represents the weight matrix on the short-term memory state vector for the current sample. l_{t-1}represents the long-term memory state vector generated from the calculation of the previous sample. V_{t}represents the weight matrix on the long-term memory state vector for the current sample. Outputs: y_{t}represents the output vector of the current sample. s_{t}represents the short-term memory state vector to feed information to the next sample. l_{t}represents the long-term memory state vector to feed information to the next sample.

4. One popular way of constructing the recursive function R() is to use gate functions to control the flow of information from inputs to outputs. A gate function is a pointwise multiplication of a gate vector of values in the range of (0.0, 1.0) and the targeted information vector. A gate function acts like a switch. The targeted information will pass through, if the switch is on (gate vector of 1). The targeted information will be stoped, if the switch is off (gate vector of 0). Mathematically, a gate function G() for vectors of n elements can be expressed as:

|g_{1}| * |i_{1}| |g_{1}*i_{1}| |g_{2}| * |i_{2}| |g_{2}*i_{2}| G() = g * i = |g_{3}| * |i_{3}| = |g_{3}*i_{3}| |g_{.}| * |i_{.}| |g_{.}*i_{.}| |g_{n}| * |i_{n}| |g_{n}*i_{n}| g represents the gate vector i represents the target information vector

5. The standard LSTM architecture
splits the recursive function R() into two parts,
input part R_{i}() and the output part R_{o}().
Then 3 gate functions,
forget gate G_{f}(), input gate G_{i}(), output gate G_{o}(),
are added to control the flow of information.
Weight matrices are re-distributed into partial recursive functions
and gate functions. The result is shown on the right side of the diagram below:

6. Now we have some options to construct partial recursive functions and gate functions. One of them is commonly used and can be expressed below using 3 gate vectors as temporary variables:

Forget gate vector: f = sigmoid(Wgf_{t}· x_{t}+ Ugf_{t}· s_{t-1}) Input gate vector: i = sigmoid(Wgi_{t}· x_{t}+ Ugi_{t}· s_{t-1}) Output gate vector: o = sigmoid(Wgo_{t}· x_{t}+ Ugo_{t}· s_{t-1}) Forget gate function: G_{f}() = f * l_{t-1}Input part of recursive function: R_{i}() = tanh(W_{t}·x_{t}+ U_{t}·s_{t-1}) Input gate function: G_{i}() = i * R_{i}() = i * tanh(Wi_{t}·x_{t}+ Ui_{t}·s_{t-1}) Long-term memory vector: l_{t}= G_{f}() + G_{i}() = f * l_{t-1}+ i * tanh(Wi_{t}·x_{t}+ Ui_{t}·s_{t-1}) Output part of recursive function: R_{o}() = tanh(l_{t}) Output gate function: G_{o}() = o * R_{o}() = o * tanh(l_{t}) Short-term memory vector: s_{t}= G_{o}() = o * tanh(l_{t}) Output of the layer: y_{t}= s_{t}() = o * tanh(l_{t}) Where: · represents the dot operation of a matrix and a vector. * represents pointwise multiplication to two vectors. Wgf_{t}, Wgi_{t}, and Wgi_{t}represent weight matrices on x_{t}. Ugf_{t}, Ugi_{t}, and Ugi_{t}represent weight matrices on s_{t-1}. V_{t}representing weight matrices on l_{t-1}is not used.

7. If you like matrix format, the LSTM mathematical model can be written as:

8. The above LSTM mathematical model can be illustrated graphically as the diagram below:

9. If we look at those gate vectors are calculated again, they are actually RNN layers themselves. They follow the same forward calculation pattern as the standard RNN layer:

Standard RNN forward calculation pattern: y_{t}= sigmoid(W_{t}·x_{t}+ U_{t}·s_{t-1}) s_{t}= y_{t}

10. The following diagram shows 4 RNN layers inside the common LSTM model.

11. Of course, the common LSTM model can be illustrated differently. Here are some examples I have collected from the Internet.

Table of Contents

Deep Playground for Classical Neural Networks

Building Neural Networks with Python

Simple Example of Neural Networks

TensorFlow - Machine Learning Platform

PyTorch - Machine Learning Platform

CNN (Convolutional Neural Network)

►RNN (Recurrent Neural Network)

What Is RNN (Recurrent Neural Network)

►What Is LSTM (Long Short-Term Memory)