Long Short-Term Memory

There seems to be a resurgence in using these units in the past year. They were first proposed in 1997 by Hochreiter and Schmidhuber, but, along with most neural network literature seemed to have been forgotten for a while, until work on neural networks made a comeback, and focus started shifting toward RNNs again. Some of the more interesting recent work using LSTMs have come from Schmidhuber’s student Alex Graves. Notice the spike here in 2009 when Graves first wrote about cursive handwriting recognition (and generation) using LSTMs.

In the last year however, they’ve gotten even more attention. The Neural Turing Machines paper used LSTMs as a benchmark as it is one of the more popular ways of maintaining some form of short-term memory when doing pattern recognition on a sequence. They’ve been incorporated into a new paradigm with sequence-to-sequence classification where an input sequence encoded with one LSTM, and then decoded into the label sequence using another LSTM. And it’s not just sequences, the recent burst of systems that attempt to do image caption generation also make use of a similar prinicple: turn an image into a vector representation, and decode that vector into a word sequence.

So… how do they work?

In Hochreiter and Schmidhuber’s 1997 paper, they identify a problem with vanishing/exploding gradients when training RNNs when analysing the derivatives of an RNN. The problem comes from propagating back errors for many time steps. If the magnitude of the error for all time steps is greater than one, then it leads to a situation where the delta increases exponentially with the number of time steps. A magnitude of smaller than one leads to vanishing $\Delta w$s.

To correct this, the idea of the constant error carousel was introduced to force a constant error flow. If the expressions for the new unit were calculated by working backwards by setting the error term to 1, it results in a network that isn’t very useful. As a result, they came up with a gating system that calculates updates to an internal state known as the cell. In the original paper, there are just 2 gates,

\newcommand{output}{\mathbf{o}}

\newcommand{cell}{\mathbf{c}}

\newcommand{input}{\mathbf{i}}

\newcommand{forget}{\mathbf{f}}

\newcommand{x}{\mathbf{x}}

\newcommand{h}{\mathbf{h}}

\newcommand{b}{\mathbf{b}}

\newcommand{W}{\mathbf{W}}

\begin{align*}

\mathbf{y}_t &= \output_t \circ h(\cell_t) \

\cell_t &= \mathbf{c}_{t-1} + \input_t \circ g(\mathbf{x}) \

\end{align*}

Where $\input_t$ and $\output_t$ is a function of $\x_t$, but the paper has left that up to the user to define. In more recent work using LSTMs, the de facto standard is to define them as a direct connection from the input data, like so:

$$\begin{align*}

\input_t &= \sigma(\W_{xi}\x_t + \W_{hi}\h_{t-1} + \b_i) \

\forget_t &= \sigma(\W_{xf}\x_t + \W_{hf}\h_{t-1} + \b_i) \

\cell_t &= \forget_t \circ \cell_{t-1} + \input_t \circ \tanh(\W_{xc}\x_t + \W_{hc}\h_{t-1} + \b_c)\

\h_t &= \output_t \circ \tanh(\cell_t)

\end{align*}$$

Most variants now include what is known as a peephole connection, which includes a connection from $\cell_t$ to the various gates. As always, all these functions are differentiable, and so backpropagation through time (BPTT) works to update the various weights here.

TL;DR: As with all neural network systems, they work by magic.

An intuitive idea of the LSTM unit

It’s easier to think about these `gates’ if you assume they only take a value of strictly 0 or 1, and that the cells are sort of like variables that get updated and modified over time.

The input gate acts like sort of a mask, whatever the updates from the $\tanh$ functions are, applying the input gate as a mask will add those values to the corresponding cell positions. The forget gate will blank out those cells where it predicts a zero, and keep the values for those it predicts a 1. The output gate masks the squashed version of the cells as output from the LSTM unit.

Also notice that the cells are not restricted to a range if we view them as a vector. This may be why they are useful as vector representations of sequences, since traditional RNNs using a $\tanh$ or $\sigma$ function restrict the range in which the output can be in.