Remembering sequences (poorly) with RNNs

I have a project going recently that aims to train a recurrent network to memorise and repeat sequences of characters. It’s here, and it hasn’t been going really well, but I thought I’d share a little bit of why I wanted to do this and why I thought it might work.

Generating Text with Recurrent Neural Networks

This paper by Ilya Sutskever et. al. shows how, by basically just reading Wikipedia, a multiplicative recurrent neural network is able to learn syntactic rules as a character-level language model: balancing brackets, dealing with tense, and creating “plausible” words like `continge’.  I figured the model needed to store some type of short-term memory within it’s internal state in order for it to know when it is suitable to generate a verb or noun, or when to use a past tense, and if it could do that, these hidden states may be able to encode entire sentences as well.

So I went ahead and built something that completely does not work. But here’s why I thought that it might:

Consider the problem of trying to memorise and encode a sequence of 5 binary numbers. Starting with a completely empty internal state $S_0$ and the following hidden-to-hidden transition matrix $T$:

$$S_0 = \left[\begin{matrix} 0\\ 0\\ 0\\ 0\\ 0 \end{matrix}\right], T = \left[\begin{matrix} 0 & 1 & 0 & 0 & 0\\ 0 & 0 & 1 & 0 & 0\\ 0 & 0 & 0 & 1 & 0\\ 0 & 0 & 0 & 0 & 1\\ 0 & 0 & 0 & 0 & 0\\ \end{matrix}\right]$$
inputs are multiplied by the following weight vector W:
$$W = \left[\begin{matrix} 1\\ 0\\ 0\\ 0\\ 0 \end{matrix}\right]$$
and this update rule, which is pretty similar to how recurrent neural networks update their internal states,
$$S_t = {S_{t-1}}^{T}T + i_t W$$
where $i_t$ is the input at time $t$.

So for a sequence 1,0,0,1,0, here’s what the states would look like:
$$S_1 = \left[\begin{matrix} 1\\ 0\\ 0\\ 0\\ 0 \end{matrix}\right], S_2 = \left[\begin{matrix} 0\\ 1\\ 0\\ 0\\ 0 \end{matrix}\right], S_3 = \left[\begin{matrix} 0\\ 0\\ 1\\ 0\\ 0 \end{matrix}\right], S_4 = \left[\begin{matrix} 1\\ 0\\ 0\\ 1\\ 0 \end{matrix}\right] S_5 = \left[\begin{matrix} 0\\ 1\\ 0\\ 0\\ 1 \end{matrix}\right]$$

I’d hate to go through the same thing all over again, but you can see how this could be then done in reverse to retrieve the values from the stored vector. It’s also possible to encode more than just binary values by using binary encodings and using a transition matrix that shifts the internal representation the width of whatever encoding you’re using.

So yeah, that works for fixed-length strings. Which doesn’t really help much when thinking of encoding sentences. But I went ahead and tried it anyway, with English dictionary words, thinking it would perhaps learn a more efficient encoding of spelling sequences in English words.

Recursive Auto-encoders

Semi-Supervised Recursive Autoencoders for Predicting Sentiment Distributions by Richard Socher et. al. introduces the concept of recursive auto-encoders, which shows a way to train a recurrent network to encode sentences. The idea uses already trained word vectors, and at every step, gets the network to produce the previous internal state representation and the current word, which in a way, gets rid of the problem of having to propagate those long-range dependencies back in time when trying to train these recurrent networks.

Remaining Questions

Remains to be seen however, if it would be possible to learn these encodings from the ground up, and to make things even harder, at a character-based level. Also, would it then be possible to have hierarchies of vector representations that go from encoding words to encoding paragraphs, or even whole books?

Interesting problems, and good times ahead for Natural Language Processing with RNNs.