I’ve talked a little bit about recursive auto-encoders a couple of posts ago. In the deep learning lingo, an auto-encoder network usually refers to an architecture that takes in an input vector, and through a series of transformations, is trained to reproduce that input in its prediction layer. The reason for doing this is to extract features that describe the input. One might think of it as a form of compression: If the network is asked to be able to reproduce an input with after passing it through hidden layers with a lot less neurons than the input layer, then some sort of compression has to happen in order for it to be able to create a good reconstruction. So let’s consider the above network. 8 inputs, 8 outputs, and 3 in the hidden layer. If we feed the network a one-hot encoding of 1 to 8 (setting only the neuron corresponding to the input to 1), and insist that that input be reconstructed at the output layer, guess what happens?
The matrix above shows the reconstructed values, and the second matrix shows the internal representations that the network has learnt. Notice it’s decided that each of the different inputs correspond to a sort of binary encoding of 0 to 7. Pretty cool huh? The network has figured out there’s a more efficient way to represent the 8 inputs using 3 hidden units. That’s kind of the essence of what auto-encoders are, and this is obviously a toy example, but the same idea has been applied to both images and text.
Now that I’ve talked a little about what auto-encoders are, lets get into what we’re really here to talk about: recursive auto-encoders. I first came across the idea in Socher’s paper, but the particular method was first described in Pollack’s paper, which was a pretty interesting read.
The idea is simple: Since we’re dealing with time-series data, at each time-step, reconstruct both the input vector and the hidden vector of the previous time step. This forces the hidden representation at each time step to be in a format from which the previous hidden state and the input can be reconstructed. If we assume this can be done without loss, then we can essentially reconstruct the entire sequence from the final encoding.
When it comes to language models, combining words vectors in this way would make more sense than simply summing the vectors up. The phrase “a fantastic movie” preceded by “not” completely changes its meaning. In Socher’s other paper, he uses this technique as a way of getting a representation of the semantics of the sentence. The parsed grammar tree of the sentence is used, and the leaf nodes are recursively merged upwards into one single vector.
In the next installment, I’ll show how the linear form of this auto-encoding (like in the first diagram) can be done using Theano using a toy example, similar to the 3-bit encoding of 8.