## Recursive Auto-encoders: Example in Theano

Okay, time to get our hands dirty with some code! I’ve written an example in Theano that encodes a stream of one-hot encodings, and this is the example I’ll run through with this post.

As a quick recap of what was covered in the previous post, here’s a diagram:

The hidden units here are the ones labelled $h_1$ to $h_3$, and the formula that gives us that is simply a linear combination of:

• the current hidden state
• the current input

passed through a non-linearity function, in this case, $\tanh$. In Greek, that’s,
$$h_i = \tanh\left(x_i {}^{\top} W^{(1)} + h_{i-1} {}^{\top} W^{(2)} + b_\text{hidden}\right)$$
To reconstruct our $x_i$ and $h_i$, we need a few more equations. So, with some notation abuse, I’ll use $W^{(-1)}$ and $W^{(-2)}$ to signify weights that do the opposite of what $W^{(1)}$ and $W^{(2)}$ do:
$$\hat{x_i} = h_i {}^{\top} W^{(-1)} + b_\text{input recon}$$$$\hat{h_{i-1}} = h_i {}^{\top} W^{(-2)} + b_\text{hidden recon}$$Looking at these equations, it’s obvious we’ll need 4 sets of weights, and 3 bias vectors.

X = T.dmatrix('X') W_input_to_hidden = U.create_shared(U.initial_weights(input_size,hidden_size)) W_hidden_to_hidden = U.create_shared(U.initial_weights(hidden_size,hidden_size)) b_hidden = U.create_shared(U.initial_weights(hidden_size)) initial_hidden = U.create_shared(U.initial_weights(hidden_size)) #for the very first hidden reconstruction W_hidden_to_hidden_reproduction = U.create_shared(U.initial_weights(hidden_size,hidden_size)) b_hidden_reproduction = U.create_shared(U.initial_weights(hidden_size)) W_hidden_to_input_reproduction = U.create_shared(U.initial_weights(hidden_size,input_size)) b_input_reproduction = U.create_shared(U.initial_weights(input_size))   parameters = [ W_input_to_hidden, W_hidden_to_hidden, b_hidden, initial_hidden, W_hidden_to_hidden_reproduction, b_hidden_reproduction, W_hidden_to_input_reproduction, b_input_reproduction, ]

Now that we have all the pieces, we need to put them all together in the right configuration. Now, there’s a little dance we need to do with theano.scan() that’s not particularly obvious, but if you are familiar with the API, just see that the parameters to the function step() correspond to the sequences, outputs_info, non_sequences parameters. The None values in outputs_info behave as a sort of mask’ so that only the first of the three return values are passed to the next iteration (it doesn’t really work like that, but that’s a good way to think about it).

def make_rae(inputs,W1_i,W1_m,b_h,i_h,W2_m,b2_m,W2_i,b2_i): def step( inputs, hidden_1, W1_m,W1_i,b_h,W2_m,b2_m,W2_i,b2_i): hidden = T.tanh( T.dot(hidden_1,W1_m) +\ T.dot(inputs,W1_i) +\ b_h )     reproduction_m = T.dot(hidden,W2_m) + b2_m reproduction_i = T.dot(hidden,W2_i) + b2_i return hidden,reproduction_m,reproduction_i [hidden_,reproduction_m_,reproduction_i_],_ = theano.scan( step, sequences = [inputs], outputs_info = [i_h,None,None], non_sequences = [W1_m,W1_i,b_h,W2_m,b2_m,W2_i,b2_i] ) return hidden_,reproduction_m_,reproduction_i_

The unroll() function does the exact opposite. Given an encoded sequence, which is the last hidden representation of the sequence, it attempts to use the $W^{(-1)}$ and $W^{(-2)}$ matrices to unroll’ them into the full sequence again:

def unroll(final_rep,W2_m,b2_m,W2_i,b2_i,n_steps): def step(curr_rep,W2_m,b2_m,W2_i,b2_i): next_rep = T.dot(curr_rep,W2_m) + b2_m input_rep = T.dot(curr_rep,W2_i) + b2_i return next_rep,input_rep [_,recon],_ = theano.scan( step, outputs_info = [final_rep,None], non_sequences = [W2_m,b2_m,W2_i,b2_i], n_steps = n_steps ) return recon

Okay, so we’ve defined how given a sequence we can compute the sequence encoding, and how we can unroll that back into its original form. But how do we go about training the network to actually do this task? The U.initialise_weights() function simply creates a matrix of that size with random Gaussian noise. Using the current network as it is will only give gibberish output.

Well, that’s where the $\hat{h_i}$ and $\hat{x_i}$ come in. As I explained in the previous post, at each time step, we want to make sure that these reconstructions of the inputs and the current hidden state are recreated as accurately as possible. Slightly more formally, we can say we want to minimise the reconstruction error, which brings us to the problem of defining said error. So let’s just reach for a simple squared error and run with that,
$$E = \sum_i \left[ (x_i – \hat{x_i})^2 + (h_{i-1} – \hat{h_{i-1}})^2 \right]$$We have to be careful here or we’ll fall into an off-by-one bug here and we’ll be trying to reduce the error between $(h_{i-1} – \hat{h_i})^2$:

def build_error(X,hidden,hidden1_reproduction,input_reproduction): input_reproduction_sqerror = T.sum((X - input_reproduction)**2) hidden_reproduction_sqerror = T.sum((hidden[:-1] - hidden1_reproduction[1:])**2) return input_reproduction_sqerror + hidden_reproduction_sqerror

Okay, we’ve got the error down, but how do we go about minimising it? As usual, we’ll just do a simple stochastic gradient descent. Theano is able to do symbolic differentiation using theano.grad(error,wrt=parameters).

And that’s all there is to it! Once that’s done, we just have to make the call to theano.function to compile the training functions and a test function to see what our trained model actually does.

if __name__ == '__main__': X,parameters,hidden,hidden1_reproduction,input_reproduction,unrolled = build_network(8,24) f = theano.function( inputs = [X], outputs = [hidden,hidden1_reproduction,input_reproduction,unrolled] )   error = build_error(X,hidden,hidden1_reproduction,input_reproduction) gradients = T.grad(error,wrt=parameters)   updates = [ (p, p - 0.0001*g) for p,g in zip(parameters,gradients) ] train = theano.function( inputs = [X], updates = updates, outputs = error )   example = np.eye(8) for _ in xrange(1000000): np.random.shuffle(example) print train(example)     np.random.shuffle(example) hidden, hidden_rep, input_rep, unrlld = f(example)   print_arr(hidden) print_arr(input_rep) print_arr(unrlld)

The full code is here.

Let’s give it a whirl and see what happens:

The top matrix shows the input into the network, while the bottom one shows the reproduced input  in reverse.

You can see that it manages to somewhat recall the last 3 inputs of the sequence, but then muddles up the rest. It probably hasn’t been optimised far enough, but I think we may be able to use a few tricks to make the training go a little faster.

We’ll take a look at those in the next installment.

1. Thanks for the article. I’m interested in your theano implementation of recursive autoencoders. Have you tried adding the a softmax layer for classification? I would be very interested in seeing how this works for a less trivial example. It would be great if you were able to test this on the movie review dataset used in the paper.

• What I’ve done is just a linear recursion, whereas the paper uses the parse tree of the sentence to combine vectors of sub-trees together. I haven’t figured out a nice way to do that with Theano though. If you have any ideas I’m interested to hear them!

• I wander whether there exists a method dynamically combines the sub tree using theano library.

• It is. Every time step $t$ depends on the state of time step $t-1$. If what you are saying that it isn’t similar to Richard Socher’s implementation, then: yes, you’re right. It isn’t the same, but it is the simpler linear version of of the recurrent autoencoder first suggested by Pollack.