Recursive Auto-encoders: Example in Theano
Okay, time to get our hands dirty with some code! I’ve written an example in Theano that encodes a stream of one-hot encodings, and this is the example I’ll run through with this post.
As a quick recap of what was covered in the previous post, here’s a diagram:
The hidden units here are the ones labelled $h_1$ to $h_3$, and the formula that gives us that is simply a linear combination of:
- the current hidden state
- the current input
passed through a non-linearity function, in this case, $\tanh$. In Greek, that’s,
$$
h_i = \tanh\left(x_i {}^{\top} W^{(1)} + h_{i-1} {}^{\top} W^{(2)} + b_\text{hidden}\right)
$$
To reconstruct our $x_i$ and $h_i$, we need a few more equations. So, with some notation abuse, I’ll use $W^{(-1)}$ and $W^{(-2)}$ to signify weights that do the opposite of what $W^{(1)}$ and $W^{(2)}$ do:
$$
\hat{x_i} = h_i {}^{\top} W^{(-1)} + b_\text{input recon}
$$$$
\hat{h_{i-1}} = h_i {}^{\top} W^{(-2)} + b_\text{hidden recon}
$$Looking at these equations, it’s obvious we’ll need 4 sets of weights, and 3 bias vectors.
X = T.dmatrix('X')
W_input_to_hidden = U.create_shared(U.initial_weights(input_size,hidden_size))
W_hidden_to_hidden = U.create_shared(U.initial_weights(hidden_size,hidden_size))
b_hidden = U.create_shared(U.initial_weights(hidden_size))
initial_hidden = U.create_shared(U.initial_weights(hidden_size)) #for the very first hidden reconstruction
W_hidden_to_hidden_reproduction = U.create_shared(U.initial_weights(hidden_size,hidden_size))
b_hidden_reproduction = U.create_shared(U.initial_weights(hidden_size))
W_hidden_to_input_reproduction = U.create_shared(U.initial_weights(hidden_size,input_size))
b_input_reproduction = U.create_shared(U.initial_weights(input_size))
parameters = [
W_input_to_hidden,
W_hidden_to_hidden,
b_hidden,
initial_hidden,
W_hidden_to_hidden_reproduction,
b_hidden_reproduction,
W_hidden_to_input_reproduction,
b_input_reproduction,
]
Now that we have all the pieces, we need to put them all together in the right configuration. Now, there’s a little dance we need to do with theano.scan()
that’s not particularly obvious, but if you are familiar with the API, just see that the parameters to the function step()
correspond to the sequences
, outputs_info
, non_sequences
parameters. The None
values in outputs_info
behave as a sort of `mask’ so that only the first of the three return values are passed to the next iteration (it doesn’t really work like that, but that’s a good way to think about it).
def make_rae(inputs,W1_i,W1_m,b_h,i_h,W2_m,b2_m,W2_i,b2_i):
def step(
inputs,
hidden_1,
W1_m,W1_i,b_h,W2_m,b2_m,W2_i,b2_i):
hidden = T.tanh(
T.dot(hidden_1,W1_m) +\
T.dot(inputs,W1_i) +\
b_h
)
reproduction_m = T.dot(hidden,W2_m) + b2_m
reproduction_i = T.dot(hidden,W2_i) + b2_i
return hidden,reproduction_m,reproduction_i
[hidden_,reproduction_m_,reproduction_i_],_ = theano.scan(
step,
sequences = [inputs],
outputs_info = [i_h,None,None],
non_sequences = [W1_m,W1_i,b_h,W2_m,b2_m,W2_i,b2_i]
)
return hidden_,reproduction_m_,reproduction_i_
The unroll()
function does the exact opposite. Given an encoded sequence, which is the last hidden representation of the sequence, it attempts to use the $W^{(-1)}$ and $W^{(-2)}$ matrices to `unroll’ them into the full sequence again:
def unroll(final_rep,W2_m,b2_m,W2_i,b2_i,n_steps):
def step(curr_rep,W2_m,b2_m,W2_i,b2_i):
next_rep = T.dot(curr_rep,W2_m) + b2_m
input_rep = T.dot(curr_rep,W2_i) + b2_i
return next_rep,input_rep
[_,recon],_ = theano.scan(
step,
outputs_info = [final_rep,None],
non_sequences = [W2_m,b2_m,W2_i,b2_i],
n_steps = n_steps
)
return recon
Okay, so we’ve defined how given a sequence we can compute the sequence encoding, and how we can unroll that back into its original form. But how do we go about training the network to actually do this task? The U.initialise_weights()
function simply creates a matrix of that size with random Gaussian noise. Using the current network as it is will only give gibberish output.
Well, that’s where the $\hat{h_i}$ and $\hat{x_i}$ come in. As I explained in the previous post, at each time step, we want to make sure that these reconstructions of the inputs and the current hidden state are recreated as accurately as possible. Slightly more formally, we can say we want to minimise the reconstruction error, which brings us to the problem of defining said error. So let’s just reach for a simple squared error and run with that,
$$
E = \sum_i \left[ (x_i – \hat{x_i})^2 + (h_{i-1} – \hat{h_{i-1}})^2 \right]
$$We have to be careful here or we’ll fall into an off-by-one bug here and we’ll be trying to reduce the error between $(h_{i-1} – \hat{h_i})^2$:
def build_error(X,hidden,hidden1_reproduction,input_reproduction):
input_reproduction_sqerror = T.sum((X - input_reproduction)**2)
hidden_reproduction_sqerror = T.sum((hidden[:-1] - hidden1_reproduction[1:])**2)
return input_reproduction_sqerror + hidden_reproduction_sqerror
Okay, we’ve got the error down, but how do we go about minimising it? As usual, we’ll just do a simple stochastic gradient descent. Theano is able to do symbolic differentiation using theano.grad(error,wrt=parameters)
.
And that’s all there is to it! Once that’s done, we just have to make the call to theano.function
to compile the training functions and a test function to see what our trained model actually does.
if __name__ == '__main__':
X,parameters,hidden,hidden1_reproduction,input_reproduction,unrolled = build_network(8,24)
f = theano.function(
inputs = [X],
outputs = [hidden,hidden1_reproduction,input_reproduction,unrolled]
)
error = build_error(X,hidden,hidden1_reproduction,input_reproduction)
gradients = T.grad(error,wrt=parameters)
updates = [ (p, p - 0.0001*g) for p,g in zip(parameters,gradients) ]
train = theano.function(
inputs = [X],
updates = updates,
outputs = error
)
example = np.eye(8)
for _ in xrange(1000000):
np.random.shuffle(example)
print train(example)
np.random.shuffle(example)
hidden, hidden_rep, input_rep, unrlld = f(example)
print_arr(hidden)
print_arr(input_rep)
print_arr(unrlld)
The full code is here.
Let’s give it a whirl and see what happens:
The top matrix shows the input into the network, while the bottom one shows the reproduced input in reverse.
You can see that it manages to somewhat recall the last 3 inputs of the sequence, but then muddles up the rest. It probably hasn’t been optimised far enough, but I think we may be able to use a few tricks to make the training go a little faster.
We’ll take a look at those in the next installment.
@misc{tan2014-05-12,
title = {Recursive Auto-encoders: Example in Theano},
author = {Tan, Shawn},
howpublished = {\url{https://blog.wtf.sg/2014/05/12/recursive-auto-encoders-example-in-theano/}},
year = {2014}
}