## Recursive Auto-encoders: Momentum

In the previous post, we wrote the code for RAE using the Theano library, but it wasn’t successful in performing the simple task of reversing a randomised sequence of 1 to 8. One of the tricks we can use for dealing with time sequence data is to use a small learning rate, along with momentum. I’ll be discussing what momentum is, and showing a simple way momentum can be implemented in Theano.

## Momentum

Stochastic gradient descent on its own performs very sharp moves in the gradient calculated for that particular mini-batch. In cases where the error surface has “ridges”, calculating and moving in the direction of the immediate gradient will cause movements that go back and forth, resulting in a lot of redundant movements and small improvements the direction of low error.

One way to get around this is through the use of momentum. To develop an intuition for this, imagine a ball rolling down a “ridge” similar to the one above. In the physical world, the ball would over time gain momentum in the direction of the ridge, rather than continuously go from side to side. The same effect can be achieved by using a moving average of gradients. In this way, directions where the gradient is decreasing gets accumulated and the parameters are moved in that direction.

In Greek:
\begin{aligned} v_{t+1} &= \mu v_t – \varepsilon \Delta f(\theta_t) \\ \theta_{t+1} &= \theta_t + v_{t+1} \end{aligned}
where $\mu$ and $\varepsilon$ are variables controlling how the momentum and learning rates respectively. We can think of $v_t$ as the “velocity” of the ball at time $t$.

Ilya Sutskever’s paper “On the importance of initialization and momentum in deep learning” has a diagram of what is being calculated when we do this, giving the sense that the current gradient at time $t$ is just sort of nudging the direction which the parameters are going.

And that’s the gist of it! To vim!

## Implementation

The paper makes recommendations as to how the two variables $\mu$ and $\varepsilon$ should be set and adjusted over time, and I’ve implemented them in my code.

We first need to define two new variables for $\mu$ and $\varepsilon$. We also need to keep track of $\theta_t$ and $v_t$ since we can compute the values at $t+1$ once we have the corresponding gradients.

# declare the variables eps = T.dscalar('eps') mu = T.dscalar('mu')   deltas = [ U.create_shared(np.zeros(p.get_value().shape)) for p in parameters ] # create the shared variables for the velocity delta_nexts = [ mu*delta + eps*grad for delta,grad in zip(deltas,gradients) ] # update velocity given the new gradient. delta_updates = [ (delta, delta_next) for delta,delta_next in zip(deltas,delta_nexts) ] # store the new velocity param_updates = [ (param, param - delta_next) for param,delta_next in zip(parameters,delta_nexts) ] # update the parameters in the direction of velocity   train = theano.function( inputs = [X,eps,mu], updates = delta_updates + param_updates, outputs = error )

So now, every time we call train, we’ll also need to supply the parameters $\mu$ and $\varepsilon$. This is useful since it is helpful to reduce the training rate while increasing the momentum as more data has been seen.

If you’re not convinced by the theoretical mumbo jumbo above, here’s a graph of the error over time for this toy example we’re playing with. The red graph shows the error without momentum, and the blue one with momentum turned on.

There’s a lot of oscillation going on in the red graph, most likely due to “ridges” and just repeatedly travelling across it rather than along it.

With the momentum method, it converges much faster, and it is now able to reproduce the sequence of 8 reliably. This is done with 64 hidden units in the recurrent network. I’ve tried with smaller sizes and the reproductions don’t go further than the last three of the sequences.

My hope is to be able to do this type of encoding for text data. I’ll be running some experiments and hopefully have something positive to write about in the next installment!