The end of this post (I don’t know where the article is now. Can’t find it.) had  a diagram showing the improvements of AdaDelta over standard SGD and AdaGrad, so I decided to look up what AdaGrad actually does. The details are written in the paper, including it’s “derivation”. It’s basically an improvement over AdaGrad, using rolling averages and also multiplying by the RMS of the rolling average of changes to the weight.

Its not hard to implement, but you now have to store the average of the squared $\Delta$s, and also the average of the squared gradient. Thanks to Theano, the code doesn’t look too far from the pseudo code in the paper:

def updates(parameters,gradients,rho,eps): # create variables to store intermediate updates gradients_sq = [ U.create_shared(np.zeros(p.get_value().shape)) for p in parameters ] deltas_sq = [ U.create_shared(np.zeros(p.get_value().shape)) for p in parameters ]   # calculates the new "average" delta for the next iteration gradients_sq_new = [ rho*g_sq + (1-rho)*(g**2) for g_sq,g in izip(gradients_sq,gradients) ]   # calculates the step in direction. The square root is an approximation to getting the RMS for the average value deltas = [ (T.sqrt(d_sq+eps)/T.sqrt(g_sq+eps))*grad for d_sq,g_sq,grad in izip(deltas_sq,gradients_sq_new,gradients) ]   # calculates the new "average" deltas for the next step. deltas_sq_new = [ rho*d_sq + (1-rho)*(d**2) for d_sq,d in izip(deltas_sq,deltas) ]   # Prepare it as a list f gradient_sq_updates = zip(gradients_sq,gradients_sq_new) deltas_sq_updates = zip(deltas_sq,deltas_sq_new) parameters_updates = [ (p,p - d) for p,d in izip(parameters,deltas) ] return gradient_sq_updates + deltas_sq_updates + parameters_updates

The nice thing about the AdaDelta update is that there’s no extra learning rate decay policy to fret over. Pick a value for $\rho$ and $\varepsilon$ and you’re good to go. In this case, I used 0.95 and 1e-6.

I wanted to see if it was good for training RAEs, so I used the sequence of 8 example again, and compared it to the convergence rate of using momentum and plain old SGD. Where the plain ol’ SGD took about 800,000 iterations and still did not get below the threshold error, with momentum it did that in 440,000 iterations. AdaGrad though, managed it in 75,000 iterations.

I think I’ll be using this gradient method from here on.

1. • Haha, you’re right. I have no idea where that article is now.

2. Hello,
If I well understand, the red curve is the training loss function using the AdaDelta + momentum.
How did you combine them? they are two different methods which you can use only one at a time. Did you just add the AdaDelta’s update and the momentum update to be the final update of the parameters? or you did something else? I’m interested to know. Thank you.
do you have any references for the combination?

• 3. 