The end of
this post (I don’t know where the article is now. Can’t find it.) had a diagram showing the improvements of AdaDelta over standard SGD and AdaGrad, so I decided to look up what AdaGrad actually does. The details are written in the paper, including it’s “derivation”. It’s basically an improvement over AdaGrad, using rolling averages and also multiplying by the RMS of the rolling average of changes to the weight.
Its not hard to implement, but you now have to store the average of the squared $\Delta$s, and also the average of the squared gradient. Thanks to Theano, the code doesn’t look too far from the pseudo code in the paper:
# create variables to store intermediate updates
gradients_sq = [ U.create_shared(np.zeros(p.get_value().shape)) for p in parameters ]
deltas_sq = [ U.create_shared(np.zeros(p.get_value().shape)) for p in parameters ]
# calculates the new "average" delta for the next iteration
gradients_sq_new = [ rho*g_sq + (1-rho)*(g**2) for g_sq,g in izip(gradients_sq,gradients) ]
# calculates the step in direction. The square root is an approximation to getting the RMS for the average value
deltas = [ (T.sqrt(d_sq+eps)/T.sqrt(g_sq+eps))*grad for d_sq,g_sq,grad in izip(deltas_sq,gradients_sq_new,gradients) ]
# calculates the new "average" deltas for the next step.
deltas_sq_new = [ rho*d_sq + (1-rho)*(d**2) for d_sq,d in izip(deltas_sq,deltas) ]
# Prepare it as a list f
gradient_sq_updates = zip(gradients_sq,gradients_sq_new)
deltas_sq_updates = zip(deltas_sq,deltas_sq_new)
parameters_updates = [ (p,p - d) for p,d in izip(parameters,deltas) ]
return gradient_sq_updates + deltas_sq_updates + parameters_updates
The nice thing about the AdaDelta update is that there’s no extra learning rate decay policy to fret over. Pick a value for $\rho$ and $\varepsilon$ and you’re good to go. In this case, I used 0.95 and 1e-6.
I wanted to see if it was good for training RAEs, so I used the sequence of 8 example again, and compared it to the convergence rate of using momentum and plain old SGD.
Where the plain ol’ SGD took about 800,000 iterations and still did not get below the threshold error, with momentum it did that in 440,000 iterations. AdaGrad though, managed it in 75,000 iterations.
I think I’ll be using this gradient method from here on.