Neural Turing Machines – Copy Task

After much fiddling around with the instability of the training procedure, I still haven’t found a recipe that would get it to converge consistently.

I did find though, that training it on shorter sequences first, before letting it see longer ones avoids huge gradients that would make the parameters explode into NaNs. And that is a huge help. Doing that still does not guarantee convergence though, and I only get a good model at random, like this one I’ve trained here copying a sequence of length 10:


It’s making mistakes even on sequences of 10, and as sequences get longer, the mistakes become more and more apparent. This is what it does for sequences of 20 and 50:


And 50, as you can probably tell, is an absolute train wreck.

This particular model has decided it likes to store the sequence backwards in the memory. There’s really no reason why it should choose to start at memory position 0 and work its way forward, so it starts from 121, and works backwards. I’ve selected the interesting part of the generated weights here (from memory position 99 to 122):

I feel like I’m missing some important secret sauce to get this to work right. Having this particular set of trained parameters shows that it can be done, but for this to be useful at all I need to be able to do this consistently. Some of my doubts on whether having one set of weights for both writing and reading would work have been cleared, but to find a proper training procedure besides the bunch of hacks I’m currently resorting to would be really helpful.

Also read...


  1. Yo,

    So what are your thoughts on R/W weightings? It looks like from your code you are using a single weighting, but looking at the diagrams in the paper from the copy task it looks like there are two weightings.

  2. So what are your thoughts on whether or not there are multiple read/write heads? It looks like from your code that there is a single head matrix of weights connected to the hidden layer, but it also seems like there are multiple read/write heads in the diagrams explaining the weightings.



    • My current understanding is there can be multiple heads in the system, and they can choose to be read or write heads depending on the controller. All you have to do to create new heads is to slap another set of head parameter prediction units over the hidden layer.

  3. Yo,

    I looked at your code and I notice one thing that maybe be different than what they do in paper – you store one set of prev_weights for all the read/write heads, but paper seems to say that you should store one set of prev_weights per each head.

    “The value of g is used to blend between the weighting w_t produced by THE head at the previous time-step …”


Leave a Reply to Shawn Cancel reply

Your e-mail address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.