After much fiddling around with the instability of the training procedure, I still haven’t found a recipe that would get it to converge consistently.
I did find though, that training it on shorter sequences first, before letting it see longer ones avoids huge gradients that would make the parameters explode into NaNs. And that is a huge help. Doing that still does not guarantee convergence though, and I only get a good model at random, like this one I’ve trained here copying a sequence of length 10:
It’s making mistakes even on sequences of 10, and as sequences get longer, the mistakes become more and more apparent. This is what it does for sequences of 20 and 50:
And 50, as you can probably tell, is an absolute train wreck.
This particular model has decided it likes to store the sequence backwards in the memory. There’s really no reason why it should choose to start at memory position 0 and work its way forward, so it starts from 121, and works backwards. I’ve selected the interesting part of the generated weights here (from memory position 99 to 122):
I feel like I’m missing some important secret sauce to get this to work right. Having this particular set of trained parameters shows that it can be done, but for this to be useful at all I need to be able to do this consistently. Some of my doubts on whether having one set of weights for both writing and reading would work have been cleared, but to find a proper training procedure besides the bunch of hacks I’m currently resorting to would be really helpful.