Neural Turing Machines FAQ

There’s been some interest in Neural Turing Machines paper, and I’ve been getting some questions regarding my implementation via e-mail and the comments section on this blog. I plan to make this a blog post where I’ll regularly come back and update with answers to some of these questions as they come up, so do check back!

$$
\newcommand{\memory}{\mathbf{M}}
\newcommand{\read}{\mathbf{r}}
\newcommand{\erase}{\mathbf{e}}
\newcommand{\add}{\mathbf{a}}
\newcommand{\weight}{\mathbf{w}}
\newcommand{\key}{\mathbf{k}}
\newcommand{\shift}{\mathbf{s}}
\newcommand{\Shift}{\mathbf{S}}
$$

Controller heads. What? How? Huh?

This is one of the important things that come up, and also one of the more vague points about the entire paper. The paper talks about how the head parameters are used to compute the read/write updates to the memory, but not much about where they come from.

My guess, and this is the current state of the code, is that each head can be a read OR a write head, and the controller can decide which action to perform.

The basic way the parameters are “predicted” is with an affine transformation over the controller’s hidden layer. This is then passed through the relevant non-linearity function in order to meet the hard constraints required of the parameters. To re-iterate from one of the previous posts, here they are:
$$
\begin{align}
\key &= \hat{\key} &\\
\beta &= \log\left(e^{\hat{\beta}} + 1 \right) &\Longrightarrow &\beta > 0 \\
g &= \sigma\left(\hat{g}\right) &\Longrightarrow & g \in (0,1) \\
(\shift)_i &= \frac{\exp((\hat{\shift})_i)}{\sum_j \exp((\hat{\shift})_j)} &\Longrightarrow
& (\shift)_i \in (0,1),\sum_i (\shift)_i = 1 \\
\gamma &= \log\left(e^{\hat{\gamma}} + 1 \right) + 1 &\Longrightarrow & \gamma \geq 1 \\
\end{align}
$$

In Alex Graves’ previous work, the way he achieves a non-negative output is with the $\exp$ function, but I’ve found that it introduces a lot of numerical instability during training, and the softplus $\log(e^x + 1)$ is a better option. For those wondering: No, I did not pull this function out of thin air. It’s a well known non-linearity known as the rectifier.

So if the controller chooses to write instead of read, it just has to ignore the input from the read head at the next time-step. This is essentially what it does when I train it for the copy task.

What are the training targets for the heads?

This is still related to heads, but I think people asking this are sort of missing one of the crucial points of the paper, so it gets its own subsection.

There are no targets. That’s part of what makes it awesome. Supply it with the structure parameters, feed it with input data and output data, and it’s supposed to learn a procedure to map input sequence to output sequence. Of course this is easier said than done, and I’ve looked at some of Alex Graves’ past work to get a hint about what he would do when facing certain situations: output heads, gradient clipping, gradient method, etc.

Why don’t you use X for the activation function instead of $\tanh$? Why did you do Y instead of Z?

I chose whatever I felt was simplest. So I started out with the feed-forward network instead of an LSTM, and I used sigmoid activation functions before I tried $\tanh$, and switched because it worked better. That’s all. The code’s there for everyone to make modifications to, though there’ve been 24 forks to date, and only one person has submitted a pull request. If you find something that works better, please make a pull request, or just share your code on your fork. The best thing we can do right now since we’re getting radio silence from the authors is to share what we know.

You can make it faster if you do A.

These are all good suggestions, but I really think right now when the code I have doesn’t converge consistently, our main goal should be getting a version that does, before we do any optimisations for speed. With Theano, I’ve found that most speed optimisations just make the code more unreadable, and as it is now, it’s already really hard to debug Theano errors. I think Donald Knuth said “Premature optimisation is the root of all evil.”

Are you going to implement the rest of the tasks?

Hopefully, in due time. Other things have attracted my attention lately, and I haven’t really focused on this project in the last 2 months. My hope was to apply it to NLP tasks like language modeling, but the results I got from the copy task weren’t promising, and I wonder how far I’ll get doing that. Again, pull requests welcome!

The thumbnail on reddit is ugly as hell.

Okay, no one really said this but I think I need to apologise again for it. I’ve changed my gravatar since.

Also read...

Comments

  1. Hey man, do you think it would make sense to cache the call grads so we can see what the derivative of the addressing functions look like so we can store them instead of computing them each time?

    Reply
    • Yeah I’ve seen it. I don’t use Torch so I haven’t tried it out. Seems like it suffers from similar problems though.

      Reply
  2. Hi Shawn,

    Great work. I am trying to implement it with theano as well, but i’m struggling. I looked at your code and there is something that I don’t understand.

    What’s the intuition behind you using mem_width=20 != input_size = 8? Is there any mention in the paper that M should be bigger than input/output size? Does it work better?

    Thanks

    Tom

    Reply

Leave a Reply

Your email address will not be published. Required fields are marked *