## Learning to Transduce with Unbounded Memory – The Neural Stack

DeepMind has in the past week released a paper proposing yet another approach to having a memory structure within a neural network. This time, they implement a stack, queue and a deque “data structure” within their models. While this idea is not necessarily new, it incorporates some of the broad ideas seen in the Neural Turing Machines, where they try to have a model that is end-to-end differentiable, rather than have the data structure decoupled from the training process. I have to admit I haven’t read any of these previous papers before, but it’s definitely on my to read list.

In any case, this paper claims that using these memory structures beats having an 8-layered LSTM network trained for the same task. If this is true, this may mean we finally have some justification for these fancier models — simply throwing bigger networks at problems just isn’t as efficient.

I’ve spent some time trying to puzzle out what exactly they’re trying to do here with the neural stack. I suspect once I’ve figured this out, the queue and deque will be pretty similar, so I don’t think I will go through them in the same detail.

$$\newcommand{V}{V}

\newcommand{v}{\mathbf{v}}

\newcommand{s}{\mathbf{s}}

\newcommand{r}{\mathbf{r}}$$

So let’s take a look at the $\V_t$ and $\s_t$:

$\V_t$ is essentially where the stack resides. It’s easier to think of it as a huge matrix at every timestep $t$, and at any point in time, the matrix is only “filled” up to the $t$-th row of matrix $\V_t$. Everything from the first row to the $t-1$-th row is copied from the previous timestep. The current time step will be $\v_t$. If you are familiar with the LSTMs, then $\v_t$ is essentially the vector being “written” into the memory at time $t$. However, this doesn’t mean a *push* behaviour.

That’s where $\s_t$ comes in. If we were to think of values in $\s_t$ as strictly $1$ or $0$, then what it represents is: at time $t$, which of the rows in $\V_t$ represent items that are on the stack. But how do we update $\s_t$?

That’s a little more complicated.

From the formula, it’s easy enough to see, and reason about why, the $t$-th value of $\s_t$ is $d_t$. $d_t$ is the value of a “push signal” from the controller and is in $(0,1)$. Again, if we try to think about this in terms of strictly binary values, we see that if a push action is needed, then whatever $\v_t$ is needs to be *pushed* onto the stack. This would mean that $\s_t[t] = 1$. Conversely, no push, $\s_t[t] = 0$.

So what about the pop behaviour? Let’s stay in the realm of the strictly binary, things are nicer that way. We need to think about what happens when $u_t = 1$. The top most item on the stack corresponds to the element that’s set to 1 in $\s_t$. The authors of this paper have achieved this using equation (2), which I will attempt to explain using my next diagram.

I’d like to start looking at this from the latest elements (when $i = t-1$) and going to the start. If we assume that there is a series of $0$s before we see our first $1$, then we know that in updating $\s_{t-1}$ to $\s_{t}$, all the latest zeros are copied over. The difference is when we see the first $i$ where $\s_{t-1}[i] = 1$. Plugging that into equation (2), we see that when $u_t = 1$, then $\s_{t}[i] = 0$.

Subsequently, every other item can just be copied over since the summation inside the inner max function will always return a $0$ after the first $1$ is seen.

As for equation (3) which governs the $\r_t$ read vector, the idea is similar. The $\min$ term gives the topmost item set to $1$ in $\s_t$. This is then multiplied by $\V_t[i]$ to give a sort of “expected value” of the read vector.

I guess overall, I’m mostly puzzled at the formulation used to achieve these behaviours, as I would have gone for a more multiplicative method that is “probabilistic” (sums to 1). I do wonder if it is some formalised way to convert some programmatic behaviour into these analysable operations for attaining partial derivatives. It would be really interesting if that were true. Perhaps I’ll post an alternative formulation of this stack pushing popping behaviour at a later date.