when trees fall... | Neural Turing Machines

Neural Turing Machines – A First Look

Some time last week, a paper from Google DeepMind caught my attention.

  <p>
    The paper is of particular interest to me because I&#8217;ve been thinking about how a recurrent neural network could learn to have access to an external form of memory. The approach taken here is interesting as it makes use of a balance between seeking using similarity of content, and shifting from that using location.
  </p>
  
  <p>
    My focus this time would be on some of the details needed for implementation. Some of these specifics are glossed over in the paper, and I&#8217;ll try to infer whatever I can and, perhaps in the next post, have code (in Theano, what else?) to present.
  </p>
  
  <p>

  <p>
    $$<br /> \newcommand{\memory}{\mathbf{M}}<br /> \newcommand{\read}{\mathbf{r}}<br /> \newcommand{\erase}{\mathbf{e}}<br /> \newcommand{\add}{\mathbf{a}}<br /> \newcommand{\weight}{\mathbf{w}}<br /> \newcommand{\key}{\mathbf{k}}<br /> \newcommand{\shift}{\mathbf{s}}<br /> \newcommand{\Shift}{\mathbf{S}}<br /> $$
  </p>
</div>

Reading and Writing

  <p>
    The memory matrix, $\memory_t$, is modified at every time step, and this is interacted with via <em>heads</em>: reading ($\read_t$), writing: erase ($\erase_t$) and add ($\add_t$). This is similar to the mechanisms used in Long Short-term Memory (LSTM) units, just with a slight change in naming &#8211; <em>write</em> is now <em>add</em> due to a clash in naming as there is a new weighting vector ($\weight_t$).
  </p>
</div>

$$
\begin{align}
\memory_t
&= \left[\begin{array}{ccc}
\memory_t(0)_0 & \cdots & \memory_t(0)_{N-1}\\
\vdots & \ddots & \vdots \\
\memory_t(i)_0 & \cdots & \memory_t(i)_{N-1}\\
\vdots & \ddots & \vdots \\
\memory_t(M-1)_0 & \cdots & \memory_t(M-1)_{N-1}\\
\end{array}\right]
&
\weight_t &= \left[ \begin{array}{c}
w_t(0)\\
\vdots\\
w_t(i)\\
\vdots\\
w_t(M-1)
\end{array}\right]
\\
\read_t
&= ~\left[ \begin{array}{ccc}
~~~r_{0}~~~ & ~~~\cdots~~~ & ~~~r_{N-1}~~~
\end{array}\right]
\\
\erase_t
&= ~\left[ \begin{array}{ccc}
~~~e_{0}~~~ & ~~~\cdots~~~ & ~~~e_{N-1}~~~
\end{array}\right]
\\
\add_t
&= ~\left[ \begin{array}{ccc}
~~~a_{0}~~~ & ~~~\cdots~~~ & ~~~a_{N-1}~~~
\end{array}\right]
\end{align}
$$

Hopefully that should give you an idea of what goes where, and at the same time, give you a clue what their purposes are. The general idea here is to use $\weight_t$ as a focusing mechanism to select the row, and $\read_t$, $\erase_t$ and $\add_t$ perform their individual roles element wise in $\memory_t(i)$.

  <p>
    Here&#8217;s where I find this interesting. If we were to try and write some sort of automaton that does this, we would perform the read/write at one location. However, in order for this to be able to be &#8220;end-to-end&#8221; differentiable, the authors have made $\weight_t$ a <em>distribution</em> over positions, $\sum_i w_t(i) = 1$. This is a technique that is used pretty consistently throughout the different mechanisms in the NTM.
  </p>
</div>

Addressing

  <p>
    <a href="https://blog.wtf.sg/wp-content/uploads/2014/10/Screenshot-from-2014-10-30-125051.png"><img class="aligncenter wp-image-799" src="https://blog.wtf.sg/wp-content/uploads/2014/10/Screenshot-from-2014-10-30-125051.png" alt="Screenshot from 2014-10-30 12:50:51" width="700" height="277" srcset="https://blog.wtf.sg/wp-content/uploads/2014/10/Screenshot-from-2014-10-30-125051.png 921w, https://blog.wtf.sg/wp-content/uploads/2014/10/Screenshot-from-2014-10-30-125051-300x118.png 300w" sizes="(max-width: 700px) 100vw, 700px" /></a><br /> Still on the topic of $\weight_t$, let me try and give a brief overview of what the authors are trying to achieve with their very&#8230; convoluted (forgive the pun) way of computing it at every time step.
  </p>
  
  <p>
    In order for the controller to be able to perform any <em>useful</em> action on $\memory_t$, it needs to be able to take in an input from the input sequence, and then translate (by which I mean predict) that to some sort of key ($\key_t$) that it can look up. After that, merely picking out that particular value may not be particularly useful, since we could possibly have just used the prediction. We might need to shift ($\shift_t$) before or after that memory location to find something useful for computation. You could imagine a sort of table where keys and values are stored $k_0,v_0,\cdots,k_N,v_N$ in succession, then when we look up $k_i$ and that is at location $2i$, then $v_i$ would be at location $2i+1$, requiring an address shift by 1.
  </p>
  
  <p>
    Let&#8217;s now take a closer look at how this $\weight_t$ is computed at each time step.
  </p>
</div>

Content-based using $\key_t$ and $\beta_t$ to $\weight_t^c$

  <p>
    So, say our mysterious controller predicts a vector $\key_t$. Again, with the table analogy, we look up $\memory_t$ to find the entry most similar to $\key_t$, but we have to do this <em>probabilistically</em>. The authors do this by using a similarity function (in this case cosine similarity) and computing similarity with $\key_t$ over all entries in $\memory_t$. They then normalise this by slapping a softmax over the computed values. This gives us $\weight_t^c$, which is just an intermediate value before the final weight vector is computed.
  </p>
</div>

Shifting values using $\shift_t$ to $\weight_t$

  <p>
    Now that we have a weighting for content-based addressing, we need to allow the controller to decide if it needs to use that, or to just perform a shift from the weighting from the previous time step. Predictably, the way the authors have achieved this is using the &#8220;expected&#8221; value ($\weight^g_t$) of $\weight_{t-1}$ and $\weight^c_t$ weighted by $g_t$ and $1-g_t$.
  </p>
  
  <p>
    Okay, so time to shift $\weight^g_t$ <em>probabilistically</em>. It goes without saying then that $\sum_i s_t(i) = 1$. The paper describes this shift as,<br /> $$\widetilde{w}_t(i) = \sum^{N-1}_{j=0} w^g_t(j)\cdot s_t(i-j)$$
  </p>
</div>

There might be a simpler way to see how this works in vectorised form. We could use $\shift_t$ to describe a matrix $\Shift_t$,

  <p>
    $$\Shift_t =<br /> \left[\begin{array}{ccc}<br /> s_t(0) & s_t(N-1) & \cdots & s_t(2) & s_t(1) \\<br /> s_t(1) & s_t(0) & s_t(N-1) & \cdots & s_t(2) \\<br /> \vdots & s_t(1) & s_t(0) & \ddots & s_t(2) \\<br /> s_t(N-2) & & \ddots & \ddots & s_t(N-1) \\<br /> s_t(N-1) & s_t(N-2) & \cdots & s_t(1) & s_t(0) \\<br /> \end{array}\right]<br /> $$
  </p>
  
  <p>
    in which case, then the final computation would just be,<br /> $$ \widetilde{\weight}_t = \Shift_t\weight^g_t$$
  </p>
</div>

As a final step, they power up the values of $\widetilde{\weight}_t$ by $\gamma_t$, and then renormalise to get the final $\weight_t$.

The Controller

  <p>
    <a href="https://blog.wtf.sg/wp-content/uploads/2014/10/Screenshot-from-2014-10-30-125035.png"><img class="alignnone wp-image-801" src="https://blog.wtf.sg/wp-content/uploads/2014/10/Screenshot-from-2014-10-30-125035.png" alt="Screenshot from 2014-10-30 12:50:35" width="600" height="407" srcset="https://blog.wtf.sg/wp-content/uploads/2014/10/Screenshot-from-2014-10-30-125035.png 673w, https://blog.wtf.sg/wp-content/uploads/2014/10/Screenshot-from-2014-10-30-125035-300x203.png 300w" sizes="(max-width: 600px) 100vw, 600px" /></a>
  </p>
  
  <p>
    So what&#8217;s this controller that keeps popping up? We could think of it as the CPU of the entire system, deciding what to read and write to memory. In the paper, the authors experiment with using an LSTM neural network and a standard feed-forward network for this purpose.
  </p>
  
  <p>
    More importantly for us, we need to figure out what the input and the output is for this network in order to implement anything. So to summarise the above, here are the inputs and outputs, along with whatever constraints that may be attached to them.
  </p>
</div>

Inputs:
$$
\begin{align}
\mathbf{i}_t,\read_t &\in \mathbb{R}^N \\
\end{align}
$$

Outputs:
$$
\begin{align}
\erase_t,\add_t,\key_t &\in (0,1)^N \\
\shift_t &\in (0,1)^M, & \sum_i& s_t(i) = 1 \\
\beta_t &\in \mathbb{R}^+ & \gamma_t & \in\mathbb{R}^{\ge 1} \\
& & g_t & \in (0,1) \\
\end{align}
$$

At every time step, these are the things fed into the controller, and the outputs with which we use to manipulate $\memory_t$.

  <p>
    And I think that&#8217;s all we need! I&#8217;ll look at implementing this over the weekend, and we&#8217;ll see if we can achieve similar results.
  </p>
</div>

@misc{tan2014-10-27,
  title        = {Neural Turing Machines – A First Look},
  author       = {Tan, Shawn},
  howpublished = {\url{https://blog.wtf.sg/2014/10/27/neural-turing-machines-a-first-look/}},
  year         = {2014}
}