Connectionist Temporal Classification (CTC) with Theano
This will be the first time I’m trying to present code I’ve written in an ipython notebook. The style’s different, but I think I’ll permanently switch to this method of presentation for code-intensive posts from now on. A nifty little tool that makes doing this so convenient is ipy2wp. It uses WordPress’ xml-rpc to post the HTML directly to the platform.
In any case, I’ve started working with the NUS School of Computing speech recognition group, and they’ve been using deep neural networks for classification of audio frames to phonemes. This requires a preprocessing step that aligns the audio frames to phonemes in order to reduce this to a simple classification problem.
CTC describes a way to compute the probability of a sequence of phonemes for a sequence of audio frames, accounting for all possible alignments. We can then define an objective function to maximise the probability of the phoneme sequence given the audio frame sequence from training data.
For me what was interesting about implementing this in Theano was the difficulty in describing the recurrence relation in terms of if-else statements. I eventually constructed a matrix for this, which you’ll see below.
The code is here:
Graves, Alex, et al. "Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks." Proceedings of the 23rd international conference on Machine learning. ACM, 2006.
<div class="inner_cell">
<div class="input_area">
<div class="highlight">
<pre><span class="kn">import</span> <span class="nn">theano</span>
import theano.tensor as T import numpy as np from theano_toolkit import utils as U from theano_toolkit import updates
We'll create a simple recurrent neural network, and then construct the CTC cost function using the given predictions. The following is pretty standard code for constructing a network with recurrent connections in its hidden layer.
<div class="inner_cell">
<div class="input_area">
<div class="highlight">
<pre><span class="k">def</span> <span class="nf">build_rnn</span><span class="p">(</span><span class="n">hidden_inputs</span><span class="p">,</span><span class="n">W_hidden_hidden</span><span class="p">,</span><span class="n">b_hidden</span><span class="p">,</span><span class="n">initial_hidden</span><span class="p">):</span>
<span class="k">def</span> <span class="nf">step</span><span class="p">(</span><span class="n">input_curr</span><span class="p">,</span><span class="n">hidden_prev</span><span class="p">):</span>
<span class="n">hidden</span> <span class="o">=</span> <span class="n">T</span><span class="o">.</span><span class="n">tanh</span><span class="p">(</span>
<span class="n">T</span><span class="o">.</span><span class="n">dot</span><span class="p">(</span><span class="n">hidden_prev</span><span class="p">,</span><span class="n">W_hidden_hidden</span><span class="p">)</span> <span class="o">+</span>\
<span class="n">input_curr</span> <span class="o">+</span>\
<span class="n">b_hidden</span>
<span class="p">)</span>
<span class="k">return</span> <span class="n">hidden</span>
<span class="n">hidden</span><span class="p">,</span><span class="n">_</span> <span class="o">=</span> <span class="n">theano</span><span class="o">.</span><span class="n">scan</span><span class="p">(</span>
<span class="n">step</span><span class="p">,</span>
<span class="n">sequences</span> <span class="o">=</span> <span class="p">[</span><span class="n">hidden_inputs</span><span class="p">],</span>
<span class="n">outputs_info</span> <span class="o">=</span> <span class="p">[</span><span class="n">initial_hidden</span><span class="p">]</span>
<span class="p">)</span>
<span class="k">return</span> <span class="n">hidden</span>
def build_model(input_size,hidden_size,output_size): X = T.matrix('X') W_input_hidden = U.create_shared(U.initial_weights(input_size,hidden_size)) W_hidden_hidden = U.create_shared(U.initial_weights(hidden_size,hidden_size)) W_hidden_output = U.create_shared(U.initial_weights(hidden_size,output_size)) b_hidden = U.create_shared(U.initial_weights(hidden_size)) i_hidden = U.create_shared(U.initial_weights(hidden_size)) b_output = U.create_shared(U.initial_weights(output_size))
<span class="n">params</span> <span class="o">=</span> <span class="p">[</span><span class="n">W_input_hidden</span><span class="p">,</span><span class="n">W_hidden_hidden</span><span class="p">,</span><span class="n">W_hidden_output</span><span class="p">,</span><span class="n">b_hidden</span><span class="p">,</span><span class="n">i_hidden</span><span class="p">,</span><span class="n">b_output</span><span class="p">]</span>
<span class="n">hidden</span> <span class="o">=</span> <span class="n">build_rnn</span><span class="p">(</span><span class="n">T</span><span class="o">.</span><span class="n">dot</span><span class="p">(</span><span class="n">X</span><span class="p">,</span><span class="n">W_input_hidden</span><span class="p">),</span><span class="n">W_hidden_hidden</span><span class="p">,</span><span class="n">b_hidden</span><span class="p">,</span><span class="n">i_hidden</span><span class="p">)</span>
<span class="n">predict</span> <span class="o">=</span> <span class="n">T</span><span class="o">.</span><span class="n">nnet</span><span class="o">.</span><span class="n">softmax</span><span class="p">(</span><span class="n">T</span><span class="o">.</span><span class="n">dot</span><span class="p">(</span><span class="n">hidden</span><span class="p">,</span><span class="n">W_hidden_output</span><span class="p">)</span> <span class="o">+</span> <span class="n">b_output</span><span class="p">)</span>
<span class="k">return</span> <span class="n">X</span><span class="p">,</span><span class="n">predict</span><span class="p">,</span><span class="n">params</span>
CTC Backward-Forward Algorithm
<p>
The paper outlines a dynamic programming algorithm used to compute the sum of probabilities over all paths corresponding to a given labelling. This is troublesome to implement in Theano. I've decided to use a matrix here to define the recurrence relations, and this makes for a lot cleaner code.
</p>
<p>
The computation is slightly odd, and differs in alternate time steps. We can recreate this by summing slices of identity matrices of different sizes to get the recurrence relation we want. So given a size 11 input:
</p>
</div>
<div class="inner_cell">
<div class="input_area">
<div class="highlight">
<pre><span class="n">size</span> <span class="o">=</span> <span class="mi">11</span>
big_I = np.eye(size+2) forward = np.eye(size) + big_I[2:,1:-1] + big_I[2:,:-2] * (np.arange(size) % 2) backward = forward.T forward
<div class="output_text output_subarea output_pyout">
<pre>
array([[ 1., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0.], [ 0., 1., 1., 1., 0., 0., 0., 0., 0., 0., 0.], [ 0., 0., 1., 1., 0., 0., 0., 0., 0., 0., 0.], [ 0., 0., 0., 1., 1., 1., 0., 0., 0., 0., 0.], [ 0., 0., 0., 0., 1., 1., 0., 0., 0., 0., 0.], [ 0., 0., 0., 0., 0., 1., 1., 1., 0., 0., 0.], [ 0., 0., 0., 0., 0., 0., 1., 1., 0., 0., 0.], [ 0., 0., 0., 0., 0., 0., 0., 1., 1., 1., 0.], [ 0., 0., 0., 0., 0., 0., 0., 0., 1., 1., 0.], [ 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 1.], [ 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1.]])
Note that this gives us the relation we want. This can be best visualised in Figure 3 of the paper. We'll build this symbolically so that this will be constructed depending on the input size.
<div class="inner_cell">
<div class="input_area">
<div class="highlight">
<pre><span class="k">def</span> <span class="nf">recurrence_relation</span><span class="p">(</span><span class="n">size</span><span class="p">):</span>
<span class="n">big_I</span> <span class="o">=</span> <span class="n">T</span><span class="o">.</span><span class="n">eye</span><span class="p">(</span><span class="n">size</span><span class="o">+</span><span class="mi">2</span><span class="p">)</span>
<span class="k">return</span> <span class="n">T</span><span class="o">.</span><span class="n">eye</span><span class="p">(</span><span class="n">size</span><span class="p">)</span> <span class="o">+</span> <span class="n">big_I</span><span class="p">[</span><span class="mi">2</span><span class="p">:,</span><span class="mi">1</span><span class="p">:</span><span class="o">-</span><span class="mi">1</span><span class="p">]</span> <span class="o">+</span> <span class="n">big_I</span><span class="p">[</span><span class="mi">2</span><span class="p">:,:</span><span class="o">-</span><span class="mi">2</span><span class="p">]</span> <span class="o">*</span> <span class="p">(</span><span class="n">T</span><span class="o">.</span><span class="n">arange</span><span class="p">(</span><span class="n">size</span><span class="p">)</span> <span class="o">%</span> <span class="mi">2</span><span class="p">)</span>
path_probs
performs a forward pass, calculating the probabilities of all possible prefix paths using theano.scan
. You can see the recurrence relation matrix in action here, and it has completely done away with the need for pesky if-else statements. It should be noted that the computation is initialised with a [latex][1,0,\cdots,0][/latex] vector, and the 2 initialising probabilities as described in the paper will be computed according to our recurrence relation.
<div class="inner_cell">
<div class="input_area">
<div class="highlight">
<pre><span class="k">def</span> <span class="nf">path_probs</span><span class="p">(</span><span class="n">predict</span><span class="p">,</span><span class="n">Y</span><span class="p">):</span>
<span class="n">P</span> <span class="o">=</span> <span class="n">predict</span><span class="p">[:,</span><span class="n">Y</span><span class="p">]</span>
<span class="n">rr</span> <span class="o">=</span> <span class="n">recurrence_relation</span><span class="p">(</span><span class="n">Y</span><span class="o">.</span><span class="n">shape</span><span class="p">[</span><span class="mi"></span><span class="p">])</span>
<span class="k">def</span> <span class="nf">step</span><span class="p">(</span><span class="n">p_curr</span><span class="p">,</span><span class="n">p_prev</span><span class="p">):</span>
<span class="k">return</span> <span class="n">p_curr</span> <span class="o">*</span> <span class="n">T</span><span class="o">.</span><span class="n">dot</span><span class="p">(</span><span class="n">p_prev</span><span class="p">,</span><span class="n">rr</span><span class="p">)</span>
<span class="n">probs</span><span class="p">,</span><span class="n">_</span> <span class="o">=</span> <span class="n">theano</span><span class="o">.</span><span class="n">scan</span><span class="p">(</span>
<span class="n">step</span><span class="p">,</span>
<span class="n">sequences</span> <span class="o">=</span> <span class="p">[</span><span class="n">P</span><span class="p">],</span>
<span class="n">outputs_info</span> <span class="o">=</span> <span class="p">[</span><span class="n">T</span><span class="o">.</span><span class="n">eye</span><span class="p">(</span><span class="n">Y</span><span class="o">.</span><span class="n">shape</span><span class="p">[</span><span class="mi"></span><span class="p">])[</span><span class="mi"></span><span class="p">]]</span>
<span class="p">)</span>
<span class="k">return</span> <span class="n">probs</span>
So in order to compute both the forward and backward pass, we simpy reverse the matrices to get the computation we want, and then reverse them back. We can then simply perform the computation in equation (14) in the paper to get the probability over all possible paths. We need to maximise this log-probability in order to train the network.
<div class="inner_cell">
<div class="input_area">
<div class="highlight">
<pre><span class="k">def</span> <span class="nf">ctc_cost</span><span class="p">(</span><span class="n">predict</span><span class="p">,</span><span class="n">Y</span><span class="p">):</span>
<span class="n">forward_probs</span> <span class="o">=</span> <span class="n">path_probs</span><span class="p">(</span><span class="n">predict</span><span class="p">,</span><span class="n">Y</span><span class="p">)</span>
<span class="n">backward_probs</span> <span class="o">=</span> <span class="n">path_probs</span><span class="p">(</span><span class="n">predict</span><span class="p">[::</span><span class="o">-</span><span class="mi">1</span><span class="p">],</span><span class="n">Y</span><span class="p">[::</span><span class="o">-</span><span class="mi">1</span><span class="p">])[::</span><span class="o">-</span><span class="mi">1</span><span class="p">,::</span><span class="o">-</span><span class="mi">1</span><span class="p">]</span>
<span class="n">probs</span> <span class="o">=</span> <span class="n">forward_probs</span> <span class="o">*</span> <span class="n">backward_probs</span> <span class="o">/</span> <span class="n">predict</span><span class="p">[:,</span><span class="n">Y</span><span class="p">]</span>
<span class="n">total_prob</span> <span class="o">=</span> <span class="n">T</span><span class="o">.</span><span class="n">sum</span><span class="p">(</span><span class="n">probs</span><span class="p">)</span>
<span class="k">return</span> <span class="o">-</span><span class="n">T</span><span class="o">.</span><span class="n">log</span><span class="p">(</span><span class="n">total_prob</span><span class="p">)</span>
Finally, just to be sure, let's check if this compute graph plays nicely with T.grad
.
<div class="inner_cell">
<div class="input_area">
<div class="highlight">
<pre><span class="n">Y</span> <span class="o">=</span> <span class="n">T</span><span class="o">.</span><span class="n">ivector</span><span class="p">(</span><span class="s">'Y'</span><span class="p">)</span>
X,predict,params = build_model(10,10,10) cost = ctc_cost(predict,Y)
# Differentiable grad = T.grad(cost,wrt=params)
f = theano.function( inputs = [X,Y], outputs = cost )
f(np.eye(11)[:,:10],np.arange(10,dtype=np.int32))
<div class="output_text output_subarea output_pyout">
<pre>
array(23.35882877759107)
Looks like everything's fine. Train away!
@misc{tan2014-10-05,
title = {Connectionist Temporal Classification (CTC) with Theano},
author = {Tan, Shawn},
howpublished = {\url{https://blog.wtf.sg/2014/10/06/connectionist-temporal-classification-ctc-with-theano/}},
year = {2014}
}