Connectionist Temporal Classification (CTC) with Theano

This will be the first time I’m trying to present code I’ve written in an ipython notebook. The style’s different, but I think I’ll permanently switch to this method of presentation for code-intensive posts from now on. A nifty little tool that makes doing this so convenient is ipy2wp. It uses WordPress’ xml-rpc to post the HTML directly to the platform.

In any case, I’ve started working with the NUS School of Computing speech recognition group, and they’ve been using deep neural networks for classification of audio frames to phonemes. This requires a preprocessing step that aligns the audio frames to phonemes in order to reduce this to a simple classification problem.

CTC describes a way to compute the probability of a sequence of phonemes for a sequence of audio frames, accounting for all possible alignments. We can then define an objective function to maximise the probability of the phoneme sequence given the audio frame sequence from training data.

For me what was interesting about implementing this in Theano was the difficulty in describing the recurrence relation in terms of if-else statements. I eventually constructed a matrix for this, which you’ll see below.

The code is here:

Graves, Alex, et al. "Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks." Proceedings of the 23rd international conference on Machine learning. ACM, 2006.

In [9]:
import theano
import theano.tensor as T
import numpy         as np
from theano_toolkit import utils as U
from theano_toolkit import updates

We'll create a simple recurrent neural network, and then construct the CTC cost function using the given predictions. The following is pretty standard code for constructing a network with recurrent connections in its hidden layer.

In [10]:
def build_rnn(hidden_inputs,W_hidden_hidden,b_hidden,initial_hidden):
    def step(input_curr,hidden_prev):
        hidden = T.tanh(
                T.dot(hidden_prev,W_hidden_hidden) +\
                input_curr +\
                b_hidden
            )
        return hidden
    hidden,_ = theano.scan(
        step,
        sequences = [hidden_inputs],
        outputs_info = [initial_hidden]
    )
    return hidden



def build_model(input_size,hidden_size,output_size):
    X = T.matrix('X')
    W_input_hidden  = U.create_shared(U.initial_weights(input_size,hidden_size))
    W_hidden_hidden = U.create_shared(U.initial_weights(hidden_size,hidden_size))
    W_hidden_output = U.create_shared(U.initial_weights(hidden_size,output_size))
    b_hidden = U.create_shared(U.initial_weights(hidden_size))
    i_hidden = U.create_shared(U.initial_weights(hidden_size))
    b_output = U.create_shared(U.initial_weights(output_size))
    
    params = [W_input_hidden,W_hidden_hidden,W_hidden_output,b_hidden,i_hidden,b_output]
    
    hidden = build_rnn(T.dot(X,W_input_hidden),W_hidden_hidden,b_hidden,i_hidden)
    
    predict = T.nnet.softmax(T.dot(hidden,W_hidden_output) + b_output)
    return X,predict,params

CTC Backward-Forward Algorithm

The paper outlines a dynamic programming algorithm used to compute the sum of probabilities over all paths corresponding to a given labelling. This is troublesome to implement in Theano. I've decided to use a matrix here to define the recurrence relations, and this makes for a lot cleaner code.

The computation is slightly odd, and differs in alternate time steps. We can recreate this by summing slices of identity matrices of different sizes to get the recurrence relation we want. So given a size 11 input:

In [11]:
size = 11
big_I = np.eye(size+2)
forward = np.eye(size) + big_I[2:,1:-1] + big_I[2:,:-2] * (np.arange(size) % 2)
backward = forward.T
forward
Out[11]:
array([[ 1.,  1.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  1.,  1.,  1.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  1.,  1.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  1.,  1.,  1.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  1.,  1.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  1.,  1.,  1.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.,  1.,  1.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  1.,  1.,  1.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  1.,  1.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  1.,  1.],
       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  1.]])

Note that this gives us the relation we want. This can be best visualised in Figure 3 of the paper. We'll build this symbolically so that this will be constructed depending on the input size.

In [12]:
def recurrence_relation(size):
    big_I = T.eye(size+2)
    return T.eye(size) + big_I[2:,1:-1] + big_I[2:,:-2] * (T.arange(size) % 2)

path_probs performs a forward pass, calculating the probabilities of all possible prefix paths using theano.scan. You can see the recurrence relation matrix in action here, and it has completely done away with the need for pesky if-else statements. It should be noted that the computation is initialised with a [1,0,\cdots,0] vector, and the 2 initialising probabilities as described in the paper will be computed according to our recurrence relation.

In [13]:
def path_probs(predict,Y):
    P = predict[:,Y]
    rr = recurrence_relation(Y.shape[0])
    def step(p_curr,p_prev):
        return p_curr * T.dot(p_prev,rr)
    probs,_ = theano.scan(
            step,
            sequences = [P],
            outputs_info = [T.eye(Y.shape[0])[0]]
        )
    return probs

So in order to compute both the forward and backward pass, we simpy reverse the matrices to get the computation we want, and then reverse them back. We can then simply perform the computation in equation (14) in the paper to get the probability over all possible paths. We need to maximise this log-probability in order to train the network.

In [14]:
def ctc_cost(predict,Y):
    forward_probs  = path_probs(predict,Y)
    backward_probs = path_probs(predict[::-1],Y[::-1])[::-1,::-1]
    probs = forward_probs * backward_probs / predict[:,Y]
    total_prob = T.sum(probs)
    return -T.log(total_prob)

Finally, just to be sure, let's check if this compute graph plays nicely with T.grad.

In [15]:
Y = T.ivector('Y')
X,predict,params = build_model(10,10,10)
cost = ctc_cost(predict,Y)

# Differentiable
grad = T.grad(cost,wrt=params)

f = theano.function(
        inputs = [X,Y],
        outputs = cost
    )

f(np.eye(11)[:,:10],np.arange(10,dtype=np.int32))
Out[15]:
array(23.35882877759107)

Looks like everything's fine. Train away!

Also read...

Comments

  1. This is a slick Theano implementation of CTC but I’m running into underflow-issues for any input sequence longer than about 100 time-steps. I’m thinking there’s some way to exponentiate the probabilities and the take the log later but I haven’t figured it out yet. I’m wondering if you know if a solution to this and could send or post it?

    BTW… this issue seems to be coming from the theano.scan loop in path_probs

    Reply
  2. I have created numpy recurrent matrix and modified it based on same phones occurs in next to each other. But have to loop through the sequence length :(. Then pass this matrix to the computation.

    for i in range(1,len(labels)-3):
    if(labels[i]==3):
    continue
    if(labels[i]==labels[i+2]):
    relation[i,i+2]= 0

    hope this solves the problem of adding twice of forward probabilities.

    Reply
    • There’s probably a way to avoid constructing a new matrix on the CPU side and then passing that recurrent matrix over every new sequence. In any case, I don’t really think the recurrent matrix is the way to go once huge probability distributions are involved. The sparsity of the matrix doesn’t seem to be accounted for and so it can be quite expensive.

      Reply

Leave a Reply

Your email address will not be published. Required fields are marked *