## Constraining Hidden Layers for Interpretability (eventually, hopefully…)

I haven’t written much this past year, so I guess as a parting post for 2015, I’d talk a little bit about the poster I presented at ASRU 2015. The bulk of the stuff’s in the paper, plus I’m still kind of unsure about the legality about putting stuff that’s *in* the paper on this blog post, so I think I’ll talk about the other things that didn’t make it in.

Firstly, the initial idea for constraining hidden layers in order to see how we can improve the interpretability came from my supervisor, Sim Khe Chai, and his ex-PhD advisor, Mark Gales. It was mainly to see if we could make the activations in the hidden layers a little bit more structured; perhaps have similar functioning units be situated “close together” in some way.

The first thing we did was to assume every hidden layer’s neurons (I know people will hate the terminology here, but… just bear with me) is arranged arbitrarily on a 2D grid. Since we use 1024 units on our hidden layers, it’s easy to just define a 32 x 32 grid.

We first tried applying an additional penalty term that imposes a kind of adjacency constraint, neighbouring cells on the grid get a penalty if they fire differently:

$$\sum_{(i,j) \in \text{Neighbours}} (c_i – c_j)^2$$

This achieved the effect we wanted, but didn’t really provide much by way of “interpreting” anything.

We eventually settled on “forcing” the regions to appear where we want, by defining regions as Gaussian shaped surfaces, and then used KL-divergence between the “ideal” Gaussian surface and the actual surface as the penalty to the hidden layers. This KL-divergence term is then added as a penalty like the previous case during training.

There’s a video here of the hidden layer as it reads in an input sequence of fmllr frames:

Since the way I calculated the Gaussian surfaces in Theano required defining a grid of coordinates ((0,0),(0,1),(0,2)…), it wasn’t too hard to see if I could initialise them randomly and train the entire system from scratch. The hope was to use a different *tuned* grid for each layer, and then we’d be able to look at the 2D coordinates learnt, and see how well the network discriminates the different classes as it goes up the hidden layers.

Unfortunately, there were problems with getting this to work right. The coordinates always wound up close to the centre of the grid, and so did not differentiate much from layer to layer. I did have slight success using a symmetric version of the KL-divergence penalty (basically adding both terms $ D_{\mathrm{KL}}(P\|Q) + D_{\mathrm{KL}}(Q\|P)\, \!$), but still when visualising the activations later, everything was too close to the centre, without clear distinguishable regions as seen in the video.

The code for generating the Gaussian surfaces and constraints are here, but it’s in quite a mess, and you’ll have to tease out what I’m using and not.

Feel free to e-mail or comment with questions!