## Learning Gaussian Feature Extractors

While playing around with the MNIST dataset and the example code, I tried to visualise the weights of the connections from the weights to the hidden layer. These can be thought of as feature extractors of the input.

If you’ve trained a denoising auto-encoder, you typically get a plot that looks something like this

(taken from http://ufldl.stanford.edu)

What I noticed about this is that there seemed to be strokes and blobs of white pixels that are clustered together. So I wondered if we could reduce the parameters to describe these blobs to the parameters that describe a 2D Gaussian function.

As it turns out, it can be done pretty easily, but I’m not sure if there’s any useful application for this.

$$

\newcommand{\precisionmat}{\mathbf{B}}

\newcommand{\pixel}{\mathbf{p}}

$$

We want to learn feature extractors for the input in terms of Gaussian filters with varying width height and rotations.

```
<p>
Let the Gaussian function be,
</p>
<p>
$$ g(\pixel;\precisionmat,\boldsymbol{\mu}) = \exp\left(-{\left\| \precisionmat (\pixel – \boldsymbol{\mu}) \right\|}^2\right) $$
</p>
<p>
Then we define a weight matrix $\mathbf{W}$ between layers such that,
</p>
<p>
$$ \underbrace{\mathbf{W}}_{(n,h)} = \underbrace{\mathbf{G}}_{(n,k)} \underbrace{\mathbf{M}}_{(k,h)}$$
</p>
<p>
where $\mathbf{M}$ is a standard transformation matrix freely tuned by gradient descent, and $\mathbf{G}$ is a matrix with each column representing a Gaussian filter:
</p>
<p>
$$(\mathbf{G})_{ij} = g(\left[\text{row}(i),\text{col}(i)\right];\precisionmat_j,\boldsymbol{\mu}_j)$$
</p>
<p>
where $\text{row}(i)$ and $\text{col}(i)$ give the row and column of input $i$. As a result, the free variables for tuning the matrix $\mathbf{G}$ are then the matrices $\precisionmat$
</p>
<p>
In this way, $\mathbf{W}$ represents a layer in which features are captured in the form of linear combinations of Gaussian shaped feature extractors from the input, which is assumed to have some form of two-dimensional topology.
</p>
<p>
So if we were to visualise a column from $\mathbf{G}$ in it’s 2D representation:
</p>
</div>
```

```
<div class="output_text output_subarea output_pyout">
<pre><matplotlib.image.AxesImage at 0x7f4ec931d6d0>
```

```
<div class="output_area">
<div class="prompt">
</div>
<div class="output_png output_subarea ">
<img class="alignnone" src="https://blog.wtf.sg/wp-content/uploads/2014/12/wpid-Learning_Gaussian_Feature_Extractors1.png" alt="" width="505" height="497" />
</div>
</div>
</div>
```

We can then use this matrix $\mathbf{W}$ as we would in an autoencoder. During training, I transpose $\mathbf{W}$. This means that whatever reconstruction it creates has to be made up of the $k$ Gaussian components that it has learnt. It learns to do this pretty well. Here is an example:

```
<div class="output_text output_subarea output_pyout">
<pre><matplotlib.image.AxesImage at 0x7f4e989817d0>
```

```
<div class="output_area">
<div class="prompt">
</div>
<div class="output_png output_subarea ">
<img class="alignnone" src="https://blog.wtf.sg/wp-content/uploads/2014/12/wpid-Learning_Gaussian_Feature_Extractors2.png" alt="" width="729" height="386" />
</div>
</div>
</div>
```

We can also visualise all of the columns in the $\mathbf{G}$ matrix, and see what components it has learnt:

```
<div class="output_area">
<div class="prompt">
</div>
<div class="output_png output_subarea ">
<img class="alignnone" src="https://blog.wtf.sg/wp-content/uploads/2014/12/wpid-Learning_Gaussian_Feature_Extractors3.png" alt="" width="741" height="474" />
</div>
</div>
</div>
```

The plot is a combined image of each column of $\mathbf{G}$ being plotted on a 28 by 28 square. The 40 plots are then combined into one image of 5 rows of 8 per row.

```
<p>
Notice that there are long blobs learnt for the right, top and bottom parts of the image. This means it’s noticed that there are big continuous regions of activations or lack of activations in those areas, and has designated one of the components into modelling that.
</p>
</div>
```

I’m not sure what this could be useful for, but at the present moment, it has reduced the original model, which has 784 x 500 parameters because of the first weight matrix, to 40 x 6 + 40 x 500 (each gaussian component has 6 parameters). This is way less parameters than the original.

```
<p>
One other thing to note is that we can now scale up the size of the input parameters just by scaling the $\precisionmat$ and $\boldsymbol{\mu}$ appropriately.
</p>
</div>
```