Learning Gaussian Feature Extractors

While playing around with the MNIST dataset and the example code, I tried to visualise the weights of the connections from the weights to the hidden layer. These can be thought of as feature extractors of the input.

If you’ve trained a denoising auto-encoder, you typically get a plot that looks something like this

(taken from http://ufldl.stanford.edu)

What I noticed about this is that there seemed to be strokes and blobs of white pixels that are clustered together. So I wondered if we could reduce the parameters to describe these blobs to the parameters that describe a 2D Gaussian function.

As it turns out, it can be done pretty easily, but I’m not sure if there’s any useful application for this.

$$
\newcommand{\precisionmat}{\mathbf{B}}
\newcommand{\pixel}{\mathbf{p}}
$$
We want to learn feature extractors for the input in terms of Gaussian filters with varying width height and rotations.

  <p>
    Let the Gaussian function be,
  </p>
  
  <p>
    $$ g(\pixel;\precisionmat,\boldsymbol{\mu}) = \exp\left(-{\left\| \precisionmat (\pixel &#8211; \boldsymbol{\mu}) \right\|}^2\right) $$
  </p>
  
  <p>
    Then we define a weight matrix $\mathbf{W}$ between layers such that,
  </p>
  
  <p>
    $$ \underbrace{\mathbf{W}}_{(n,h)} = \underbrace{\mathbf{G}}_{(n,k)} \underbrace{\mathbf{M}}_{(k,h)}$$
  </p>
  
  <p>
    where $\mathbf{M}$ is a standard transformation matrix freely tuned by gradient descent, and $\mathbf{G}$ is a matrix with each column representing a Gaussian filter:
  </p>
  
  <p>
    $$(\mathbf{G})_{ij} = g(\left[\text{row}(i),\text{col}(i)\right];\precisionmat_j,\boldsymbol{\mu}_j)$$
  </p>
  
  <p>
    where $\text{row}(i)$ and $\text{col}(i)$ give the row and column of input $i$. As a result, the free variables for tuning the matrix $\mathbf{G}$ are then the matrices $\precisionmat$
  </p>
  
  <p>
    In this way, $\mathbf{W}$ represents a layer in which features are captured in the form of linear combinations of Gaussian shaped feature extractors from the input, which is assumed to have some form of two-dimensional topology.
  </p>
  
  <p>
    So if we were to visualise a column from $\mathbf{G}$ in it&#8217;s 2D representation:
  </p>
</div>

Out[1]:

    <div class="output_text output_subarea output_pyout">
      <pre>&lt;matplotlib.image.AxesImage at 0x7f4ec931d6d0&gt;

  <div class="output_area">
    <div class="prompt">
    </div>
    
    <div class="output_png output_subarea ">
      <img class="alignnone" src="https://blog.wtf.sg/wp-content/uploads/2014/12/wpid-Learning_Gaussian_Feature_Extractors1.png" alt="" width="505" height="497" />
    </div>
  </div>
</div>

We can then use this matrix $\mathbf{W}$ as we would in an autoencoder. During training, I transpose $\mathbf{W}$. This means that whatever reconstruction it creates has to be made up of the $k$ Gaussian components that it has learnt. It learns to do this pretty well. Here is an example:

Out[2]:

    <div class="output_text output_subarea output_pyout">
      <pre>&lt;matplotlib.image.AxesImage at 0x7f4e989817d0&gt;

  <div class="output_area">
    <div class="prompt">
    </div>
    
    <div class="output_png output_subarea ">
      <img class="alignnone" src="https://blog.wtf.sg/wp-content/uploads/2014/12/wpid-Learning_Gaussian_Feature_Extractors2.png" alt="" width="729" height="386" />
    </div>
  </div>
</div>

We can also visualise all of the columns in the $\mathbf{G}$ matrix, and see what components it has learnt:

Out[3]:

<matplotlib.image.AxesImage at 0x7f4ec931d2d0>

  <div class="output_area">
    <div class="prompt">
    </div>
    
    <div class="output_png output_subarea ">
      <img class="alignnone" src="https://blog.wtf.sg/wp-content/uploads/2014/12/wpid-Learning_Gaussian_Feature_Extractors3.png" alt="" width="741" height="474" />
    </div>
  </div>
</div>

The plot is a combined image of each column of $\mathbf{G}$ being plotted on a 28 by 28 square. The 40 plots are then combined into one image of 5 rows of 8 per row.

  <p>
    Notice that there are long blobs learnt for the right, top and bottom parts of the image. This means it&#8217;s noticed that there are big continuous regions of activations or lack of activations in those areas, and has designated one of the components into modelling that.
  </p>
</div>

I’m not sure what this could be useful for, but at the present moment, it has reduced the original model, which has 784 x 500 parameters because of the first weight matrix, to 40 x 6 + 40 x 500 (each gaussian component has 6 parameters). This is way less parameters than the original.

  <p>
    One other thing to note is that we can now scale up the size of the input parameters just by scaling the $\precisionmat$ and $\boldsymbol{\mu}$ appropriately.
  </p>
</div>

@misc{tan2014-12-06,
  title        = {Learning Gaussian Feature Extractors},
  author       = {Tan, Shawn},
  howpublished = {\url{https://blog.wtf.sg/2014/12/06/learning-gaussian-feature-extractors/}},
  year         = {2014}
}