Vectorisation and the logsumexp trick

Having played around and written up some simple machine learning algorithms the last couple of months, I've decided to write about some useful tricks of the trade that I've learnt. These are are not specific to any algorithm, but stuff that I've found are useful, regardless of the approach you intend to take. Many machine learning techniques can be performed as a series of matrix operations. As as a result, it isn't too uncommon to see the term ``vectorisation'' being thrown about in a machine learning related paper. This essentially means to reframe the problem as a series of matrix operations, which results in some performance gains. There has been plenty of work done to speed up matrix operations which have been implemented as libraries, and bindings written in different languages to take full advantage of these libraries.

A simple example

Let me try to paint a clearer picture of why vectorisation gains you much better performance: Say you have a linear model [latex]f(\mathbf{x}) = w_1x_1 + w_2x_2 + \cdots + w_nx_n[/latex], feature vectors [latex]\mathbf{x_i}[/latex] in the form of [latex](x_1,x_2,\cdots,x_n)[/latex] and a weights for each of these variables in the form of [latex](w_1,w_2,\cdots,w_n)[/latex]. Say you wanted to get the relevant [latex]f(\mathbf{x_i})[/latex] for each of these feature vectors, one way to do that would be the following:

result = []
    for X_i in XList:
        sum = 0
        for j in range(len(X_i)):
            sum += X_i[j] * w[j]

In numpy, this can be achieved with just the numpy.dot(w,X) operation. In this way, the entire operation is done in native code, and would proceed much much faster.

Broadcasting

Before we get into the tricky part, let's first discuss how we can construct an operation that does the following: $$ \begin{align*} \left[\begin{matrix} a\\ b\\ c \end{matrix}\right] + \left[\begin{matrix} x_1 & x_2 & x_3 \\ y_1 & y_2 & y_3 \\ z_1 & z_2 & z_3 \end{matrix}\right] = \left[\begin{matrix} x_1 + a & x_2 + a & x_3 + a \\ y_1 + b & y_2 + b & y_3 + b \\ z_1 + c & z_2 + c & z_3 + c \end{matrix}\right] \end{align*} $$ Another feature of numpy is broadcasting, however, the way it works requires the scalar and the vector to be at the same dimension. At the basic level, adding a scalar and a 1-dimensional numpy vector gives you the following: \begin{align*} a + \left[ \begin{matrix} x & y & z \end{matrix} \right] = \left[ \begin{matrix} x + a & y + a & z + a \end{matrix} \right] \end{align*} This should make our job relatively easy right? Seeing as how each element in the vector would broadcast into each row in the matrix, all we would have to do is add the two together. However, by simply defining numpy.array([a,b,c]), we're effectively only defining a single row matrix. In order to make it a column vector, we're going to have to ``reshape'' the matrix. What this effectively does is create a new pointer to the memory region that stores the original matrix, but redefines the dimensions. This way, we can turn the row matrix into a column vector, and then do the addition.

The logsumexp trick

In many cases, some of these values are minute probabilities, or extremely huge exponentiated numbers. A common trick used is to bring these values to the logarithmic scale and substitute all multiplicative operations with additions. So then the question arises: What do you do with additive operations? The obvious way would be to re-exponentiate the values, sum them up, then apply log again. Unfortunately, re-exponentiating the values may bring about the original problem of (over | under)flow. Here's what we want to find, [latex]\log T[/latex], where [latex]T[/latex] is the sum of the values we are trying to sum up, [latex]x_1,x_2,\cdots,x_n[/latex]. The [latex]x_i[/latex] are either too large or too small, so we need to find a way to bring everything down a notch. Say we find a value [latex]L[/latex] such that for all the [latex]x_i[/latex], [latex]\frac{x_i}{L}[/latex] would not cause an floating point errors, $$ \begin{align*} T &= \sum^{n}_{i=1} x_i\\ \text{Scale everything down by $L$}\\ \frac{T}{L} &= \sum^{n}_{i=1} \frac{x_i}{L}\\ \text{and log both sides}\\ \log T - \log L &= \log \sum^{n}_{i=1} \frac{x_i}{L}\\ \log T - \log L &= \log \sum^{n}_{i=1} \exp\left(\log \frac{x_i}{L} \right)\\ \log T - \log L &= \log \sum^{n}_{i=1} \exp\left(\log x_i - \log L \right)\\ \log T &= \log L + \log \sum^{n}_{i=1} \exp\left(\log x_i - \log L \right) \end{align*} $$ And that's how we compute [latex]\log T[/latex] without any of these calculations exploding out to magnitudes beyond our control. Oh, and [latex]L[/latex], its usually defined as (I need to make this a proper equation rather than inline because it's important): $$ L = \max_{i} |x_i| $$ The same can be done for vectors and matrices, since the standard numpy.dot operation cannot be applied. We can write something to replace this, it won't be as fast, but it's faster than computing the same thing the iterative way. As for the logsumexp trick, scipy.misc has an implementation which I use.

logsumexpdot

So, to sum it all up and combine the 2 ideas above, we can then perform dot operations on matrices with very huge or small values.

misc.logsumexp(
        loga.reshape(loga.shape+(1,))+logM
    ,axis=0)

Reshape the original row matrix into column vector $$ \begin{align*} \left[\begin{matrix} x_1 & \cdots & x_n \end{matrix}\right] \Rightarrow \left[\begin{matrix} x_1 \\ \vdots \\ x_n \end{matrix}\right] \end{align*} $$
Perform the logsumexp addition between vector and matrix $$ \begin{align*} \left[\begin{matrix} x_1 \\ \vdots \\ x_n \end{matrix}\right] + \left[\begin{matrix} y_{1,1} & \cdots & y_{1,m} \\ \vdots & \ddots & \vdots \\ y_{n,1} & \cdots & y_{n,m} \\ \end{matrix}\right] = \left[\begin{matrix} y_{1,1} + x_1 & \cdots & y_{1,m} + x_1\\ \vdots & \ddots & \vdots \\ y_{n,1} + x_n & \cdots & y_{n,m} + x_n \\ \end{matrix}\right] \end{align*} $$
Sum up the matrix column-wise using logsumexp $$\begin{align*} \sum_{\downarrow} \left[\begin{matrix} z_{1,1} & \cdots & z_{1,m} \\ \vdots & \ddots & \vdots \\ z_{n,1} & \cdots & z_{n,m} \\ \end{matrix}\right] =\left[ \begin{matrix} \displaystyle\sum^n_{i=1} z_{i,1} & \cdots & \displaystyle\sum^n_{i=1} z_{i,m} \end{matrix}\right] \end{align*}$$

And there we have it! wp-syntax-cache-content: - | a:2:{i:1;s:1312:"

result = []
    for X_i in XList:
        sum = 0
        for j in range(len(X_i)):
            sum += X_i[j] * w[j]

";i:2;s:858:"

misc.logsumexp(
        loga.reshape(loga.shape+(1,))+logM
    ,axis=0)

";} categories: - Uncategorized tags: - Math - numpy - Vectorisation --- Having been played around and writing up some simple machine learning algorithms the last couple of months, I’ve decided to write about some useful tricks of the trade that I’ve learnt. These are are not specific to any algorithm, but stuff that I’ve found are useful. regardless of the kind of approach you intend to take. Many machine learning techniques can be performed as a series of matrix operations. As as a result, it isn’t too uncommon to see the term “vectorisation” being thrown about in a machine learning related paper. This essentially means to reframe the problem as a series of matrix operations, which results in some performance gains. There has been plenty of work done to speed up matrix operations which have been implemented as libraries, and bindings written in different languages to take full advantage of these libraries.## A simple example Let me try to paint a clearer picture of why vectorisation gains you much better performance: Say you have a linear model \[latex]f(\mathbf{x}) = w\_1x\_1 + w\_2x\_2 + \cdots + w\_nx\_n[/latex], feature vectors [latex]\mathbf{x\_i}[/latex] in the form of [latex\](x\_1,x\_2,\cdots,x\_n)\[/latex] and a weights for each of these variables in the form of [latex\](w\_1,w\_2,\cdots,w\_n)[/latex]. Say you wanted to get the relevant [latex]f(\mathbf{x\_i})[/latex] for each of these feature vectors, one way to do that would be the following:

result = []
for X_i in XList:
    sum = 0
    for j in range(len(X_i)):
        sum += X_i[j] * w[j]

In numpy, this can be achieved with just the `numpy.dot(w,X)` operation. In this way, the entire operation is done in native code, and would proceed much much faster. ## Broadcasting Before we get into the tricky part, let’s first discuss how we can construct an operation that does the following: $$ \begin{align*} \left[\begin{matrix} a\\ b\\ c \end{matrix}\right] + \left[\begin{matrix} x\_1 & x\_2 & x_3 \\ y\_1 & y\_2 & y_3 \\ z\_1 & z\_2 & z_3 \end{matrix}\right] = \left[\begin{matrix} x\_1 + a & x\_2 + a & x_3 + a \\ y\_1 + b & y\_2 + b & y_3 + b \\ z\_1 + c & z\_2 + c & z_3 + c \end{matrix}\right] \end{align*} $$ Another feature of numpy is broadcasting, however, the way it works requires the scalar and the vector to be at the same dimension. At the basic level, adding a scalar and a 1-dimensional numpy vector gives you the following: \begin{align} a + \left[ \begin{matrix} x & y & z \end{matrix} \right] = \left[ \begin{matrix} x + a & y + a & z + a \end{matrix} \right] \end{align} This should make our job relatively easy right? Seeing as how each element in the vector would broadcast into each row in the matrix, all we would have to do is add the two together. However, by simply defining `numpy.array([a,b,c])`, we’re effectively only defining a single row matrix. In order to make it a column vector, we’re going to have to “reshape” the matrix. What this effectively does is create a new pointer to the memory region that stores the original matrix, but redefines the dimensions. This way, we can turn the row matrix into a column vector, and then do the addition. Broadcasting is covered in more detail in the [numpy docs][1]. ## The logsumexp trick In many cases, some of these values are minute probabilities, or extremely huge exponentiated numbers. A common trick used is to bring these values to the logarithmic scale and substitute all multiplicative operations with additions. So then the question arises: What do you do with additive operations? The obvious way would be to re-exponentiate the values, sum them up, then apply log again. Unfortunately, re-exponentiating the values may bring about the original problem of (over | under)flow. Here’s what we want to find, [latex]\log T[/latex], where [latex]T[/latex] is the sum of the values we are trying to sum up, [latex]x\_1,x\_2,\cdots,x\_n[/latex]. The [latex]x\_i[/latex] are either too large or too small, so we need to find a way to bring everything down a notch. Say we find a value [latex]L[/latex] such that for all the [latex]x\_i[/latex], [latex]\frac{x\_i}{L}[/latex] would not cause an floating point errors, $$ \begin{align*} T &= \sum^{n}\_{i=1} x\_i\\ \text{Scale everything down by $L$}\\ \frac{T}{L} &= \sum^{n}\_{i=1} \frac{x\_i}{L}\\ \text{and log both sides}\\ \log T – \log L &= \log \sum^{n}\_{i=1} \frac{x\_i}{L}\\ \log T – \log L &= \log \sum^{n}\_{i=1} \exp\left(\log \frac{x\_i}{L} \right)\\ \log T – \log L &= \log \sum^{n}\_{i=1} \exp\left(\log x\_i – \log L \right)\\ \log T &= \log L + \log \sum^{n}\_{i=1} \exp\left(\log x\_i – \log L \right) \end{align*} $$ And that’s how we compute [latex]\log T[/latex] without any of these calculations exploding out to magnitudes beyond our control. Oh, and [latex]L[/latex], its usually defined as (I need to make this a proper equation rather than inline because it’s important): $$ L = \max\_{i} |x\_i| $$ The same can be done for vectors and matrices, since the standard `numpy.dot` operation cannot be applied. We can write something to replace this, it won’t be as fast, but it’s faster than computing the same thing the iterative way. As for the logsumexp trick, scipy.misc has an implementation which I use. Alex Smola talks about [this][2] in his blog ## logsumexpdot So, to sum it all up and combine the 2 ideas above, we can then perform dot operations on matrices with very huge or small values.

misc.logsumexp(
    loga.reshape(loga.shape+(1,))+logM
,axis=0)

1. Reshape the original row matrix into column vector $$ \begin{align*} \left[\begin{matrix} x\_1 & \cdots & x\_n \end{matrix}\right] \Rightarrow \left[\begin{matrix} x_1 \\ \vdots \\ x_n \end{matrix}\right] \end{align*} $$ 2. Perform the logsumexp addition between vector and matrix $$ \begin{align*} \left[\begin{matrix} x_1 \\ \vdots \\ x_n \end{matrix}\right] + \left[\begin{matrix} y\_{1,1} & \cdots & y\_{1,m} \\ \vdots & \ddots & \vdots \\ y\_{n,1} & \cdots & y\_{n,m} \\ \end{matrix}\right] = \left[\begin{matrix} y\_{1,1} + x\_1 & \cdots & y\_{1,m} + x\_1\\ \vdots & \ddots & \vdots \\ y\_{n,1} + x\_n & \cdots & y\_{n,m} + x\_n \\ \end{matrix}\right] \end{align*} $$ 3. Sum up the matrix column-wise using logsumexp $$\begin{align*} \sum_{\downarrow} \left[\begin{matrix} z\_{1,1} & \cdots & z\_{1,m} \\ \vdots & \ddots & \vdots \\ z\_{n,1} & \cdots & z\_{n,m} \\ \end{matrix}\right] =\left[ \begin{matrix} \displaystyle\sum^n\_{i=1} z\_{i,1} & \cdots & \displaystyle\sum^n\_{i=1} z\_{i,m} \end{matrix}\right] \end{align*}$$ And there we have it! [1]: http://docs.scipy.org/doc/numpy/user/basics.broadcasting.html [2]: http://blog.smola.org/post/987977550/log-probabilities-semirings-and-floating-point-numbers

@misc{tan2013-06-09,
  title        = {Vectorisation and the logsumexp trick},
  author       = {Tan, Shawn},
  howpublished = {\url{https://blog.wtf.sg/2013/06/09/vectorisation-and-the-logsumexp-trick/}},
  year         = {2013}
}