## Yes it's just doing compression. No it's not the diss you think it is.

In Ted Chiang’s New Yorker article, he likened language models to “a blurry JPEG”. JPEG is a lossless lossy (Edit: I meant to quote Ted Chiang, but slipped up) compression method for images. And some people absolutely hated this comparison. I’m going to attempt to convince you that the objective of maximising log-likelihood is optimising for compression. And then I’m going to cover something perhaps a little more controversial: compression and understanding aren’t antithetical concepts.

## The New XOR Problem

In 1969, Marvin Minsky and Seymour Papert published Perceptrons: An Introduction to Computational Geometry. In it, they showed that a single-layer perceptron cannot compute the XOR function. The main argument relies on linear separability: Perceptrons are linear classifiers, which essentially means drawing a line to separate input that would result in 1 versus 0. You can do it in the OR and AND case, but not XOR. Of course, we’re way past that now, neural networks with one hidden layer can solve that problem.

## Learning to Dequantise with Truncated Flows

A pet idea that I’ve been coming back to time and again is doing autoregressive language modelling with stochastic embeddings’’. Each word would have a distribution over the embedding that represented it, instead of a deterministic embedding. The thought would be that modelling word embeddings in this way would better represent the ability for word meanings to overlap while not completely subsuming the other, or in some cases have multi-modal representations because of the distinct word senses in which they are used (‘bank’ to refer to the ‘land alongside a body of water’ or ‘a financial institution’).

## Vectorising The Inside Algorithm

This one goes by several names: CYK, Inside, Matrix chain ordering problem. Whatever you call it, the “shape” of the algorithm looks the same: And it’s ultimately used to enumerate over all possible full binary trees. In the Matrix chain ordering problem, the tree defines the pairwise order in which matrices are multiplied, $$(A(BC))(DE)$$ while CYK constructs a tree from the bottom up with Context-Free Grammar rules that would generate the observed sentence.

## Smoothing With Backprop

If you’ve ever implemented forward-backward in an HMM (likely for a class assignment), you know this is an annoying exercise fraught with off-by-one errors or underflow issues. A fun fact that has since been made concrete by Jason Eisner’s tutorial paper in 2016 is that backpropagation is forward-backward — if you implemented the forward pass for marginalisation for an HMM, then performing backpropagation will net you the result of forward-backward, or the smoothing result.

## Traversing Connectionist Trees

We’ve just put our paper “Recursive Top-Down Production for Sentence Generation with Latent Trees” up on ArXiv. The code is here. The paper has been accepted to EMNLP Findings (slightly disappointing for me, but, such is life.) This has been an interesting project to work on. Automata theory has been a interesting topic for me, coming up close behind machine learning. Context-free grammars (CFGs), in particular, comes up often when studying language, and grammars are often written as CFG rewriting rules.

## On “Better Exploiting Latent Variables in Text Modeling”

I’ve been working on latent variable language models for some time, and intend to make it the topic of my PhD. So when Google Scholar recommended “Better Exploiting Latent Variables in Text Modeling”, I was naturally excited to see that this work has continued beyond the Bowman’s paper on VAE language models. Of course, since then, there have been multiple improvements on the original model. More recently, Yoon Kim from Harvard has been publishing papers on this topic that have been particularly interesting.

## Can Chinese Rooms Think?

There’s a tendency as a machine learning or CS researcher to get into a philosophical debate about whether machines will ever be able to think like humans. This argument goes so far back that the people that started the field have had to grapple with it. It’s also fun to think about, especially with sci-fi always portraying AI vs human world-ending/apocalypse showdowns, and humans always prevailing because of love or friendship or humanity.

But there’s a tendency for people in such a debate to wind up talking past each other.

Consider a matrix $\mathbf{X}$ with rows of datapoints $\mathbf{x_i}$ which are $(n, d)$. The matrix $\mathbf{M}$ is made up of the $\boldsymbol{\mu}_j$ of $k$ different Gaussian components. The task is to compute the log probability of each of these $k$ components for all $n$ data points. In [1]: import theano import theano.tensor as T import numpy as np import time X = T.matrix('X') M = T.