Learning to Dequantise with Truncated Flows

A pet idea that I’ve been coming back to time and again is doing autoregressive language modelling with ``stochastic embeddings’’. Each word would have a distribution over the embedding that represented it, instead of a deterministic embedding. The thought would be that modelling word embeddings in this way would better represent the ability for word meanings to overlap while not completely subsuming the other, or in some cases have multi-modal representations because of the distinct word senses in which they are used (‘bank’ to refer to the ‘land alongside a body of water’ or ‘a financial institution’).

Continue reading

Vectorising The Inside Algorithm

This one goes by several names: CYK, Inside, Matrix chain ordering problem. Whatever you call it, the “shape” of the algorithm looks the same: And it’s ultimately used to enumerate over all possible full binary trees. In the Matrix chain ordering problem, the tree defines the pairwise order in which matrices are multiplied, $$(A(BC))(DE)$$ while CYK constructs a tree from the bottom up with Context-Free Grammar rules that would generate the observed sentence.

Continue reading

Smoothing With Backprop

If you’ve ever implemented forward-backward in an HMM (likely for a class assignment), you know this is an annoying exercise fraught with off-by-one errors or underflow issues. A fun fact that has since been made concrete by Jason Eisner’s tutorial paper in 2016 is that backpropagation is forward-backward — if you implemented the forward pass for marginalisation for an HMM, then performing backpropagation will net you the result of forward-backward, or the smoothing result.

Continue reading

Traversing Connectionist Trees

We’ve just put our paper “Recursive Top-Down Production for Sentence Generation with Latent Trees” up on ArXiv. The code is here. The paper has been accepted to EMNLP Findings (slightly disappointing for me, but, such is life.) This has been an interesting project to work on. Automata theory has been a interesting topic for me, coming up close behind machine learning. Context-free grammars (CFGs), in particular, comes up often when studying language, and grammars are often written as CFG rewriting rules.

Continue reading

On “Better Exploiting Latent Variables in Text Modeling”

I’ve been working on latent variable language models for some time, and intend to make it the topic of my PhD. So when Google Scholar recommended “Better Exploiting Latent Variables in Text Modeling”, I was naturally excited to see that this work has continued beyond the Bowman’s paper on VAE language models. Of course, since then, there have been multiple improvements on the original model. More recently, Yoon Kim from Harvard has been publishing papers on this topic that have been particularly interesting.

Continue reading

Can Chinese Rooms Think?

There’s a tendency as a machine learning or CS researcher to get into a philosophical debate about whether machines will ever be able to think like humans. This argument goes so far back that the people that started the field have had to grapple with it. It’s also fun to think about, especially with sci-fi always portraying AI vs human world-ending/apocalypse showdowns, and humans always prevailing because of love or friendship or humanity.

But there’s a tendency for people in such a debate to wind up talking past each other.

Continue reading

Computing Log Normal for Isotropic Gaussians

Consider a matrix $\mathbf{X}$ with rows of datapoints $\mathbf{x_i}$ which are $(n, d)$. The matrix $\mathbf{M}$ is made up of the $\boldsymbol{\mu}_j$ of $k$ different Gaussian components. The task is to compute the log probability of each of these $k$ components for all $n$ data points. In [1]: import theano import theano.tensor as T import numpy as np import time X = T.matrix('X') M = T.

Continue reading

Details of the Hierarchical VAE (Part 2)

So, as a recap, here’s a (badly drawn) diagram of the architecture:

Architecture diagram

Continue reading

Details of the Hierarchical VAE (Part 1)

The motivation to use a hierarchical architecture for this task was two-fold:

  1. Learning a vanilla encoder-decoder type of architecture for the task would be the basic deep learning go-to model for such a task. However, the noise modeled if we perform maximum likelihood is only at the pixel level. This seems inappropriate as it implies there is one “right answer”, with some pixel colour variations, given an outer context. The hierarchical VAE models different uncertainties at different levels of abstraction, so it seems like a good fit.
  2. I wanted to investigate how the hierarchical factorisation of the latent variables affect learning in such a model. It turns out certain layers overfit, or I would diagnose the problem as overfitting, and I’m unsure how to remedy those problems.

Continue reading

Samples from the Hierarchical VAE

Each of the following plots are samples of the conditional VAE that I’m using for the inpainting task. As expected with results from a VAE, they’re blurry. However, the fun thing about having a hierarchy of latent variables is I can freeze all the layers except for one, and vary that just to see the type of noise it models. The pictures are generated by using the $\mu_{z_l}(z_{l-1})$ for all layers except for the $i$-th layer.


Continue reading