We’ve just put our paper “Recursive Top-Down Production for Sentence Generation with Latent Trees” up on ArXiv.
The code is here.
The paper has been accepted to EMNLP Findings (slightly disappointing for me, but, such is life.)
This has been an interesting project to work on.
Automata theory has been a interesting topic for me, coming up close behind machine learning.
Context-free grammars (CFGs), in particular, comes up often when studying language, and grammars are often written as CFG rewriting rules.
Probabilistic CFGs augment CFGs by assigning a probability distribution to each non-terminal rewriting rule.
The upshot is that you can now sample from a PCFG, or assign a probability value to a sequence generated by it.
One of the fun algorithms related to CFGs is the Cocke-Younger-Kasami (CYK) algorithm.
The algorithm takes an input sequence, and a set of CFG rewriting rules converted to Chomsky normal form, and finds all possible parses are possible for the input sequence.
More interestingly, it does this in $O(N^3)$ time (barring some constant factors of the size of the grammar).
The same algorithm can be modified to marginalise over all possible parse trees that could generate a sequence given a PCFG.
It’s surprising that this can be done in cubic time.
If we consider how many full binary trees of $N$ leaves there can be and how this number grows:
N &= 2 & 1, \\
N &= 3 & 2, \\
N &= 4 & 5, \\
N &= 5 & 14, \\
N &= 6 & 42, \\
This sequence corresponds to the $N-1$-th Catalan number.
For the $k$-th Catalan number, it is expressed as:
For any production from a non-terminal, the rules are independent of the non-terminal’s parents.
This is what makes CYK work and allows for the cubic time algorithm.
This is where we get to the method described in our paper.
In our model, we wanted a production function that continuously expanded a root representation into a tree structure where its leaves emitted the sequence under consideration.
This meant the independence assumption in PCFGs were gone, and since these intermediate representations were continous, it meant marginalising over them like the Inside algorithm was no longer possible.
The result was the algorithm based on successive leaves in the paper.
While it ran in exponential time, it was an improvement over the factorial time that a naive method would take (enumerating every possible permutation of full binary tree).
This was an interesting takeaway of this project for me, but I’m not sure how useful this knowledge would be.
This blog post is about the complexity of the problem, but the benefits of using a tree structural prior for a systematic generalisation task like SCAN is also an interesting takeaway.
Since getting these results, there’ve been better performing methods specifically on the SCAN dataset (the LANE paper comes to mind), but our goal with this method is to extend it to natural language tasks.
For now, the exponential complexity is still a huge hindrance to the scalability of this method, but… we’re working on it!