Yes it's just doing compression. No it's not the diss you think it is.

In Ted Chiang’s New Yorker article, he likened language models to “a blurry JPEG”. JPEG is a ~~lossless~~ lossy (Edit: I meant to quote Ted Chiang, but slipped up) compression method for images. And some people absolutely hated this comparison.

I’m going to attempt to convince you that the objective of maximising log-likelihood is optimising for compression. And then I’m going to cover something perhaps a little more controversial: compression and understanding aren’t antithetical concepts.

Or in fewer words: “The loss is the sauce.”

Say you had a text file of a bunch of random words from a dictionary of $N$ words, and you want to transmit these words with some binary encoding. A simple approach would be to assign a binary code to each word, each of length $\lceil \log_2 N \rceil$. Say if there were 4 words, and we made the assignment, $$A: 00,~B: 01,~C: 10,~D: 11$$

Can we do better? If you knew the frequency of each word appearing, and they’re not uniform, you absolutely can.

Consider the case where the dictionary only has words with the following frequencies: \begin{align*} P(w=A) &= 0.99 \\
P(w=B) &= 0.006 \\
P(w=C) &= 0.002 \\
P(w=D) &= 0.002 \\
\end{align*}

This should give you a text sequence with repeated $A$s interspersed with a few $B$s, $C$s, and $D$s. As an example, a tiny snippet of the file might look like this: $$AAAAAAAAAAAAABAABDAAAA$$ Should words with higher frequencies get a longer bit-length or lower? If we made them equal in length,

$$00000000000000000000000000010000011100000000 \sim 44 \text{bits}$$

we’d be missing out on an opportunity to communicate with fewer bits that which we know to happen more often, compressing the text file in the process. If we used this substitution, $$A: 0,~B: 10,~C: 110,~D:111$$

would result in, $$0000000000000100010111000 \sim 25 \text{bits}$$

We could quantify the average bit-length with the average bit-length: $$\sum_w P(w) \cdot \mathrm{EncodingLength}(w)$$

So, having a better idea of what words you’ll see gives you a better chance of compressing a large body of words.

Shannon’s source coding theorem

In 1948, the OG Claude E. Shannon wrote The Mathematical Theory of Communication where he showed that the shortest possible average bit-length is about the entropy of that word distribution, $$\sum_w P(w) \cdot \mathrm{EncodingLength}(w) \approx -\sum_w P(w) \log_2 P(w) = H(w)$$ We can kinda guess from the equation that the $\mathrm{EncodingLength}(w) \approx -\log_2 P(w)$, If we had a uniform distribution over words (all words appear with equal frequency), we’d be back where we began. Consider again the words $A$, $B$, $C$, $D$, $$-\sum_w P(w) \cdot\log_2 P(w) =-\sum_w \frac{1}{4} \cdot \log_2 \frac{1}{4} = 2~\text{bits}$$ This tells us that in a uniform distribution, 2 bits is the best we can do!

But how do you go from knowing the frequency of each word, to assigning an actual binary code to each word?

Huffman coding

In 1951, Robert Fano and Shannon were working on this very problem. In a Good Will Hunting-esque twist, Fano, like all professors, assigned homework based on his own research. He told his students, among them David Huffman, that they would be exempted from the final exam if they could solve a coding challenge. Not knowing that it was an open problem at the time, Huffman eventually found a solution and proved that it was optimal, days before the exam. Today, one of the oft-used encoding schemes is Huffman encoding

Huffman’s method is surprisingly simple:

Constructing the Huffman Tree

Start with $N$ nodes, with each node representing the word and its frequency $n_w = (w, P(w))$
Start merging nodes to form a tree:
1. Find the two nodes with the smallest frequencies $n_a$, $n_b$
2. Merge them to form a new node $n_{a+b} = ((n_a, n_b), P(a) + P(b))$
3. Repeat until only one node left.

To encode: For any given word, the encoding is the path from the root to the word. A left branch encodes a $0$, while a right branch encodes a $1$.

To decode: Given a binary string,

Follow the code down the branches of the Huffman tree.
Once you get to a leaf,
1. Decode the word, and
2. Start at the root again for the next word.

Since then, things have gotten significantly more advanced, and today (as far as I know) we often see Asymmetric Numeral Systems (ANS) being used.

What does any of this have to do with Language Models?

Everything above assumed each word that occured in the document occured independently and identically distributed (iid) across the file. But words in language never occur in this way.

Language models are trained to predict the next word, given all of the prior context. This means for every position in a given text file, it assigns a probability for every possible word in the dictionary, and its objective is to maximise the probability of the right word, $$\begin{align*} &&\arg\max_\theta \log P_\theta(w_t| w_{t-1}, w_{t-2}, \cdots, w_1) \\
&=& \arg\min_\theta -\log P_\theta(w_t| w_{t-1}, w_{t-2}, \cdots, w_1) \end{align*}$$ which means we’re trying to minimise the _negative log-likelihood_ for each word. Recall that in the iid word case, the best code would have an encoding length of about $-\log_2 P(w)$. But we can also use this *conditional* probability for compression, just like before. The fact that we have a probability distribution remains the same, only now, the distribution is _informed_ by the fact that we know all the words coming before.

How would this really work in practice? Let’s say we used Huffman coding.

To encode This means we have a body of text $w_1, \cdots, w_T$ we want to encode. We’ve seen $t-1$ words so far:

Predict the distribution over words at timestep $t$ using the language model: $$P_\theta(w_t| w_{t-1}, w_{t-2}, \cdots, w_1)$$
Construct the Huffman Tree using the distribution.
Encode the actual $w_t$ using the given Huffman Tree.
Feed $w_t$ into the language model and go to step 1.

To decode We have a long binary string sequence we want to decode. We’ve seen $t-1$ words so far:

Predict the distribution over words at timestep $t$ using the language model: $$P_\theta(w_t| w_{t-1}, w_{t-2}, \cdots, w_1)$$
Construct the Huffman Tree using the distribution.
Walk the Huffman tree using the next available bits on the incoming stream.
Once you hit a leaf, decode the word $w_t$.
Feed $w_t$ into the language model and go to step 1.

The better you can predict what words are coming next, the better you can encode any given body of text into fewer bits.

And for the final piece of evidence: Fabrice Bellard, the GOAT solo engineer behind ffmpeg and qemu (and a whole host of other interesting projects), has used neural networks for compression of text in the Large Text Compression Benchmark (LTCB). As of this writing, it holds the top position on the leaderboard. It’s rather undeniable: optimising next-word prediction (in terms of log-likelihood) is compression.

Compression as Understanding

On the flip side, people have also been using Ted Chiang’s piece as an argument for why LLMs will never be able to reason: After all, it is only compression.

Marcus Hutter, an AI researcher, has a data compression challenge known as the Hutter Prize, held explicitly for the reason that they believe that “text compression and AI are equivalent problems”. The LTCB compression challenge is actually a less-constrained version of the Hutter Prize. Jurgen Schmidhuber also (yep.) discusses ideas connecting information theory and “curiosity”. Gregory Chaitin, one of the pioneers of Algorithmic Information Theory, also views comprehension as a kind of data compression.

It sounds like a bit of quackery at first, but: What does it mean to understand?

Chiang does mention,

To grasp the proposed relationship between compression and understanding, imagine that you have a text file containing a million examples of addition, subtraction, multiplication, and division. Although any compression algorithm could reduce the size of this file, the way to achieve the greatest compression ratio would probably be to derive the principles of arithmetic and then write the code for a calculator program. Using a calculator, you could perfectly reconstruct not just the million examples in the file but any other example of arithmetic that you might encounter in the future. The same logic applies to the problem of compressing a slice of Wikipedia. If a compression program knows that force equals mass times acceleration, it can discard a lot of words when compressing the pages about physics because it will be able to reconstruct them. Likewise, the more the program knows about supply and demand, the more words it can discard when compressing the pages about economics, and so forth.

Let’s run with this. If the text file contained only arithmetic statements of the form:

A <operator> B = C

An efficient way to transmit the file would only be to send A <operator> B, with whatever (Huffman) coding method we have. On the receiver’s end, we’d simply evaluate the expression to obtain C.

Or we could view it this way: $$P(w_t = C~|~A ~\lt\text{operator}\gt~ B~) = 1$$ Which means the resulting bit-length is 0 — the result is entirely predictable, no need to even code for it. What does this mean for the training objective? Getting the arithmetic right means lowering the loss.

Sending programs instead of codebooks

If the data is sufficiently structured, instead of a lookup table to encode/decode each word, I could send you a program that, when executed, generates the content of the text file on the receiver’s end.

It helps here to start realising code is data, and that data can also be code. Consider the case of an infinite series of digits like $\pi$. We could transmit a program that generates these numbers. The program would be finite, but the data in generates on execution is theoretically infinite. Algorithmic Information Theory is concerned with these ideas: the capabilities and limits of such programs, and how to measure their complexity.

The notion of “understanding” is hard to pin down. Should a program that can compress A <operator> B = C in the way we did above be considered to have “understood” arithmetic?

Yes, a calculator can perform the same feat. But examine its inner-workings, and we find that it has hand-coded algorithms to perform the very computation needed to arrive at the answers, which relied on understanding by the engineers that wrote them.

We do not have the same level of understanding of the inner-workings of a neural net (yet.), but what if we did?

Consider a program that was able to compress the entire Wikipedia series on Microeconomics, but at a far better rate than NNCP. And when we peeked inside, we find that the only concepts coded for were ‘supply’ and ‘demand’ (whatever that might look like.)

Would this not pass a practical bar for understanding?

Recap, Bits, and Bobs

There’s a sort of middle ground to be found here between “Compressor gonna compress / It’s only predicting the next word” and “AGI’s coming, wait for it. / Paper clip factories!”. I hope I’ve managed to connect ideas of maximum likelihood to information theory to compression, and their links (tenuous or not) to notions of “understanding”, and allowing you to better make sense of some of the strong opinions on either side of this. Hopefully you’ll find of the links and references above useful.

In any case, for the sake of simplicity, I’ve omitted some details:

In this discussion, I’ve left out the fact that often the size of the codebook is included in the size of the compressed dataset. Without considering this, we’re implicitly assuming that both sender and receiver has the algorithm and codebook necessary to perform the encoding or decoding, which means that any receiver will need to have, and be able to run an LLM if you actually do want to use something like this for text compression.
While maximum likelihood can serve as a good objective for achieving “understanding”, there are other considerations. Some examples:
1. Does the data have enough examples in order for the learning process to exploit the pattern? It may be easier to memorise a single case of a problem than to figure out the algorithm to generalise for future instances of said problem.
2. Does the model have sufficient expressive power to compute the ‘right’ answer?
Try copying text from other places to the Huffman coding tree builder above.

@misc{tan2023-06-05,
  title        = {Yes it's just doing compression. No it's not the diss you think it is.},
  author       = {Tan, Shawn},
  howpublished = {\url{https://blog.wtf.sg/posts/2023-06-05-yes-its-just-doing-compression.-no-its-not-the-diss-you-think-it-is./}},
  year         = {2023}
}