NHacker Next
  • new
  • past
  • show
  • ask
  • show
  • jobs
  • submit
Six (and a half) intuitions for KL divergence (perfectlynormal.co.uk)
abetusk 2 hours ago [-]
Here's my explanation:

Let's say you're a company that's providing an internet connection to a business. The company trusts you, so there's only compression of bits over the wire, not encryption, and you're aware of the compression scheme the company is using to send their bits to you. You're charging the company a premium for using the line you manage but you also lease the line, so it's in your interest to compress what they give you as best as possible so as to make a profit.

Say the companies compression scheme is imperfect. They have a Huffman coding of their (imperfect) model of tokens they send, call it q(x) (that is, they think token x shows up with probability q(x)). You've determined the true distribution, p(x) (token x shows up with actual probability p(x)).

The business has tokens that show up with probability p(x) but they encode them with lg(q(x)) bits, giving an average token bit size of:

    -\sum _ x p(x) lg(q(x))
If you then use an optimal Huffman encoding, you will send tokens with average bit length of:

    -\sum _ x p(x) lg(p(x))
How many bits, on average, do you save? Just the difference:

    -\sum _ x p(x) lg(p(x)) - \sum _ x p(x) lg(q(x)) = -\sum _ x p(x) lg(p(x)/q(x))
Which is the Kullback-Leibler divergence.

To me, this is a much more intuitive explanation. I made a blog post about it [0], if anyone cares.

[0] https://mechaelephant.com/dev/Kullback-Leibler-Divergence.ht...

dist-epoch 35 minutes ago [-]
To rephrase what you wrote in plain English: you are Amazon, a client uses an S3 bucket to store .zip files in them, which they pay by the byte, you re-compress and store the data as .7z files, and the KL divergence is related to zip_file_size - 7z_file_size, your "win".
notrealyme123 31 minutes ago [-]
Wow this is really great. I just realised last weak that MLE can be motivated with the KL divergence between true distribution and approximation. My mind was blown in how obvious that connection was.
ttul 4 hours ago [-]
This is great. I had only ever seen the expected surprise explanation. The others help to fill in the gaps.
dist-epoch 32 minutes ago [-]
For those wondering where is this practically relevant - this is the basic metric used to compare quantization of various LLM models - what is the KL divergence of a 4-bit quantization versus an 8 bit one versus the original 16 bit one.
RickHull 3 hours ago [-]
Is there a gentler intro to this topic?
jey 3 hours ago [-]
Try the textbook Elements of Information Theory by Cover and Thomas (2006)
darioterror 3 hours ago [-]
[dead]
Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact
Rendered at 08:06:52 GMT+0000 (Coordinated Universal Time) with Vercel.