Updated post with more efficient and simple encoding!
(As usual for these mathy posts, inspiration was an XKCD forums thread. Join us!)
Let's say you wanted to store a list of every single prime. Cool! Problem is, primes start to get fairly big after a while. You don't want to use any particular fixed-width binary encoding (it'll waste a lot of space on small ones, and be unable to store large ones), and even with a variable-width encoding you're wasting a lot of space. Primes tend to be really close to each other, so by listing each one explicitly you're repeating most of the data!
There's a lot of ways to throw compression at this problem, but an easy one to talk about is to instead store a list of the gaps between primes. For example, there's a 2-gap after 3, since the next prime is 5; there's a 4-gap after 7, since the next prime is 11. Primes actually cluster pretty closely together - there's an infinite number of "twin primes" with a 2-gap, and the average gap at any point is equal to the density of the primes, which is the natural log (in other words, it grows very slowly). So the gaps are much smaller than the primes, especially as numbers get bigger - for example, per this MathWorld article, the maximal gap between primes 18 digits or less is only 1220 (and the average gap is around 40 at that point).
So now we have to talk about encoding those gaps, in some reasonably efficient way. I've come up with a scheme that I think works pretty well, derived from UTF-8:
- 2 and 3 are assumed. This will start listing the gaps from 3 onwards.
- Because all the gaps are even, we'll actually be encoding the halfGap.
- Find the halfGap's binary expansion.
- Determine how many output bytes you'll need by dividing the number of bits halfGap uses by 7. (If the answer is > 8, this is an exceptional case. Skip to step 7 for these cases.)
- In the first output byte, set the high (bytes - 1) bits to 1, then the next bit to 0. In the remaining bits of this byte, and the remaining output bytes, write the halfGap directly.
- Occasionally, whenever you feel is appropriate, encode an "i-frame": write out a byte of all 1s, then encode the next prime into however many subsequent bytes it takes. Use only the lower 7 bits of each byte, leaving the high bit 0. When finished, write out another all-1s byte.
- For exceptional cases (gap would need more than 8 bytes to encode), write out the prime as an i-frame instead.
Judicious use of "i-frames" (video encoding terminology!) means you don't have to process the entire preceding list to find a given prime, just back to the most recent i-frame. It also gives you a resync point, in case you lose track of where you are in the stream for whatever reason. (UTF-8 resyncs with every character, but it pays for that with lower efficiency.)
This encodes every gap with just 1/7 overhead. The average gap fits into a single byte until around e^2^8, or approximately 10^111; into two bytes until around e^2^15; three bytes until e^2^22; etc. At about e^2^57 (approximately 10^10^17) it finally reaches the point where the average gap is "exceptional" and requires an i-frame, and you have to pay a 2-byte tax for each one. Luckily, at that point each prime takes up about 10 petabytes of storage just to write out, so the 2 extra bytes aren't very noticeable. ^_^