**Updated post with more efficient and simple encoding!**

(As usual for these mathy posts, inspiration was an XKCD forums thread. Join us!)

Let's say you wanted to store a list of every single prime. Cool! Problem is, primes start to get fairly big after a while. You don't want to use any particular fixed-width binary encoding (it'll waste a lot of space on small ones, and be unable to store large ones), and even with a variable-width encoding you're wasting a lot of space. Primes tend to be really close to each other, so by listing each one explicitly you're repeating most of the data!

There's a lot of ways to throw compression at this problem, but an easy one to talk about is to instead store a list of the *gaps between primes*. For example, there's a 2-gap after 3, since the next prime is 5; there's a 4-gap after 7, since the next prime is 11. Primes actually cluster pretty closely together - there's an infinite number of "twin primes" with a 2-gap, and the *average* gap at any point is equal to the density of the primes, which is the natural log (in other words, it grows *very* slowly). So the gaps are *much* smaller than the primes, especially as numbers get bigger - for example, per this MathWorld article, the *maximal* gap between primes 18 digits or less is only **1220** (and the average gap is around 40 at that point).

So now we have to talk about encoding those gaps, in some reasonably efficient way. I've come up with a scheme that I think works pretty well, derived from UTF-8:

- 2 and 3 are assumed. This will start listing the gaps from 3 onwards.
- Because all the gaps are even, we'll actually be encoding the halfGap.
- Find the halfGap's binary expansion.
- Determine how many output bytes you'll need by dividing the number of bits halfGap uses by 7. (If the answer is > 8, this is an exceptional case. Skip to step 7 for these cases.)
- In the first output byte, set the high (bytes - 1) bits to 1, then the next bit to 0. In the remaining bits of this byte, and the remaining output bytes, write the halfGap directly.
- Occasionally, whenever you feel is appropriate, encode an "i-frame": write out a byte of all 1s, then encode the next prime into however many subsequent bytes it takes. Use only the lower 7 bits of each byte, leaving the high bit 0. When finished, write out another all-1s byte.
- For exceptional cases (gap would need more than 8 bytes to encode), write out the prime as an i-frame instead.

Judicious use of "i-frames" (video encoding terminology!) means you don't have to process the *entire* preceding list to find a given prime, just back to the most recent i-frame. It also gives you a resync point, in case you lose track of where you are in the stream for whatever reason. (UTF-8 resyncs with every character, but it pays for that with lower efficiency.)

This encodes every gap with just 1/7 overhead. The *average* gap fits into a single byte until around e^2^8, or approximately 10^111; into two bytes until around e^2^15; three bytes until e^2^22; etc. At about e^2^57 (approximately 10^10^17) it finally reaches the point where the average gap is "exceptional" and requires an i-frame, and you have to pay a 2-byte tax for each one. Luckily, at that point each prime takes up about 10 petabytes of storage just to write out, so the 2 extra bytes aren't very noticeable. ^_^

Martin M.:If 2 and 3 are assumed you could transcribe the primes into the form 2n + 3. Then you can distill the next 8 primes (1, 2, 4, 5, 7, 8, 10 13...) into their gaps (1, 2, 1, 2, 1, 2, 3) and save a term.

## Reply?

Tab Atkins Jr.:From what I can tell that's... exactly what I already described?

## Reply?