More Efficient Encoding of Primes

Earlier I talked about efficiently encoding lists of primes via encoding the gaps between them, in a UTF-8 inspired encoding. That's great, but there's a much simpler, even more efficient way to encode a list of all primes - use bitfields!

In its simplest, least efficient form, you just associate each successive bit with each successive integer - the first bit is for 1, the second is for 2, etc. The bit is 1 if the associated number is prime, and 0 if it's composite.

This trivial, stupid encoding has an efficiency of 8 numbers per byte, which means it's equivalent to the previous gap-encoding as long as the average gap is less than 8 (which covers quite a lot of numbers!). But we can, of course, make it much better.

The first and most obvious improvement is to throw out all the even numbers. Past 2, every prime is odd, so all the even bits are guaranteed to be 0 in the naive scheme. That's wasted space; we can just assume they're composite and spend our actual data budget on the odd numbers. So now the Nth bit corresponds to the (2N + 1)th number. This *doubles our efficiency, encoding 16 numbers per byte, and roughly doubling the number of digits for which this is more efficient than the gap-encoding (covering the primes up to e^16, or ~9 million).

At this point we should pause and recognize the additional benefits we get from this encoding. Not only is it more efficient for quite a lot of primes (the primes up to 9M fill about 500kB of memory, and allow efficient prime testing/factorizing of numbers up to nearly 100 trillion), but it also gives us constant-time random access to the primality information. If you want to find out whether a random N is prime or not, you just look up the (N/16)th byte, and then check the (N/2 + 1)%8th bit. For some algorithms this is really important!

Okay, back to efficiency. Excluding all the even numbers is great, but it's of course also true that, once you get past 3, all the numbers divisible by 3 are composite. Combining that with the previous "no evens" insight, we can chop down the relevant set to just two out of every six numbers - only the values that are equal to 1mod6 or 5mod6. With this, we can encode 50% more numbers in a given number of bits! To be specific, it gives us 24 numbers per byte, exceeding the efficiency of gap-encoding up to e^24, or about 20 billion.

And we can go further! We can additionally exclude the numbers divisible by 5, leaving us with only 8 possibilities out of every 30 numbers (1, 7, 11, 13, 17, 19, 23, 29; mod 30), for an extra 25% density, winning the efficiency race up to ~10 trillion. This has some additional benefits, too - each "set" of possibles is 8 large, same as the number of bits in a byte, which simplifies the math considerably - you look in the N/30th byte, then grab the corresponding bit from the group. This limit also exceeds what we can reasonably use - storing the numbers up to 10 trillion would take about a third of a terabyte, greatly exceeding the RAM of desktop machines. So even in extreme cases, you'll get to store/use more primes with this method than the gap-encoding, and you get the random-access benefit to boot.

We could go further and exclude the multiples of 7, but the benefits are fairly small (encode about 16% more in a given space) and the math becomes a lot more annoying (49 possible primes spread across 210 numbers) and loses the nice symmetries of the previous values, so it's best to probably just stop there.

One can write a(n) = 1 if (6n - 3 - (-1)^n)/4 is prime, otherwise a(n) = 0, getting rid of the evens and the multiples of 3, and still getting the set into 1's and 0's. (OEIS A296028)

Reply?

Yes, that's what I wrote in the paragraph starting with "Okay, back to efficiency.". And then in the next paragraph I point out that excluding the multiples of 5 as well gives you exactly 8 possible primes for every 30 numbers, letting you fit a set of 30 into a byte for really easy decoding.

Thanks for the article! Minor correction: If we exclude multiples of 2, 3, 5 and 7, we need 48 bits (not 49) to encode a range of 210 numbers. See e.g. https://stackoverflow.com/a/2617794/131160 Since 48 is a multiple of 8, we do get efficient random-access etc. (But in many situations, the added complexity probably still isn't worth it.)