Tab Completion

I'm Tab Atkins Jr, and I wear many hats. I work for Google on the Chrome browser as a Web Standards Hacker. I'm also a member of the CSS Working Group, and am either a member or contributor to several other working groups in the W3C. You can contact me here.
Listing of All Posts

More Efficient Encoding of Primes

Last updated:

Earlier I talked about efficiently encoding lists of primes via encoding the gaps between them, in a UTF-8 inspired encoding. That's great, but there's a much simpler, even more efficient way to encode a list of all primes - use bitfields!

In its simplest, least efficient form, you just associate each successive bit with each successive integer - the first bit is for 1, the second is for 2, etc. The bit is 1 if the associated number is prime, and 0 if it's composite.

This trivial, stupid encoding has an efficiency of 8 numbers per byte, which means it's equivalent to the previous gap-encoding as long as the average gap is less than 8 (which covers quite a lot of numbers!). But we can, of course, make it much better.

The first and most obvious improvement is to throw out all the even numbers. Past 2, every prime is odd, so all the even bits are guaranteed to be 0 in the naive scheme. That's wasted space; we can just assume they're composite and spend our actual data budget on the odd numbers. So now the Nth bit corresponds to the (2N + 1)th number. This *doubles our efficiency, encoding 16 numbers per byte, and roughly doubling the number of digits for which this is more efficient than the gap-encoding (covering the primes up to e^16, or ~9 million).

At this point we should pause and recognize the additional benefits we get from this encoding. Not only is it more efficient for quite a lot of primes (the primes up to 9M fill about 500kB of memory, and allow efficient prime testing/factorizing of numbers up to nearly 100 trillion), but it also gives us constant-time random access to the primality information. If you want to find out whether a random N is prime or not, you just look up the (N/16)th byte, and then check the (N/2 + 1)%8th bit. For some algorithms this is really important!

Okay, back to efficiency. Excluding all the even numbers is great, but it's of course also true that, once you get past 3, all the numbers divisible by 3 are composite. Combining that with the previous "no evens" insight, we can chop down the relevant set to just two out of every six numbers - only the values that are equal to 1mod6 or 5mod6. With this, we can encode 50% more numbers in a given number of bits! To be specific, it gives us 24 numbers per byte, exceeding the efficiency of gap-encoding up to e^24, or about 20 billion.

And we can go further! We can additionally exclude the numbers divisible by 5*, leaving us with only 8 possibilities out of every 30 numbers (1, 7, 11, 13, 17, 19, 23, 29; mod 30), for an extra 25% density, winning the efficiency race up to ~10 trillion. This has some additional benefits, too - each "set" of possibles is 8 large, same as the number of bits in a byte, which simplifies the math considerably - you look in the N/30th byte, then grab the corresponding bit from the group. This limit also exceeds what we can reasonably use - storing the numbers up to 10 trillion would take about a third of a terabyte, greatly exceeding the RAM of desktop machines. So even in extreme cases, you'll get to store/use *more primes with this method than the gap-encoding, and you get the random-access benefit to boot.

We could go further and exclude the multiples of 7, but the benefits are fairly small (encode about 16% more in a given space) and the math becomes a lot more annoying (49 possible primes spread across 210 numbers) and loses the nice symmetries of the previous values, so it's best to probably just stop there.

If I Had A Time Machine

Last updated:

An incomplete and growing list of mistakes that I'd fix if I had a time machine:

Languages

Make all languages be written vertically, and stack their lines left-to-right. Everyone being on one writing direction is A+, and vertical writing gives more time for the ink to dry before your hand has to touch it when you're left-handed.

Figure out a universal script that everyone can use. Doesn't have to be too complicated; tons of languages with different sounds use the Latin alphabet fine. I'm kinda in love with the Korean script, tho - it's alphabetic, but is written like it's ideographic, with the letters arranged into syllable-blocks in a standard way. Something like that would be good.

Putting everyone on a single language would be good too, but probably really hard to maintain. Just making sure everyone's writing in a mutually intelligible way is a good start.

Numbers

Switch everyone to base 6 counting. It's got some great reasons behind it!

  1. Each of your hands is a single base-6 digit (can represent the values 0-5), so you could count to 35 instead of just 10.
  2. It's got good divisibility - /2 and /3 (and /6 of course) are just "check the last digit", /4 and /9 are "check the last two digits", /5 is "sum the digits and check if they're /5" (same as /9 in base 10), /7 is "alternately add/subtract the digits and check if they're /7" (same as /11 in base 10).
  3. The multiplication table is really trivial, almost insultingly so. Similar to the divisibility, multiplying by 2 and 3 are now super easy (like multiplying by 2 and 5 are in base 10).
* | 1| 2| 3| 4| 5| 10|
----------------------
 1| 1| 2| 3| 4| 5| 10|
 2| 2| 4|10|12|14| 20|
 3| 3|10|13|20|23| 30|
 4| 4|12|20|24|32| 40|
 5| 5|14|23|32|41| 50|
10|10|20|30|40|50|100|

Numbers in base 6 are about 50% longer than in base 10, which isn't a huge loss. We can compensate for it by making sure the names for the digits are single-syllable, as that boosts your ability to remember strings of digits. Grouping digits into sets of 3 would also feel more "natural" - we might even be able to go up to sets of 6 instead.

I guess when talking about computers, we'd generally use octal (like we use hexadecimal in base 10) - 012345TE. Octal is a lot easier to learn than hex, too.

Dates

The Gregorian calendar is stupid. We can't control the length of a day or year, so we can't do anything perfect, but we can at least improve the situation by switching to a consistent 30-day month, with 6 5-day weeks. 12 of these months leaves us with a single 5-day week left over at the end, which should be treated as intercalary and placed at the end. Leap days also go at the end. This makes days consistent across months and years (the 2nd of every month is always the same day of the week, etc), and gives us much easier math in general, in both base 10 and base 6. In base 6, we have 20 months, each containing 10 5-day weeks, or 50 days total. 👌

Time

Keeping time is... fine. It's not great; our time system is clearly based on the dozenal roots in some of our old numbering systems (a day is two dozen hours, an hour is 5 dozen minutes, a minute is 5 dozen seconds), but we can still do better.

We're lucky in that, unlike calendars stuck with the unchangeable physical constants of "day" and "year" (and their terrible ratios), time-keeping is only beholden to the day - the rest of the units are human-made, and can be adjusted as we see fit. So we can switch to proper metric time! In base 10, I prefer dividing the day into 100 "hours", each equivalent to about about 15 of our old minutes, and then further subdividing into 100 "seconds", each very close to our old seconds. 15-minute chunks are very tractable, I feel - we often informally divide the hour into quarter-hours already, because it's a convenient length of time, neither too short to get anything done nor too long to keep track of.

In base 6 I'd do similar, but you can fit in three divisions - 100 hours (each about 40 of our old minutes), split into 100 minutes (each about equal to our old minute), split into 100 seconds (each about 2 of our old seconds).

Time Zones

Time zones, on the other hand, can get fucked. We don't have "season zones" for dates - when it's March 1st, it's March 1st everywhere in the world, regardless of whether that means it's snowy or cool or starting to warm up. Instead we have much fuzzier "seasons" to correspond to the weather, and those do fluctuate based on where you are.

Time should work the same. We should have a single consistent global time, and then pair it with a set of fuzzier "moments" that describe roughly what the light-level/meal time is. Then it's easy to describe when a meeting will happen (at hour 34, or something), but also tell people it's too early (it's still dawn over here at that time), same as we can tell people when we'll vacation (January) and what it'll be like (middle of summer, because we're doing a beach vacation in New Zealand).

I Understand Platonic Solids A Little Better Now

Last updated:

So I was watching this experimental gameplay video for Miegakure, a 4d indie video game. It's like Fez, which was a 2d platformer in a 3d world that involved "rotating" things in 3d to solve puzzles, except lifted one level, so it's a 3d game in a 4d world.

Anyway, it mentioned the 24-cell, a weird 4d shape that I've always wanted to understand better, so I started on a Wikipedia binge. This immediately led me back to the platonic solids, when I realized something.

It always struck me as weird that there are an infinite number of regular 2d polygons, one for every number of sides - equilateral triangle, square, regular pentagon, regular hexagon, etc - but there were only five regular 3d polyhedrons - with dice names, the d4, d6, d8, d12, and d20. Going from infinite to finite is always interesting, and I didn't get it before, but I got to thinking about the patterns in the shapes, and suddenly it clicked - there just isn't enough space for any more regular polyhedrons!

See, the dice shapes can be split into two lines. The d4, d6, and d12 form the "increasing polygon side-count" line, while the d4, d8, and d20 form the "increasing number of triangles" line. (Yes, the d4 is the base of both.)

The "increasing polygon side-count" line goes from triangle (3) sides, to square (4) sides, to pentagon (5) sides. As you increase the number of sides, the angles get larger, so you have lay the shapes out "flatter" to get them to fit together snugly, reducing the curvature and increasing the size of the die. But what happens when you fit hexagons together? They tile the plane. Put hexagons together snugly, and they perfectly fill up a flat 2d plane - 120° × 3 = 360°. You can't curve them at all, there's not enough space. So there's no way to make a polyhedron with a regular hexagon - it can't curve around to form a ball!

The same argument happens with "increasing numbers of triangles". It's easy to fit together three triangles - you get the d4. Spread them out a little and you make space for a fourth, getting the d8. A little more flattening makes room for a fifth, making the d20. You can still squeeze in one more triangle, putting six together snugly, but only by flattening them into a plane again. It's the same problem as the hexagon tiling - 60° × 6 = 360° again. The triangles don't curve around at all, so you once again can't form a ball!

7 (whether 7-sided shapes, or 7 triangles) is of course right out. You can't fit those together no matter how you try.

I presume this is why there are only three regular solids in 5+ dimension: the simplex (analogue of the d4), the measure complex (analogue of the d6), and the cross complex (analogue of the d8). In higher dimensions the volumes take up more "room", so you just can't fit as many together, and aren't able to go up to the "5" shapes (analogues of the five-sided or 5-triangled shapes).

This still doesn't answer what the fuck is up with four dimensions, tho. 4d has six regular solids - 5 analogues of the 3d solids, and 1 special. There's the three made out of increasing-size dice: the 5-cell, made out of d4s; the 8-cell, made out of d6s; and the 120-cell, made out of d12s. There's the three made out of tetrahedrons: the 5-cell, where 4 meet at each vertex; the 16-cell, where 5 meet at each vertex; and the 600-cell, where 6 meet at each vertex.

Then there's the 24-cell, made out of d8s. What the fuck. Geometry is weird.

My Password Strategy, and So Can You!

Last updated:

Everyone knows they're supposed to use strong, random passwords, and use different passwords on every single site. For most of us, tho, that's impossible - we just can't remember that many. Some of us give up and just remember one strong password (there goes your security when something is breached) or use a password locker like 1Password (hope you're backing up appropriately, and have access to the vault from every single place you'll need a password).

None of these are good strategies. A few years ago I thought about this a bit, and put together a tool to help me handle all this properly and safely. I've evangelized it a bit informally, but more people really need to know about this. So, here goes:

  1. Memorize one good, strong password. Make it long and random, as secure as you can possibly make it and still be sure to remember it. This is the only random password you'll be required to remember ever again.
  2. Whenever you need a password for something, visit https://tabatkins.github.io/password/. Put your master password from Step 1 into the master password slot, and enter a memorable "site tag" for that slot. This does not need to be secure in any way, so focus on making it as easy to remember as possible, like the domain name of the site.
  3. Hit the "Long" button and copy the generated password out. DONE.
  4. If you're using a terrible website like a bank that applies password limits, the "Short" button usually works - it gives you a 12-char password. If you need more control, the "More Options" section lets you customize the password thoroughly, which should satisfy whatever idiotic demands they call for. Try to avoid using this if possible, just because it means more memorization.
  5. Store the site tag (and the custom options, if necessary) somewhere accessible. I just use a Google Doc, which I can access from anywhere on multiple devices. This is not secure information, so don't worry about it being exposed - as long as your Master Password is good, you're safe.

And that's it! This method has several benefits over a traditional password locker:

  • Same amount of memorization - one master password.
  • No need to "back up" anything - you can probably remember the site tag anyway, and if you do record them somewhere it doesn't have to be securely stored.
  • Accessible from anywhere - as long as you can touch the internet, you can reach this - no need for a browser extension or a separate program that won't be installed on public computers.
  • Works offline - the site is totally self-contained in a single file, so you can save it to your device and use it locally without any internet connection at all.
  • Totally independent - nothing can stop working because some company was acquired or went out of business. If you save a local version of the file or host it on your own site, even me removing my site won't stop you.
  • No chance of losing passwords - no chance that your password file can be "corrupted" and impossible to decrypt, because there is no password file - it's just a hash function run on your two inputs.

If you're paranoid, feel free to audit the code on your own - it's unobfuscated HTML and JS that makes zero network calls and saves no information. The entire operation is done locally, my version of the file is served over HTTPS, and you can run a local version if you're really paranoid about code changes.

I've been using this for years, and it changed my life around passwords.

Compactly Encoding All Primes via Gaps

Last updated:

Updated post with more efficient and simple encoding!

(As usual for these mathy posts, inspiration was an XKCD forums thread. Join us!)

Let's say you wanted to store a list of every single prime. Cool! Problem is, primes start to get fairly big after a while. You don't want to use any particular fixed-width binary encoding (it'll waste a lot of space on small ones, and be unable to store large ones), and even with a variable-width encoding you're wasting a lot of space. Primes tend to be really close to each other, so by listing each one explicitly you're repeating most of the data!

There's a lot of ways to throw compression at this problem, but an easy one to talk about is to instead store a list of the gaps between primes. For example, there's a 2-gap after 3, since the next prime is 5; there's a 4-gap after 7, since the next prime is 11. Primes actually cluster pretty closely together - there's an infinite number of "twin primes" with a 2-gap, and the average gap at any point is equal to the density of the primes, which is the natural log (in other words, it grows very slowly). So the gaps are much smaller than the primes, especially as numbers get bigger - for example, per this MathWorld article, the maximal gap between primes 18 digits or less is only 1220 (and the average gap is around 40 at that point).

So now we have to talk about encoding those gaps, in some reasonably efficient way. I've come up with a scheme that I think works pretty well, derived from UTF-8:

  1. 2 and 3 are assumed. This will start listing the gaps from 3 onwards.
  2. Because all the gaps are even, we'll actually be encoding the halfGap.
  3. Find the halfGap's binary expansion.
  4. Determine how many output bytes you'll need by dividing the number of bits halfGap uses by 7. (If the answer is > 8, this is an exceptional case. Skip to step 7 for these cases.)
  5. In the first output byte, set the high (bytes - 1) bits to 1, then the next bit to 0. In the remaining bits of this byte, and the remaining output bytes, write the halfGap directly.
  6. Occasionally, whenever you feel is appropriate, encode an "i-frame": write out a byte of all 1s, then encode the next prime into however many subsequent bytes it takes. Use only the lower 7 bits of each byte, leaving the high bit 0. When finished, write out another all-1s byte.
  7. For exceptional cases (gap would need more than 8 bytes to encode), write out the prime as an i-frame instead.

Judicious use of "i-frames" (video encoding terminology!) means you don't have to process the entire preceding list to find a given prime, just back to the most recent i-frame. It also gives you a resync point, in case you lose track of where you are in the stream for whatever reason. (UTF-8 resyncs with every character, but it pays for that with lower efficiency.)

This encodes every gap with just 1/7 overhead. The average gap fits into a single byte until around e^2^8, or approximately 10^111; into two bytes until around e^2^15; three bytes until e^2^22; etc. At about e^2^57 (approximately 10^10^17) it finally reaches the point where the average gap is "exceptional" and requires an i-frame, and you have to pay a 2-byte tax for each one. Luckily, at that point each prime takes up about 10 petabytes of storage just to write out, so the 2 extra bytes aren't very noticeable. ^_^