Creating ASCII-Sortable Strings from Numbers, Redux

Last updated: Wednesday, November 30 2016

(I wrote about this a while ago. I'm revisiting it for fun, and to run thru it in a more reasonable order.)

Sometimes you have a lot of files with numerical names, and you want to name them such that a stupid ASCIIbetical sort (that is, something that sorts letter-by-letter according to ASCII ordering) will sort them in order. Just numbering them won't work right - "10.txt" will come between "1.txt" and "2.txt". Zero-padding works for most practical purposes, but it requires you to guess how many entries there will be ahead of time, and come on now.

What if you had no idea how many files there were going to be, but maybe it'll be a whole bunch? Then there are a few schemes that'll work.

Unary Numbers

First, let's talk about unary numbers. Normal numbers are "positional" - the value of a digit depends on its position in the number. Unary numbers are non-positional, so digits are worth the same no matter where they are. You already know of one common type - Roman numerals. An X is always worth 10 no matter where it shows up.

It turns out that, if they satisfy two criteria, numbers written in unary always sort correctly, no matter how long they are. These criteria are:

The number must be written in "greedy form", where you always use as many of the largest possible digit first, before trying any smaller digits. For example, in Roman numerals 16 must be written XVI, not VVVI or XIIIIII.
The digits must sort correctly themselves. Roman numerals do not do this - I, V, and X sort correctly, but C (100) goes before all of them.

Here's a simple system that does satisfy those two constraints: the letters A-Z, where A stands for 1, B for 2, etc until Z is for 26. Then a number like 100 is written ZZZV. I'll use this for the rest of the post.

The problem with unary numbers is that they grow linearly with the size of the number. If I double the number, I double the length of the number too. The length of positional numbers grow with the log of the value, which is much much smaller - doubling the value of a binary number only increases its length by 1.

So we can't quite have everything. Positional numbers grow slowly, but don't sort right. Unary numbers sort right, but grow quickly. In combination, tho!

Unary Prefix

Positional numbers do sort correctly as long as they have the same length. All 2-digit numbers (10-99) will correctly sort themselves ASCIIbetically, for example. This is why zero-padding works - it changes all the numbers to be the same length.

By taking advantage of this, we can get the best of both worlds. The first way to write numbers that'll sort correctly as they get infinitely large is to write the number normally, then prefix it with an unary number listing its length.

For this to work correctly, the unary number must use a different set of digits from the "normal" number, and those digits must sort after the "normal" digits. These are both true of the A-Z system.

So, the number 5 is written "A5". The number 99 is written "B99", while 100 is "C100". All 2-digit numbers will sort after the 1-digit numbers (because B > A) but before the 3-digit numbers (because B < C), and whenever 2-digit numbers are compared, they'll match on their B digit and then sort correctly on their remaining value.

This system adds a single character of overhead for all numbers below 10^27, or two for more numbers below 10^53, more than long enough for all practical purposes. But what if we want really big numbers? A 100-digit number will have 4 digits of overhead. A 1000-digit number will have 39. A million-digit number will have almost 40 thousand characters of overhead. Can we do better?

Of course.

MOAR POSITIONALS

So our problem is that the prefix grows linearly with the length of the number (the log of the value). This is fine for normal numbers, but if you get really big numbers, the prefix gets too big. Unary numbers just aren't good at representing large numbers, you need positionals for that. But how do we integrate more positional values without running into the sorting problem again?

Recall what the purpose of the prefix was - to correctly sort the strings so that they were grouped with same-length prefixes that sorted properly even tho they were positional. Assigning each set of suffixes to a different group can be done fine with any number, so let's use positional numbers for it instead, so 1 million (1000000, 7 digits long) is written as "71000000".

Now we're back in the positional hole, because a 10-digit number like 1 billion will start with a "10" prefix, and sort before 1 million. So let's return to unary, and use it to sort the prefixes into correctly-sorting same-length groups - if the prefix is 1 digit, further prefix it with an A, if the prefix is 2 digits, further prefix it with a B, etc. With this, 1 million is written "A71000000", and 1 billion is written "B101000000000", and everything sorts correctly again.

This forces us to pay a little bit extra prefix tax for small numbers - ten is written "A210", whereas before it was just "B10". But it quickly overcomes the previous scheme - previously, a 10^105 was "ZZZZA10000..." (5 chars of prefix) while now it's "C10510000..." (4 chars of prefix). A million-digit number previously had about 40 thousand chars of prefix, while now it has 8 ("G1000000"). That's great! But what if our numbers are really big - can we do even better?

Of course.

Stacks On Stacks

A million digit number (like 10^million) is handled great by the previous form, and is so stupendously large that there's no possible way any larger sorting scheme could be needed. But we're just playing around here - what if we wanted to encode a number like 10^10^million? Then the positional prefix is a million characters long, and the unary part is another 40 thousand.

The obvious solution here is to just repeat the previous step: have the second prefix be positional as well, and then have the final unary prefix be the length of the second prefix. So our 10^10^million number's prefix would be "A710000...", a million and two characters long.

But that's kinda inelegant. It's thematically similar to zero-padding - you have to guess ahead of time how many levels of prefix you need. If you guess too low, the excess gets soaked up into a big unary prefix (10^10^10^million would still have a 40k unary prefix). If you guess too high, you end up saddling "low" numbers with several levels of unnecessary prefix (ten is written as "A1210").

Instead, let's produce prefix levels dynamically. Start with the original prefix - the length of the number, written positionally. If this prefix is 2 or more chars long, add another prefix level indicating how long it is. Repeat as needed until you get to a single-char prefix. At the end we still need an unary prefix to make it all work, but we'll encode the number of levels with it. It's a little non-obvious, but you can show without too much difficulty that this correctly sorts all numbers.

So, a few of the numbers we were talking about previously:

ten: "A210" (2 prefix chars)
billion: "B2101000000000" (4 prefix chars)
10^million: "B71000000100000..." (9 prefix chars)
10^10^million: "C71000000100000...10000....." (1000009 prefix chars)

For any finite number of levels, the total prefix length is: log^1 + log^2 + ... + (log^N)*2, where N is the number of levels.

In our final scheme, the length is instead: log^1 + log^2 + ... + log^N + N, where N is the log* of the number. log* is the iterated log, the number of times you have to apply the log function before you hit <=1. log* stays ridiculously small for all numbers you can conceive of.

Can we go further? Yeah, of course. For unbelievably huge numbers, the number of levels might get large enough for the unary prefix to start mattering. In that case, you can extend it to use multiple positional levels to indicate the number of levels, with the unary prefix indicating the number of those super-levels. But extensions are very mechanical at this point, and only matter for numbers so large that they're difficult to even write down. Stopping at this level is good enough for anything ever in reality.

Purely additive Roman numerals of course sort by ASCII order if the letter for 50 is changed from L to Y, Z, a or b and all lower numbers use uppercase letter, all higher numbers use lowercase letters: I < V < X < {Y, Z, b} < c < d < m.

Reply?

One cannot use the convention to make a final, non-initial I into a J for better readability, because 2 IJ would then sort after 3 IIJetc., but changing every non-initial I works, I < IJ < IJJ < IJJJ < V … I and J or double-Jcan ligate to U from the left, too, I < U< UJ < UJJ or UU … Although UUcannot further fuse into W, that does work for VU. For months in dates or clock faces, we therefore can replace standard I, II, III, IIII(or subtractive IV), V, VI, VII, VIII, VIIII (IX), X, XI and XII by I Jan, U Feb, UJ Mar, UU Apr, V May, VJ Jun, W Jul, WJ Aug, WU Sep, X Oct, XJ Nov and XU Dec and have 2 “digits” max. Very intuitive.

Reply?

Of course, if you're willing to change Roman numerals, might as well change them to be more optimal from the get-go.

I purposely didn't put any thought into optimizing the unary prefix, instead opting for simplicity. You can't improve it by more than a linear factor, anyway.

Reply?