More About ASCII-Sortable Strings

In my previous post about ASCII-sortable strings, I presented a fairly simple method of generating a short prefix that would ensure that your numbers sorted correctly.

Astute readers may have noticed, though, that the method I presented is, frankly, ridiculous in practice. It's optimized for infinity, not for numbers we'll actually deal with. There are several much simpler and shorter methods we can use which have ranges that extend far past what we'll ever need. Let's look at a few.

A High-Base Unary Prefix

Zero-padding is basically using an unary prefix, but it counts down instead of up, so it has a maximum range. If we reverse it so that the prefix character sorts after digits and the prefix gets longer as the number of digits increase, we get a very simple method for ensuring sorting that works forever.

The problem with this, though, is that the overhead gets longer rather quickly as the numbers get bigger, and there are a lot more big numbers than small ones. Reducing the overhead by a single character on numbers with N digits can be balanced by an increase of nine characters on numbers with N-1 digits, on average.

We can reduce the pain of this by using a high-base unary. Rather than just repeating a bunch of "A" characters, we can first count up A to Z, then use ZA to ZZ, then ZZA to ZZZ, etc. This is still an unary system (more properly, an additive system, like Roman numerals), but it increases much more slowly. For numbers below 10^28 (that's ten thousand trillion trillion) we have only a single character of overhead! This is already far past the point at which Javascript starts losing precision with integers (that happens at 2^53, or approximately 10^16 (nine quadrillion)), so we're more than fine here.

Here's the simple version of the code, which doesn't try to handle numbers outside of Javascript's limits of integer precision. This is all you will ever need in nearly all cases - if you try to go past this, you'll need to use something other than JS's built in numbers. This is what I actually use to number my blog comment files.

function sortableStringFromNum(num) {
  if(num < 10) return ''+num;
  return String.fromCharCode(63 + (''+num).length) + num;
}
function numFromSortableString(str) {
  if(str.length == 1) return +str;
  return +str.substr(1);
}

Here's the general-purpose version of the code, which can handle numbers of any size. Like I said, this won't actually be useful in JS unless/until we introduce value types that let us use a BigInt class like it was a normal number, but it can be adapted for use in other languages.

function sortableStringFromNum(num) {
  for(var len = 0, counter = 10;counter <= num;len += 1, counter *= 10){}
  for(var prefix = '';len > 0;len -= 26) 
    prefix += String.fromCharCode(64+Math.min(len, 26));
  return prefix + num;
}
function numFromSortableString(str) {
  return +/[0-9].*/.exec(str)[0];
}

(Note the somewhat odd loop on the first line to find the length. This was necessary because after a certain point, JS starts serializing numbers in "1e20" notation, which throws off the naive string-length approach I use in the simple code. I can't do a nice Math.log() either, because I start running into *float* precision issues somewhere around 10^26 that give me wrong answers. The loop I ended up with is the best compromise I could come up with.)

"Zero-Padding" the Digit Stack

The technique in the previous post used a "digit-stack" technique to compress the prefix. Rather than immediately jumping to an unary prefix, it first prefixed the number with the number of digits.

This of course runs into the same sorting problem as soon as you hit 10 digits, because "10-1000000000" sorts before "2-10".
To fix this, it introduced a "stack" of digit-length numbers. Each level of the digit stack partitions the numbers so that the following level sorts correctly, until we come to the number itself.

However, just like how numbers only sort correctly when compared with numbers of the same length, digit stacks only compare correctly when compared with stacks of the same number of levels. If you need your system to extend to infinity, you eventually need to fall back to an unary prefix, like the method in the previous post did. If you're okay with limiting your range to a merely ridiculous maximum instead of infinity, though, you can get away with a variant of zero-padding. Just decide on how many stacks you want, and always include that number of them. The stacks that would ordinary be "missing" instead just have the value "1".

This method lets you get ridiculously large numbers while still avoiding some of the extra overhead. Limiting yourself to just two stacks will work for all numbers with less than a billion digits. Here's the code:

function sortableStringFromNum(num) {
  var digits = (''+num).length;
  return (''+digits).length + digits + '-' + num;
}
function numFromSortableString(str) {
  return +str.substr(str.indexOf('-')+1);
}

Increasing the Efficiency?

This post has all been about increasing the ease and efficiency for "low" numbers, at the expense of either not handling "high" numbers at all or at least making them longer.

Can we go the opposite direction, though? Can we optimize more powerfully for very large numbers, reducing their overhead at the expense of more overhead for small numbers?

Yes!

One easy target is the unary prefix. If you're going to infinity, you must, at some point, include some unary, because only unary numbers sort perfectly against each other regardless of length. However, the length of an unary number grows linearly, which is exponentially faster than the growth of normal positional numbers.

The method in the previous post was basically just "an unary prefix equal to the number of digits", except that we interposed a digit stack between the prefix and the number, so that the prefix instead referred to the number of levels in the digit stack. Since this number grew much slower than the length of the number, we got a nice win.

We can improve this, though, by interposing another stack between the unary prefix and the rest of the number. Just like the existing digit stack, we can add a level stack, which counts how many levels are in the digit stack. The level stack then builds up multiple levels itself in the same way as the digit stack, and the unary prefix now counts the number of levels in the level stack. This adds some complexity, but it slows down the growth of the unary prefix by another double-exponential factor, I think.

I believe you can extend this out infinitely, adding more and more meta-stacks and pushing the unary out further every time. You have to get to astronomical numbers before these strategies start paying dividends, but when you've got infinity to work with, that's easy.

Does anyone know of more interesting methods of reducing the overhead in the long term?

Let's write ten thousand trillion trillion comments!

This being said, testing for numbers when sorting file names doesn't seem such a hard problem. A simple regexp should do the trick.

On my Linux machine, the default file explorer, Dolphin, does the check. I guess it annoyed someone, and he solved the issue.

Now, take a Mac. The Finder doesn't check for numbers, and we have to wait for Apple to do something about it.

I acknowledge that it is hard to know what features a particular software is meant to have. I remember having read a Chromium bug about Google Chrome not having a "Save as Wallpaper" item in the context menu. (Chrome still doesn't have this feature.) A lot of people, however, considered this a fundamental feature that every browser should have.

But still, at least in open-source I can fix the issue. And get my pull request refused.

Reply?

It has to be ten thousand trillion trillion on a single post, mind you. And I suspect my server would fall over somewhere short of a single trillion trillion, let alone ten thousand.

Yeah, you can just use a good natural sort, like http://www.codinghorror.com/blog/2007/12/sorting-for-humans-natural-sort-order.html, but I'm optimizing for the ftp program that I use, and the '<' operator in PHP and JS.