Tab Completion

I'm Tab Atkins Jr, and I wear many hats. I work for Google on the Chrome browser as a Web Standards Hacker. I'm also a member of the CSS Working Group, and am either a member or contributor to several other working groups in the W3C. You can contact me here.
Listing of All Posts

Strings Shouldn't Be Iterable By Default

Last updated:

Most programming languages I use, particularly those that are more "dynamic", have made the same, annoying mistake, which has a pretty high chance of causing bugs for very little benefit: they all make strings iterable by default.

By that I mean that you can use strings as the sequence value in a loop, like for(let x of someString){...}. This is a Mistake, for several reasons, and I don't think there's any excuse to perpetuate it in future languages, as even in the cases where you intend to loop over a string, this behavior is incorrect.

Strings are Rarely Collections

The first problem with string being iterable by default is that, in your program's semantics, strings are rarely actually collections. Something being a collection means that the important part of it is that it's a sequence of individual things, each of which is important to your program. An array of user data, for example, is semantically a collection of user data.

Your average string, however, is not a "collection of single characters" in your program's semantics. It's very rare for a program to actually want to interact with the individual characters of a string as significant entities; instead, it's almost always a singular item, like an integer or a normal object.

The consequence of this is that it's very easy to accidentally write buggy code that nonetheless runs, just incorrectly. For example, you might have a function that's intended to take a sequence as one of its arguments, which it'll loop over; if the user accidentally passes a single integer, the function will throw an error since integers aren't iterable, but if the user accidentally passes a single string, the function will successfully loop over the characters of the string, likely not doing what was expected.

For example, this commonly happens to me when initializing sets in Python. set() is supposed to take a sequence, which it'll consume and add the elements of to itself. If I need to initialize it with a single string, it's easy to accidentally type set("foo"), which then initializes the set to contain the strings "f" and "o", definitely not what I intended! Had I incorrectly initialized it with a number, like set(1), it immediately throws an informative error telling me that 1 isn't iterable, rather than just waiting for a later part of my program to work incorrectly because the set doesn't contain what I expect.

As a result, you often have to write code that defensively tests if an input is a string before looping over it. There's not even a useful affirmative test for looping appropriate-ness; testing isinstance(arg, collections.Sequence) returns True for strings! This is, in almost all cases, the only sequence type that requires this sort of special handling; every single other object that implements Sequence is almost always intended to be treated as a sequence.

There's No "Correct" Way to Iterate a String

Another big issue is that there are so many ways to divide up a string, any of which might be correct in a given situation. You might want to divide it up by codepoints (like Python), grapheme clusters (like Swift), UTF-16 code units (like JS in some circumstances), UTF-8 bytes (Python bytestrings, if encoded in UTF-8), or more. For each of these, you might want to have the string normalized into one of the Unicode Normalization Forms first, too.

None of these choices are broadly "correct". (Well, UTF-16 code units is almost always incorrect, but that's legacy JS for you.) Each has its benefits depending on your situation. None of them are appropriate to select as a "default" iteration method; the author of the code should really select the correct method for their particular usage. (Strings are actually super complicated! People should think about them more!)

Infinite Descent Shouldn't Be Thrown Around Casually

A further problem is that strings are the only built-in sequence type that is, by default, infinitely recursively iterable. By that I mean, strings are iterable, yielding individual characters. But these individual characters are actually still strings, just length-1 strings, which are still iterable, yielding themselves again.

This means that if you try to write code that processes a generic nested data structure by iterating over the values and recursing when it finds more iterable items (not uncommon when dealing with JSON), if you don't specially handle strings you'll infinite-loop on them (or blow your stack). Again, this isn't something you need to worry about for any other builtin sequence type, nor for virtually any custom sequence you write; strings are pretty singular in this regard.

(And an obvious "fix" for this is worse than the original problem: Common Lisp says that strings are composed of characters, a totally different type, which doesn't implement the same methods and has to be handled specially. It's really annoying.)

The Solution

The fix for all this is easy: just make strings non-iterable by default. Instead, give them several methods that return iterators over them, like .codepoints() or what-have-you. (Similar to .keys()/.values()/.items() on dicts in Python.)

This avoids whole classes of bugs, as described in the first and third sections. It also forces authors, in the rare cases they actually do want to loop over a string, to affirmatively decide on how they want to iterate it.

So, uh, if you're planning on making a new programming language, maybe consider this?

(a limited set of Markdown is supported)