Storing Passwords Properly

So you've made a cool webapp, and you want to have a login system so people can save data on your server. You slap something together, but seem concerned by the fact that you're just storing the passwords in the database alongside all the other data. How can you make this better?

Well, your first mistake was to store the passwords at all. Use a federated login system like OpenID, so you don't have to worry about it. But, given that you're stuck with your current situation, let's look at the attacks you want to guard yourself against and how to do so.

There are four basic attacks:

Network sniffing
Just, you know, looking at it
Quick hack
Brute force

We can never completely remove the risk, but we can at least greatly mitigate each of these attacks through some simple security measures.

Network Sniffing

tl;dr: Use HTTPS all the time.

If you have your login page on a normal HTTP address, then anyone with a packet sniffer can grab the login information of your users as they log in. This has gotten much worse as more people use wifi, and tools like FireSheep have arrived which automate the info-stealing process. Forcing the login page to be accessed over HTTPS means that the connection is encrypted, which makes it much safer.

Preferably, do your entire application in HTTPS. It's good for the web.

Just, you know, looking at it

tl;dr: Hash your passwords before storing them.

This is probably an info-grab from someone in your organization, or maybe just a lucky hacker that got into your DB.

Now, if the attacker is a fellow programmer on your team, you're screwed. They've got access to the codebase, they can do whatever they want. But if you're part of a larger organization, and you want to make sure that Joe in Tech Support who knows a bit about networks can't get into the production db and sell a bunch of passwords to a Russian cartel, you can take some simple steps.

Basically, you don't want to store passwords in plain text. If anything ever happens to the database, they'll be spilled all over the place and the jig is up. You need to somehow mask the password so that it can't be deciphered by itself.

This sounds like a job for encryption, right? Almost, but not quite. Encryption's failing here is that it's reversible. Using appropriate keys you can turn the encrypted password back into a plaintext one. If they have access to the DB, they may also have the key; for example, if they grabbed a copy of your source code quickly before their access was shut down.

The key insight is that you don't actually need to know what the password is, just that it matches what the user provided. Encryption transforms the stored password, and then un-transforms it to check if it matches the string the user provided. Obviously, though, there's a second option - just encrypt what the user provides as well, and then check the encrypted string against the encrypted string stored in the database. If they were the same to start with, they'll be the same after encryption, so the equality check will still work.

But if we can just do this, then we don't need the ability to reverse the transformation at all. We can use a hash function, which is a one-way transformation. Technically, it's impossible to make a transformation actually one-way, but you can make it much more difficult to reverse the transformation than it is to transform the string originally, which is sufficient - if it takes you a fraction of a second to make the transformation but years to reverse it, that's a good tradeoff.. Like encryption, you shouldn't ever try to make up your own hash function, because it will almost certainly have significant weaknesses. New cryptographic functions requires a ton of review from the professional community before they're accepted as usable.

So, hash all your passwords using a currently widely-accepted hash function. This isn't enough by itself, but it's a start. The next two sections elaborate on this.

Quick Hack

tl;dr: Salt your passwords before hashing.

Okay, so you've hashed all your passwords. Even if the attacker gets away with all the info in your database and your source code, they still just get a load of gibberish and can't use it to actually log in, right? Well, not quite, or else this article would already be over.

You see, hash functions are designed to be hard to reverse. But they can be reversed, a little bit at a time, by just hashing stuff and seeing what pops out. If you record what you used and what the result was, you can now reverse that hash function for that particular value. If you spend enough time doing this, you can build up a substantial database of strings and their hashes, so that reversing a hash is just a matter of looking in the database. These are called rainbow tables, and they're bought and sold for use in attacking password databases just like this. Unfortunately, most people's passwords are simple dictionary words, which you can construct a rainbow table for in almost no time. A rainbow table of every possible password of 8 lowercase letters or less, which will cover the vast majority of people, can fit on a few terabyte drives (there's only about 200 billion of them).

It's infeasible to construct a rainbow table for any large set of strings - if you restrict yourself to 8 characters or less, but also mix in uppercase letters, numbers, and some punctuation (say, another 10 characters worth), you increase the total set of possible strings to about 700 trillion, a 3 or 4 thousand-fold increase. But can we do better? Of course!

Remember, the reason we could use a hash is because we didn't actually need to know what the original password was, we just needed to know how we transformed it, so we could transform the string the user provides in the same way. Well, if you're already transforming it unrecognizably, transforming it more won't hurt anything. Rainbow tables only work because the set of strings we're hashing is relatively small, because people choose bad passwords. We can fix that weakness, making the set of strings much larger and thus making rainbow tables completely infeasible, by just making the strings longer.

We don't have to involve the user in this process at all. You just have to create a long random string, called a salt, which we'll append to each password before we hash it. Again, as long as we do the exact same thing to the stored password and the one the user provides, we'll get the same result, so we can still test for equality. Just adding a 30 character salt to everything balloons the necessary size of the rainbow table to around 1e67, a number so large that we exhaust the set of common names for numbers. Suffice it to say that a hard drive large enough to hold a rainbow table for this set would be more massive than the Earth itself.

That said, this still doesn't get us all the way there. Sure, making a rainbow table for all possible salted strings is impossible. But if you have the source code you'll be able to find out what the salt is, and then you can make a rainbow table for that specific database. It's more expensive than just buying a ready-made one, but still quite doable on modern hardware.

We can fix this, though. We don't need the salt to be the same for everyone. As long as we remember what salt we used, we can use a different one for every single entry, actually. Just generate a fresh random string for every new user, and use it as their personal salt. You can even store it plainly in the database, because it doesn't help the attacker much to know what it is.

So, we've killed the possibility of rainbow tables being used against us. Great! But there's still one final attack, which has become more and more feasible recently, and now is actually a significant concern if you're not doing things right...

Brute Force

tl;dr: Use a slow hash function.

In the end, the attacker can always just try every possible string, hashing them one by one and comparing them to the value in the database. Salting doesn't help you here, as the attacker knows the salt, and can just append it to every possible string. This returns us to a similar situation we were in earlier, where the fact that most people use simple dictionary words makes us easy to attack. The attacker now has to bruteforce every password individually, but if people start with weak passwords, this process will be quick and easy. With a current industrial supercomputer, which can be rented for a mere few thousand dollars an hour, an attacker can process every possible 6-character string and a selection of common longer words through the MD5 hash function in less than a second. With a $200 GPU the same attack can be carried out in an hour; this is longer, but still way too short for security, and easily within the price range of any random teenage hacker.

The problem here is that many hash functions are designed to be fast. Hashes are used for much more than security; for example, if you're regularly updating some large set of data pulled from another server, you can hash your data and then ask a server for the hash of their current data; if the hashes are the same, the data hasn't changed and you don't need to waste a ton of bandwidth. This sort of thing isn't very security-conscious, so it just wants some function that is fast to compute and likely to change if the input changes. An attacker, though, also loves a fast hash function, because it means they can compute more hashes in less time and their bruteforce attack will go faster.

There's an asymmetry we can exploit here to help us - we don't need to hash things very often (just whenever a user first signs up, and whenever they log in), but an attacker using a brute-force attack needs to hash a lot of things. A hash that takes a leisurely half-second is just fine for our purposes, but it changes the dynamic radically for the attacker - now, hashing all possible 6-character strings takes years per attempt. This effectively kills the bruteforce attack completely - the attacker can still probably sweep up people just using dictionary words, as that set is only a few thousand strings in size, but some simple password requirements will help mitigate that.

I'm not familiar enough with slow hashes to make good recommendations, but I've heard good things about bcrypt. It can also be adjusted to be slower as time goes on and computers get faster. According to RFC 2898, just taking a standard hash like SHA1 and iterating it 50 thousand times or so should work as well, and of course this can also be adjusted to be as slow as desired.

Conclusion

Using these tips won't make you totally safe. There are still plenty of attack vectors, like simple malware or social engineering. But they're all simple to implement, and will stop a decent class of attacks before they start, so they're very worthwhile to implement in your own projects.