Sunday, June 17, 2012

How to calculate Password Strength (Part III)

This is the conclusion to a three part series on calculating password strength using a brand new algorithm that I've been teasing about for a while.

Read Part I and Part II for the earlier bits to this story.

If you are a programmer and just want the tl;dr nitty-gritty (i.e. the source code to my algorithm), then you will need to download the SSO Server and Client and extract the code from 'server/support/functions.php'. The two relevant functions are:

SSO_GetNISTNumBits($password, $repeatcalc = false);

SSO_IsStrongPassword($password, $minbits = 18, $usedict = false, $minwordlen = 4);

The algorithm I developed essentially attempts to break a password in an optimal amount of time (less than 1/4 sec). But how does one do that? The first step is to calculate the entropy of the password. NIST has done some work in this regard but they only published a set of suggestions not actual recommendations. The next step is to apply a threshold at some acceptable bit level that rejects bad passwords. Displaying a password strength meter is not sufficient.

When calculating entropy, it is important to not hand out bits. Be stingy. The NIST algorithm gets to be reasonably accurate when a password gets past 20 characters in length. Shorter passwords tend to not be very accurate. This is where the "suggestion" part likely comes from. What I've done is use the NIST algorithm as a starting point - a fast pre-filter. I've got both the original and a modified version of that algorithm that reduces character reuse values by 75% per character (e.g. the letter 'a' repeated many times is not a strong password).

The next step in my algorithm is to check for keyboard layout passwords and "tricks" (shift by one). For example, 'qwertyuiop' is not a strong password because it uses aspects of the keyboard's layout to come up with a password.

The final step is optional but highly recommended. The relevant SSO Server Generic Login module checks the password and keyboard sliding variations against a 300,000 word English dictionary. It only needs to check for the first matching word up to the minimum entropy level because "correct horse battery staple" is actually a strong password (a password phrase). Sentences generally make for secure passwords, so the Generic Login module encourages such behavior. Passwords should be able to be any length.

At each step of the process, the goal is to attempt to find a lower number of bits of entropy so that the password fails to pass the tests. Each test takes a little bit longer than the previous test. Still, rejecting 99% of all bad passwords at 18 bits of entropy (18 bits is roughly 8 characters long) is a fantastic solution to the problem of users selecting weak passwords.

The most important thing with calculating password strength is to do something with the calculation. Letting a user know that their password is weak won't do squat. You have to actually enforce it. Requiring a minimum number of bits of entropy is called thresholding. Any password that doesn't meet a minimum number of bits of entropy must be rejected if we want to rid the world of poorly selected passwords (a good thing). Using my algorithm with dictionary checks enabled, I've got a good set of rules in Part II of this series.

So there you have it, the anticlimactic conclusion of this nerdtastic series. Now get out there and do some password thresholding! Save people from themselves!

Friday, June 15, 2012

The correct way to validate an e-mail address

If you are using regular expressions, in general and not just e-mail, you are Doing It Wrong(TM).

Every single time I've ever seen preg_match() or the equivalent function in another language used, not just for e-mail addresses, I know that the code in that location is wrong. The regular expression will miss something important and either be too strict or not strict enough. This is especially true for e-mail address validation. I have yet to find a circumstance where a regex pattern match is a valid solution. It acts as a blacklist and blacklists are constant maintenance nightmares. Regular expression string replacement, however, acts as a simple whitelist. preg_match() = bad, preg_replace = good. But preg_replace() is not what programmers use to validate e-mail addresses nor is it a good idea.

The correct way to validate an e-mail address is to do exactly what the RFCs say to do: Implement a state engine that parses the address one character at a time using the complex grammar specified by the RFCs.

I know what you are going to say next, "But a regular expression parser implements a state engine!" Sure it does, but can your regular expression actually correct an invalid e-mail address? Didn't think so. And can you fully implement the complex grammars in the RFCs in your regex parser in a readable way? Not unless you're using something recent but I see that as a hack for already broken software. Not once have I ever seen a regex not break and do what the author actually intended. In addition, when you control the state engine, you also get to define how the input string is parsed and can even correct invalid inputs in some cases where there is an obvious mistake and only one logical path to take.

Of course, I have a solution already built that passes a very large test suite with flying colors:

Ultimate E-mail Toolkit

The toolkit parses addresses backwards because extracting the domain portion is the easy part, leaving me with the mess in front of the '@' to deal with. But it is equally valid to use an IP address instead of a domain name, so it can get a little messy trying to figure things out even for something seemingly simple. And don't forget that comments can appear in an e-mail address in certain places. I'm not sure what the rationale was for allowing comments, but since they can exist, it is important to handle them too (usually by removing them - again, something a pattern matching regex can't do).

Anyway, I've said what I wanted to say. Parsing e-mail addresses is hard and regular expressions don't cut it.

Friday, June 08, 2012

LinkedIn, eHarmony, and Last.FM hacked - How to not be the next victim

If you have been following the news lately, you know that business social media giant LinkedIn, the popular dating site eHarmony, and the used-to-be-popular-before-the-merger music site Last.FM were hacked and part or all of their databases were stolen and passwords cracked. This is what happens if you are a beginning programmer who writes a login system. These sites were mentioned in the news because they are larger data breaches, but there are thousands of compromised sites every day that you don't hear about.

Don't be next. Instead of authoring a login system, you should be using a product written by someone who has spent the time researching industry best-practices and carefully and painstakingly crafted each aspect of a modern login system.

http://barebonescms.com/documentation/sso/

That is an enterprise-grade, Single Sign-On system (written in PHP, but the client is portable to other languages). I'd mention other products but, as of the time of this writing, they simply don't exist. That's right, other than the above piece of software, no one to date has developed a solid self-hosted login system that other people can use in their software applications. I scoured Google, SourceForge, Google Code, and GitHub before and after developing this product and came up empty-handed. There are some libraries (OpenID, OAuth, and HybridAuth) but they effectively require writing a login system - again, nothing prepackaged. There are also cheesy little scripts here and there and everywhere but they are all badly broken security-wise - written by newbie programmers, not industry veterans and therefore is code you shouldn't even touch.

We've been doing dynamic website development for what? Fifteen years? So, roughly 15 years of web development have passed and there is absolutely NOTHING, NADA, ZIP, ZILCH filling this remarkably vacant space. Surprised? I'm not - because writing a login system is a "rite of passage" (or something like that) for each web programmer out there. This practice needs to stop right now because it is the source of the problem. The security breaches of the last couple years should be a massive wake-up call for the entire industry to put together legitimate solutions to this serious security problem. I've got the first and currently only product and that's just...so incredibly pathetic.

Writing a login system is a fine exercise for a programmer, but don't use it in a production environment - there is a VERY good chance you will miss something critical. A login system is generally a website's primary security mechanism. If it is flawed, then there are serious problems with the entire website. Leave writing login systems to those who know what they are doing and use prepackaged solutions wherever possible. And by "wherever possible", I mean "everywhere".