Friday, June 15, 2012

The correct way to validate an e-mail address

If you are using regular expressions, in general and not just e-mail, you are Doing It Wrong(TM).

Every single time I've ever seen preg_match() or the equivalent function in another language used, not just for e-mail addresses, I know that the code in that location is wrong. The regular expression will miss something important and either be too strict or not strict enough. This is especially true for e-mail address validation. I have yet to find a circumstance where a regex pattern match is a valid solution. It acts as a blacklist and blacklists are constant maintenance nightmares. Regular expression string replacement, however, acts as a simple whitelist. preg_match() = bad, preg_replace = good. But preg_replace() is not what programmers use to validate e-mail addresses nor is it a good idea.

The correct way to validate an e-mail address is to do exactly what the RFCs say to do: Implement a state engine that parses the address one character at a time using the complex grammar specified by the RFCs.

I know what you are going to say next, "But a regular expression parser implements a state engine!" Sure it does, but can your regular expression actually correct an invalid e-mail address? Didn't think so. And can you fully implement the complex grammars in the RFCs in your regex parser in a readable way? Not unless you're using something recent but I see that as a hack for already broken software. Not once have I ever seen a regex not break and do what the author actually intended. In addition, when you control the state engine, you also get to define how the input string is parsed and can even correct invalid inputs in some cases where there is an obvious mistake and only one logical path to take.

Of course, I have a solution already built that passes a very large test suite with flying colors:

Ultimate E-mail Toolkit

The toolkit parses addresses backwards because extracting the domain portion is the easy part, leaving me with the mess in front of the '@' to deal with. But it is equally valid to use an IP address instead of a domain name, so it can get a little messy trying to figure things out even for something seemingly simple. And don't forget that comments can appear in an e-mail address in certain places. I'm not sure what the rationale was for allowing comments, but since they can exist, it is important to handle them too (usually by removing them - again, something a pattern matching regex can't do).

Anyway, I've said what I wanted to say. Parsing e-mail addresses is hard and regular expressions don't cut it.

No comments:

Post a Comment