In this week's episode of Terrible Ideas Taken Too Far, we'll be exploring the delicate art of determining whether the names people provide are valid, using regex syntax, the commonly used text parsing script. Doing something like this is not uncommon in apps with social functionality, where you might want to enforce some restraints on the publicly visible names people can set for themselves.

Of course, the most pragmatic way to validate names is to not even try. It's a hard problem; people's names have a dangerous combination of extreme character set variability and strongly associated emotions. Maybe it's best to just stay in bed, and let the user set $^%*$($)';DROP TABLE users; as their name, rather than having to deal with validating characters from every language.

But let's consider the possibility that your boss doesn't buy the argument for doing nothing, and you're instructed to "do something" instead. What are your options? If regex is your hammer, you're probably going to do something like one of the following, each of which we'll examine in greater detail below:

  1. /^[^~!@#$%^&*()+=;"{}><,[\]\/\?\.]+$/ Exclude a list of characters (demo - opens in new tab).
  2. /^[a-z0-9 '_-]+$/i Allow a range of alpha-numeric characters from the ASCII Latin character set, plus spaces, apostrophe, underscore and dash (demo - opens in new tab).
  3. /^[\pL\pN\pM\pZ'_-]+$/u Allow characters from the Unicode character set with certain properties, plus apostrophe, underscore and dash (demo - opens in new tab).
  4. /^(?:[\pL\pN\pM]+[\pZ'_-])*[\pL\pN\pM]+$/u Allow the same Unicode characters as above, but only allow single separators between groups of one or more letters and numbers (demo - opens in new tab).

Excluding characters

It could be tempting to solve the problem by maintaining a list of characters that can't be used, ~!@#$%^&*()+=;"{}><,[]/?. in the example given. And maybe that's reasonable for your use case; maybe you're only concerned with disallowing characters that might be used for cross-site-scripting XSS attacks and SQL injection (hopefully not in lieu of properly parameterized queries and htmlentities encoding untrusted strings at render). However assuming you do care about users having cryptic names consisting of weird characters, is this a viable way to ensure sanitized names?

I'm sure there's a well-funded study showing the vast majority of teenagers giving up on setting names consisting entirely of special characters after exhausting those displayed on their keyboards. It's that 1% that'll get you though, the ones that keep finding new characters with which to mildly annoy their fellow community members, convicting you to a life of maintaining an ever-expanding list of banned characters. Maybe this isn't the best approach.

Allowing a range of ASCII Latin characters

Maybe your community caters only to users of Anglo-Saxon ancestry, with names of character as non-special as the nose on your face, and you don't see any problems with limiting people's names to letters from A-Z in God's alphabet (Romans', whatever), plus numbers, spaces, underscores, dashes... and since we're feeling generous, we'll throw the Irish a bone and allow apostrophes. Everything might seem fine, until the day management pivots to catering to the French, and you've got Jérôme complaining that he can't create an account. Maybe this wasn't such a hot idea after all.

Allowing Unicode characters with certain properties

So after realizing that this is a bigger problem than anticipated, you might be worried about maintaining lists of every acceptable character in any language. Well put that worry to rest, because the creators of the Unicode character set specification have gone through the trouble of classifying all the characters for you, using a list of properties. Unicode characters have properties like letter (pL) (which also apply to single-character words such as the Chinese Han character set), number (pN) or separator (pZ), for which we can filter using regex with the /u Unicode global flag.

In the example given, we're allowing characters that are either a letter, number, mark (pM - component characters needed in certain languages) or separator (spaces in Latin and most languages), all in any language, plus apostrophe, underscore and dash. That's going to allow all but the latest trending rapper with a name consisting mostly of dollar signs to enter their name without issue.

Allowing the Unicode characters above, but controlling spaces and other separators

So you've rolled out your international-friendly name validation, and you're feeling super proud of your adherence to the Unicode standard, and then some *&%^$*%$ kid creates a user name consisting of seven spaces enclosed in some apostrophes, and people are freaking out. Something must be done to reign in the use of spaces!

It's actually pretty simple to modify the regex pattern above to only allow spaces, underscores, apostrophes and dashes between groups of letters, numbers and marks, as in the fourth example. With this, we've got a pretty good chance of keeping even the most persistent of teens with all the time in the world from making annoyingly-named users, at least to the extent we're willing to do something about.

The Unicode-based approaches above were inspired by the way Laravel / Illuminate validates their alpha-dash rule, commonly used for sanitizing names: (opens in new tab) has a handy reference listing the regex Unicode property filters useful for any PCRE regex environment here: (opens in new tab)

"*&%^$*%$ kid" is a fictional character used to illustrate a point. Boss is very much real, however.