Styles

Friday, November 14, 2014

Regex for Unicode Variations

Ever had a spammer try to get around spam rules for a subject line like "Account Verification" which would otherwise be picked up by your regular spam? They would usually try something sneaky like "Acc0unt Verification" by replace o's with zeros (0) or have colons in between so your pattern matching wont pick up subject lines like "A:ccount Verification".

Worse they would even use subtle variations in vowels to substitute an e with a unicode variation like é to make pattern matching extremely difficult "Accøünt Vérificatiøn".

Regular expression can definitely make your life easier, but unicode would make things more interesting. First thing you will need is the basic pattern matching to catch out the characters in between that usually tries to through out whole word pattern matches:

^(?i).*(a.?c.?c.?o.?u.?n.?t.?[\s]*v.?e.?r.?i.?f.?i.?c.?a.?t.?i.?o.?n)+.*$

You will probably want to add some basic variations to the vowels which would catch any spammer trying to replace o's with zeros (0) or i's with ones (1) or a's with at symbols (@).
^(?i).*([a@].?c.?c.?[o0].?u.?n.?t.?[\s]*v.?[e3].?r.?[il1].?f.?[il1].?c.?[a@].?t.?[il1].?[o0].?n)+.*$

And if you really want to be pedantic, I've found a good table listing all the unicode characters and have provided them below so you can replace the vowel matching with the following unicode variations.

[a@À-Åà-åĀ-ąǍǎǞ-ǡǺǻȀ-ȃȦȧȺɑΆΑάαаӐ-ӓᗅᶏᶐḀḁẚẠ-ặἈ-ἏÅᾈ-ᾏᾸ-ᾼ₳]
[e3È-Ëè-ëĒ-ěȄ-ȇȨȩɆɇΈΕеѐёҼ-ҿӖӗᴇḔ-ḝẸ-ệⴹἘ-ἝῈΈ]
[il1Ì-Ïì-ïĨ-ıĺļľŀłƖƗǏǐȈ-ȋɨ-ɭΊΙіїӀḬ-ḯḷ-ḽỈ-ịἰ-Ἷὶίῐ-Ί]
[o0Ò-Øð-øŌ-őǑǒǪ-ǭǾǿȌ-ȏȪ-ȱɸɵΌΟθоѲѳӦ-ӫ০੦௦ᴏṌ-ṓỌ-ợὀ-Ὅὸό]
[uµÙ-Üù-üŨ-ųǓ-ǜȔ-ȗɄᴜṲ-ṻỤ-ựὐ-ὗὺύῠ-ΰ]

There are a few common consonants that also get used by spammers to avert the pattern matching:
[nŃ-ŋƝƞǸǹɳΝᶇṄ-ṋἠ-ἧ]
[tŢ-ŧƫ-ƮȚțȾᴛṪ-ṱ]
[y¥ÝýÿŶ-ŸƳƴȲȳɎɏΫγϒ-ϔўҮ-ұӮ-ӳẎẏẙỲ-ỹỾỿὙ-ὟⲨⲩ]
The end result of the regex pattern may seem long and complicated, but can definitely prevent many spam emails from slipping through the cracks