UTF-8 matchers: Letters, Marks, Punctuation etc.
Matching letters in different alphabets
Examples below are given in Ruby, but same matchers should be available in any modern language.
Let’s say we have the string "AℵNaïve"
, produced by Messy Artificial Intelligence. It consists of letters, but generic \w
matcher won’t match much:
▶ "AℵNaïve"[/\w+/]
#⇒ "A"
The correct way to match Unicode letter with combining marks is to use \X
to specify a grapheme cluster. There is a caveat for Ruby, though. Onigmo, the regex engine for Ruby, still uses the old definition of a grapheme cluster. It is not yet updated to Extended Grapheme Cluster as defined in Unicode Standard Annex 29.
So, for Ruby we could have a workaround: \p{L}
will do almost fine, save for it fails on combined diacritical accent on i
:
▶ "AℵNaïve"[/\p{L}+/]
#⇒ "AℵNai"
By adding the “Mark symbols” to the expression, we can finally match everything:
▶ "AℵNaïve"[/[\p{L}\p{M}]+/]
#⇒ "AℵNaïve"