Regular Expressions

Character classes

Remarks#

Simple classes

Regex Matches
[abc] Any of the following characters: a, b, or c
[a-z] Any character from a to z, inclusive (this is called a range)
[0-9] Any digit from 0 to 9, inclusive

Common classes

Some groups/ranges of characters are so often used, they have special abbreviations:

Regex Matches
\w Alphanumeric characters plus the underscore (also referred to as “word characters”)
\W Non-word characters (same as [^\w])
\d Digits (wider than [0-9] since include Persian digits, Indian ones etc.)
\D Non-digits (shorter than [^0-9] since reject Persian digits, Indian ones etc.)
\s Whitespace characters (spaces, tabs, etc…) Note: may vary depending on your engine/context
\S Non-whitespace characters

Negating classes

A caret (^) after the opening square bracket works as a negation of the characters that follow it. This will match all characters that are not in the character class.

Negated character classes also match line break characters, therefore if these are not to be matched, the specific line break characters must be added to the class (\r and/or \n).

Regex Matches
[^AB] Any character other than A and B
[^\d] Any character, except digits

The basics

Suppose we have a list of teams, named like this: Team A, Team B, …, Team Z. Then:

  • Team [AB]: This will match either either Team A or Team B
  • Team [^AB]: This will match any team except Team A or Team B

We often need to match characters that “belong” together in some context or another (like letters from A through Z), and this is what character classes are for.

Match different, similar words

Consider the character class [aeiou]. This character class can be used in a regular expression to match a set of similarly spelled words.

b[aeiou]t matches:

  • bat
  • bet
  • bit
  • bot
  • but

It does not match:

  • bout
  • btt
  • bt

Character classes on their own match one and only one character at a time.

Non-alphanumerics matching (negated character class)

[^0-9a-zA-Z]

This will match all characters that are neither numbers nor letters (alphanumerical characters). If the underscore character _ is also to be negated, the expression can be shortened to:

[^\w]

Or:

\W

In the following sentences:

  1. Hi, what’s up?

  2. I can’t wait for 2017!!!

The following characters match:

  1. ,, , ', ? and the end of line character.

  2. ', , ! and the end of line character.

UNICODE NOTE
Note that some flavors with Unicode character properties support may interpret \w and \W as [\p{L}\p{N}_] and [^\p{L}\p{N}_] which means other Unicode letters and numeric characters will be included as well (see PCRE docs). Here is a PCRE \w test:enter image description here

In .NET, \w = [\p{Ll}\p{Lu}\p{Lt}\p{Lo}\p{Lm}\p{Mn}\p{Nd}\p{Pc}], and note it does not match \p{Nl} and \p{No} unlike PCRE (see the \w .NET documentation):

Picture

Note that for some reason, Unicode 3.1 lowercase letters (like 𝐚𝒇𝓌𝔨𝕨𝗐𝛌𝛚) are not matched.

Java’s (?U)\w will match a mix of what \w matches in PCRE and .NET:enter image description here

Non-digits matching (negated character class)

[^0-9]

This will match all characters that are not ASCII digits.

If Unicode digits are also to be negated, the following expression can be used, depending on your flavor/language settings:

[^\d]

This can be shortened to:

\D

You may need to enable Unicode character properties support explicitly by using the u modifier or programmatically in some languages, but this may be non-obvious. To convey the intent explicitly, the following construct can be used (when support is available):

\P{N}

Which by definition means: any character which is not a numeric character in any script. In a negated character range, you may use:

[^\p{N}]

In the following sentences:

  1. Hi, what’s up?

  2. I can’t wait for 2017!!!

The following characters will be matched:

  1. ,, , ', ?, the end of line character and all letters (lowercase and uppercase).

  2. ', , !, the end of line character and all letters (lowercase and uppercase).

Character class and common problems faced by beginner

1. Character Class

Character class is denoted by []. Content inside a character class is treated as single character separately. e.g. suppose we use

[12345]

In the example above, it means match 1 or 2 or 3 or 4 or 5 . In simple words, it can be understood as or condition for single characters (stress on single character)

1.1 Word of caution

  • In character class, there is no concept of matching a string. So, if you are using regex [cat], it does not mean that it should match the word cat literally but it means that it should match either c or a or t. This is a very common misunderstanding existing among people who are newer to regex.
  • Sometimes people use | (alternation) inside character class thinking it will act as OR condition which is wrong. e.g. using [a|b] actually means match a or | (literally) or b.

2. Range in character class

Range in character class is denoted using - sign. Suppose we want to find any character within English alphabets A to Z. This can be done by using the following character class

[A-Z]

This could be done for any valid ASCII or unicode range. Most commonly used ranges include [A-Z], [a-z] or [0-9]. Moreover these ranges can be combined in character class as

[A-Za-z0-9]

This means that match any character in the range A to Z or a to z or 0 to 9. The ordering can be anything. So the above is equivalent to [a-zA-Z0-9] as long as the range you define is correct.

2.1 Word of caution

  • Sometimes when writing ranges for A to Z people write it as [A-z]. This is wrong in most cases because we are using z instead of Z. So this denotes match any character from ASCII range 65 (of A) to 122 (of z) which includes many unintended character after ASCII range 90 (of Z). HOWEVER, [A-z] can be used to match all [a-zA-Z] letters in POSIX-style regex when collation is set for a particular language.

[[ "ABCEDEF[]_abcdef" =~ ([A-z]+) ]] && echo "${BASH_REMATCH[1]}" on Cygwin with LC_COLLATE="en_US.UTF-8" yields ABCEDF. If you set LC_COLLATE to C (on Cygwin, done with export), it will give the expected ABCEDEF[]_abcdef.

  • Meaning of - inside character class is special. It denotes range as explained above. What if we want to match - character literally? We can’t put it anywhere otherwise it will denote ranges if it is put between two characters. In that case we have to put - in starting of character class like [-A-Z] or in end of character class like [A-Z-] or escape it if you want to use it in middle like [A-Z\-a-z].

3. Negated character class

Negated character class is denoted by [^..]. The caret sign ^ denotes match any character except the one present in character class. e.g.

[^cat]

means match any character except c or a or t.

3.1 Word of caution

  • The meaning of caret sign ^ maps to negation only if its in the starting of character class. If its anywhere else in character class it is treated as literal caret character without any special meaning.
  • Some people write regex like [^]. In most regex engines, this gives an error. The reason being when you are using ^ in the starting position, it expects at least one character that should be negated. In JavaScript though, this is a valid construct matching anything but nothing, i.e. matches any possible symbol (but diacritics, at least in ES5).

POSIX Character classes

POSIX character classes are predefined sequences for a certain set of characters.

Character class Description
[:alpha:] Alphabetic characters
[:alnum:] Alphabetic characters and digits
[:digit:] Digits
[:xdigit:] Hexadecimal digits
[:blank:] Space and Tab
[:cntrl:] Control characters
[:graph:] Visible characters (anything except spaces and control characters)
[:print:] Visible characters and spaces
[:lower:] Lowercase letters
[:upper:] Uppercase letters
[:punct:] Punctuation and symbols
[:space:] All whitespace characters, including line breaks

Additional character classes my be available depending on the implementation and/or locale.

Character class Description
[:<:] Beginning of word
[:>:] End of word
[:ascii:] ASCII Characters
[:word:] Letters, digits and underscore. Equivalent to \w

To use the inside a bracket sequence (aka. character class), you should also include the square brackets. Example:

[[:alpha:]]

This will match one alphabetic character.

[[:digit:]-]{2}

This will match 2 characters, that are either digits or -. The following will match:

  • --
  • 11
  • -2
  • 3-

More information is available on: Regular-expressions.info


This modified text is an extract of the original Stack Overflow Documentation created by the contributors and released under CC BY-SA 3.0 This website is not affiliated with Stack Overflow