Regular Expressions

Regex modifiers (flags)

Introduction#

Regular expression patterns are often used with modifiers (also called flags) that redefine regex behavior. Regex modifiers can be regular (e.g. /abc/i) and inline (or embedded) (e.g. (?i)abc). The most common modifiers are global, case-insensitive, multiline and dotall modifiers. However, regex flavors differ in the number of supported regex modifiers and their types.

Remarks#

PCRE Modifiers

Modifier Inline Description
PCRE_CASELESS (?i) Case insensitive match
PCRE_MULTILINE (?m) Multiple line matching
PCRE_DOTALL (?s) . matches new lines
PCRE_ANCHORED (?A) Meta-character ^ matches only at the start
PCRE_EXTENDED (?x) White-spaces are ignored
PCRE_DOLLAR_ENDONLY n/a Meta-character $ matches only at the end
PCRE_EXTRA (?X) Strict escape parsing
PCRE_UTF8 Handles UTF-8 characters
PCRE_UTF16 Handles UTF-16 characters
PCRE_UTF32 Handles UTF-32 characters
PCRE_UNGREEDY (?U) Sets the engine to lazy matching
PCRE_NO_AUTO_CAPTURE (?:) Disables auto-capturing groups

Java Modifiers

Modifier (Pattern.###) Value Description
UNIX_LINES 1 Enables Unix lines mode.
CASE_INSENSITIVE 2 Enables case-insensitive matching.
COMMENTS 4 Permits whitespace and comments in a pattern.
MULTILINE 8 Enables multiline mode.
LITERAL 16 Enables literal parsing of the pattern.
DOTALL 32 Enables dotall mode.
UNICODE_CASE 64 Enables Unicode-aware case folding.
CANON_EQ 128 Enables canonical equivalence.
UNICODE_CHARACTER_CLASS 256 Enables the Unicode version of Predefined character classes and POSIX character classes.

DOTALL modifier

A regex pattern where a DOTALL modifier (in most regex flavors expressed with s) changes the behavior of . enabling it to match a newline (LF) symbol:

/cat (.*?) dog/s

This Perl-style regex will match a string like "cat fled from\na dog" capturing "fled from\na" into Group 1.

An inline version: (?s) (e.g. (?s)cat (.*?) dog)

Note: In Ruby, the DOTALL modifier equivalent is m, Regexp::MULTILINE modifier (e.g. /a.*b/m).

Note: JavaScript does not provide a DOTALL modifier, so a . can never be allowed to match a newline character. In order to achieve the same effect, a workaround is necessary, e. g. substituting all the .s with a catch-all character class like [\S\s], or a not nothing character class [^] (however, this construct will be treated as an error by all other engines, and is thus not portable).

MULTILINE modifier

Another example is a MULTILINE modifier (usually expressed with m flag (not in Oniguruma (e.g. Ruby) that uses m to denote a DOTALL modifier)) that makes ^ and $ anchors match the start/end of a line, not the start/end of the whole string.

/^My Line \d+$/gm

will find all lines that start with My Line, then contain a space and 1+ digits up to the line end.

An inline version: (?m) (e.g. (?m)^My Line \d+$)

NOTE: In Oniguruma (e.g. in Ruby), and also in almost any text editors supporting regexps, the ^ and $ anchors denote line start/end positions by default. You need to use \A to define the whole document/string start and \z to denote the document/string end. The difference between the \Z and \z is that the former can match before the final newline (LF) symbol at the end of the string (e.g. /\Astring\Z/ will find a match in "string\n") (except Python, where \Z behavior is equal to \z and \z anchor is not supported).

IGNORE CASE modifier

The common modifier to ignore case is i:

/fog/i

will match Fog, foG, etc.

The inline version of the modifier looks like (?i).

Notes:

In Java, by default, case-insensitive matching assumes that only characters in the US-ASCII charset are being matched. Unicode-aware case-insensitive matching can be enabled by specifying the UNICODE_CASE flag in conjunction with this (CASE_INSENSITIVE) flag. (e.g. Pattern p = Pattern.compile("YOUR_REGEX", Pattern.CASE_INSENSITIVE | Pattern.UNICODE_CASE);). Some more on this can be found at Case-Insensitive Matching in Java RegEx. Also, UNICODE_CHARACTER_CLASS can be used to make matching Unicode aware.

VERBOSE / COMMENT / IgnorePatternWhitespace modifier

The modifier that allows using whitespace inside some parts of the pattern to format it for better readability and to allow comments starting with #:

/(?x)^          # start of string
  (?=\D*\d)     # the string should contain at least 1 digit 
  (?!\d+$)      # the string cannot consist of digits only
  \#            # the string starts with a hash symbol
  [a-zA-Z0-9]+ # the string should have 1 or more alphanumeric symbols
  $             # end of string
/

Example of a string: #word1here. Note the # symbol is escaped to denote a literal # that is part of a pattern.

Unescaped white space in the regular expression pattern is ignored, escape it to make it a part of the pattern.

Usually, the whitespace inside character classes ([...]) is treated as a literal whitespace, except in Java.

Also, it is worth mentioning that in PCRE, .NET, Python, Ruby Oniguruma, ICU, Boost regex flavors one can use (?#:...) comments inside the regex pattern.

Explicit Capture modifier

This is a .NET regex specific modifier expressed with n. When used, unnamed groups (like (\d+)) are not captured. Only valid captures are explicitly named groups (e.g. (?<name> subexpression)).

(?n)(\d+)-(\w+)-(?<id>\w+)

will match the whole 123-1_abc-00098, but (\d+) and (\w+) won’t create groups in the resulting match object. The only group will be ${id}. See demo.

UNICODE modifier

The UNICODE modifier, usually expressed as u (PHP, Python) or U (Java), makes the regex engine treat the pattern and the input string as Unicode strings and patterns, make the pattern shorthand classes like \w, \d, \s, etc. Unicode-aware.

/\A\p{L}+\z/u

is a PHP regex to match strings that consist of 1 or more Unicode letters. See the regex demo.

Note that in PHP, the /u modifier enables the PCRE engine to handle strings as UTF8 strings (by turning on PCRE_UTF8 verb) and make the shorthand character classes in the pattern Unicode aware (by enabling PCRE_UCP verb, see more at pcre.org).

Pattern and subject strings are treated as UTF-8. This modifier is available from PHP 4.1.0 or greater on Unix and from PHP 4.2.3 on win32. UTF-8 validity of the pattern and the subject is checked since PHP 4.3.5. An invalid subject will cause the preg_* function to match nothing; an invalid pattern will trigger an error of level E_WARNING. Five and six octet UTF-8 sequences are regarded as invalid since PHP 5.3.4 (resp. PCRE 7.3 2007-08-28); formerly those have been regarded as valid UTF-8.

In Python 2.x, the re.UNICODE only affects the pattern itself: Make \w, \W, \b, \B, \d, \D, \s and \S dependent on the Unicode character properties database.

An inline version: (?u) in Python, (?U) in Java. For example:

print(re.findall(ur"(?u)\w+", u"Dąb")) # [u'D\u0105b']
print(re.findall(r"\w+", u"Dąb"))      # [u'D', u'b']

System.out.println("Dąb".matches("(?U)\\w+")); // true
System.out.println("Dąb".matches("\\w+"));     // false

PCRE_DOLLAR_ENDONLY modifier

The PCRE-compliant PCRE_DOLLAR_ENDONLY modifier that makes the $ anchor match at the very end of the string (excluding the position before the final newline in the string).

/^\d+$/D

is equal to

/^\d+\z/

and matches a whole string that consists of 1 or more digits and will not match "123\n", but will match "123".

PCRE_ANCHORED modifier

Another PCRE-compliant modifier expressed with /A modifier. If this modifier is set, the pattern is forced to be “anchored”, that is, it is constrained to match only at the start of the string which is being searched (the “subject string”). This effect can also be achieved by appropriate constructs in the pattern itself, which is the only way to do it in Perl.

/man/A

is the same as

/^man/

PCRE_UNGREEDY modifier

The PCRE-compliant PCRE_UNGREEDY flag expressed with /U. It switches greediness inside a pattern: /a.*?b/U = /a.*b/ and vice versa.

PCRE_INFO_JCHANGED modifier

One more PCRE modifier that allows the use of duplicate named groups.

NOTE: only inline version is supported - (?J), and must be placed at the start of the pattern.

If you use

/(?J)\w+-(?:new-(?<val>\w+)|\d+-empty-(?<val>[^-]+)-collection)/

the “val” group values will be never empty (will always be set). A similar effect can be achieved with branch reset though.

PCRE_EXTRA modifier

A PCRE modifier that causes an error if any backslash in a pattern is followed by a letter that has no special meaning. By default, a backslash followed by a letter with no special meaning is treated as a literal.

E.g.

/big\y/

will match bigy, but

/big\y/X

will throw an exception.

Inline version: (?X)


This modified text is an extract of the original Stack Overflow Documentation created by the contributors and released under CC BY-SA 3.0 This website is not affiliated with Stack Overflow