Regular Expressions

Atomic Grouping

Introduction#

Regular non-capturing groups allow the engine to re-enter the group and attempt to match something different (such as a different alternation, or match fewer characters when a quantifier is used).

Atomic groups differ from regular non-capturing groups in that backtracking is forbidden. Once the group exits, all backtracking information is discarded, so no alternate matches can be attempted.

Remarks#

A possessive quantifier behaves like an atomic group in that the engine will be unable to backtrack over a token or group.

The following are equivalent in terms of functionality, although some will be faster than others:

a*+abc
(?>a*)abc
(?:a+)*+abc
(?:a)*+abc
(?:a*)*+abc
(?:a*)++abc

Grouping with (?>)

Using an Atomic Group

Atomic groups have the format (?>...) with a ?> after the open paren.

Consider the following sample text:

ABC

The regex will attempt to match starting at position 0 of the text, which is before the A in ABC.

If a case-insensitive expression (?>a*)abc were used, the (?>a*) would match 1 A character, leaving

BC

as the remaining text to match. The (?>a*) group is exited, and abc is attempted on the remaining text, which fails to match.

The engine is unable to backtrack into the atomic group, and so the current pass fails. The engine moves to the next position in the text, which would be at position 1, which is after the A and before the B of ABC.

The regex (?>a*)abc is attempted again, and (?>a*) matches A 0 times, leaving

BC

as the remaining text to match. The (?>a*) group is exited and abc is attempted, which fails.

Again, the engine is unable to backtrack into the atomic group, and so the current pass fails. The regex will continue to fail until all positions in the text have been exhausted.

Using a Non-Atomic Group

Regular non-capturing groups have the format (?:...) with a ?: after the open paren.

Given the same sample text, but with the case-insensitive expression (?:a*)abc instead, a match would occur since backtracking is allowed to occur.

At first, (?:a*) will consume the letter A in the text

ABC

leaving

BC

as the remaining text to match. The (?:a*) group is exited, and abc is attempted on the remaining text, which fails to match.

The engine backtracks into the (?:a*) group and attempts to match 1 fewer character: Instead of matching 1 A character, it attempts to match 0 A characters, and the (?:a*) group is exited. This leaves

ABC

as the remaining text to match. The regex abc is now able to successfully match the remaining text.

Other Example Text

Consider this sample text, with both atomic and non-atomic groups (again, case-insensitive):

AAAABC

The regex will attempt to match starting at position 0 of the text, which is before the first A in AAAABC.

The pattern using the atomic group (?>a*)abc will be unable to match, behaving almost identically to the atomic ABC example above: all 4 of the A characters are first matched with (?>a*) (leaving BC as the remaining text to match), and abc is unable to match on that text. The group is not able to be re-entered, so the match fails.

The pattern using the non-atomic group (?:a*)abc will be able to match, behaving similarly to the non-atomic ABC example above: all 4 of the A characters are first matched with (?:a*) (leaving BC as the remaining text to match), and abc is unable to match on that text. The group is able to be re-entered, so one fewer A is attempted: 3 A characters are matched instead of 4 (leaving ABC as the remaining text to match), and abc is able to successfully match on that text.


This modified text is an extract of the original Stack Overflow Documentation created by the contributors and released under CC BY-SA 3.0 This website is not affiliated with Stack Overflow