Regular Expressions

Backtracking

What causes Backtracking?

To find a match, the regex engine will consume characters one by one. When a partial match begins, the engine will remember the start position so it can go back in case the following characters don’t complete the match.

  • If the match is complete, the is no backtracking
  • If the match isn’t complete, the engine will backtrack the string (like when you rewind an old tape) to try to find a whole match.

For example: \d{3}[a-z]{2} against the string abc123def will be browsed as such:

abc123def
^ Does not match \d
abc123def
 ^ Does not match \d
abc123def
  ^ Does not match \d
abc123def
   ^ Does match \d (first one)
abc123def
    ^ Does match \d (second one)
abc123def
     ^ Does match \d (third one)
abc123def
      ^ Does match [a-z] (first one)
abc123def
       ^ Does match [a-z] (second one)
           MATCH FOUND

Now lets change the regex to \d{2}[a-z]{2} against the same string (abc123def):

abc123def
^ Does not match \d
abc123def
 ^ Does not match \d
abc123def
  ^ Does not match \d
abc123def
   ^ Does match \d (first one)
abc123def
    ^ Does match \d (second one)
abc123def
     ^ Does not match [a-z]
abc123def
    ^ BACKTRACK to catch \d{2} => (23)
abc123def
      ^ Does match [a-z] (first one)
abc123def
       ^ Does match [a-z] (second one)
           MATCH FOUND

Why can backtracking be a trap?

Backtracking can be caused by optional quantifiers or alternation constructs, because the regex engine will try to explore every path. If you run the regex a+b against aaaaaaaaaaaaaa there is no match and the engine will find it pretty fast.

But if you change the regex to (aa*)+b the number of combinations will grow pretty fast, and most (not optimized) engines will try to explore all the paths and will take an eternity to try to find a match or throw a timeout exception. This is called catastrophic backtracking.

Of course, (aa*)+b seems a newbie error but it’s here to illustrate the point and sometimes you’ll end up with the same issue but with more complicated patterns.

A more extreme case of catastrophic backtracking occurs with the regex (x+x+)+y (you’ve probably seen it before here and here), which needs exponential time to figure out that a string that contains xs and nothing else (e.g xxxxxxxxxxxxxxxxxxxx) don’t match it.

How to avoid it?

Be as specific as possible, reduce as much as possible the possible paths. Note that some regex matchers are not vulnerable to backtracking, such as those included in awk or grep because they are based on Thompson NFA.


This modified text is an extract of the original Stack Overflow Documentation created by the contributors and released under CC BY-SA 3.0 This website is not affiliated with Stack Overflow