How does non-greedy matching work with regular expressions?

Solution Unverified - Updated -

Environment

  • Red Hat Enterprise linux
  • pcregrep

Issue

In some cases, regular expression with non-greedy quantifiers seem to return greedy results

Resolution

Non-greedy quantifiers, don't guarantee the shortest result. They merely hint the regular expression library, which path to try next.

Some examples (note that -o or --only-matching only prints out the match itself)

$ echo "abc def ghi" | pcregrep -o 'abc.*ghi'
abc def ghi
$ echo "abc def ghi" | pcregrep -o 'abc.*?ghi'
abc def ghi

No surprises here, the string matches both the greedy and non-greedy regexp variants, and both matches are the same.

$ echo "abc def ghi ghi" | pcregrep -o 'abc.*ghi'
abc def ghi ghi
$ echo "abc def ghi ghi" | pcregrep -o 'abc.*?ghi'
abc def ghi

This is the greedy versus the non-greedy matches in action. The greedy variant matches as much as possible, and backtracks from the end of the string, until it finds a match for ghi. The non-greedy variant returns a shorter match. In reality, it returns the match it encounters first, too. The non-greedy modifier, makes the regexp engine explore the path that continues with g first. It sees a match, and returns it.

$ echo "abc def abc ghi" | pcregrep -o 'abc.*ghi'
abc def abc ghi
$ echo "abc def abc ghi" | pcregrep -o 'abc.*?ghi'
abc def abc ghi

Surprisingly, the whole string is matched in both cases. This seems to be wrong, but really isn't. Let's see what happens here:
First, the string abc is found, right at the start of the string. Then, the .*? starts matching, until an occurence of ghi is found. Coincidentally, what's matched by .*? also contains another occurence of abc, but what is matched is the shortest match for .*?.

Let's take this one step further:

$ echo "abc def abc ghi ghi" | pcregrep -o 'abc.*ghi'
abc def abc ghi ghi
$ echo "abc def abc ghi ghi" | pcregrep -o 'abc.*?ghi'
abc def abc ghi

Here, it becomes clear, that the match for .*? doesn't include the first occurence of ghi.

if you would need to find the shortest match, you will have to exclude bits of the first part in the middle part of the expression. A common way of doing this, is to have a negated character class.

$ echo "abc def abc def ghi ghi" | pcregrep -o 'abc[^ag]*ghi'
abc def ghi

Root Cause

Non-greedy matching is often understood to return the shortest match, but this is not the case. The non-greedy quantifier merely changes the search order for different paths.

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.

Comments