How are nested capturing groups numbered in regular expressions?

Java.NetRegexPerlLanguage Agnostic

Java Problem Overview


Is there a defined behavior for how regular expressions should handle the capturing behavior of nested parentheses? More specifically, can you reasonably expect that different engines will capture the outer parentheses in the first position, and nested parentheses in subsequent positions?

Consider the following PHP code (using PCRE regular expressions)

<?php
  $test_string = 'I want to test sub patterns';
  preg_match('{(I (want) (to) test) sub (patterns)}', $test_string, $matches);
  print_r($matches);
?>

Array
(
	[0] => I want to test sub patterns	//entire pattern
	[1] => I want to test			//entire outer parenthesis
	[2] => want				//first inner
	[3] => to				//second inner
	[4] => patterns				//next parentheses set
)

The entire parenthesized expression is captured first (I want to test), and then the inner parenthesized patterns are captured next ("want" and "to"). This makes logical sense, but I could see an equally logical case being made for first capturing the sub parentheses, and THEN capturing the entire pattern.

So, is this "capture the entire thing first" defined behavior in regular expression engines, or is it going to depend on the context of the pattern and/or the behavior of the engine (PCRE being different than C#'s being different than Java's being different than etc.)?

Java Solutions


Solution 1 - Java

From perlrequick

> If the groupings in a regex are > nested, $1 gets the group with the > leftmost opening parenthesis, $2 the > next opening parenthesis, etc.

Caveat: Excluding non-capture group opening parenthesis (?=)

Update

I don't use PCRE much, as I generally use the real thing ;), but PCRE's docs show the same as Perl's:

> SUBPATTERNS > > > >2. It sets up the subpattern as a capturing subpattern. This means that, when the whole pattern matches, that portion of the subject string that matched the subpattern is passed back to the caller via the ovector argument of pcre_exec(). Opening parentheses are counted from left to right (starting from 1) to obtain number for the capturing subpatterns. > > For example, if the string "the red king" is matched against the pattern > > the ((red|white) (king|queen)) > > the captured substrings are "red king", "red", and "king", and are numbered 1, 2, and 3, respectively.

If PCRE is drifting away from Perl regex compatibility, perhaps the acronym should be redefined--"Perl Cognate Regular Expressions", "Perl Comparable Regular Expressions" or something. Or just divest the letters of meaning.

Solution 2 - Java

Yeah, this is all pretty much well defined for all the languages you're interested in:

The  first  pair  of  integers, ovector[0] and ovector[1], identify the
portion of the subject string matched by the entire pattern.  The next
pair  is  used for the first capturing subpattern, and so on. The value
returned by pcre_exec() is one more than the highest numbered pair that
has  been  set.  For example, if two substrings have been captured, the
returned value is 3. If there are no capturing subpatterns, the  return
value from a successful match is 1, indicating that just the first pair
of offsets has been set.

  • Perl's different - <http://perldoc.perl.org/perlre.html#Capture-buffers>
    $1, $2 etc. match capturing groups as you'd expect (i.e. by occurrence of opening bracket), however $0 returns the program name, not the entire query string - to get that you use $& instead.

You'll more than likely find similar results for other languages (Python, Ruby, and others).

You say that it's equally logical to list the inner capture groups first and you're right - it's just be a matter of indexing on closing, rather than opening, parens. (if I understand you correctly). Doing this is less natural though (for example it doesn't follow reading direction convention) and so makes it more difficult (probably not significantly) to determine, by insepection, which capturing group will be at a given result index.

Putting the entire match string being in position 0 also makes sense - mostly for consistency. It allows the entire matched string to remain at the same index regardless of the number capturing groups from regex to regex and regardless of the number of capturing groups that actually match anything (Java for example will collapse the length of the matched groups array for each capturing group does not match any content (think for example something like "a (.*)pattern"). You could always inspect capturing_group_results[capturing_group_results_length - 2], but that doesn't translate well to languages to Perl which dynamically create variables ($1, $2 etc.) (Perl's a bad example of course, since it uses $& for the matched expression, but you get the idea :).

Solution 3 - Java

Every regex flavor I know numbers groups by the order in which the opening parentheses appear. That outer groups are numbered before their contained sub-groups is just a natural outcome, not explicit policy.

Where it gets interesting is with named groups. In most cases, they follow the same policy of numbering by the relative positions of the parens--the name is merely an alias for the number. However, in .NET regexes the named groups are numbered separately from numbered groups. For example:

Regex.Replace(@"one two three four", 
              @"(?<one>\w+) (\w+) (?<three>\w+) (\w+)",
              @"$1 $2 $3 $4")

// result: "two four one three"

In effect, the number is an alias for the name; the numbers assigned to named groups start where the "real" numbered groups leave off. That may seem like a bizarre policy, but there's a good reason for it: in .NET regexes you can use the same group name more than once in a regex. That makes possible regexes like the one from this thread for matching floating-point numbers from different locales:

^[+-]?[0-9]{1,3}
(?:
    (?:(?<thousand>\,)[0-9]{3})*
    (?:(?<decimal>\.)[0-9]{2})?
|
    (?:(?<thousand>\.)[0-9]{3})*
    (?:(?<decimal>\,)[0-9]{2})?
|
    [0-9]*
    (?:(?<decimal>[\.\,])[0-9]{2})?
)$

If there's a thousands separator, it will be saved in group "thousand" no matter which part of the regex matched it. Similarly, the decimal separator (if there is one) will always be saved in group "decimal". Of course, there are ways to identify and extract the separators without reusable named groups, but this way is so much more convenient, I think it more than justifies the weird numbering scheme.

And then there's Perl 5.10+, which gives us more control over capturing groups than I know what to do with. :D

Solution 4 - Java

The order of capturing in the order of the left paren is standard across all the platforms I've worked in. (perl, php, ruby, egrep)

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionAlan StormView Question on Stackoverflow
Solution 1 - JavadaotoadView Answer on Stackoverflow
Solution 2 - JavaAlan DonnellyView Answer on Stackoverflow
Solution 3 - JavaAlan MooreView Answer on Stackoverflow
Solution 4 - JavaDevin CeartasView Answer on Stackoverflow