Can regular expressions be used to match nested patterns?

Regex Problem Overview

Is it possible to write a regular expression that matches a nested pattern that occurs an unknown number of times? For example, can a regular expression match an opening and closing brace when there are an unknown number of open/close braces nested within the outer braces?

For example:

public MyMethod()
{
  if (test)
  {
    // More { }
  }

  // More { }
} // End

Should match:

{
  if (test)
  {
    // More { }
  }

  // More { }
}

Regex Solutions

Solution 1 - Regex

No. It's that easy. A finite automaton (which is the data structure underlying a regular expression) does not have memory apart from the state it's in, and if you have arbitrarily deep nesting, you need an arbitrarily large automaton, which collides with the notion of a finite automaton.

You can match nested/paired elements up to a fixed depth, where the depth is only limited by your memory, because the automaton gets very large. In practice, however, you should use a push-down automaton, i.e a parser for a context-free grammar, for instance LL (top-down) or LR (bottom-up). You have to take the worse runtime behavior into account: O(n^3) vs. O(n), with n = length(input).

There are many parser generators avialable, for instance ANTLR for Java. Finding an existing grammar for Java (or C) is also not difficult.
For more background: Automata Theory at Wikipedia

Solution 2 - Regex

Using regular expressions to check for nested patterns is very easy.

'/(\((?>[^()]+|(?1))*\))/'

Solution 3 - Regex

Probably working Perl solution, if the string is on one line:

my $NesteD ;
$NesteD = qr/ \{( [^{}] | (??{ $NesteD }) )* \} /x ;

if ( $Stringy =~ m/\b( \w+$NesteD )/x ) {
    print "Found: $1\n" ;
  }

HTH

EDIT: check:

http://dev.perl.org/perl6/rfc/145.html
ruby information: http://www.ruby-forum.com/topic/112084
more perl: http://www.perlmonks.org/?node_id=660316
even more perl: https://metacpan.org/pod/Text::Balanced
perl, perl, perl: http://perl.plover.com/yak/regex/samples/slide083.html

And one more thing by Torsten Marek (who had pointed out correctly, that it's not a regex anymore):

http://coding.derkeiler.com/Archive/Perl/comp.lang.perl.misc/2008-03/msg01047.html

Solution 4 - Regex

Yes, if it is .NET RegEx-engine. .Net engine supports finite state machine supplied with an external stack. see details

Solution 5 - Regex

The Pumping lemma for regular languages is the reason why you can't do that.

The generated automaton will have a finite number of states, say k, so a string of k+1 opening braces is bound to have a state repeated somewhere (as the automaton processes the characters). The part of the string between the same state can be duplicated infinitely many times and the automaton will not know the difference.

In particular, if it accepts k+1 opening braces followed by k+1 closing braces (which it should) it will also accept the pumped number of opening braces followed by unchanged k+1 closing brases (which it shouldn't).

Solution 6 - Regex

Proper Regular expressions would not be able to do it as you would leave the realm of Regular Languages to land in the Context Free Languages territories.

Nevertheless the "regular expression" packages that many languages offer are strictly more powerful.

For example, Lua regular expressions have the "%b()" recognizer that will match balanced parenthesis. In your case you would use "%b{}"

Another sophisticated tool similar to sed is gema, where you will match balanced curly braces very easily with {#}.

So, depending on the tools you have at your disposal your "regular expression" (in a broader sense) may be able to match nested parenthesis.

Solution 7 - Regex

YES

...assuming that there is some maximum number of nestings you'd be happy to stop at.

Let me explain.

@torsten-marek is right that a regular expression cannot check for nested patterns like this, BUT it is possible to define a nested regex pattern which will allow you to capture nested structures like this up to some maximum depth. I created one to capture EBNF-style comments (try it out here), like:

(* This is a comment (* this is nested inside (* another level! *) hey *) yo *)

The regex (for single-depth comments) is the following:

m{1} = \(+\*+(?:[^*(]|(?:\*+[^)*])|(?:\(+[^*(]))*\*+\)+

This could easily be adapted for your purposes by replacing the \(+\*+ and \*+\)+ with { and } and replacing everything in between with a simple [^{}]:

p{1} = \{(?:[^{}])*\}

(Here's the link to try that out.)

To nest, just allow this pattern within the block itself:

p{2} = \{(?:(?:p{1})|(?:[^{}]))*\}
  ...or...
p{2} = \{(?:(?:\{(?:[^{}])*\})|(?:[^{}]))*\}

To find triple-nested blocks, use:

p{3} = \{(?:(?:p{2})|(?:[^{}]))*\}
  ...or...
p{3} = \{(?:(?:\{(?:(?:\{(?:[^{}])*\})|(?:[^{}]))*\})|(?:[^{}]))*\}

A clear pattern has emerged. To find comments nested to a depth of N, simply use the regex:

p{N} = \{(?:(?:p{N-1})|(?:[^{}]))*\}

  where N > 1 and
  p{1} = \{(?:[^{}])*\}

A script could be written to recursively generate these regexes, but that's beyond the scope of what I need this for. (This is left as an exercise for the reader. )

Solution 8 - Regex

Using the recursive matching in the PHP regex engine is massively faster than procedural matching of brackets. especially with longer strings.

http://php.net/manual/en/regexp.reference.recursive.php

e.g.

$patt = '!\( (?: (?: (?>[^()]+) | (?R) )* ) \)!x';

preg_match_all( $patt, $str, $m );

vs.

matchBrackets( $str );

function matchBrackets ( $str, $offset = 0 ) {

	$matches = array();

	list( $opener, $closer ) = array( '(', ')' );

	// Return early if there's no match
	if ( false === ( $first_offset = strpos( $str, $opener, $offset ) ) ) {
		return $matches;
	}

	// Step through the string one character at a time storing offsets
	$paren_score = -1;
	$inside_paren = false;
	$match_start = 0;
	$offsets = array();

	for ( $index = $first_offset; $index < strlen( $str ); $index++ ) {
		$char = $str[ $index ];

		if ( $opener === $char ) {
			if ( ! $inside_paren ) {
				$paren_score = 1;
				$match_start = $index;
			}
			else {
				$paren_score++;
			}
			$inside_paren = true;
		}
		elseif ( $closer === $char ) {
			$paren_score--;
		}

		if ( 0 === $paren_score ) {
			$inside_paren = false;
			$paren_score = -1;
			$offsets[] = array( $match_start, $index + 1 );
		}
	}

	while ( $offset = array_shift( $offsets ) ) {

		list( $start, $finish ) = $offset;

		$match = substr( $str, $start, $finish - $start );
		$matches[] = $match;
	}

	return $matches;
}

Solution 9 - Regex

No, you are getting into the realm of Context Free Grammars at that point.

Solution 10 - Regex

as zsolt mentioned, some regex engines support recursion -- of course, these are typically the ones that use a backtracking algorithm so it won't be particularly efficient. example: /(?>[^{}]*){(?>[^{}]*)(?R)*(?>[^{}]*)}/sm

Solution 11 - Regex

This seems to work: /(\{(?:\{.*\}|[^\{])*\})/m

Content Type	Original Author	Original Content on Stackoverflow
Question	Richard Dorman	View Question on Stackoverflow
Solution 1 - Regex	Torsten Marek	View Answer on Stackoverflow
Solution 2 - Regex	MichaelRushton	View Answer on Stackoverflow
Solution 3 - Regex	Zsolt Botykai	View Answer on Stackoverflow
Solution 4 - Regex	Pavlush	View Answer on Stackoverflow
Solution 5 - Regex	Rafał Dowgird	View Answer on Stackoverflow
Solution 6 - Regex	Remo.D	View Answer on Stackoverflow
Solution 7 - Regex	awwsmm	View Answer on Stackoverflow
Solution 8 - Regex	Pete B	View Answer on Stackoverflow
Solution 9 - Regex	Craig H	View Answer on Stackoverflow
Solution 10 - Regex	sirnotappearingonthissite	View Answer on Stackoverflow
Solution 11 - Regex	Sean Huber	View Answer on Stackoverflow