Regex Group in Perl: how to capture elements into array from regex group that matches unknown number of/multiple/variable occurrences from a string?

RegexPerlGroupingMatch

Regex Problem Overview


In Perl, how can I use one regex grouping to capture more than one occurrence that matches it, into several array elements?

For example, for a string:

var1=100 var2=90 var5=hello var3="a, b, c" var7=test var3=hello

to process this with code:

$string = "var1=100 var2=90 var5=hello var3=\"a, b, c\" var7=test var3=hello";
   
my @array = $string =~ <regular expression here>
   
for ( my $i = 0; $i < scalar( @array ); $i++ )
{
  print $i.": ".$array[$i]."\n";
}

I would like to see as output:

0: var1=100
1: var2=90
2: var5=hello
3: var3="a, b, c"
4: var7=test
5: var3=hello

What would I use as a regex?

The commonality between things I want to match here is an assignment string pattern, so something like:

my @array = $string =~ m/(\w+=[\w\"\,\s]+)*/;

Where the * indicates one or more occurrences matching the group.

(I discounted using a split() as some matches contain spaces within themselves (i.e. var3...) and would therefore not give desired results.)

With the above regex, I only get:

0: var1=100 var2

Is it possible in a regex? Or addition code required?

Looked at existing answers already, when searching for "perl regex multiple group" but not enough clues:

Regex Solutions


Solution 1 - Regex

my $string = "var1=100 var2=90 var5=hello var3=\"a, b, c\" var7=test var3=hello";

while($string =~ /(?:^|\s+)(\S+)\s*=\s*("[^"]*"|\S*)/g) {
        print "<$1> => <$2>\n";
}

Prints:

<var1> => <100>
<var2> => <90>
<var5> => <hello>
<var3> => <"a, b, c">
<var7> => <test>
<var3> => <hello>

Explanation:

Last piece first: the g flag at the end means that you can apply the regex to the string multiple times. The second time it will continue matching where the last match ended in the string.

Now for the regex: (?:^|\s+) matches either the beginning of the string or a group of one or more spaces. This is needed so when the regex is applied next time, we will skip the spaces between the key/value pairs. The ?: means that the parentheses content won't be captured as group (we don't need the spaces, only key and value). \S+ matches the variable name. Then we skip any amount of spaces and an equal sign in between. Finally, ("[^"]*"|\S*)/ matches either two quotes with any amount of characters in between, or any amount of non-space characters for the value. Note that the quote matching is pretty fragile and won't handle escpaped quotes properly, e.g. "\"quoted\"" would result in "\".

EDIT:

Since you really want to get the whole assignment, and not the single keys/values, here's a one-liner that extracts those:

my @list = $string =~ /(?:^|\s+)((?:\S+)\s*=\s*(?:"[^"]*"|\S*))/g;

Solution 2 - Regex

With regular expressions, use a technique that I like to call tack-and-stretch: anchor on features you know will be there (tack) and then grab what's between (stretch).

In this case, you know that a single assignment matches

\b\w+=.+

and you have many of these repeated in $string. Remember that \b means word boundary:

> A word boundary (\b) is a spot between two characters that has a \w on one side of it and a \W on the other side of it (in either order), counting the imaginary characters off the beginning and end of the string as matching a \W.

The values in the assignments can be a little tricky to describe with a regular expression, but you also know that each value will terminate with whitespace—although not necessarily the first whitespace encountered!—followed by either another assignment or end-of-string.

To avoid repeating the assertion pattern, compile it once with qr// and reuse it in your pattern along with a look-ahead assertion (?=...) to stretch the match just far enough to capture the entire value while also preventing it from spilling into the next variable name.

Matching against your pattern in list context with m//g gives the following behavior:

> The /g modifier specifies global pattern matching—that is, matching as many times as possible within the string. How it behaves depends on the context. In list context, it returns a list of the substrings matched by any capturing parentheses in the regular expression. If there are no parentheses, it returns a list of all the matched strings, as if there were parentheses around the whole pattern.

The pattern $assignment uses non-greedy .+? to cut off the value as soon as the look-ahead sees another assignment or end-of-line. Remember that the match returns the substrings from all capturing subpatterns, so the look-ahead's alternation uses non-capturing (?:...). The qr//, in contrast, contains implicit capturing parentheses.

#! /usr/bin/perl

use warnings;
use strict;

my $string = <<'EOF';
var1=100 var2=90 var5=hello var3="a, b, c" var7=test var3=hello
EOF

my $assignment = qr/\b\w+ = .+?/x;
my @array = $string =~ /$assignment (?= \s+ (?: $ | $assignment))/gx;

for ( my $i = 0; $i < scalar( @array ); $i++ )
{
  print $i.": ".$array[$i]."\n";
}

Output:

0: var1=100
1: var2=90
2: var5=hello
3: var3="a, b, c"
4: var7=test
5: var3=hello

Solution 3 - Regex

I'm not saying this is what you should do, but what you're trying to do is write a Grammar. Now your example is very simple for a Grammar, but Damian Conway's module Regexp::Grammars is really great at this. If you have to grow this at all, you'll find it will make your life much easier. I use it quite a bit here - it is kind of perl6-ish.

use Regexp::Grammars;
use Data::Dumper;
use strict;
use warnings;

my $parser = qr{
	<[pair]>+
	<rule: pair>     <key>=(?:"<list>"|<value=literal>)
	<token: key>     var\d+
	<rule: list>     <[MATCH=literal]> ** (,)
	<token: literal> \S+

}xms;

q[var1=100 var2=90 var5=hello var3="a, b, c" var7=test var3=hello] =~ $parser;
die Dumper {%/};

Output:

$VAR1 = {
          '' => 'var1=100 var2=90 var5=hello var3="a, b, c" var7=test var3=hello',
          'pair' => [
                      {
                        '' => 'var1=100',
                        'value' => '100',
                        'key' => 'var1'
                      },
                      {
                        '' => 'var2=90',
                        'value' => '90',
                        'key' => 'var2'
                      },
                      {
                        '' => 'var5=hello',
                        'value' => 'hello',
                        'key' => 'var5'
                      },
                      {
                        '' => 'var3="a, b, c"',
                        'key' => 'var3',
                        'list' => [
                                    'a',
                                    'b',
                                    'c'
                                  ]
                      },
                      {
                        '' => 'var7=test',
                        'value' => 'test',
                        'key' => 'var7'
                      },
                      {
                        '' => 'var3=hello',
                        'value' => 'hello',
                        'key' => 'var3'
                      }
                    ]

Solution 4 - Regex

A bit over the top maybe, but an excuse for me to look into http://p3rl.org/Parse::RecDescent. How about making a parser?

#!/usr/bin/perl

use strict;
use warnings;

use Parse::RecDescent;

use Regexp::Common;

my $grammar = <<'_EOGRAMMAR_'
INTEGER: /[-+]?\d+/
STRING: /\S+/
QSTRING: /$Regexp::Common::RE{quoted}/

VARIABLE: /var\d+/
VALUE: ( QSTRING | STRING | INTEGER )

assignment: VARIABLE "=" VALUE /[\s]*/ { print "$item{VARIABLE} => $item{VALUE}\n"; }

startrule: assignment(s)
_EOGRAMMAR_
;

$Parse::RecDescent::skip = '';
my $parser = Parse::RecDescent->new($grammar);

my $code = q{var1=100 var2=90 var5=hello var3="a, b, c" var7=test var8=" haha \" heh " var3=hello};
$parser->startrule($code);

yields:

var1 => 100
var2 => 90
var5 => hello
var3 => "a, b, c"
var7 => test
var8 => " haha \" heh "
var3 => hello

PS. Note the double var3, if you want the latter assignment to overwrite the first one you can use a hash to store the values, and then use them later.

PPS. My first thought was to split on '=' but that would fail if a string contained '=' and since regexps are almost always bad for parsing, well I ended up trying it out and it works.

Edit: Added support for escaped quotes inside quoted strings.

Solution 5 - Regex

I've recently had to parse x509 certificates "Subject" lines. They had similar form to the one you have provided:

echo 'Subject: C=HU, L=Budapest, O=Microsec Ltd., CN=Microsec e-Szigno Root CA 2009/emailAddress=info@e-szigno.hu' | \
  perl -wne 'my @a = m/(\w+\=.+?)(?=(?:, \w+\=|$))/g; print "$_\n" foreach @a;'

C=HU
L=Budapest
O=Microsec Ltd.
CN=Microsec e-Szigno Root CA 2009/emailAddress=info@e-szigno.hu

Short description of the regex:

(\w+\=.+?) - capture words followed by '=' and any subsequent symbols in non greedy mode
(?=(?:, \w+\=|$)) - which are followed by either another , KEY=val or end of line.

The interesting part of the regex used are:

  • .+? - Non greedy mode
  • (?:pattern) - Non capturing mode
  • (?=pattern) zero-width positive look-ahead assertion

Solution 6 - Regex

This one will provide you also common escaping in double-quotes as for example var3="a, "b, c".

@a = /(\w+=(?:\w+|"(?:[^\\"]*(?:\\.[^\\"]*)*)*"))/g;

In action:

echo 'var1=100 var2=90 var42="foo\"bar\\" var5=hello var3="a, b, c" var7=test var3=hello' |
perl -nle '@a = /(\w+=(?:\w+|"(?:[^\\"]*(?:\\.[^\\"]*)*)*"))/g; $,=","; print @a'
var1=100,var2=90,var42="foo\"bar\\",var5=hello,var3="a, b, c",var7=test,var3=hello

Solution 7 - Regex

#!/usr/bin/perl

use strict; use warnings;

use Text::ParseWords;
use YAML;

my $string =
    "var1=100 var2=90 var5=hello var3=\"a, b, c\" var7=test var3=hello";

my @parts = shellwords $string;
print Dump \@parts;

@parts = map { { split /=/ } } @parts;

print Dump \@parts;

Solution 8 - Regex

You asked for a RegEx solution or other code. Here is a (mostly) non regex solution using only core modules. The only regex is \s+ to determine the delimiter; in this case one or more spaces.

use strict; use warnings;
use Text::ParseWords;
my $string="var1=100 var2=90 var5=hello var3=\"a, b, c\" var7=test var3=hello";  

my @array = quotewords('\s+', 0, $string);

for ( my $i = 0; $i < scalar( @array ); $i++ )
{
 	print $i.": ".$array[$i]."\n";
}

Or you can execute the code HERE

The output is:

0: var1=100
1: var2=90
2: var5=hello
3: var3=a, b, c
4: var7=test
5: var3=hello

If you really want a regex solution, Alan Moore's comment linking to his code on IDEone is the gas!

Solution 9 - Regex

It is possible to do this with regexes, however it's fragile.

my $string = "var1=100 var2=90 var5=hello var3=\"a, b, c\" var7=test var3=hello";

my $regexp = qr/( (?:\w+=[\w\,]+) | (?:\w+=\"[^\"]*\") )/x;
my @matches = $string =~ /$regexp/g;

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestiontherobyouknowView Question on Stackoverflow
Solution 1 - RegexjkramerView Answer on Stackoverflow
Solution 2 - RegexGreg BaconView Answer on Stackoverflow
Solution 3 - RegexEvan CarrollView Answer on Stackoverflow
Solution 4 - RegexnicomenView Answer on Stackoverflow
Solution 5 - RegexDelian KrustevView Answer on Stackoverflow
Solution 6 - RegexHynek -Pichi- VychodilView Answer on Stackoverflow
Solution 7 - RegexSinan ÜnürView Answer on Stackoverflow
Solution 8 - RegexdawgView Answer on Stackoverflow
Solution 9 - RegexszbalintView Answer on Stackoverflow