Why is `\s+` so much faster than `\s\s*` in this Perl regex?

RegexPerlRegex Greedy

Regex Problem Overview


Why does replacing \s* (or even \s\s*) with \s+ result in such a speedup for this input?

use Benchmark qw(:all);
$x=(" " x 100000) . "_\n";
$count = 100;
timethese($count, {
    '/\s\s*\n/' => sub { $x =~ /\s\s*\n/ },
    '/\s+\n/' => sub { $x =~ /\s+\n/ },
});

Link to online version

I noticed a slow regex s/\s*\n\s*/\n/g in my code - when given a 450KB input file consisting of lots of spaces with a few non-spaces here and there, and a final newline at the end - the regex hung and never finished.

I intuitively replaced the regex with s/\s+\n/\n/g; s/\n\s+/\n/g; and all was well.

But why is it so much faster? After using re Debug => "EXECUTE" I noticed the \s+ version is somehow optimised to run in only one iteration: http://pastebin.com/0Ug6xPiQ

Matching REx "\s*\n" against "       _%n"
Matching stclass ANYOF{i}[\x09\x0a\x0c\x0d ][{non-utf8-latin1-all}{unicode_all}] against "       _%n" (9 bytes)
   0 <> <       _%n>         |  1:STAR(3)
                                  SPACE can match 7 times out of 2147483647...
                                  failed...
   1 < > <      _%n>         |  1:STAR(3)
                                  SPACE can match 6 times out of 2147483647...
                                  failed...
   2 <  > <     _%n>         |  1:STAR(3)
                                  SPACE can match 5 times out of 2147483647...
                                  failed...
   3 <   > <    _%n>         |  1:STAR(3)
                                  SPACE can match 4 times out of 2147483647...
                                  failed...
   4 <    > <   _%n>         |  1:STAR(3)
                                  SPACE can match 3 times out of 2147483647...
                                  failed...
   5 <     > <  _%n>         |  1:STAR(3)
                                  SPACE can match 2 times out of 2147483647...
                                  failed...
   6 <      > < _%n>         |  1:STAR(3)
                                  SPACE can match 1 times out of 2147483647...
                                  failed...
   8 <       _> <%n>         |  1:STAR(3)
                                  SPACE can match 1 times out of 2147483647...
   8 <       _> <%n>         |  3:  EXACT <\n>(5)
   9 <       _%n> <>         |  5:  END(0)
Match successful!

Matching REx "\s+\n" against "       _%n"
Matching stclass SPACE against "       _" (8 bytes)
   0 <> <       _%n>         |  1:PLUS(3)
                                  SPACE can match 7 times out of 2147483647...
                                  failed...

I know Perl 5.10+ will immediately fail the regex (without running it) if a newline is not present. I suspect it is using the location of the newline to reduce the amount of searching it does. For all cases above it seems to cleverly reduce the backtracking involved (usually /\s*\n/ against a string of spaces would take exponential-time). Can anyone offer insight into why the \s+ version is so much faster?

Also note that \s*? does not offer any speedup.

Regex Solutions


Solution 1 - Regex

First, even if the resulting regex will not keep the same meaning, let's reduces regexes to \s*0 and \s+0 and use (" " x 4) . "_0" as an input. For the sceptics, you can see here that the lag is still present.

Now let's consider the following code:

$x = (" " x 4) . "_ 0";
$x =~ /\s*0/; # The slow line 
$x =~ /\s+0/; # The fast line

Digging a bit with use re debugcolor; we get the following output:

Guessing start of match in sv for REx "\s*0" against "    _0"
Found floating substr "0" at offset 5...
start_shift: 0 check_at: 5 s: 0 endpos: 6 checked_upto: 0
Does not contradict STCLASS...
Guessed: match at offset 0
Matching REx "\s*0" against "    _0"
Matching stclass ANYOF_SYNTHETIC[\x09-\x0d 0\x85\xa0][{unicode_all}] against "    _0" (6 bytes)
   0 <    _0>|  1:STAR(3)
                                  POSIXD[\s] can match 4 times out of 2147483647...
                                  failed...
   1 <    _0>|  1:STAR(3)
                                  POSIXD[\s] can match 3 times out of 2147483647...
                                  failed...
   2 <    _0>|  1:STAR(3)
                                  POSIXD[\s] can match 2 times out of 2147483647...
                                  failed...
   3 <    _0>|  1:STAR(3)
                                  POSIXD[\s] can match 1 times out of 2147483647...
                                  failed...
   5 <    _0>|  1:STAR(3)
                                  POSIXD[\s] can match 0 times out of 2147483647...
   5 <    _0>|  3:  EXACT <0>(5)
   6 <    _0>|  5:  END(0)
Match successful!

-----------------------

Guessing start of match in sv for REx "\s+0" against "    _0"
Found floating substr "0" at offset 5...
start_shift: 1 check_at: 5 s: 0 endpos: 5 checked_upto: 0
Does not contradict STCLASS...
Guessed: match at offset 0
Matching REx "\s+0" against "    _0"
Matching stclass POSIXD[\s] against "    _" (5 bytes)
   0 <    _0>|  1:PLUS(3)
                                  POSIXD[\s] can match 4 times out of 2147483647...
                                  failed...
Contradicts stclass... [regexec_flags]
Match failed

Perl seems to be optimized for failure. It will first look for constant strings (which only consume O(N)). Here, it'll look for 0 : Found floating substr "0" at offset 5...

Then it'll start with the variable part of the regex, respectivly \s* and \s+, against the whole minimum string to check:

Matching REx "\s*0" against "    _0"
Matching stclass ANYOF_SYNTHETIC[\x09-\x0d 0\x85\xa0][{unicode_all}] against "    _0" (6 bytes)
Matching REx "\s+0" against "    _0"
Matching stclass POSIXD[\s] against "    _" (5 bytes) # Only 5 bytes because there should be at least 1 "\s" char

After that it'll look for the first position meeting the stclass requirement, here at position 0.

  • \s*0:
    • starts at 0, find 4 spaces then fail;
    • starts at 1, find 3 spaces then fail;
    • starts at 2, find 2 spaces then fail;
    • starts at 3, find 1 spaces then fail;
    • starts at 4, find 0 spaces then doesn't fail;
    • Find an exact 0
  • \s+0:
    • starts at 0, find 4 spaces then fail. As the minimum number of spaces is not matched, the regex fails instantly.

If you want to have fun with Perl regex optimization, you can consider the following regexes / *\n and / * \n. At first glance, they look the same, have the same meaning... But if you run its against (" " x 40000) . "_\n" the first one will check all possibilities while the second one will look for " \n" and fail immediately.

In a vanilla, non-optimized regex engine, both regex can cause catastrophic backtracking, since they need to retry the pattern as it bumps along. However, in the example above, the second doesn't fail with Perl because it have been optimized to find floating substr "0%n"


You can see another example on Jeff Atwood's blog.

Note also that the issue is not about \s consideration but any pattern where xx* is used instead of x+ see example with 0s and also regex explosive quantifiers

With such shorter example the behavior is "findable", but if you start to play with complicated patterns, it's far from easy to spot, for example: https://stackoverflow.com/questions/11522954/regular-expression-hangs-program-100-cpu-usage/11523255#11523255

Solution 2 - Regex

When there is a "plus" node (e.g. \s+) at the beginning of a pattern and the node fails to match, the regex engine skips forward to the point of failure and tries again; with \s*, on the other hand, the engine only advances one character at a time.

Yves Orton explains this optimization nicely here:

>The start class optimisation has two modes, "try every valid start position" (doevery) and "flip flop mode" (!doevery) where it trys only the first valid start position in a sequence. > >Consider /(\d+)X/ and the string "123456Y", now we know that if we fail to match X after matching "123456" then we will also fail to match after "23456" (assuming no evil tricks are in place, which disable the optimisation anyway), so we know we can skip forward until the check /fails/ and only then start looking for a real match. This is flip-flop mode.

/\s+/ triggers flip-flop mode; /\s*/, /\s\s*/, and /\s\s+/ don't. This optimization can't be applied to "star" nodes like \s* because they can match zero characters, so a failure at one point in a sequence isn't indicative of failure later in the same sequence.


You can see this in the debug output for each regex. I've highlighted the skipped characters with ^. Compare this (skips four characters at a time):

$ perl -Mre=Debug,MATCH -e'"123 456 789 x" =~ /\d+x/'
   ...
   0 <> <123 456 78>         |  1:PLUS(3)
                                  POSIXD[\d] can match 3 times out of 2147483647...
                                  failed...
   4 <123 > <456 789 x>      |  1:PLUS(3)
      ^^^^
                                  POSIXD[\d] can match 3 times out of 2147483647...
                                  failed...
   8 <23 456 > <789 x>       |  1:PLUS(3)
         ^^^^
                                  POSIXD[\d] can match 3 times out of 2147483647...
                                  failed...

to this (skips one or two characters at a time):

$ perl -Mre=Debug,MATCH -e'"123 456 789 x" =~ /\d*x/'
   ...
   0 <> <123 456 78>         |  1:STAR(3)
                                  POSIXD[\d] can match 3 times out of 2147483647...
                                  failed...
   1 <1> <23 456 789>        |  1:STAR(3)
      ^
                                  POSIXD[\d] can match 2 times out of 2147483647...
                                  failed...
   2 <12> <3 456 789 >       |  1:STAR(3)
       ^
                                  POSIXD[\d] can match 1 times out of 2147483647...
                                  failed...
   4 <123 > <456 789 x>      |  1:STAR(3)
        ^^
                                  POSIXD[\d] can match 3 times out of 2147483647...
                                  failed...
   5 <123 4> <56 789 x>      |  1:STAR(3)
          ^
                                  POSIXD[\d] can match 2 times out of 2147483647...
                                  failed...
   6 <23 45> <6 789 x>       |  1:STAR(3)
          ^
                                  POSIXD[\d] can match 1 times out of 2147483647...
                                  failed...
   8 <23 456 > <789 x>       |  1:STAR(3)
           ^^
                                  POSIXD[\d] can match 3 times out of 2147483647...
                                  failed...
   9 <23 456 7> <89 x>       |  1:STAR(3)
             ^
                                  POSIXD[\d] can match 2 times out of 2147483647...
                                  failed...
  10 <23 456 78> <9 x>       |  1:STAR(3)
              ^
                                  POSIXD[\d] can match 1 times out of 2147483647...
                                  failed...
  12 <23 456 789 > <x>       |  1:STAR(3)
               ^^
                                  POSIXD[\d] can match 0 times out of 2147483647...
  12 <23 456 789 > <x>       |  3:  EXACT <x>(5)
  13 <23 456 789 x> <>       |  5:  END(0)

Note that the optimization isn't applied to /\s\s+/, because \s+ isn't at the beginning of the pattern. Both /\s\s+/ (logically equivalent to /\s{2,}/) and /\s\s*/ (logically equivalent to /\s+/) probably could be optimized, though; it might make sense to ask on perl5-porters whether either would be worth the effort.


In case you're interested, "flip-flop mode" is enabled by setting the PREGf_SKIP flag on a regex when it's compiled. See the code around lines 7344 and 7405 in regcomp.c and line 1585 in regexec.c in the 5.24.0 source.

Solution 3 - Regex

The \s+\n requires that the character preceding the \n is a SPACE.

According to use re qw(debug) the compilation establishes that it needs a straight string of a known number of spaces, up to the substring \n, which is first checked for in the input. Then it checks the fixed-length space-only substring against the remaining part of input, failing as it comes to _. It's a single possibility to check, regardless of how many spaces the input has. (When there are more _\n each is found to fail equally directly, per debug output.)

Looked at it this way, it's an optimization you'd almost expect, utilizing a rather specific search pattern and lucking out with this input. Except when taken in comparison with other engines, which clearly don't do this kind of analysis.

With \s*\n this is not the case. Once the \n is found and the previous character is not a space, the search hasn't failed since \s* allows nothing (zero characters). There are no fixed-length substrings either, and it's in the backtracking game.

Solution 4 - Regex

I am not sure of the internals of the regular expression engine, but it looks like it does not recognize that \s+ is in some way the same as \s\s*, since in the second one it matches a space, and then tries to match ever growing numbers of spaces, while in the first, it immediately concludes that there is no match.

The output using use re qw( Debug ); clearly shows this, using a much shorter string:

test_re.pl

#!/usr/bin/env perl
use re qw(debug);

$x=(" " x 10) . "_\n";
print '-'x50 . "\n";
$x =~ /\s+\n/;
print '-'x50 . "\n";
$x =~ /\s\s*\n/;
print '-'x50 . "\n";

Output

Compiling REx "\s+\n"
Final program:
    1: PLUS (3)
    2:   SPACE (0)
    3: EXACT <\n> (5)
    5: END (0)
floating "%n" at 1..2147483647 (checking floating) stclass SPACE plus minlen 2
Compiling REx "\s\s*\n"
Final program:
    1: SPACE (2)
    2: STAR (4)
    3:   SPACE (0)
    4: EXACT <\n> (6)
    6: END (0)
floating "%n" at 1..2147483647 (checking floating) stclass SPACE minlen 2
--------------------------------------------------
Guessing start of match in sv for REx "\s+\n" against "          _%n"
Found floating substr "%n" at offset 11...
    start_shift: 1 check_at: 11 s: 0 endpos: 11
Does not contradict STCLASS...
Guessed: match at offset 0
Matching REx "\s+\n" against "          _%n"
Matching stclass SPACE against "          _" (11 bytes)
   0 <> <          >         |  1:PLUS(3)
                                  SPACE can match 10 times out of 2147483647...
                                  failed...
Contradicts stclass... [regexec_flags]
Match failed
--------------------------------------------------
Guessing start of match in sv for REx "\s\s*\n" against "          _%n"
Found floating substr "%n" at offset 11...
    start_shift: 1 check_at: 11 s: 0 endpos: 11
Does not contradict STCLASS...
Guessed: match at offset 0
Matching REx "\s\s*\n" against "          _%n"
Matching stclass SPACE against "          _" (11 bytes)
   0 <> <          >         |  1:SPACE(2)
   1 < > <         _>        |  2:STAR(4)
                                  SPACE can match 9 times out of 2147483647...
                                  failed...
   1 < > <         _>        |  1:SPACE(2)
   2 <  > <        _>        |  2:STAR(4)
                                  SPACE can match 8 times out of 2147483647...
                                  failed...
   2 <  > <        _>        |  1:SPACE(2)
   3 <   > <       _%n>      |  2:STAR(4)
                                  SPACE can match 7 times out of 2147483647...
                                  failed...
   3 <   > <       _%n>      |  1:SPACE(2)
   4 <    > <      _%n>      |  2:STAR(4)
                                  SPACE can match 6 times out of 2147483647...
                                  failed...
   4 <    > <      _%n>      |  1:SPACE(2)
   5 <     > <     _%n>      |  2:STAR(4)
                                  SPACE can match 5 times out of 2147483647...
                                  failed...
   5 <     > <     _%n>      |  1:SPACE(2)
   6 <      > <    _%n>      |  2:STAR(4)
                                  SPACE can match 4 times out of 2147483647...
                                  failed...
   6 <      > <    _%n>      |  1:SPACE(2)
   7 <       > <   _%n>      |  2:STAR(4)
                                  SPACE can match 3 times out of 2147483647...
                                  failed...
   7 <       > <   _%n>      |  1:SPACE(2)
   8 <        > <  _%n>      |  2:STAR(4)
                                  SPACE can match 2 times out of 2147483647...
                                  failed...
   8 <        > <  _%n>      |  1:SPACE(2)
   9 <         > < _%n>      |  2:STAR(4)
                                  SPACE can match 1 times out of 2147483647...
                                  failed...
   9 <         > < _%n>      |  1:SPACE(2)
  10 <          > <_%n>      |  2:STAR(4)
                                  SPACE can match 0 times out of 2147483647...
                                  failed...
Contradicts stclass... [regexec_flags]
Match failed
--------------------------------------------------
Freeing REx: "\s+\n"
Freeing REx: "\s\s*\n"

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionrjhView Question on Stackoverflow
Solution 1 - RegexThomas AyoubView Answer on Stackoverflow
Solution 2 - RegexThisSuitIsBlackNotView Answer on Stackoverflow
Solution 3 - RegexzdimView Answer on Stackoverflow
Solution 4 - RegexxxfelixxxView Answer on Stackoverflow