Split string by delimiter, but not if it is escaped

PhpRegexPreg Split

Php Problem Overview


How can I split a string by a delimiter, but not if it is escaped? For example, I have a string:

1|2\|2|3\\|4\\\|4

The delimiter is | and an escaped delimiter is \|. Furthermore I want to ignore escaped backslashes, so in \\| the | would still be a delimiter.

So with the above string the result should be:

[0] => 1
[1] => 2\|2
[2] => 3\\
[3] => 4\\\|4

Php Solutions


Solution 1 - Php

Use dark magic:

$array = preg_split('~\\\\.(*SKIP)(*FAIL)|\|~s', $string);

\\\\. matches a backslash followed by a character, (*SKIP)(*FAIL) skips it and \| matches your delimiter.

Solution 2 - Php

Instead of split(...), it's IMO more intuitive to use some sort of "scan" function that operates like a lexical tokenizer. In PHP that would be the preg_match_all function. You simply say you want to match:

  1. something other than a \ or |
  2. or a \ followed by a \ or |
  3. repeat #1 or #2 at least once

The following demo:

$input = "1|2\\|2|3\\\\|4\\\\\\|4";
echo $input . "\n\n";
preg_match_all('/(?:\\\\.|[^\\\\|])+/', $input, $parts);
print_r($parts[0]);

will print:

1|2\|2|3\\|4\\\|4

Array
(
    [0] => 1
    [1] => 2\|2
    [2] => 3\\
    [3] => 4\\\|4
)

Solution 3 - Php

Recently I devised a solution:

$array = preg_split('~ ((?<!\\\\)|(?<=[^\\\\](\\\\\\\\)+)) \| ~x', $string);

But the black magic solution is still three times faster.

Solution 4 - Php

For future readers, here is a universal solution. It is based on NikiC's idea with (*SKIP)(*FAIL):

function split_escaped($delimiter, $escaper, $text)
{
    $d = preg_quote($delimiter, "~");
    $e = preg_quote($escaper, "~");
    $tokens = preg_split(
        '~' . $e . '(' . $e . '|' . $d . ')(*SKIP)(*FAIL)|' . $d . '~',
        $text
    );
    $escaperReplacement = str_replace(['\\', '$'], ['\\\\', '\\$'], $escaper);
    $delimiterReplacement = str_replace(['\\', '$'], ['\\\\', '\\$'], $delimiter);
    return preg_replace(
        ['~' . $e . $e . '~', '~' . $e . $d . '~'],
        [$escaperReplacement, $delimiterReplacement],
        $tokens
    );
}

Make a try:

// the base situation:
$text = "asdf\\,fds\\,ddf,\\\\,f\\,,dd";
$delimiter = ",";
$escaper = "\\";
print_r(split_escaped($delimiter, $escaper, $text));

// other signs:
$text = "dk!%fj%slak!%df!!jlskj%%dfl%isr%!%%jlf";
$delimiter = "%";
$escaper = "!";
print_r(split_escaped($delimiter, $escaper, $text));

// delimiter with multiple characters:
$text = "aksd()jflaksd())jflkas(('()j()fkl'()()as()d('')jf";
$delimiter = "()";
$escaper = "'";
print_r(split_escaped($delimiter, $escaper, $text));

// escaper is same as delimiter:
$text = "asfl''asjf'lkas'''jfkl''d'jsl";
$delimiter = "'";
$escaper = "'";
print_r(split_escaped($delimiter, $escaper, $text));

Output:

Array
(
    [0] => asdf,fds,ddf
    [1] => \
    [2] => f,
    [3] => dd
)
Array
(
    [0] => dk%fj
    [1] => slak%df!jlskj
    [2] => 
    [3] => dfl
    [4] => isr
    [5] => %
    [6] => jlf
    )
Array
(
    [0] => aksd
    [1] => jflaksd
    [2] => )jfl'kas((()j
    [3] => fkl()
    [4] => as
    [5] => d(')jf
)
Array
(
    [0] => asfl'asjf
    [1] => lkas'
    [2] => jfkl'd
    [3] => jsl
)

Note: There is a theoretical level problem: implode('::', ['a:', ':b']) and implode('::', ['a', '', 'b']) result the same string: 'a::::b'. Imploding can be also an interesting problem.

Solution 5 - Php

Regex is painfully slow. A better method is removing escaped characters from the string prior to splitting then putting them back in:

$foo = 'a,b|,c,d||,e';

function splitEscaped($str, $delimiter,$escapeChar = '\\') {
    //Just some temporary strings to use as markers that will not appear in the original string
	$double = "\0\0\0_doub";
	$escaped = "\0\0\0_esc";
	$str = str_replace($escapeChar . $escapeChar, $double, $str);
	$str = str_replace($escapeChar . $delimiter, $escaped, $str);
	
	$split = explode($delimiter, $str);
	foreach ($split as &$val) $val = str_replace([$double, $escaped], [$escapeChar, $delimiter], $val);
	return $split;
}

print_r(splitEscaped($foo, ',', '|'));

which splits on ',' but not if escaped with "|". It also supports double escaping so "||" becomes a single "|" after the split happens:

Array ( [0] => a [1] => b,c [2] => d| [3] => e ) 

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionAntonView Question on Stackoverflow
Solution 1 - PhpNikiCView Answer on Stackoverflow
Solution 2 - PhpBart KiersView Answer on Stackoverflow
Solution 3 - PhpAntonView Answer on Stackoverflow
Solution 4 - PhpDávid HorváthView Answer on Stackoverflow
Solution 5 - PhpTom BView Answer on Stackoverflow