Why is "asdf".replace(/.*/g, "x") == "xx"?

JavascriptRegex

Javascript Problem Overview


I stumbled across a surprising (to me) fact.

console.log("asdf".replace(/.*/g, "x"));

Why two replacements? It seems any non-empty string without newlines will produce exactly two replacements for this pattern. Using a replacement function, I can see that the first replacement is for the entire string, and the second is for an empty string.

Javascript Solutions


Solution 1 - Javascript

As per the ECMA-262 standard, String.prototype.replace calls RegExp.prototype[@@replace], which says:

11. Repeat, while done is false
  a. Let result be ? RegExpExec(rx, S).
  b. If result is null, set done to true.
  c. Else result is not null,
    i. Append result to the end of results.
    ii. If global is false, set done to true.
    iii. Else,
      1. Let matchStr be ? ToString(? Get(result, "0")).
      2. If matchStr is the empty String, then
        a. Let thisIndex be ? ToLength(? Get(rx, "lastIndex")).
        b. Let nextIndex be AdvanceStringIndex(S, thisIndex, fullUnicode).
        c. Perform ? Set(rx, "lastIndex", nextIndex, true).

where rx is /.*/g and S is 'asdf'.

See 11.c.iii.2.b: > b. Let nextIndex be AdvanceStringIndex(S, thisIndex, fullUnicode).

Therefore in 'asdf'.replace(/.*/g, 'x') it is actually:

  1. result (undefined), results = [], lastIndex = 0
  2. result = 'asdf', results = [ 'asdf' ], lastIndex = 4
  3. result = '', results = [ 'asdf', '' ], lastIndex = 4, AdvanceStringIndex, set lastIndex to 5
  4. result = null, results = [ 'asdf', '' ], return

Therefore there are 2 matches.

Solution 2 - Javascript

Together in an offline chat with yawkat, we found an intuitive way of seeing why "abcd".replace(/.*/g, "x") exactly produces two matches. Note that we haven't checked whether it completely equals the semantics imposed by the ECMAScript standard, hence just take it as a rule of thumb.

Rules of Thumb
  • Consider the matches as a list of tuples (matchStr, matchIndex) in chronological order that indicate which string parts and indices of the input string have already been eaten up.
  • This list is continuously built up starting from the left of the input string for the regex.
  • Parts already eaten up cannot be matched anymore
  • Replacement is done at indices given by matchIndex overwriting the substring matchStr at that position. If matchStr = "", then the "replacement" is effectively insertion.

Formally, the act of matching and replacement is described as a loop as seen in the other answer.

Easy Examples
  1. "abcd".replace(/.*/g, "x") outputs "xx":
  • The match list is [("abcd", 0), ("", 4)]

    Notably, it does not include the following matches one could have thought of for the following reasons:

    • ("a", 0), ("ab", 0): the quantifier * is greedy
    • ("b", 1), ("bc", 1): due to the previous match ("abcd", 0), the strings "b" and "bc" are already eaten up
    • ("", 4), ("", 4) (i.e. twice): the index position 4 is already eaten up by the first apparent match
  • Hence, the replacement string "x" replaces the found match strings exactly at those positions: at position 0 it replaces the string "abcd" and at position 4 it replaces "".

    Here you can see that replacement can act as true replacement of a previous string or just as insertion of a new string.

  1. "abcd".replace(/.*?/g, "x") with a lazy quantifier *? outputs "xaxbxcxdx"
  • The match list is [("", 0), ("", 1), ("", 2), ("", 3), ("", 4)]

    In contrast to the previous example, here ("a", 0), ("ab", 0), ("abc", 0), or even ("abcd", 0) are not included due to the quantifier's laziness that strictly limits it to find the shortest possible match.

  • Since all match strings are empty, no actual replacement occurs, but instead insertions of x at positions 0, 1, 2, 3, and 4.

  1. "abcd".replace(/.+?/g, "x") with a lazy quantifier +? outputs "xxxx"
  • The match list is [("a", 0), ("b", 1), ("c", 2), ("d", 3)]
  1. "abcd".replace(/.{2,}?/g, "x") with a lazy quantifier [2,}? outputs "xx"
  • The match list is [("ab", 0), ("cd", 2)]
  1. "abcd".replace(/.{0}/g, "x") outputs "xaxbxcxdx" by the same logic as in example 2.
Harder Examples

We can consistently exploit the idea of insertion instead of replacement if we just always match an empty string and control the position where such matches happen to our advantage. For example, we can create regular expressions matching the empty string at every even position to insert a character there:

  1. "abcdefgh".replace(/(?<=^(..)*)/g, "_")) with a positive lookbehind (?<=...) outputs "_ab_cd_ef_gh_" (only supported in Chrome so far)
  • The match list is [("", 0), ("", 2), ("", 4), ("", 6), ("", 8)]
  1. "abcdefgh".replace(/(?=(..)*$)/g, "_")) with a positive lookahead (?=...) outputs "_ab_cd_ef_gh_"
  • The match list is [("", 0), ("", 2), ("", 4), ("", 6), ("", 8)]

Solution 3 - Javascript

The first match is obviously "asdf" (Position [0,4]). Because the global flag (g) is set, it continues searching. At this point (Position 4), it finds a second match, an empty string (Position [4,4]).

Remember that * matches zero or more elements.

Solution 4 - Javascript

simply, the first x is for the replacement of matching asdf.

second x for the empty string after asdf. Search terminates when empty.

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionrecursiveView Question on Stackoverflow
Solution 1 - JavascriptAlan LiangView Answer on Stackoverflow
Solution 2 - JavascriptComFreekView Answer on Stackoverflow
Solution 3 - JavascriptDavid SKView Answer on Stackoverflow
Solution 4 - JavascriptNilanka ManojView Answer on Stackoverflow