Regular Expression to split on spaces unless in quotes

C#.NetRegex

C# Problem Overview


I would like to use the .Net Regex.Split method to split this input string into an array. It must split on whitespace unless it is enclosed in a quote.

Input: Here is "my string"    it has "six  matches"

Expected output:

  1. Here
  2. is
  3. my string
  4. it
  5. has
  6. six  matches

What pattern do I need? Also do I need to specify any RegexOptions?

C# Solutions


Solution 1 - C#

No options required

Regex:

\w+|"[\w\s]*"

C#:

Regex regex = new Regex(@"\w+|""[\w\s]*""");

Or if you need to exclude " characters:

    Regex
        .Matches(input, @"(?<match>\w+)|\""(?<match>[\w\s]*)""")
        .Cast<Match>()
        .Select(m => m.Groups["match"].Value)
        .ToList()
        .ForEach(s => Console.WriteLine(s));

Solution 2 - C#

Lieven's solution gets most of the way there, and as he states in his comments it's just a matter of changing the ending to Bartek's solution. The end result is the following working regEx:

(?<=")\w[\w\s]*(?=")|\w+|"[\w\s]*"

Input: Here is "my string" it has "six matches"

Output:

  1. Here
  2. is
  3. "my string"
  4. it
  5. has
  6. "six matches"

Unfortunately it's including the quotes. If you instead use the following:

(("((?<token>.*?)(?<!\\)")|(?<token>[\w]+))(\s)*)

And explicitly capture the "token" matches as follows:

    RegexOptions options = RegexOptions.None;
    Regex regex = new Regex( @"((""((?<token>.*?)(?<!\\)"")|(?<token>[\w]+))(\s)*)", options );
    string input = @"   Here is ""my string"" it has   "" six  matches""   ";
    var result = (from Match m in regex.Matches( input ) 
                  where m.Groups[ "token" ].Success
                  select m.Groups[ "token" ].Value).ToList();

    for ( int i = 0; i < result.Count(); i++ )
    {
        Debug.WriteLine( string.Format( "Token[{0}]: '{1}'", i, result[ i ] ) );
    }

Debug output:

Token[0]: 'Here'
Token[1]: 'is'
Token[2]: 'my string'
Token[3]: 'it'
Token[4]: 'has'
Token[5]: ' six  matches'

Solution 3 - C#

The top answer doesn't quite work for me. I was trying to split this sort of string by spaces, but it looks like it splits on the dots ('.') as well.

"the lib.lib" "another lib".lib

I know the question asks about regexs, but I ended up writing a non-regex function to do this:

    /// <summary>
    /// Splits the string passed in by the delimiters passed in.
    /// Quoted sections are not split, and all tokens have whitespace
    /// trimmed from the start and end.
    public static List<string> split(string stringToSplit, params char[] delimiters)
    {
        List<string> results = new List<string>();

        bool inQuote = false;
        StringBuilder currentToken = new StringBuilder();
        for (int index = 0; index < stringToSplit.Length; ++index)
        {
            char currentCharacter = stringToSplit[index];
            if (currentCharacter == '"')
            {
                // When we see a ", we need to decide whether we are
                // at the start or send of a quoted section...
                inQuote = !inQuote;
            }
            else if (delimiters.Contains(currentCharacter) && inQuote == false)
            {
                // We've come to the end of a token, so we find the token,
                // trim it and add it to the collection of results...
                string result = currentToken.ToString().Trim();
                if (result != "") results.Add(result);

                // We start a new token...
                currentToken = new StringBuilder();
            }
            else
            {
                // We've got a 'normal' character, so we add it to
                // the curent token...
                currentToken.Append(currentCharacter);
            }
        }

        // We've come to the end of the string, so we add the last token...
        string lastResult = currentToken.ToString().Trim();
        if (lastResult != "") results.Add(lastResult);

        return results;
    }

Solution 4 - C#

I was using Bartek Szabat's answer, but I needed to capture more than just "\w" characters in my tokens. To solve the problem, I modified his regex slightly, similar to Grzenio's answer:

Regular Expression: (?<match>[^\s"]+)|(?<match>"[^"]*")

C# String:          (?<match>[^\\s\"]+)|(?<match>\"[^\"]*\")

Bartek's code (which returns tokens stripped of enclosing quotes) becomes:

Regex
        .Matches(input, "(?<match>[^\\s\"]+)|(?<match>\"[^\"]*\")")
        .Cast<Match>()
        .Select(m => m.Groups["match"].Value)
        .ToList()
        .ForEach(s => Console.WriteLine(s));

Solution 5 - C#

I have found the regex in this answer to be quite useful. To make it work in C# you will have to use the MatchCollection class.

//need to escape \s
string pattern = "[^\\s\"']+|\"([^\"]*)\"|'([^']*)'";

MatchCollection parsedStrings = Regex.Matches(line, pattern);

for (int i = 0; i < parsedStrings.Count; i++)
{
    //print parsed strings
    Console.Write(parsedStrings[i].Value + " ");
}
Console.WriteLine();

Solution 6 - C#

This regex will split based on the case you have given above, although it does not strip the quotes or extra spaces, so you may want to do some post processing on your strings. This should correctly keep quoted strings together though.

"[^"]+"|\s?\w+?\s

Solution 7 - C#

With a little bit of messiness, regular languages can keep track of even/odd counting of quotes, but if your data can include escaped quotes (") then you're in real trouble producing or comprehending a regular expression that will handle that correctly.

Solution 8 - C#

EDIT: Sorry for my previous post, this is obviously possible.

To handle all the non-alphanumeric characters you need something like this:

MatchCollection matchCollection = Regex.Matches(input, @"(?<match>[^""\s]+)|\""(?<match>[^""]*)""");
foreach (Match match in matchCollection)
        {
            yield return match.Groups["match"].Value;
        }

you can make the foreach smarter if you are using .Net >2.0

Solution 9 - C#

Shaun,

I believe the following regex should do it

(?<=")\w[\w\s]*(?=")|\w+  

Regards,
Lieven

Solution 10 - C#

Take a look at LSteinle's "Split Function that Supports Text Qualifiers" over at Code project

Here is the snippet from his project that you’re interested in.

using System.Text.RegularExpressions;

public string[] Split(string expression, string delimiter, string qualifier, bool ignoreCase)
{
    string _Statement = String.Format("{0}(?=(?:[^{1}]*{1}[^{1}]*{1})*(?![^{1}]*{1}))", 
                        Regex.Escape(delimiter), Regex.Escape(qualifier));

    RegexOptions _Options = RegexOptions.Compiled | RegexOptions.Multiline;
    if (ignoreCase) _Options = _Options | RegexOptions.IgnoreCase;

    Regex _Expression = New Regex(_Statement, _Options);
    return _Expression.Split(expression);
}

Just watch out for calling this in a loop as its creating and compiling the Regex statement every time you call it. So if you need to call it more then a handful of times, I would look at creating a Regex cache of some kind.

Solution 11 - C#

If you'd like to take a look at a general solution to this problem in the form of a free, open-source javascript object, you can visit http://splitterjsobj.sourceforge.net/ for a live demo (and download). The object has the following features:

  • Pairs of user-defined quote characters can be used to escape the delimiter (prevent a split inside quotes). The quotes can be escaped with a user-defined escape char, and/or by "double quote escape." The escape char can be escaped (with itself). In one of the 5 output arrays (properties of the object), output is unescaped. (For example, if the escape char = /, "a///"b" is unescaped as a/"b)
  • Split on an array of delimiters; parse a file in one call. (The output arrays will be nested.)
  • All escape sequences recognized by javascript can be evaluated during the split process and/or in a preprocess.
  • Callback functionality
  • Cross-browser consistency

The object is also available as a jQuery plugin, but as a new user at this site I can only include one link in this message.

Solution 12 - C#

I need to support nesting so none of these worked for me. I gave up trying to do it via Regex and just coded:

  public static Argument[] ParseCmdLine(string args) {
    List<string> ls = new List<string>();
    StringBuilder sb = new StringBuilder(128);

    // support quoted text nesting up to 8 levels deep
    Span<char> quoteChar = stackalloc char[8];
    int quoteLevel = 0;
      
    for (int i = 0; i < args.Length; ++i) {
      char ch = args[i];
      switch (ch) {
        case ' ':
          if (quoteLevel == 0) {
            ls.Add(sb.ToString());
            sb.Clear();
            break;
          } 
          goto default; 
        case '"':
        case '\'':
          if (quoteChar[quoteLevel] == ch) {
            --quoteLevel;
          } else {
            quoteChar[++quoteLevel] = ch;
          }
          goto default; 
        default:
          sb.Append(ch);
          break;
      }
    }
    if (sb.Length > 0) { ls.Add(sb.ToString()); sb.Clear(); }

    return Arguments.ParseCmdLine(ls.ToArray());
  }

And here's some additional code to parse the command line arguments to objects:

  public struct Argument {
    public string Prefix;
    public string Name;
    public string Eq;
    public string QuoteType;
    public string Value;

    public string[] ToArray() => this.Eq == " " ? new string[] { $"{Prefix}{Name}", $"{QuoteType}{Value}{QuoteType}" } : new string[] { this.ToString() };
    public override string ToString() => $"{Prefix}{Name}{Eq}{QuoteType}{Value}{QuoteType}";
  }

  private static readonly Regex RGX_MatchArg = new Regex(@"^(?<prefix>-{1,2}|\/)(?<name>[a-zA-Z][a-zA-Z_-]*)(?<assignment>(?<eq>[:= ]|$)(?<quote>[""'])?(?<value>.+?)(?:\k<quote>|\s*$))?");
  private static readonly Regex RGX_MatchQuoted = new Regex(@"(?<quote>[""'])?(?<value>.+?)(?:\k<quote>|\s*$)");

  public static Argument[] ParseCmdLine(string[] rawArgs) {
    int count = 0;
    Argument[] pairs = new Argument[rawArgs.Length];

    int i = 0;
    while(i < rawArgs.Length) {
      string current = rawArgs[i];
      i+=1;
      Match matches = RGX_MatchArg.Match(current);
      Argument arg = new Argument();
      arg.Prefix = matches.Groups["prefix"].Value;
      arg.Name = matches.Groups["name"].Value;
      arg.Value = matches.Groups["value"].Value;
      if(!string.IsNullOrEmpty(arg.Value)) {
        arg.Eq = matches.Groups["eq"].Value;
        arg.QuoteType = matches.Groups["quote"].Value;
      } else if ((i < rawArgs.Length) && !rawArgs[i].StartsWith('-') && !rawArgs[i].StartsWith('/')) {
        arg.Eq = " ";
        Match quoted = RGX_MatchQuoted.Match(rawArgs[i]);
        arg.QuoteType = quoted.Groups["quote"].Value;
        arg.Value = quoted.Groups["value"].Value;
        i+=1;
      }
      if(string.IsNullOrEmpty(arg.QuoteType) && arg.Value.IndexOfAny(new char[] { ' ', '/', '\\', '-', '=', ':' }) >= 0) {
        arg.QuoteType = "\"";
      }
      pairs[count++] = arg;
    }

    return pairs.Slice(0..count);
  }

  public static ILookup<string, Argument> ToLookup(this Argument[] args) => args.ToLookup((arg) => arg.Name, StringComparer.OrdinalIgnoreCase);
}

It's able to parse all different kinds of argument variants:

-test -environment staging /DEqTest=avalue /Dcolontest:anothervalue /DwithSpaces="heys: guys" /slashargflag -action="Do: 'The Thing'" -action2 "do: 'Do: \"The Thing\"'" -init

Nested quotes just need to be alternated between different quote types.

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionShaun BoweView Question on Stackoverflow
Solution 1 - C#Bartek SzabatView Answer on Stackoverflow
Solution 2 - C#Timothy WaltersView Answer on Stackoverflow
Solution 3 - C#Richard ShepherdView Answer on Stackoverflow
Solution 4 - C#BoinstView Answer on Stackoverflow
Solution 5 - C#Syed AliView Answer on Stackoverflow
Solution 6 - C#John ConradView Answer on Stackoverflow
Solution 7 - C#Liudvikas BukysView Answer on Stackoverflow
Solution 8 - C#GrzenioView Answer on Stackoverflow
Solution 9 - C#Lieven KeersmaekersView Answer on Stackoverflow
Solution 10 - C#Adam LarsenView Answer on Stackoverflow
Solution 11 - C#Brian WView Answer on Stackoverflow
Solution 12 - C#Derek ZiembaView Answer on Stackoverflow