Converting a MatchCollection to string array

C#ArraysRegex

C# Problem Overview


Is there a better way than this to convert a MatchCollection to a string array?

MatchCollection mc = Regex.Matches(strText, @"\b[A-Za-z-']+\b");
string[] strArray = new string[mc.Count];
for (int i = 0; i < mc.Count;i++ )
{
    strArray[i] = mc[i].Groups[0].Value;
}

P.S.: mc.CopyTo(strArray,0) throws an exception: >At least one element in the source array could not be cast down to the destination array type.

C# Solutions


Solution 1 - C#

Try:

var arr = Regex.Matches(strText, @"\b[A-Za-z-']+\b")
    .Cast<Match>()
    .Select(m => m.Value)
    .ToArray();

Solution 2 - C#

Dave Bish's answer is good and works properly.

It's worth noting although that replacing Cast<Match>() with OfType<Match>() will speed things up.

Code wold become:

var arr = Regex.Matches(strText, @"\b[A-Za-z-']+\b")
    .OfType<Match>()
    .Select(m => m.Groups[0].Value)
    .ToArray();

Result is exactly the same (and addresses OP's issue the exact same way) but for huge strings it's faster.

Test code:

// put it in a console application
static void Test()
{
    Stopwatch sw = new Stopwatch();
    StringBuilder sb = new StringBuilder();
    string strText = "this will become a very long string after my code has done appending it to the stringbuilder ";

    Enumerable.Range(1, 100000).ToList().ForEach(i => sb.Append(strText));
    strText = sb.ToString();

    sw.Start();
    var arr = Regex.Matches(strText, @"\b[A-Za-z-']+\b")
              .OfType<Match>()
              .Select(m => m.Groups[0].Value)
              .ToArray();
    sw.Stop();

    Console.WriteLine("OfType: " + sw.ElapsedMilliseconds.ToString());
    sw.Reset();

    sw.Start();
    var arr2 = Regex.Matches(strText, @"\b[A-Za-z-']+\b")
              .Cast<Match>()
              .Select(m => m.Groups[0].Value)
              .ToArray();
    sw.Stop();
    Console.WriteLine("Cast: " + sw.ElapsedMilliseconds.ToString());
}

Output follows:

OfType: 6540
Cast: 8743

For very long strings Cast() is therefore slower.

Solution 3 - C#

I ran the exact same benchmark that Alex has posted and found that sometimes Cast was faster and sometimes OfType was faster, but the difference between both was negligible. However, while ugly, the for loop is consistently faster than both of the other two.

Stopwatch sw = new Stopwatch();
StringBuilder sb = new StringBuilder();
string strText = "this will become a very long string after my code has done appending it to the stringbuilder ";
Enumerable.Range(1, 100000).ToList().ForEach(i => sb.Append(strText));
strText = sb.ToString();

//First two benchmarks

sw.Start();
MatchCollection mc = Regex.Matches(strText, @"\b[A-Za-z-']+\b");
var matches = new string[mc.Count];
for (int i = 0; i < matches.Length; i++)
{
    matches[i] = mc[i].ToString();
}
sw.Stop();

Results:

OfType: 3462
Cast: 3499
For: 2650

Solution 4 - C#

One could also make use of this extension method to deal with the annoyance of MatchCollection not being generic. Not that it's a big deal, but this is almost certainly more performant than OfType or Cast, because it's just enumerating, which both of those also have to do.

(Side note: I wonder if it would be possible for the .NET team to make MatchCollection inherit generic versions of ICollection and IEnumerable in the future? Then we wouldn't need this extra step to immediately have LINQ transforms available).

public static IEnumerable<Match> ToEnumerable(this MatchCollection mc)
{
    if (mc != null) {
        foreach (Match m in mc)
            yield return m;
    }
}

Solution 5 - C#

Consider the following code...

var emailAddress = "joe@sad.com; joe@happy.com; joe@elated.com";
List<string> emails = new List<string>();
emails = Regex.Matches(emailAddress, @"([a-zA-Z0-9_\-\.]+)@((\[[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.)|(([a-zA-Z0-9\-]+\.)+))([a-zA-Z]{2,4}|[0-9]{1,3})")
                .Cast<Match>()
                .Select(m => m.Groups[0].Value)
                .ToList();

Solution 6 - C#

If you need a recursive capture, eg. Tokenizing Math Equations:

//INPUT (I need this tokenized to do math)
	string sTests = "(1234+5678)/ (56.78-   1234   )";
            
	Regex splitter = new Regex(@"([\d,\.]+|\D)+");
	Match match = splitter.Match(sTests.Replace(" ", ""));
	string[] captures = (from capture in match.Groups.Cast<Group>().Last().Captures.Cast<Capture>()
                         select capture.Value).ToArray();

...because you need to go after the last captures in the last group.

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionVildanView Question on Stackoverflow
Solution 1 - C#Dave BishView Answer on Stackoverflow
Solution 2 - C#AlexView Answer on Stackoverflow
Solution 3 - C#David DeMarView Answer on Stackoverflow
Solution 4 - C#Nicholas PetersenView Answer on Stackoverflow
Solution 5 - C#gpmurthyView Answer on Stackoverflow
Solution 6 - C#mikeView Answer on Stackoverflow