Why are Where and Select outperforming just Select?

C#Linq

C# Problem Overview


I have a class, like this:

public class MyClass
{
    public int Value { get; set; }
    public bool IsValid { get; set; }
}

In actual fact it's much larger, but this recreates the problem (weirdness).

I want to get the sum of the Value, where the instance is valid. So far, I've found two solutions to this.

The first one is this:

int result = myCollection.Where(mc => mc.IsValid).Select(mc => mc.Value).Sum();

The second one, however, is this:

int result = myCollection.Select(mc => mc.IsValid ? mc.Value : 0).Sum();

I want to get the most efficient method. I, at first, thought that the second one would be more efficient. Then the theoretical part of me started going "Well, one is O(n + m + m), the other one is O(n + n). The first one should perform better with more invalids, while the second one should perform better with less". I thought that they would perform equally. EDIT: And then @Martin pointed out that the Where and the Select were combined, so it should actually be O(m + n). However, if you look below, it seems like this is not related.


So I put it to the test.

(It's 100+ lines, so I thought it was better to post it as a Gist.)
The results were... interesting.

With 0% tie tolerance:

The scales are in the favour of Select and Where, by about ~30 points.

How much do you want to be the disambiguation percentage?
0
Starting benchmarking.
Ties: 0
Where + Select: 65
Select: 36

With 2% tie tolerance:

It's the same, except that for some they were within 2%. I'd say that's a minimum margin of error. Select and Where now have just a ~20 point lead.

How much do you want to be the disambiguation percentage?
2
Starting benchmarking.
Ties: 6
Where + Select: 58
Select: 37

With 5% tie tolerance:

This is what I'd say to be my maximum margin of error. It makes it a bit better for the Select, but not much.

How much do you want to be the disambiguation percentage?
5
Starting benchmarking.
Ties: 17
Where + Select: 53
Select: 31

With 10% tie tolerance:

This is way out of my margin of error, but I'm still interested in the result. Because it gives the Select and Where the twenty point lead it's had for a while now.

How much do you want to be the disambiguation percentage?
10
Starting benchmarking.
Ties: 36
Where + Select: 44
Select: 21

With 25% tie tolerance:

This is way, way out of my margin of error, but I'm still interested in the result, because the Select and Where still (nearly) keep their 20 point lead. It seems like it's outclassing it in a distinct few, and that's what giving it the lead.

How much do you want to be the disambiguation percentage?
25
Starting benchmarking.
Ties: 85
Where + Select: 16
Select: 0


Now, I'm guessing that the 20 point lead came from the middle, where they're both bound to get around the same performance. I could try and log it, but it would be a whole load of information to take in. A graph would be better, I guess.

So that's what I did.

Select vs Select and Where.

It shows that the Select line keeps steady (expected) and that the Select + Where line climbs up (expected). However, what puzzles me is why it doesn't meet with the Select at 50 or earlier: in fact I was expecting earlier than 50, as an extra enumerator had to be created for the Select and Where. I mean, this shows the 20-point lead, but it doesn't explain why. This, I guess, is the main point of my question.

Why does it behave like this? Should I trust it? If not, should I use the other one or this one?


As @KingKong mentioned in the comments, you can also use Sum's overload that takes a lambda. So my two options are now changed to this:

First:

int result = myCollection.Where(mc => mc.IsValid).Sum(mc => mc.Value);

Second:

int result = myCollection.Sum(mc => mc.IsValid ? mc.Value : 0);

I'm going to make it a bit shorter, but:

How much do you want to be the disambiguation percentage?
0
Starting benchmarking.
Ties: 0
Where: 60
Sum: 41
How much do you want to be the disambiguation percentage?
2
Starting benchmarking.
Ties: 8
Where: 55
Sum: 38
How much do you want to be the disambiguation percentage?
5
Starting benchmarking.
Ties: 21
Where: 49
Sum: 31
How much do you want to be the disambiguation percentage?
10
Starting benchmarking.
Ties: 39
Where: 41
Sum: 21
How much do you want to be the disambiguation percentage?
25
Starting benchmarking.
Ties: 85
Where: 16
Sum: 0

The twenty-point lead is still there, meaning it doesn't have to do with the Where and Select combination pointed out by @Marcin in the comments.

Thanks for reading through my wall of text! Also, if you're interested, here's the modified version that logs the CSV that Excel takes in.

C# Solutions


Solution 1 - C#

Select iterates once over the entire set and, for each item, performs a conditional branch (checking for validity) and a + operation.

Where+Select creates an iterator that skips invalid elements (doesn't yield them), performing a + only on the valid items.

So, the cost for a Select is:

> t(s) = n * ( cost(check valid) + cost(+) )

And for Where+Select:

> t(ws) = n * ( cost(check valid) + p(valid) * (cost(yield) + cost(+)) )

Where:

  • p(valid) is the probability that an item in the list is valid.
  • cost(check valid) is the cost of the branch that checks for validity
  • cost(yield) is the cost of constructing the new state of the where iterator, which is more complex than the simple iterator that the Select version uses.

As you can see, for a given n, the Select version is a constant, whereas the Where+Select version is a linear equation with p(valid) as a variable. The actual values of the costs determine the intersection point of the two lines, and since cost(yield) can be different from cost(+), they don't necessarily intersect at p(valid)=0.5.

Solution 2 - C#

Here's an in-depth explanation of what's causing the timing-differences.


The Sum() function for IEnumerable<int> looks like this:

public static int Sum(this IEnumerable<int> source)
{
    int sum = 0;
    foreach(int item in source)
    {
        sum += item;
    }
    return sum;
}

In C#, foreach is just syntactic sugar for .Net's version of an iterator, IEnumerator<T> (not to be confused with IEnumerable<T>). So the above code is actually translated to this:

public static int Sum(this IEnumerable<int> source)
{
    int sum = 0;

    IEnumerator<int> iterator = source.GetEnumerator();
    while(iterator.MoveNext())
    {
        int item = iterator.Current;
        sum += item;
    }
    return sum;
}

Remember, the two lines of code you're comparing are the following

int result1 = myCollection.Where(mc => mc.IsValid).Sum(mc => mc.Value);
int result2 = myCollection.Sum(mc => mc.IsValid ? mc.Value : 0);

Now here's the kicker:

LINQ uses deferred execution. Thus, while it may appear that result1 iterates over the collection twice, it actually only iterates over it once. The Where() condition is actually applied during the Sum(), inside of the call to MoveNext() (This is possible thanks to the magic of yield return).

This means that, for result1, the code inside of the while loop,

{
    int item = iterator.Current;
    sum += item;
}

is only executed once for each item with mc.IsValid == true. By comparison, result2 will execute that code for every item in the collection. That is why result1 is generally faster.

(Though, note that calling the Where() condition within MoveNext() still has some small overhead, so if most/all of the items have mc.IsValid == true, result2 will actually be faster!)


Hopefully now it's clear why result2 is usually slower. Now I'd like to explain why I stated in the comments that these LINQ performance comparisons don't matter.

Creating a LINQ expression is cheap. Calling delegate functions is cheap. Allocating and looping over an iterator is cheap. But it's even cheaper to not do these things. Thus, if you find that a LINQ statement is the bottleneck in your program, in my experience rewriting it without LINQ will always make it faster than any of the various LINQ methods.

So, your LINQ workflow should look like this:

  1. Use LINQ everywhere.
  2. Profile.
  3. If the profiler says LINQ is the cause of a bottleneck, rewrite that piece of code without LINQ.

Fortunately, LINQ bottlenecks are rare. Heck, bottlenecks are rare. I've written hundreds of LINQ statements in the last few years, and have ended up replacing <1%. And most of those were due to LINQ2EF's poor SQL optimization, rather than being LINQ's fault.

So, like always, write clear and sensible code first, and wait until after you've profiled to worry about micro-optimizations.

Solution 3 - C#

Funny thing. Do you know how is Sum(this IEnumerable<TSource> source, Func<TSource, int> selector) defined? It uses Select method!

public static int Sum<TSource>(this IEnumerable<TSource> source, Func<TSource, int> selector)
{
	return source.Select(selector).Sum();
}

So actually, it all should work nearly the same. I did quick research on my own, and here are the results:

Where -- mod: 1 result: 0, time: 371 ms
WhereSelect -- mod: 1  result: 0, time: 356 ms
Select -- mod: 1  result 0, time: 366 ms
Sum -- mod: 1  result: 0, time: 363 ms
-------------
Where -- mod: 2 result: 4999999, time: 469 ms
WhereSelect -- mod: 2  result: 4999999, time: 429 ms
Select -- mod: 2  result 4999999, time: 362 ms
Sum -- mod: 2  result: 4999999, time: 358 ms
-------------
Where -- mod: 3 result: 9999999, time: 441 ms
WhereSelect -- mod: 3  result: 9999999, time: 452 ms
Select -- mod: 3  result 9999999, time: 371 ms
Sum -- mod: 3  result: 9999999, time: 380 ms
-------------
Where -- mod: 4 result: 7500000, time: 571 ms
WhereSelect -- mod: 4  result: 7500000, time: 501 ms
Select -- mod: 4  result 7500000, time: 406 ms
Sum -- mod: 4  result: 7500000, time: 397 ms
-------------
Where -- mod: 5 result: 7999999, time: 490 ms
WhereSelect -- mod: 5  result: 7999999, time: 477 ms
Select -- mod: 5  result 7999999, time: 397 ms
Sum -- mod: 5  result: 7999999, time: 394 ms
-------------
Where -- mod: 6 result: 9999999, time: 488 ms
WhereSelect -- mod: 6  result: 9999999, time: 480 ms
Select -- mod: 6  result 9999999, time: 391 ms
Sum -- mod: 6  result: 9999999, time: 387 ms
-------------
Where -- mod: 7 result: 8571428, time: 489 ms
WhereSelect -- mod: 7  result: 8571428, time: 486 ms
Select -- mod: 7  result 8571428, time: 384 ms
Sum -- mod: 7  result: 8571428, time: 381 ms
-------------
Where -- mod: 8 result: 8749999, time: 494 ms
WhereSelect -- mod: 8  result: 8749999, time: 488 ms
Select -- mod: 8  result 8749999, time: 386 ms
Sum -- mod: 8  result: 8749999, time: 373 ms
-------------
Where -- mod: 9 result: 9999999, time: 497 ms
WhereSelect -- mod: 9  result: 9999999, time: 494 ms
Select -- mod: 9  result 9999999, time: 386 ms
Sum -- mod: 9  result: 9999999, time: 371 ms

For following implementations:

result = source.Where(x => x.IsValid).Sum(x => x.Value);
result = source.Select(x => x.IsValid ? x.Value : 0).Sum();
result = source.Sum(x => x.IsValid ? x.Value : 0);
result = source.Where(x => x.IsValid).Select(x => x.Value).Sum();

mod means: every 1 from mod items is invalid: for mod == 1 every item is invalid, for mod == 2 odd items are invalid, etc. Collection contains 10000000 items.

enter image description here

And results for collection with 100000000 items:

enter image description here

As you can see, Select and Sum results are quite consistent across all mod values. However where and where+select are not.

Solution 4 - C#

My guess is that the version with Where filters out the 0s and they are not a subject for Sum (i.e. you are not executing the addition). This is of course a guess since I cannot explain how executing additional lambda expression and calling multiple methods outperforms a simple addition of a 0.

A friend of mine suggested that the fact that the 0 in the sum may cause severe performance penalty because of overflow checks. It would be interesting to see how this would perform in unchecked context.

Solution 5 - C#

Running the following sample, it becomes clear to me that the only time Where+Select can outperform Select is in fact when it is discarding a good amount (approx half in my informal tests) of the potential items in the list. In the small example below, I get roughly the same numbers out of both samples when the Where skips approx 4mil items out of 10mil. I ran in release, and reordered the execution of where+select vs select with same results.

static void Main(string[] args)
        {
            int total = 10000000;
            Random r = new Random();
            var list = Enumerable.Range(0, total).Select(i => r.Next(0, 5)).ToList();
            for (int i = 0; i < 4000000; i++)
                list[i] = 10;
            
            var sw = new Stopwatch();
            sw.Start();

            int sum = 0;

            sum = list.Where(i => i < 10).Select(i => i).Sum();            

            sw.Stop();
            Console.WriteLine(sw.ElapsedMilliseconds);

            sw.Reset();
            sw.Start();
            sum = list.Select(i => i).Sum();            

            sw.Stop();

            Console.WriteLine(sw.ElapsedMilliseconds);
        }

Solution 6 - C#

If you need speed, just doing a straightforward loop is probably your best bet. And doing for tends to be better than foreach (assuming your collection is random-access of course).

Here are the timings I got with 10% of elements being invalid:

Where + Select + Sum:   257
Select + Sum:           253
foreach:                111
for:                    61

And with 90% invalid elements:

Where + Select + Sum:   177
Select + Sum:           247
foreach:                105
for:                    58

And here is my benchmark code...

public class MyClass {
    public int Value { get; set; }
    public bool IsValid { get; set; }
}

class Program {

    static void Main(string[] args) {

        const int count = 10000000;
        const int percentageInvalid = 90;

        var rnd = new Random();
        var myCollection = new List<MyClass>(count);
        for (int i = 0; i < count; ++i) {
            myCollection.Add(
                new MyClass {
                    Value = rnd.Next(0, 50),
                    IsValid = rnd.Next(0, 100) > percentageInvalid
                }
            );
        }

        var sw = new Stopwatch();
        sw.Restart();
        int result1 = myCollection.Where(mc => mc.IsValid).Select(mc => mc.Value).Sum();
        sw.Stop();
        Console.WriteLine("Where + Select + Sum:\t{0}", sw.ElapsedMilliseconds);

        sw.Restart();
        int result2 = myCollection.Select(mc => mc.IsValid ? mc.Value : 0).Sum();
        sw.Stop();
        Console.WriteLine("Select + Sum:\t\t{0}", sw.ElapsedMilliseconds);
        Debug.Assert(result1 == result2);

        sw.Restart();
        int result3 = 0;
        foreach (var mc in myCollection) {
            if (mc.IsValid)
                result3 += mc.Value;
        }
        sw.Stop();
        Console.WriteLine("foreach:\t\t{0}", sw.ElapsedMilliseconds);
        Debug.Assert(result1 == result3);

        sw.Restart();
        int result4 = 0;
        for (int i = 0; i < myCollection.Count; ++i) {
            var mc = myCollection[i];
            if (mc.IsValid)
                result4 += mc.Value;
        }
        sw.Stop();
        Console.WriteLine("for:\t\t\t{0}", sw.ElapsedMilliseconds);
        Debug.Assert(result1 == result4);

    }

}

BTW, I concur with the Stilgar's guess: the relative speeds of your two cases vary depending on percentage of invalid items, simply because the amount of job Sum needs to do varies in the "Where" case.

Solution 7 - C#

Rather than try to explain via description, I'm going to take a more mathematical approach.

Given the code below which should approximate what LINQ is doing internally, the relative costs are as follows:
Select only: Nd + Na
Where+Select : Nd + Md + Ma

To figure out the point where they will cross, we need to do a little algebra:
Nd + Md + Ma = Nd + Na => M(d + a) = Na => (M/N) = a/(d+a)

What this means is that in order for the inflection point to be at 50%, the cost of a delegate invocation must be roughly the same as the cost of an addition. Since we know that the actual inflection point was at about 60%, we can work backwards and determine that the cost of a delegate invocation for @It'sNotALie was actually about 2/3 the cost of an addition which is surprising, but that's what his numbers say.

static void Main(string[] args)
{
    var set = Enumerable.Range(1, 10000000)
                        .Select(i => new MyClass {Value = i, IsValid = i%2 == 0})
                        .ToList();

    Func<MyClass, int> select = i => i.IsValid ? i.Value : 0;
    Console.WriteLine(
        Sum(                        // Cost: N additions
            Select(set, select)));  // Cost: N delegate
    // Total cost: N * (delegate + addition) = Nd + Na

    Func<MyClass, bool> where = i => i.IsValid;
    Func<MyClass, int> wSelect = i => i.Value;
    Console.WriteLine(
        Sum(                        // Cost: M additions
            Select(                 // Cost: M delegate
                Where(set, where),  // Cost: N delegate
                wSelect)));
    // Total cost: N * delegate + M * (delegate + addition) = Nd + Md + Ma
}

// Cost: N delegate calls
static IEnumerable<T> Where<T>(IEnumerable<T> set, Func<T, bool> predicate)
{
    foreach (var mc in set)
    {
        if (predicate(mc))
        {
            yield return mc;
        }
    }
}

// Cost: N delegate calls
static IEnumerable<int> Select<T>(IEnumerable<T> set, Func<T, int> selector)
{
    foreach (var mc in set)
    {
        yield return selector(mc);
    }
}

// Cost: N additions
static int Sum(IEnumerable<int> set)
{
    unchecked
    {
        var sum = 0;
        foreach (var i in set)
        {
            sum += i;
        }

        return sum;
    }
}

Solution 8 - C#

I think it's interesting that MarcinJuraszek's result is different from It'sNotALie's. In particular, MarcinJuraszek's results start out with all four implementations at the same place, while It'sNotALie's results cross over around the middle. I will explain how this works from the source.

Let us assume that there are n total elements, and m valid elements.

The Sum function is pretty simple. It just loops through the enumerator: http://typedescriptor.net/browse/members/367300-System.Linq.Enumerable.Sum(IEnumerable%601)

For simplicity, let's assume the collection is a list. The both Select and WhereSelect will create a WhereSelectListIterator. This means that the actual iterators generated are the same. In both cases, there is a Sum that loops over an iterator, the WhereSelectListIterator. The most interesting part of the iterator is the MoveNext method.

Since the iterators are the same, the loops are the same. The only difference is in the body of the loops.

The body of these lambdas have very similar cost. The where clause returns a field value, and the ternary predicate also returns a field value. The select clause returns a field value, and the two branches of the ternary operator return either a field value or a constant. The combined select clause has the branch as a ternary operator, but the WhereSelect uses the branch in MoveNext.

However, all of these operations are fairly cheap. The most expensive operation so far is the branch, where a wrong prediction will cost us.

Another expensive operation here is the Invoke. Invoking a function takes quite a bit longer than adding a value, as Branko Dimitrijevic has shown.

Also weighing in is the checked accumulation in Sum. If the processor does not have an arithmetic overflow flag, then this could be costly to check as well.

Hence, the interesting costs are: is:

  1. (n + m) * Invoke + m * checked+=
  2. n * Invoke + n * checked+=

Thus, if the cost of Invoke is much higher than the cost of checked accumulation, then the case 2 is always better. If they're about even, then we will see a balance when about half of the elements are valid.

It looks like on MarcinJuraszek's system, checked+= has negligible cost, but on It'sNotALie's and Branko Dimitrijevic's systems, checked+= has significant costs. It looks like it's the most expensive on It'sNotALie's system since the break even point is much higher. It doesn't look like anyone has posted results from a system where the accumulation costs much more than the Invoke.

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionIt&#39;sNotALie.View Question on Stackoverflow
Solution 1 - C#AlexView Answer on Stackoverflow
Solution 2 - C#BlueRaja - Danny PflughoeftView Answer on Stackoverflow
Solution 3 - C#MarcinJuraszekView Answer on Stackoverflow
Solution 4 - C#StilgarView Answer on Stackoverflow
Solution 5 - C#DavidNView Answer on Stackoverflow
Solution 6 - C#Branko DimitrijevicView Answer on Stackoverflow
Solution 7 - C#Jon NortonView Answer on Stackoverflow
Solution 8 - C#John TsengView Answer on Stackoverflow