Huge performance difference (26x faster) when compiling for 32 and 64 bits

C# Problem Overview

I was trying to measure the difference of using a for and a foreach when accessing lists of value types and reference types.

I used the following class to do the profiling.

public static class Benchmarker
{
    public static void Profile(string description, int iterations, Action func)
    {
        Console.Write(description);

        // Warm up
        func();

        Stopwatch watch = new Stopwatch();

        // Clean up
        GC.Collect();
        GC.WaitForPendingFinalizers();
        GC.Collect();

        watch.Start();
        for (int i = 0; i < iterations; i++)
        {
            func();
        }
        watch.Stop();

        Console.WriteLine(" average time: {0} ms", watch.Elapsed.TotalMilliseconds / iterations);
    }
}

I used double for my value type. And I created this 'fake class' to test reference types:

class DoubleWrapper
{
    public double Value { get; set; }

    public DoubleWrapper(double value)
    {
        Value = value;
    }
}

Finally I ran this code and compared the time differences.

static void Main(string[] args)
{
    int size = 1000000;
    int iterationCount = 100;

    var valueList = new List<double>(size);
    for (int i = 0; i < size; i++) 
        valueList.Add(i);

    var refList = new List<DoubleWrapper>(size);
    for (int i = 0; i < size; i++) 
        refList.Add(new DoubleWrapper(i));

    double dummy;

    Benchmarker.Profile("valueList for: ", iterationCount, () =>
    {
        double result = 0;
        for (int i = 0; i < valueList.Count; i++)
        {
             unchecked
             {
                 var temp = valueList[i];
                 result *= temp;
                 result += temp;
                 result /= temp;
                 result -= temp;
             }
        }
        dummy = result;
    });

    Benchmarker.Profile("valueList foreach: ", iterationCount, () =>
    {
        double result = 0;
        foreach (var v in valueList)
        {
            var temp = v;
            result *= temp;
            result += temp;
            result /= temp;
            result -= temp;
        }
        dummy = result;
    });

    Benchmarker.Profile("refList for: ", iterationCount, () =>
    {
        double result = 0;
        for (int i = 0; i < refList.Count; i++)
        {
            unchecked
            {
                var temp = refList[i].Value;
                result *= temp;
                result += temp;
                result /= temp;
                result -= temp;
            }
        }
        dummy = result;
    });

    Benchmarker.Profile("refList foreach: ", iterationCount, () =>
    {
        double result = 0;
        foreach (var v in refList)
        {
            unchecked
            {
                var temp = v.Value;
                result *= temp;
                result += temp;
                result /= temp;
                result -= temp;
            }
        }

        dummy = result;
    });

    SafeExit();
}

I selected Release and Any CPU options, ran the program and got the following times:

valueList for:  average time: 483,967938 ms
valueList foreach:  average time: 477,873079 ms
refList for:  average time: 490,524197 ms
refList foreach:  average time: 485,659557 ms
Done!

Then I selected Release and x64 options, ran the program and got the following times:

valueList for:  average time: 16,720209 ms
valueList foreach:  average time: 15,953483 ms
refList for:  average time: 19,381077 ms
refList foreach:  average time: 18,636781 ms
Done!

Why is x64 bit version so much faster? I expected some difference, but not something this big.

I do not have access to other computers. Could you please run this on your machines and tell me the results? I'm using Visual Studio 2015 and I have an Intel Core i7 930.

Here's the SafeExit() method, so you can compile/run by yourself:

private static void SafeExit()
{
    Console.WriteLine("Done!");
    Console.ReadLine();
    System.Environment.Exit(1);
}

As requested, using double? instead of my DoubleWrapper:

Any CPU

valueList for:  average time: 482,98116 ms
valueList foreach:  average time: 478,837701 ms
refList for:  average time: 491,075915 ms
refList foreach:  average time: 483,206072 ms
Done!

x64

valueList for:  average time: 16,393947 ms
valueList foreach:  average time: 15,87007 ms
refList for:  average time: 18,267736 ms
refList foreach:  average time: 16,496038 ms
Done!

Last but not least: creating a x86 profile gives me almost the same results of using Any CPU.

C# Solutions

Solution 1 - C#

I can reproduce this on 4.5.2. No RyuJIT here. Both x86 and x64 disassemblies look reasonable. Range checks and so on are the same. The same basic structure. No loop unrolling.

x86 uses a different set of float instructions. The performance of these instructions seems to be comparable with the x64 instructions except for the division:

The division operation makes the 32 bit version extremely slow. Uncommenting the division equalizes performance to a large degree (32 bit down from 430ms to 3.25ms).

Peter Cordes points out that the instruction latencies of the two floating point units are not that dissimilar. Maybe some of the intermediate results are denormalized numbers or NaN. These might trigger a slow path in one of the units. Or, maybe the values diverge between the two implementations because of 10 byte vs. 8 byte float precision.

Peter Cordes also points out that all intermediate results are NaN... Removing this problem (valueList.Add(i + 1) so that no divisor is zero) mostly equalizes the results. Apparently, the 32 bit code does not like NaN operands at all. Let's print some intermediate values: if (i % 1000 == 0) Console.WriteLine(result);. This confirms that the data is now sane.

When benchmarking you need to benchmark a realistic workload. But who would have thought that an innocent division can mess up your benchmark?!

Try simply summing the numbers to get a better benchmark.

Division and modulo are always very slow. If you modify the BCL Dictionary code to simply not use the modulo operator to compute the bucket index performance measurable improves. This is how slow division is.

Here's the 32 bit code:

64 bit code (same structure, fast division):

This is not vectorized despite SSE instructions being used.

Solution 2 - C#

valueList[i] = i, starting from i=0, so the first loop iteration does 0.0 / 0.0. So every operation in your entire benchmark is done with NaNs.

As @usr showed in disassembly output, the 32bit version used x87 floating point, while 64bit used SSE floating point.

I'm not an expert on performance with NaNs, or the difference between x87 and SSE for this, but I think this explains the 26x perf difference. I bet your results will be a lot closer between 32 and 64bit if you initialize valueList[i] = i+1. (update: usr confirmed that this made 32 and 64bit performance fairly close.)

Division is very slow compared to other operations. See my comments on @usr's answer. Also see http://agner.org/optimize/ for tons of great stuff about hardware, and optimizing asm and C/C++, some of it relevant to C#. He has instruction tables of latency and throughput for most instructions for all recent x86 CPUs.

However, 10B x87 fdiv isn't much slower than SSE2's 8B double precision divsd, for normal values. IDK about perf differences with NaNs, infinities, or denormals.

They have different controls for what happens with NaNs and other FPU exceptions, though. The x87 FPU control word is separate from the SSE rounding / exception control register (MXCSR). If x87 is getting a CPU exception for every division, but SSE isn't, that easily explains the factor of 26. Or maybe there's just a performance difference that big when handling NaNs. The hardware is not optimized for churning through NaN after NaN.

IDK if the SSE controls for avoiding slowdowns with denormals will come into play here, since I believe result will be NaN all the time. IDK if C# sets the denormals-are-zero flag in the MXCSR, or the flush-to-zero-flag (which writes zeroes in the first place, instead of treating denormals as zero when read back).

I found an Intel article about SSE floating point controls, contrasting it with the x87 FPU control word. It doesn't have much to say about NaN, though. It ends with this:

> Conclusion > > To avoid serialization and performance issues due to denormals and > underflow numbers, use the SSE and SSE2 instructions to set > Flush-to-Zero and Denormals-Are-Zero modes within the hardware to > enable highest performance for floating-point applications.

IDK if this helps any with divide-by-zero.

for vs. foreach

It might be interesting to test a loop body that is throughput-limited, rather than just being one single loop-carried dependency chain. As it is, all of the work depends on previous results; there's nothing for the CPU to do in parallel (other than bounds-check the next array load while the mul/div chain is running).

You might see more difference between the methods if the "real work" occupied more of the CPUs execution resources. Also, on pre-Sandybridge Intel, there's a big difference between a loop fitting in the 28uop loop buffer or not. You get instruction decode bottlenecks if not, esp. when the average instruction length is longer (which happens with SSE). Instructions that decode to more than one uop will also limit decoder throughput, unless they come in a pattern that's nice for the decoders (e.g. 2-1-1). So a loop with more instructions of loop overhead can make the difference between a loop fitting in the 28-entry uop cache or not, which is a big deal on Nehalem, and sometimes helpful on Sandybridge and later.

Solution 3 - C#

We have the observation that 99.9% of all the floating point operations will involve NaN's, which is at least highly unusual (found by Peter Cordes first). We have another experiment by usr, which found that removing the division instructions makes the time difference almost completely go away.

The fact however is that the NaN's are only generated because the very first division calculates 0.0 / 0.0 which gives the initial NaN. If the divisions are not performed, result will always be 0.0, and we will always calculate 0.0 * temp -> 0.0, 0.0 + temp -> temp, temp - temp = 0.0. So removing the division did not only remove the divisions, but also removed the NaNs. I would expect that the NaN's are actually the problem, and that one implementation handles NaN's very slowly, while the other one doesn't have the problem.

It would be worthwhile starting the loop at i = 1 and measuring again. The four operations result * temp, + temp, / temp, - temp effectively add (1 - temp) so we wouldn't have any unusual numbers (0, infinity, NaN) for most of the operations.

The only problem could be that the division always gives an integer result, and some division implementations have shortcuts when the correct result doesn't use many bits. For example, dividing 310.0 / 31.0 gives 10.0 as the first four bits with a remainder of 0.0, and some implementations can stop evaluating the remaining 50 or so bits while others can't. If there is a significiant difference, then starting the loop with result = 1.0 / 3.0 would make a difference.

Solution 4 - C#

There may be several reasons why this is executing faster in 64bit on your machine. The reason I asked which CPU you were using was because when 64bit CPUs first made their appearance, AMD and Intel had different mechanisms to handle 64bit code.

Processor architecture:

Intel's CPU architecture was purely 64bit. In order to execute 32bit code, the 32bit instructions needed to be converted (inside the CPU) to 64bit instructions before execution.

AMD's CPU architecture was to build 64bit right on top of their 32bit architecture; that is, it was essentially a 32bit architecture with 64bit extentions - there was no code conversion process.

This was obviously a few years ago now, so I've no idea if/how the technology has changed, but essentially, you would expect 64bit code to perform better on a 64bit machine since the CPU is able to work with double the amount of bits per instruction.

.NET JIT

It's argued that .NET (and other managed languages like Java) are capable of outperforming languages like C++ because of the way the JIT compiler is able to optimize your code according to your processor architecture. In this respect, you might find that the JIT compiler is utilizing something in 64bit architecture that possibly wasn't available or required a workaround when executed in 32bit.

Note:

Rather than using DoubleWrapper, have you considered using Nullable<double> or shorthand syntax: double? - I'd be interested to see if that has any impact on your tests.

Note 2: Some people seem to be conflating my comments about 64bit architecture with IA-64. Just to clarify, in my answer, 64bit refers to x86-64 and 32bit refers to x86-32. Nothing here referenced IA-64!

Content Type	Original Author	Original Content on Stackoverflow
Question	Trauer	View Question on Stackoverflow
Solution 1 - C#	usr	View Answer on Stackoverflow
Solution 2 - C#	Peter Cordes	View Answer on Stackoverflow
Solution 3 - C#	gnasher729	View Answer on Stackoverflow
Solution 4 - C#	Matthew Layton	View Answer on Stackoverflow