Does using xor reg, reg give advantage over mov reg, 0?

AssemblyX86Micro Optimization

Assembly Problem Overview


There're two well-known ways to set an integer register to zero value on x86.

Either

mov reg, 0

or

xor reg, reg

There's an opinion that the second variant is better since the value 0 is not stored in the code and that saves several bytes of produced machine code. This is definitely good - less instruction cache is used and this can sometimes allow for faster code execution. Many compilers produce such code.

However there's formally an inter-instruction dependency between the xor instruction and whatever earlier instruction that changes the same register. Since there's a depedency the latter instruction needs to wait until the former completes and this could reduce the processor units load and hurt performance.

add reg, 17
;do something else with reg here
xor reg, reg

It's obvious that the result of xor will be exactly the same regardless of the initial register value. But it the processor able to recognize this?

I tried the following test in VC++7:

const int Count = 10 * 1000 * 1000 * 1000;
int _tmain(int argc, _TCHAR* argv[])
{
    int i;
    DWORD start = GetTickCount();
    for( i = 0; i < Count ; i++ ) {
        __asm {
            mov eax, 10
            xor eax, eax
        };
    }
    DWORD diff = GetTickCount() - start;
    start = GetTickCount();
    for( i = 0; i < Count ; i++ ) {
        __asm {
            mov eax, 10
            mov eax, 0
        };
    }
    diff = GetTickCount() - start;
    return 0;
}

With optimizations off both loops take exactly the same time. Does this reasonably prove that the processor recognizes that there's no dependency of xor reg, reg instruction on the earlier mov eax, 0 instruction? What could be a better test to check this?

Assembly Solutions


Solution 1 - Assembly

an actual answer for you:

Intel 64 and IA-32 Architectures Optimization Reference Manual

Section 3.5.1.7 is where you want to look.

In short there are situations where an xor or a mov may be preferred. The issues center around dependency chains and preservation of condition codes.

> In processors based on Intel Core microarchitecture, a number of instructions can help clear execution dependency when software uses these instruction to clear register content to zero.

> In contexts where the condition codes must be preserved, move 0 into > the register instead.

Solution 2 - Assembly

On modern CPUs the XOR pattern is preferred. It is smaller, and faster.

Smaller actually does matter because on many real workloads one of the main factors limiting performance is i-cache misses. This wouldn't be captured in a micro-benchmark comparing the two options, but in the real world it will make code run slightly faster.

And, ignoring the reduced i-cache misses, XOR on any CPU in the last many years is the same speed or faster than MOV. What could be faster than executing a MOV instruction? Not executing any instruction at all! On recent Intel processors the dispatch/rename logic recognizes the XOR pattern, 'realizes' that the result will be zero, and just points the register at a physical zero-register. It then throws away the instruction because there is no need to execute it.

The net result is that the XOR pattern uses zero execution resources and can, on recent Intel CPUs, 'execute' four instructions per cycle. MOV tops out at three instructions per cycle.

For details see this blog post that I wrote:

https://randomascii.wordpress.com/2012/12/29/the-surprising-subtleties-of-zeroing-a-register/

Most programmers shouldn't be worrying about this, but compiler writers do have to worry, and it's good to understand the code that is being generated, and it's just frickin' cool!

Solution 3 - Assembly

x86 has variable-length instructions. MOV EAX, 0 requires one or two more bytes in code space than XOR EAX, EAX.

Solution 4 - Assembly

I stopped being able to fix my own cars after I sold my 1966 HR station wagon. I'm in a similar fix with modern CPUs :-)

It really will depend on the underlying microcode or circuitry. It's quite possible that the CPU could recognise "XOR Rn,Rn" and simply zero all bits without worrying about the contents. But of course, it may do the same thing with a "MOV Rn, 0". A good compiler will choose the best variant for the target platform anyway so this is usually only an issue if you're coding in assembler.

If the CPU is smart enough, your XOR dependency disappears since it knows the value is irrelevant and will set it to zero anyway (again this depends on the actual CPU being used).

However, I'm long past caring about a few bytes or a few clock cycles in my code - this seems like micro-optimisation gone mad.

Solution 5 - Assembly

I think on earlier architectures the mov eax, 0 instruction used to take a little longer than the xor eax, eax as well... cannot recall exactly why. Unless you have many more movs however I would imagine you're not likely to cause cache misses due to that one literal stored in the code.

Also note that from memory the status of the flags is not identical between these methods, but I may be misremembering this.

Solution 6 - Assembly

Are you writing a compiler?

And on a second note, your benchmarking probably won't work, since you have a branch in there that probably takes all the time anyway. (unless your compiler unrolls the loop for you)

Another reason that you can't benchmark a single instruction in a loop is that all your code will be cached (unlike real code). So you have taken much of the size difference between mov eax,0 and xor eax,eax out of the picture by having it in L1-cached the whole time.

My guess is that any measurable performance difference in the real world would be due to the size difference eating up the cache, and not due to execution time of the two options.

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionsharptoothView Question on Stackoverflow
Solution 1 - AssemblyMarkView Answer on Stackoverflow
Solution 2 - AssemblyBruce DawsonView Answer on Stackoverflow
Solution 3 - Assemblyajs410View Answer on Stackoverflow
Solution 4 - AssemblypaxdiabloView Answer on Stackoverflow
Solution 5 - AssemblyjerryjvlView Answer on Stackoverflow
Solution 6 - AssemblyThomasView Answer on Stackoverflow