Using AVX CPU instructions: Poor performance without "/arch:AVX"

C++PerformanceVisual Studio-2010SseAvx

C++ Problem Overview


My C++ code uses SSE and now I want to improve it to support AVX when it is available. So I detect when AVX is available and call a function that uses AVX commands. I use Win7 SP1 + VS2010 SP1 and a CPU with AVX.

To use AVX, it is necessary to include this:

#include "immintrin.h"

and then you can use intrinsics AVX functions like _mm256_mul_ps, _mm256_add_ps etc. The problem is that by default, VS2010 produces code that works very slowly and shows the warning:

> warning C4752: found Intel(R) Advanced Vector Extensions; consider > using /arch:AVX

It seems VS2010 actually does not use AVX instructions, but instead, emulates them. I added /arch:AVX to the compiler options and got good results. But this option tells the compiler to use AVX commands everywhere when possible. So my code may crash on CPU that does not support AVX!

So the question is how to make VS2010 compiler to produce AVX code but only when I specify AVX intrinsics directly. For SSE it works, I just use SSE intrinsics functions and it produce SSE code without any compiler options like /arch:SSE. But for AVX it does not work for some reason.

C++ Solutions


Solution 1 - C++

2021 update: Modern versions of MSVC don't need manual use of _mm256_zeroupper() even when compiling AVX intrinsics without /arch:AVX. VS2010 did.


The behavior that you are seeing is the result of expensive state-switching.

See page 102 of Agner Fog's manual:

http://www.agner.org/optimize/microarchitecture.pdf

Every time you improperly switch back and forth between SSE and AVX instructions, you will pay an extremely high (~70) cycle penalty.

When you compile without /arch:AVX, VS2010 will generate SSE instructions, but will still use AVX wherever you have AVX intrinsics. Therefore, you'll get code that has both SSE and AVX instructions - which will have those state-switching penalties. (VS2010 knows this, so it emits that warning you're seeing.)

Therefore, you should use either all SSE, or all AVX. Specifying /arch:AVX tells the compiler to use all AVX.

It sounds like you're trying to make multiple code paths: one for SSE, and one for AVX. For this, I suggest you separate your SSE and AVX code into two different compilation units. (one compiled with /arch:AVX and one without) Then link them together and make a dispatcher to choose based on the what hardware it's running on.

If you need to mix SSE and AVX, be sure to use _mm256_zeroupper() or _mm256_zeroall() appropriately to avoid the state-switching penalties.

Solution 2 - C++

tl;dr for old versions of MSVC only

Use _mm256_zeroupper(); or _mm256_zeroall(); around sections of code using AVX (before or after depending on function arguments). Only use option /arch:AVX for source files with AVX rather than for an entire project to avoid breaking support for legacy-encoded SSE-only code paths.

In modern MSVC (and the other mainstream compilers, GCC/clang/ICC), the compiler knows when to use a vzeroupper asm instruction. Forcing extra vzerouppers with intrinsics can hurt performance when inlining. See https://stackoverflow.com/questions/68736527/do-i-need-to-use-mm256-zeroupper-in-2021

Cause

I think the best explanation is in the Intel article, "Avoiding AVX-SSE Transition Penalties" (PDF). The abstract states:

> Transitioning between 256-bit Intel® AVX instructions and legacy Intel® SSE instructions within a program may cause performance penalties because the hardware must save and restore the upper 128 bits of the YMM registers.

Separating your AVX and SSE code into different compilation units may NOT help if you switch between calling code from both SSE-enabled and AVX-enabled object files, because the transition may occur when AVX instructions or assembly are mixed with any of (from the Intel paper):

  • 128-bit intrinsic instructions
  • SSE inline assembly
  • C/C++ floating point code that is compiled to Intel® SSE
  • Calls to functions or libraries that include any of the above

This means there may even be penalties when linking with external code using SSE.

Details

There are 3 processor states defined by the AVX instructions, and one of the states is where all of the YMM registers are split, allowing the lower half to be used by SSE instructions. The Intel document "Intel® AVX State Transitions: Migrating SSE Code to AVX" provides a diagram of these states:

enter image description here

When in state B (AVX-256 mode), all bits of the YMM registers are in use. When an SSE instruction is called, a transition to state C must occur, and this is where there is a penalty. The upper half of all YMM registers must be saved into an internal buffer before SSE can start, even if they happen to be zeros. The cost of the transitions is on the "order of 50-80 clock cycles on Sandy Bridge hardware". There is also a penalty going from C -> A, as diagrammed in Figure 2.

You can also find details about the state switching penalty causing this slowdown on page 130, Section 9.12, "Transitions between VEX and non-VEX modes" in Agner Fog's optimization guide (of version updated 2014-08-07), referenced in Mystical's answer. According to his guide, any transition to/from this state takes "about 70 clock cycles on Sandy Bridge". Just as the Intel document states, this is an avoidable transition penalty.

Skylake has a different dirty-upper mechanism that causes false dependencies for legacy-SSE with dirty uppers, rather than one-time penalties. https://stackoverflow.com/questions/41303780/why-is-this-sse-code-6-times-slower-without-vzeroupper-on-skylake

Resolution

To avoid the transition penalties you can either remove all legacy SSE code, instruct the compiler to convert all SSE instructions to their VEX encoded form of 128-bit instructions (if compiler is capable), or put the YMM registers in a known zero state before transitioning between AVX and SSE code. Essentially, to maintain the separate SSE code path, you must zero out the upper 128-bits of all 16 YMM registers (issuing a VZEROUPPER instruction) after any code that uses AVX instructions. Zeroing these bits manually forces a transition to state A, and avoids the expensive penalty since the YMM values do not need to be stored in an internal buffer by hardware. The intrinsic that performs this instruction is _mm256_zeroupper. The description for this intrinsic is very informative:

> This intrinsic is useful to clear the upper bits of the YMM registers when transitioning between Intel® Advanced Vector Extensions (Intel® AVX) instructions and legacy Intel® Supplemental SIMD Extensions (Intel® SSE) instructions. There is no transition penalty if an application clears the upper bits of all YMM registers (sets to ‘0’) via VZEROUPPER, the corresponding instruction for this intrinsic, before transitioning between Intel® Advanced Vector Extensions (Intel® AVX) instructions and legacy Intel® Supplemental SIMD Extensions (Intel® SSE) instructions.

In Visual Studio 2010+ (maybe even older), you get this intrinsic with immintrin.h.

Note that zeroing out the bits with other methods does not eliminate the penalty - the VZEROUPPER or VZEROALL instructions must be used.

One automatic solution implemented by the Intel Compiler is to insert a VZEROUPPER at the beginning of each function containing Intel AVX code if none of the arguments are a YMM register or __m256/__m256d/__m256i datatype, and at the end of functions if the returned value is not a YMM register or __m256/__m256d/__m256i datatype.

In the wild

This VZEROUPPER solution is used by FFTW to generate a library with both SSE and AVX support. See simd-avx.h:

/* Use VZEROUPPER to avoid the penalty of switching from AVX to SSE.
   See Intel Optimization Manual (April 2011, version 248966), Section
   11.3 */
#define VLEAVE _mm256_zeroupper

Then VLEAVE(); is called at the end of every function using intrinsics for AVX instructions.

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionMikeView Question on Stackoverflow
Solution 1 - C++MysticialView Answer on Stackoverflow
Solution 2 - C++chappjcView Answer on Stackoverflow