Do current x86 architectures support non-temporal loads (from "normal" memory)?

C++ Problem Overview

I am aware of multiple questions on this topic, however, I haven't seen any clear answers nor any benchmark measurements. I thus created a simple program that works with two arrays of integers. The first array a is very large (64 MB) and the second array b is small to fit into L1 cache. The program iterates over a and adds its elements to corresponding elements of b in a modular sense (when the end of b is reached, the program starts from its beginning again). The measured numbers of L1 cache misses for different sizes of b is as follows:

[![enter image description here][1]][1]

The measurements were made on a Xeon E5 2680v3 Haswell type CPU with 32 kiB L1 data cache. Therefore, in all the cases, b fitted into L1 cache. However, the number of misses grew considerably by around 16 kiB of b memory footprint. This might be expected since the loads of both a and b causes invalidation of cache lines from the beginning of b at this point.

There is absolutely no reason to keep elements of a in cache, they are used only once. I therefore run a program variant with non-temporal loads of a data, but the number of misses did not change. I also run a variant with non-temporal prefetching of a data, but still with the very same results.

My benchmark code is as follows (variant w/o non-temporal prefetching shown):

int main(int argc, char* argv[])
{
   uint64_t* a;
   const uint64_t a_bytes = 64 * 1024 * 1024;
   const uint64_t a_count = a_bytes / sizeof(uint64_t);
   posix_memalign((void**)(&a), 64, a_bytes);

   uint64_t* b;
   const uint64_t b_bytes = atol(argv[1]) * 1024;
   const uint64_t b_count = b_bytes / sizeof(uint64_t);
   posix_memalign((void**)(&b), 64, b_bytes);

   __m256i ones = _mm256_set1_epi64x(1UL);
   for (long i = 0; i < a_count; i += 4)
       _mm256_stream_si256((__m256i*)(a + i), ones);

   // load b into L1 cache
   for (long i = 0; i < b_count; i++)
       b[i] = 0;

   int papi_events[1] = { PAPI_L1_DCM };
   long long papi_values[1];
   PAPI_start_counters(papi_events, 1);

   uint64_t* a_ptr = a;
   const uint64_t* a_ptr_end = a + a_count;
   uint64_t* b_ptr = b;
   const uint64_t* b_ptr_end = b + b_count;

   while (a_ptr < a_ptr_end) {
#ifndef NTLOAD
      __m256i aa = _mm256_load_si256((__m256i*)a_ptr);
#else
      __m256i aa = _mm256_stream_load_si256((__m256i*)a_ptr);
#endif
      __m256i bb = _mm256_load_si256((__m256i*)b_ptr);
      bb = _mm256_add_epi64(aa, bb);
      _mm256_store_si256((__m256i*)b_ptr, bb);

      a_ptr += 4;
      b_ptr += 4;
      if (b_ptr >= b_ptr_end)
         b_ptr = b;
   }

   PAPI_stop_counters(papi_values, 1);
   std::cout << "L1 cache misses: " << papi_values[0] << std::endl;

   free(a);
   free(b);
}

What I wonder is whether CPU vendors support or are going to support non-temporal loads / prefetching or any other way how to label some data as not-being-hold in cache (e.g., to tag them as LRU). There are situations, e.g., in HPC, where similar scenarios are common in practice. For example, in sparse iterative linear solvers / eigensolvers, matrix data are usually very large (larger than cache capacities), but vectors are sometimes small enough to fit into L3 or even L2 cache. Then, we would like to keep them there at all costs. Unfortunately, loading of matrix data can cause invalidation of especially x-vector cache lines, even though in each solver iteration, matrix elements are used only once and there is no reason to keep them in cache after they have been processed.

UPDATE

I just did a similar experiment on an Intel Xeon Phi KNC, while measuring runtime instead of L1 misses (I haven't find a way how to measure them reliably; PAPI and VTune gave weird metrics.) The results are here:

[![enter image description here][2]][2]

The orange curve represents ordinary loads and it has the expected shape. The blue curve represents loads with so-call eviction hint (EH) set in the instruction prefix and the gray curve represents a case where each cache line of a was manually evicted; both these tricks enabled by KNC obviously worked as we wanted to for b over 16 kiB. The code of the measured loop is as follows:

while (a_ptr < a_ptr_end) {
#ifdef NTLOAD
   __m512i aa = _mm512_extload_epi64((__m512i*)a_ptr,
      _MM_UPCONV_EPI64_NONE, _MM_BROADCAST64_NONE, _MM_HINT_NT);
#else
   __m512i aa = _mm512_load_epi64((__m512i*)a_ptr);
#endif
   __m512i bb = _mm512_load_epi64((__m512i*)b_ptr);
   bb = _mm512_or_epi64(aa, bb);
   _mm512_store_epi64((__m512i*)b_ptr, bb);

#ifdef EVICT
   _mm_clevict(a_ptr, _MM_HINT_T0);
#endif

   a_ptr += 8;
   b_ptr += 8;
   if (b_ptr >= b_ptr_end)
       b_ptr = b;
}

UPDATE 2

On Xeon Phi, icpc generated for normal-load variant (orange curve) prefetching for a_ptr:

400e93:       62 d1 78 08 18 4c 24    vprefetch0 [r12+0x80]

When I manually (by hex-editing the executable) modified this to:

400e93:       62 d1 78 08 18 44 24    vprefetchnta [r12+0x80]

I got the desired resutls, even better than the blue/gray curves. However, I was not able to force the compiler to generate non-temporal prefetchnig for me, even by using #pragma prefetch a_ptr:_MM_HINT_NTA before the loop :( [1]: https://i.stack.imgur.com/LgpsB.png [2]: https://i.stack.imgur.com/CU2dY.png

C++ Solutions

Solution 1 - C++

To answer specifically the headline question:

Yes, recent¹ mainstream Intel CPUs support non-temporal loads on normal² memory - but only "indirectly" via non-temporal prefetch instructions, rather than directly using non-temporal load instructions like movntdqa. This is in contrast to non-temporal stores where you can just use the corresponding non-temporal store instructions³ directly.

The basic idea is that you issue a prefetchnta to the cache line before any normal loads, and then issue loads as normal. If the line wasn't already in the cache, it will be loaded in a non-temporal fashion. The exact meaning of non-temporal fashion depends on the architecture but the general pattern is that the line is loaded into, at least the L1 and perhaps some higher cache levels. Indeed for a prefetch to be of any use it needs to cause the line to load, at least into some cache level for consumption by a later load. The line may also be treated specially in the cache, for example by flagging it as high priority for eviction or restricting the ways in which it can be placed.

The upshot of all this is that while non-temporal loads are supported in a sense, they are really only partly non-temporal, unlike stores where you really leave no trace of the line in any of the cache levels. Non-temporal loads will cause some cache pollution, but generally less than regular loads. The exact details are architecture specific, and I've included some details below for modern Intel. You can find a slightly longer writeup in this answer to the question "Non-temporal loads and the hardware prefetcher, do they work together?" ).

Skylake Client

Based on the tests in this answer it seems that the behavior for prefetchnta Skylake is to fetch normally into the L1 cache, to skip the L2 entirely, and fetches in a limited way into the L3 cache (probably into 1 or 2 ways only so the total amount of the L3 available to nta prefetches is limited).

This was tested on Skylake client, but I believe this basic behavior probably extends backwards probably to Sandy Bridge and earlier (based on wording in the Intel optimization guide), and also forwards to Kaby Lake and later architectures based on Skylake client. So unless you are using a Skylake-SP or Skylake-X part, or an extremely old CPU, this is probably the behavior you can expect from prefetchnta.

Skylake Server

The only recent Intel chip known to have different behavior is Skylake server (used in Skylake-X, Skylake-SP and a few other lines). This has a considerably changed L2 and L3 architecture, and the L3 is no longer inclusive of the much larger L2. For this chip, it seems that prefetchnta skips both the L2 and L3 caches, so on this architecture cache pollution is limited to the L1.

This behavior was reported by user Mysticial in a comment. The downside, as pointed out in those comments is that this makes prefetchnta much more brittle: if you get the prefetch distance or timing wrong (especially easy when hyperthreading is involved and the sibling core is active), and the data gets evicted from L1 before you use, you are going all the way back to main memory rather than the L3 on earlier architectures.

¹ Recent here probably means anything in the last decade or so, but I don't mean to imply that earlier hardware didn't support non-temporal prefetch: it's possible that support goes right back to the introduction of prefetchnta but I don't have the hardware to check that and can't find an existing reliable source of information on it.

² Normal here just means WB (writeback) memory, which is the memory dealing with at the application level the overwhelming majority of the time.

³ Specifically, the NT store instructions are movnti for general purpose registers and the movntd* and movntp* families for SIMD registers.

Solution 2 - C++

I answer my own question since I found the following post from Intel Developer Forum, which makes sense for me. It was written by John McCalpin:

> The results for the mainstream processors are not surprising -- in the absence of true "scratchpad" memory, it is not clear that it is possible to design an implementation of "non-temporal" behavior that is not subject to nasty surprises. Two approaches that have been used in the past are (1) loading the cache line, but marking it LRU instead of MRU, and (2) loading the cache line into one specific "set" of the set-associative cache. In either case it is relatively easy to generate situations in which the cache drops the data before the processor completes reading it. > > Both of these approaches risk performance degradation in cases operating on more than a small number of arrays, and are made much more difficult to implement without "gotchas" when HyperThreading is considered. > > In other contexts I have argued for the implementation of "load multiple" instructions that would guarantee that the entire contents of a cache line would be copied to registers atomically. My reasoning is that the hardware absolutely guarantees that the cache line is moved atomically and that the time required to copy the remainder of the cache line to registers was so small (an extra 1-3 cycles, depending on the processor generation) that it could be safely implemented as an atomic operation.
> > Starting with Haswell, the core can read 64 Bytes in a single cycle (2 256-bit aligned AVX reads), so the exposure to unintended side effects becomes even lower.
> > Starting with KNL, full-cache-line (aligned) loads should be "naturally" atomic, since the transfers from the L1 Data Cache to the core are full cache lines and all of the data is placed into the target AVX-512 register. (This does not mean that Intel guarantees atomicity in the implementation! We don't have visibility into the horrible corner cases that the designers have to account for, but it is reasonable to conclude that most of the time aligned 512-bit loads will occur atomically.) With this "natural" 64-Byte atomicity, some of the tricks used in the past for reducing cache pollution due to "non-temporal" loads may deserve another look....

> The MOVNTDQA instruction is intended primarily for reading from address ranges that are mapped as "Write-Combining" (WC), and not for reading from normal system memory that is mapped "Write-Back" (WB). The description in Volume 2 of the SWDM says that an implementation "may" do something special with MOVNTDQA for WB regions, but the emphasis is on the behavior for the WC memory type. > >The "Write-Combining" memory type is almost never used for "real" memory --- it is used almost exclusively for Memory-Mapped IO regions.

See here for the whole post: https://software.intel.com/en-us/forums/intel-isa-extensions/topic/597075

Content Type	Original Author	Original Content on Stackoverflow
Question	Daniel Langr	View Question on Stackoverflow
Solution 1 - C++	BeeOnRope	View Answer on Stackoverflow
Solution 2 - C++	Daniel Langr	View Answer on Stackoverflow