C++ cache aware programming

C++ Problem Overview

is there a way in C++ to determine the CPU's cache size? i have an algorithm that processes a lot of data and i'd like to break this data down into chunks such that they fit into the cache. Is this possible? Can you give me any other hints on programming with cache-size in mind (especially in regard to multithreaded/multicore data processing)?

Thanks!

C++ Solutions

Solution 1 - C++

According to "What every programmer should know about memory", by Ulrich Drepper you can do the following on Linux:

> Once we have a formula for the memory > requirement we can compare it with the > cache size. As mentioned before, the > cache might be shared with multiple > other cores. Currently {There > definitely will sometime soon be a > better way!} the only way to get > correct information without hardcoding > knowledge is through the /sys > filesystem. In Table 5.2 we have seen > the what the kernel publishes about > the hardware. A program has to find > the directory: > > /sys/devices/system/cpu/cpu*/cache

This is listed in Section 6: What Programmers Can Do.

He also describes a short test right under Figure 6.5 which can be used to determine L1D cache size if you can't get it from the OS.

There is one more thing I ran across in his paper: sysconf(_SC_LEVEL2_CACHE_SIZE) is a system call on Linux which is supposed to return the L2 cache size although it doesn't seem to be well documented.

Solution 2 - C++

C++ itself doesn't "care" about CPU caches, so there's no support for querying cache-sizes built into the language. If you are developing for Windows, then there's the GetLogicalProcessorInformation()-function, which can be used to query information about the CPU caches.

Solution 3 - C++

Preallocate a large array. Then access each element sequentially and record the time for each access. Ideally there will be a jump in access time when cache miss occurs. Then you can calculate your L1 Cache. It might not work but worth trying.

Solution 4 - C++

read the cpuid of the cpu (x86) and then determine the cache-size by a look-up-table. The table has to be filled with the cache sizes the manufacturer of the cpu publishes in its programming manuals.

Solution 5 - C++

Depending on what you're trying to do, you might also leave it to some library. Since you mention multicore processing, you might want to have a look at Intel Threading Building Blocks.

TBB includes cache aware memory allocators. More specifically, check cache_aligned_allocator (in the reference documentation, I couldn't find any direct link).

Solution 6 - C++

Interestingly enough, I wrote a program to do this awhile ago (in C though, but I'm sure it will be easy to incorporate in C++ code).

http://github.com/wowus/CacheLineDetection/blob/master/Cache%20Line%20Detection/cache.c

The get_cache_line function is the interesting one, which returns the location of right before the biggest spike in timing data of array accesses. It correctly guessed on my machine! If anything else, it can help you make your own.

It's based off of this article, which originally piqued my interest: http://igoro.com/archive/gallery-of-processor-cache-effects/

Solution 7 - C++

You can see this thread: http://software.intel.com/en-us/forums/topic/296674

The short answer is in this other thread: > On modern IA-32 hardware, the cache line size is 64. The value 128 is > a legacy of the Intel Netburst Microarchitecture (e.g. Intel Pentium > D) where 64-byte lines are paired into 128-byte sectors. When a line > in a sector is fetched, the hardware automatically fetches the other > line in the sector too. So from a false sharing perspective, the > effective line size is 128 bytes on the Netburst processors. (http://software.intel.com/en-us/forums/topic/292721)

Solution 8 - C++

IIRC, GCC has a __builtin_prefetch hint.

http://gcc.gnu.org/onlinedocs/gcc-3.3.6/gcc/Other-Builtins.html

has an excellent section on this. Basically, it suggests:

__builtin_prefetch (&array[i + LookAhead], rw, locality);

where rw is a 0 (prepare for read) or 1 (prepare for a write) value, and locality uses the number 0-3, where zero is no locality, and 3 is very strong locality.

Both are optional. LookAhead would be the number of elements to look ahead to. If memory access were 100 cycles, and the unrolled loops are two cycles apart, LookAhead could be set to 50 or 51.

Solution 9 - C++

There are two cases that need to be distinguished. Do you need to know the cache sizes at compile time or at runtime?

Determining the cache-size at compile-time

For some applications, you know the exact architecture that your code will run on, for example, if you can compile the code directly on the host machine. In that case, simplify looking up the size and hard-coding it is an option (could be automated in the build system). On most machines today, the L1 cache line should be 64 bytes.

If you want to avoid that complexity or if you need to support compilation on unknown architectures, you can use the C++17 feature std::hardware_constructive_interference_size as a good fallback. It will provide a compile-time estimation for the cache line, but be aware of its limitations. Note that the compiler cannot guess perfectly when it creates the binary, as the size of the cache-line is, in general, architecture dependent.

Determining the cache-size at runtime

At runtime, you have the advantage that you know the exact machine, but you will need platform specific code to read the information from the OS. A good starting point is the code snippet from this answer, which supports the major platforms (Windows, Linux, MacOS). In a similar fashion, you can also read the L2 cache size at runtime.

I would advise against trying to guess the cache line by running benchmarks at startup and measuring which one performed best. It might well work, but it is also error-prone if the CPU is used by other processes.

Combining both approaches

If you have to ship one binary and the machines that it will later run on features a range of different architectures with varying cache sizes, you could create specialized code parts for each cache size, and then dynamically (at application startup) choose the best fitting one.

Solution 10 - C++

The cache will usually do the right thing. The only real worry for normal programmer is false sharing, and you can't take care of that at runtime because it requires compiler directives.

Content Type	Original Author	Original Content on Stackoverflow
Question	Mat	View Question on Stackoverflow
Solution 1 - C++	Robert S. Barnes	View Answer on Stackoverflow
Solution 2 - C++	kusma	View Answer on Stackoverflow
Solution 3 - C++	ben	View Answer on Stackoverflow
Solution 4 - C++	Tobias Langner	View Answer on Stackoverflow
Solution 5 - C++	Stéphane Bonniez	View Answer on Stackoverflow
Solution 6 - C++	Clark Gaebel	View Answer on Stackoverflow
Solution 7 - C++	Daniel Munoz	View Answer on Stackoverflow
Solution 8 - C++	Max	View Answer on Stackoverflow
Solution 9 - C++	Philipp Claßen	View Answer on Stackoverflow
Solution 10 - C++	Charles Eli Cheese	View Answer on Stackoverflow