Which ordering of nested loops for iterating over a 2D array is more efficient

CPerformanceFor LoopCpu Cache

C Problem Overview


Which of the following orderings of nested loops to iterate over a 2D array is more efficient in terms of time (cache performance)? Why?

int a[100][100];

for(i=0; i<100; i++)
{
   for(j=0; j<100; j++)
   {
       a[i][j] = 10;    
   }
}

or

for(i=0; i<100; i++)
{
   for(j=0; j<100; j++)
   {
      a[j][i] = 10;    
   }
}

C Solutions


Solution 1 - C

The first method is slightly better, as the cells being assigned to lays next to each other.

First method:

[ ][ ][ ][ ][ ] ....
^1st assignment
   ^2nd assignment
[ ][ ][ ][ ][ ] ....
^101st assignment

Second method:

[ ][ ][ ][ ][ ] ....
^1st assignment
   ^101st assignment
[ ][ ][ ][ ][ ] ....
^2nd assignment

Solution 2 - C

  1. For array[100][100] - they are both the same, if the L1 cache is larger then 100100sizeof(int) == 10000sizeof(int) == [usually] 40000. Note in Sandy Bridge - 100100 integers should be enough elements to see a difference, since the L1 cache is only 32k.

  2. Compilers will probably optimize this code all the same

  3. Assuming no compiler optimizations, and matrix does not fit in L1 cache - the first code is better due to cache performance [usually]. Every time an element is not found in cache - you get a cache miss - and need to go to the RAM or L2 cache [which are much slower]. Taking elements from RAM to cache [cache fill] is done in blocks [usually 8/16 bytes] - so in the first code, you get at most miss rate of 1/4 [assuming 16 bytes cache block, 4 bytes ints] while in the second code it is unbounded, and can be even 1. In the second code snap - elements that were already in cache [inserted in the cache fill for the adjacent elements] - were taken out, and you get a redundant cache miss.

    • This is closely related to the principle of locality, which is the general assumption used when implementing the cache system. The first code follows this principle while the second doesn't - so cache performance of the first will be better of those of the second.

Conclusion: For all cache implementations I am aware of - the first will be not worse then the second. They might be the same - if there is no cache at all or all the array fits in cache completely - or due to compiler optimization.

Solution 3 - C

This sort of micro-optimization is platform-dependent so you'll need to profile the code in order to be able to draw a reasonable conclusion.

Solution 4 - C

In your second snippet, the change in j in each iteration produces a pattern with low spatial locality. Remember that behind the scenes, an array reference computes:

( ((y) * (row->width)) + (x) ) 

Consider a simplified L1 cache that has enough space for only 50 rows of our array. For the first 50 iterations, you will pay the unavoidable cost for 50 cache misses, but then what happens? For each iteration from 50 to 99, you will still cache miss and have to fetch from L2 (and/or RAM, etc). Then, x changes to 1 and y starts over, leading to another cache miss because the first row of your array has been evicted from the cache, and so forth.

The first snippet does not have this problem. It accesses the array in row-major order, which achieves better locality - you only have to pay for cache misses at most once (if a row of your array is not present in the cache at the time the loop starts) per row.

That being said, this is a very architecture-dependent question, so you would have to take into consideration the specifics (L1 cache size, cache line size, etc.) to draw a conclusion. You should also measure both ways and keep track of hardware events to have concrete data to draw conclusions from.

Solution 5 - C

Considering C++ is row major, I believe first method is going to be a bit faster. In memory a 2D array is represented in a Single dimension array and performance depends in accessing it either using row major or column major

Solution 6 - C

This is a classic problem about cache line bouncing

In most time the first one is better, but I think the exactly answer is: IT DEPENDS, different architecture maybe different result.

Solution 7 - C

In second method, Cache miss, because the cache stores contigous data. hence the first method is efficient than second method.

Solution 8 - C

In your case (fill all array 1 value), that will be faster:

   for(j = 0; j < 100 * 100; j++){
      a[j] = 10;
   }

and you could still treat a as 2 dimensional array.

EDIT: As Binyamin Sharet mentioned, you could do it if your a is declared that way:

int **a = new int*[100];
for(int i = 0; i < 100; i++){
	a[i] = new int[100];
}

Solution 9 - C

In general, better locality (noticed by most of responders) is only the first advantage for loop #1 performance.

The second (but related) advantage, is that for loops like #1 - compiler is normally capable to efficiently auto-vectorize the code with stride-1 memory access pattern (stride-1 means there is contiguous access to array elements one by one in every next iteration). On the contrary, for loops like #2, auto-vectorizations will not normally work fine, because there is no consecutive stride-1 iterative access to contiguos blocks in memory.

Well, my answer is general. For very simple loops exactly like #1 or #2, there could be even simpler aggressive compiler optimizations used (grading any difference) and also compiler will normally be able to auto-vectorize #2 with stride-1 for outer loop (especially with #pragma simd or similar).

Solution 10 - C

First option is better as we can store a[i] in a temp variable inside first loop and then lookup for j index in that. In this sense it can be said as cached variable.

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionSachin MhetreView Question on Stackoverflow
Solution 1 - CMByDView Answer on Stackoverflow
Solution 2 - CamitView Answer on Stackoverflow
Solution 3 - CLuchian GrigoreView Answer on Stackoverflow
Solution 4 - CMichael FoukarakisView Answer on Stackoverflow
Solution 5 - CHabibView Answer on Stackoverflow
Solution 6 - Cllj098View Answer on Stackoverflow
Solution 7 - CParagView Answer on Stackoverflow
Solution 8 - CIProblemFactoryView Answer on Stackoverflow
Solution 9 - CzamView Answer on Stackoverflow
Solution 10 - CHimanshu GoelView Answer on Stackoverflow