Is it better to avoid using the mod operator when possible?

CPerformanceOptimizationModulo

C Problem Overview


I assume that calculating the modulus of a number is a somewhat expensive operation, at least compared to simple arithmetic tests (such as seeing if a number exceeds the length of an array). If this is indeed the case, is it more efficient to replace, for example, the following code:

res = array[(i + 1) % len];

with the following? :

res = array[(i + 1 == len) ? 0 : i + 1];

The first one is easier on the eyes, but I wonder if the second might be more efficient. If so, might I expect an optimizing compiler to replace the first snippet with the second when a compiled language is used?

Of course, this "optimization" (if it is indeed an optimization) doesn't work in all cases (in this case, it only works if i+1 is never more than len).

C Solutions


Solution 1 - C

My general advice is as follows. Use whichever version you think is easier on the eye, and then profile your entire system. Only optimize those parts of the code that the profiler flags up as bottlenecks. I'll bet my bottom dollar that the modulo operator isn't going to be among them.

As far as the specific example goes, only benchmarking can tell which is faster on your specific architecture using your specific compiler. You are potentially replacing modulo with branching, and it's anything but obvious which would be faster.

Solution 2 - C

Some simple measurement:

#include <stdio.h>
#include <stdlib.h>

int main(int argc, char *argv[])
{
    int test = atoi(argv[1]);
    int divisor = atoi(argv[2]);
    int iterations = atoi(argv[3]);

    int a = 0;

    if (test == 0) {
        for (int i = 0; i < iterations; i++)
            a = (a + 1) % divisor;
    } else if (test == 1) {
        for (int i = 0; i < iterations; i++)
            a = a + 1 == divisor ? 0 : a + 1;
    }

    printf("%d\n", a);
}

Compiling with either gcc or clang with -O3, and running time ./a.out 0 42 1000000000 (modulo version) or time ./a.out 1 42 1000000000 (comparison version) results in

  • 6.25 seconds user runtime for the modulo version,
  • 1.03 seconds for the comparison version.

(using gcc 5.2.1 or clang 3.6.2; Intel Core i5-4690K @ 3.50GHz; 64-bit Linux)

This means that it is probably a good idea to use the comparison version.

Solution 3 - C

Well, have a look at 2 ways to get the next value of a "modulo 3" cyclic counter.

int next1(int n) {
    return (n + 1) % 3;
}

int next2(int n) {
    return n == 2 ? 0 : n + 1;
}

I've compiled it with gcc -O3 option (for the common x64 architecture), and -s to get the assembly code.

The code for the first function does some unexplainable magic (*) to avoid a division, using a multiplication anyway:

addl	$1, %edi
movl	$1431655766, %edx
movl	%edi, %eax
imull	%edx
movl	%edi, %eax
sarl	$31, %eax
subl	%eax, %edx
leal	(%rdx,%rdx,2), %eax
subl	%eax, %edi
movl	%edi, %eax
ret

And is much longer (and I bet slower) than the second function:

leal	1(%rdi), %eax
cmpl	$2, %edi
movl	$0, %edx
cmove	%edx, %eax
ret

So it is not always true that "the (modern) compiler does a better job than you anyway".

Interestingly, the same experiment with 4 instead of 3 leads to a and-masking for the first function

addl	$1, %edi
movl	%edi, %edx
sarl	$31, %edx
shrl	$30, %edx
leal	(%rdi,%rdx), %eax
andl	$3, %eax
subl	%edx, %eax
ret

but it is still, and by large, inferior to the second version.

Being more explicit about proper ways to do the things

int next3(int n) {
    return (n + 1) & 3;;
}

yields much better results :

leal	1(%rdi), %eax
andl	$3, %eax
ret

(*) well, not that complicated. Multiplication by reciprocical. Compute the integer constant K = (2^N)/3, for some large enough value of N. Now, when you want the value of X/3, instead of a division by 3, compute X*K, and shift it N positions to the right.

Solution 4 - C

Here is some additional benchmark. Note that I also added a branchless version:

#include <iostream>
#include <array>
#include <algorithm>
#include <random>
#include <chrono>
using namespace std::chrono;

constexpr size_t iter = 1e8;

int main() {
  std::minstd_rand rnd_engine{1234};
  std::uniform_int_distribution<int> dist {-1000, 1000};
  auto gen = [&]() { return dist(rnd_engine); };

  std::array<int, 10> a;
  std::generate( a.begin(), a.end(), gen);

  for (size_t size = 2; size < 10; size++) {
    std::cout << "Modulus size = " << size << '\n';
  
    {
      std::cout << "operator%  ";
      long sum = 0;
      size_t x = 0;
      auto start = high_resolution_clock::now();
      for (size_t i = 0; i < iter; ++i) {
        sum += a[x];
        x = (x + 1) % size;
      }
      auto stop = high_resolution_clock::now();
      std::cout << duration_cast<microseconds>(stop - start).count()*0.001
                << "ms\t(sum = " << sum << ")\n";
    }
  
    {
      std::cout << "ternary    ";
      long sum = 0;
      size_t x = 0;
      auto start = high_resolution_clock::now();
      for (size_t i = 0; i < iter; ++i) {
        sum += a[x];
        x = ((x + 1) == size) ? 0 : x + 1;
      }
      auto stop = high_resolution_clock::now();
      std::cout << duration_cast<microseconds>(stop - start).count()*0.001
                << "ms\t(sum = " << sum << ")\n";
    }
    
    {
      std::cout << "branchless ";
      long sum = 0;
      size_t x = 1;
      auto start = high_resolution_clock::now();
      for (size_t i = 0; i < iter; ++i) {
        sum += a[x-1];
        x = ( x != size ) * x + 1;
      }
      auto stop = high_resolution_clock::now();
      std::cout << duration_cast<microseconds>(stop - start).count()*0.001
                << "ms\t(sum = " << sum << ")\n";
    }

  }
  return 0;
}

And here is the output on my i7-4870HQ

$ g++ -Ofast test.cpp && ./a.out
Modulus size = 2
operator%  904.249ms	(sum = -4200000000)
ternary    137.04ms	    (sum = -4200000000)
branchless 169.182ms	(sum = -4200000000)
Modulus size = 3
operator%  914.911ms	(sum = -31533333963)
ternary    113.384ms	(sum = -31533333963)
branchless 167.614ms	(sum = -31533333963)
Modulus size = 4
operator%  877.3ms   	(sum = -36250000000)
ternary    97.265ms	    (sum = -36250000000)
branchless 167.215ms	(sum = -36250000000)
Modulus size = 5
operator%  891.295ms	(sum = -30700000000)
ternary    88.562ms	    (sum = -30700000000)
branchless 167.087ms	(sum = -30700000000)
Modulus size = 6
operator%  903.644ms	(sum = -39683333196)
ternary    83.433ms	    (sum = -39683333196)
branchless 167.778ms	(sum = -39683333196)
Modulus size = 7
operator%  908.096ms	(sum = -34585713678)
ternary    79.703ms	    (sum = -34585713678)
branchless 166.849ms	(sum = -34585713678)
Modulus size = 8
operator%  869ms	    (sum = -39212500000)
ternary    76.972ms	    (sum = -39212500000)
branchless 167.29ms	    (sum = -39212500000)
Modulus size = 9
operator%  875.003ms	(sum = -36500000580)
ternary    75.011ms	    (sum = -36500000580)
branchless 172.356ms	(sum = -36500000580)

In this particular case the ternary operator looks far superior, and it becomes even more like so when the branch predictor ramps up. Note however that this is a very particular case: if we were not incrementing the index by non-const value, using the more general operator% would be straightforward while the other two methods could become very intricated.

I would like to stress this very much underrated comment:

> if len is a compile-time constant a recent GCC compiler (with -02) is > usually doing clever things, often avoiding the modulus machine > instruction of the target processor.Basile Starynkevitch

For instance by removing the loop on the size variable and declaring it as const size_t size = 4; I get:

g++ -Ofast test.cpp && ./a.out
Modulus size = 4
operator%  62.103ms 	(sum = -36250000000)
ternary    93.674ms 	(sum = -36250000000)
branchless 166.774ms	(sum = -36250000000)

Conclusions

The execution time of the branchless version is pretty stable across the various scenarios. The ternary is consistently better than the branchless over the considered cases, especially when the branch predictor kicks in. Finally, the operator%, while being more general and significantly slower, has chances to get optimized to become the fastest as in the case of particular const values of the right hand side.

Of course this is completely platform dependent, who knows how this will be on an Arduino :)

Solution 5 - C

If 'len' in your code is big enough, then the conditional will be faster, as the branch predictors will nearly always guess correctly.

If not, then I believe this is closely connected to circular queues, where it is often the case that the length is a power of 2. This will enable the compiler to substitute modulo with a simple AND.

The code is the following:

#include <stdio.h>
#include <stdlib.h>

#define modulo

int main()
{
	int iterations = 1000000000;
	int size = 16;
	int a[size];
	unsigned long long res = 0;
	int i, j;

	for (i=0;i<size;i++)
		a[i] = i;

	for (i=0,j=0;i<iterations;i++)
	{
		j++;
		#ifdef modulo
			j %= size;
		#else
			if (j >= size)
				j = 0;
		#endif
		res += a[j];
	}

	printf("%llu\n", res);
}

size=15:

  • modulo: 4,868s
  • cond: 1,291s

size=16:

  • modulo: 1,067s
  • cond: 1,599s

Compiled in gcc 7.3.0 , with -O3 optimization. The machine is an i7 920.

Solution 6 - C

I read article on making a fast hash map. A bottle neck can be the modulus operator to find the hash bucket. They suggested to make your number of buckets a power of 2. Apparently doing modulus by power of two means just like looking at last n bits.

Solution 7 - C

Modulo operator is expensive but the division is expensive too. So converting your code from using modulo operator to division is not going to optimize your code.

  (i + 1) % len

To optimize the above code

if ((i+1)==len){
   return 0
} else {
   return i+1
}

Solution 8 - C

Modulo can be done with a single processor instruction on most architectures (ex. DIV on x86). However it's likely a premature optimization for what you need.

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
Questionlimp_chimpView Question on Stackoverflow
Solution 1 - CNPEView Answer on Stackoverflow
Solution 2 - CdpiView Answer on Stackoverflow
Solution 3 - CMichel BillaudView Answer on Stackoverflow
Solution 4 - CDarioPView Answer on Stackoverflow
Solution 5 - CJ. DoeView Answer on Stackoverflow
Solution 6 - Csteviekm3View Answer on Stackoverflow
Solution 7 - CYilmazView Answer on Stackoverflow
Solution 8 - CseandView Answer on Stackoverflow