Why is this C++ program so incredibly fast?

C++Performance Assembly Compiler Construction

C++ Problem Overview

I wrote a little benchmark to compare the performance of different interpreters/compilers for Python, Ruby, JavaScript and C++. As expected, it turns out that (optimized) C++ beats the scripting languages, but the factor by which it does so is incredibly high.

The results are:

sven@jet:~/tmp/js$ time node bla.js              # * JavaScript with node *
0

real	0m1.222s
user	0m1.190s
sys	0m0.015s
sven@jet:~/tmp/js$ time ruby foo.rb              # * Ruby *
0

real	0m52.428s
user	0m52.395s
sys	0m0.028s
sven@jet:~/tmp/js$ time python blub.py           # * Python with CPython *
0

real	1m16.480s
user	1m16.371s
sys	0m0.080s

sven@jet:~/tmp/js$ time pypy blub.py             # * Python with PyPy *
0

real	0m4.707s
user	0m4.579s
sys	0m0.028s

sven@jet:~/tmp/js$ time ./cpp_non_optimized 1000 1000000 # * C++ with -O0 (gcc) *
0

real	0m1.702s
user	0m1.699s
sys	0m0.002s
sven@jet:~/tmp/js$ time ./cpp_optimized 1000 1000000     # * C++ with -O3 (gcc) *
0

real	0m0.003s # (!!!) <---------------------------------- WHY?
user	0m0.002s
sys	0m0.002s

I am wondering if anybody can provide an explanation why the optimized C++ code is over three orders of magnitude faster than everything else.

The C++ benchmark uses command-line parameters in order to prevent pre-computing the result at compile time.

Below, I placed the source codes for the different language benchmarks, which should be semantically equivalent. Also, I provided the assembly code for the optimized C++ compiler output (using gcc). When looking at the optimized assembly, it seems that the compiler merged the two loops in the benchmark to a single one, but nevertheless, there IS still a loop!

JavaScript:

var s = 0;
var outer = 1000;
var inner = 1000000;

for (var i = 0; i < outer; ++i) {
	for (var j = 0; j < inner; ++j) {
		++s;
	}
	s -= inner;
}
console.log(s);

Python:

s = 0
outer = 1000
inner = 1000000

for _ in xrange(outer):
    for _ in xrange(inner):
        s += 1
    s -= inner
print s

Ruby:

s = 0
outer = 1000
inner = 1000000

outer_end = outer - 1
inner_end = inner - 1

for i in 0..outer_end
  for j in 0..inner_end
    s = s + 1
  end
  s = s - inner
end
puts s

C++:

#include <iostream>
#include <cstdlib>
#include <cstdint>

int main(int argc, char* argv[]) {
  uint32_t s = 0;
  uint32_t outer = atoi(argv[1]);
  uint32_t inner = atoi(argv[2]);
  for (uint32_t i = 0; i < outer; ++i) {
    for (uint32_t j = 0; j < inner; ++j)
      ++s;
    s -= inner;
  }
  std::cout << s << std::endl;
  return 0;
}

Assembly (when compiling the above C++ code with gcc -S -O3 -std=c++0x):

	.file	"bar.cpp"
	.section	.text.startup,"ax",@progbits
	.p2align 4,,15
	.globl	main
	.type	main, @function
main:
.LFB1266:
	.cfi_startproc
	pushq	%r12
	.cfi_def_cfa_offset 16
	.cfi_offset 12, -16
	movl	$10, %edx
	movq	%rsi, %r12
	pushq	%rbp
	.cfi_def_cfa_offset 24
	.cfi_offset 6, -24
	pushq	%rbx
	.cfi_def_cfa_offset 32
	.cfi_offset 3, -32
	movq	8(%rsi), %rdi
	xorl	%esi, %esi
	call	strtol
	movq	16(%r12), %rdi
	movq	%rax, %rbp
	xorl	%esi, %esi
	movl	$10, %edx
	call	strtol
	testl	%ebp, %ebp
	je	.L6
	movl	%ebp, %ebx
	xorl	%eax, %eax
	xorl	%edx, %edx
	.p2align 4,,10
	.p2align 3
.L3:                             # <--- Here is the loop
	addl	$1, %eax             # <---
	cmpl	%eax, %ebx           # <---
	ja	.L3                      # <---
.L2:
	movl	%edx, %esi
	movl	$_ZSt4cout, %edi
	call	_ZNSo9_M_insertImEERSoT_
	movq	%rax, %rdi
	call	_ZSt4endlIcSt11char_traitsIcEERSt13basic_ostreamIT_T0_ES6_
	popq	%rbx
	.cfi_remember_state
	.cfi_def_cfa_offset 24
	popq	%rbp
	.cfi_def_cfa_offset 16
	xorl	%eax, %eax
	popq	%r12
	.cfi_def_cfa_offset 8
	ret
.L6:
	.cfi_restore_state
	xorl	%edx, %edx
	jmp	.L2
	.cfi_endproc
.LFE1266:
	.size	main, .-main
	.p2align 4,,15
	.type	_GLOBAL__sub_I_main, @function
_GLOBAL__sub_I_main:
.LFB1420:
	.cfi_startproc
	subq	$8, %rsp
	.cfi_def_cfa_offset 16
	movl	$_ZStL8__ioinit, %edi
	call	_ZNSt8ios_base4InitC1Ev
	movl	$__dso_handle, %edx
	movl	$_ZStL8__ioinit, %esi
	movl	$_ZNSt8ios_base4InitD1Ev, %edi
	addq	$8, %rsp
	.cfi_def_cfa_offset 8
	jmp	__cxa_atexit
	.cfi_endproc
.LFE1420:
	.size	_GLOBAL__sub_I_main, .-_GLOBAL__sub_I_main
	.section	.init_array,"aw"
	.align 8
	.quad	_GLOBAL__sub_I_main
	.local	_ZStL8__ioinit
	.comm	_ZStL8__ioinit,1,1
	.hidden	__dso_handle
	.ident	"GCC: (Ubuntu 4.8.2-19ubuntu1) 4.8.2"
	.section	.note.GNU-stack,"",@progbits

C++ Solutions

Solution 1 - C++

The optimizer has worked out that the inner loop along with the subsequent line is a no-op, and eliminated it. Unfortunately it hasn't managed to eliminate the outer loop as well.

Note that the node.js example is faster than the unoptimized C++ example, indicating that V8 (node's JIT compiler) has managed to eliminate at least one of the loops. However, its optimization has some overhead, as (like any JIT compiler) it must balance the opportunities for optimization and profile-guided re-optimization against the cost of doing so.

Solution 2 - C++

I didn't do a complete analysis of the assembly, but it looks like it did loop unrolling of the inner loop and figured out that together with the subtraction of inner it is a nop.

The assembly only seems to do the outer loop which only increments a counter until outer is reached. It could even have optimized that away, but it seems like it didn't do that.

Solution 3 - C++

> Is there a way to cache the JIT compiled code after it optimizes it, or does it have to re-optimize the code every time the program is run?

If I were writing in Python I'd try to reduce the code down in size to get an "overhead" view of what the code was doing. Like try writing this (much easier to read IMO):

for i in range(outer):
    innerS = sum(1 for _ in xrange(inner))
    s += innerS
    s -= innerS

or even s = sum(inner - inner for _ in xrange(outer))

Solution 4 - C++

for (uint32_t i = 0; i < outer; ++i) {
    for (uint32_t j = 0; j < inner; ++j)
        ++s;
    s -= inner;
}

The inner loop is equivalent to "s += inner; j = inner; " which a good optimising compiler can do. Since the variable j is gone after the loop, the whole code is equivalent to

for (uint32_t i = 0; i < outer; ++i) {
    s += inner;
    s -= inner;
}

Again, a good optimising compiler can remove the two changes to s, then remove the variable i, and there's nothing left whatsoever. It seems that is what happened.

Now it's up to you to decide how often an optimisation like this happens, and whether it is any real life benefit.

Solution 5 - C++

Even though the loops have plenty of iterations, the programs probably still aren't long-running enough to escape the overhead of interpreter/JVM/shell/etc.'s startup times. In some environments these can vary massively - in some cases coughJavacough taking several seconds before it gets anywhere near your actual code.

Ideally you would time the execution within each piece of code. It may be tricky to do this accurately across all languages, but even printing out clock time in ticks before and after would be better than using time, and would do the job since you probably aren't concerned with super-accurate timing here.

(I guess this doesn't really relate to why the C++ example is so much faster - but it could explain some of the variability in the other results. :) ).

Content Type	Original Author	Original Content on Stackoverflow
Question	Sven Hager	View Question on Stackoverflow
Solution 1 - C++	ecatmur	View Answer on Stackoverflow
Solution 2 - C++	wich	View Answer on Stackoverflow
Solution 3 - C++	aoeu256	View Answer on Stackoverflow
Solution 4 - C++	gnasher729	View Answer on Stackoverflow
Solution 5 - C++	Tom	View Answer on Stackoverflow