How fast is D compared to C++?

C++ Problem Overview

I like some features of D, but would be interested if they come with a runtime penalty?

To compare, I implemented a simple program that computes scalar products of many short vectors both in C++ and in D. The result is surprising:

D: 18.9 s [see below for final runtime]
C++: 3.8 s

Is C++ really almost five times as fast or did I make a mistake in the D program?

I compiled C++ with g++ -O3 (gcc-snapshot 2011-02-19) and D with dmd -O (dmd 2.052) on a moderate recent linux desktop. The results are reproducible over several runs and standard deviations negligible.

Here the C++ program:

#include <iostream>
#include <random>
#include <chrono>
#include <string>

#include <vector>
#include <array>

typedef std::chrono::duration<long, std::ratio<1, 1000>> millisecs;
template <typename _T>
long time_since(std::chrono::time_point<_T>& time) {
      long tm = std::chrono::duration_cast<millisecs>( std::chrono::system_clock::now() - time).count();
  time = std::chrono::system_clock::now();
  return tm;
}

const long N = 20000;
const int size = 10;

typedef int value_type;
typedef long long result_type;
typedef std::vector<value_type> vector_t;
typedef typename vector_t::size_type size_type;

inline value_type scalar_product(const vector_t& x, const vector_t& y) {
  value_type res = 0;
  size_type siz = x.size();
  for (size_type i = 0; i < siz; ++i)
    res += x[i] * y[i];
  return res;
}

int main() {
  auto tm_before = std::chrono::system_clock::now();

  // 1. allocate and fill randomly many short vectors
  vector_t* xs = new vector_t [N];
  for (int i = 0; i < N; ++i) {
    xs[i] = vector_t(size);
      }
  std::cerr << "allocation: " << time_since(tm_before) << " ms" << std::endl;
  
  std::mt19937 rnd_engine;
  std::uniform_int_distribution<value_type> runif_gen(-1000, 1000);
  for (int i = 0; i < N; ++i)
    for (int j = 0; j < size; ++j)
      xs[i][j] = runif_gen(rnd_engine);
  std::cerr << "random generation: " << time_since(tm_before) << " ms" << std::endl;

  // 2. compute all pairwise scalar products:
  time_since(tm_before);
  result_type avg = 0;
  for (int i = 0; i < N; ++i)
    for (int j = 0; j < N; ++j) 
      avg += scalar_product(xs[i], xs[j]);
  avg = avg / N*N;
  auto time = time_since(tm_before);
  std::cout << "result: " << avg << std::endl;
  std::cout << "time: " << time << " ms" << std::endl;
}

And here the D version:

import std.stdio;
import std.datetime;
import std.random;

const long N = 20000;
const int size = 10;

alias int value_type;
alias long result_type;
alias value_type[] vector_t;
alias uint size_type;

value_type scalar_product(const ref vector_t x, const ref vector_t y) {
  value_type res = 0;
  size_type siz = x.length;
  for (size_type i = 0; i < siz; ++i)
    res += x[i] * y[i];
  return res;
}

int main() {   
  auto tm_before = Clock.currTime();

  // 1. allocate and fill randomly many short vectors
  vector_t[] xs;
  xs.length = N;
  for (int i = 0; i < N; ++i) {
    xs[i].length = size;
  }
  writefln("allocation: %i ", (Clock.currTime() - tm_before));
  tm_before = Clock.currTime();

  for (int i = 0; i < N; ++i)
    for (int j = 0; j < size; ++j)
      xs[i][j] = uniform(-1000, 1000);
  writefln("random: %i ", (Clock.currTime() - tm_before));
  tm_before = Clock.currTime();

  // 2. compute all pairwise scalar products:
  result_type avg = cast(result_type) 0;
  for (int i = 0; i < N; ++i)
    for (int j = 0; j < N; ++j) 
      avg += scalar_product(xs[i], xs[j]);
  avg = avg / N*N;
  writefln("result: %d", avg);
  auto time = Clock.currTime() - tm_before;
  writefln("scalar products: %i ", time);

  return 0;
}

C++ Solutions

Solution 1 - C++

To enable all optimizations and disable all safety checks, compile your D program with the following DMD flags:

-O -inline -release -noboundscheck

EDIT: I've tried your programs with g++, dmd and gdc. dmd does lag behind, but gdc achieves performance very close to g++. The commandline I used was gdmd -O -release -inline (gdmd is a wrapper around gdc which accepts dmd options).

Looking at the assembler listing, it looks like neither dmd nor gdc inlined scalar_product, but g++/gdc did emit MMX instructions, so they might be auto-vectorizing the loop.

Solution 2 - C++

One big thing that slows D down is a subpar garbage collection implementation. Benchmarks that don't heavily stress the GC will show very similar performance to C and C++ code compiled with the same compiler backend. Benchmarks that do heavily stress the GC will show that D performs abysmally. Rest assured, though, this is a single (albeit severe) quality-of-implementation issue, not a baked-in guarantee of slowness. Also, D gives you the ability to opt out of GC and tune memory management in performance-critical bits, while still using it in the less performance-critical 95% of your code.

I've put some effort into improving GC performance lately and the results have been rather dramatic, at least on synthetic benchmarks. Hopefully these changes will be integrated into one of the next few releases and will mitigate the issue.

Solution 3 - C++

This is a very instructive thread, thanks for all the work to the OP and helpers.

One note - this test is not assessing the general question of abstraction/feature penalty or even that of backend quality. It focuses on virtually one optimization (loop optimization). I think it's fair to say that gcc's backend is somewhat more refined than dmd's, but it would be a mistake to assume that the gap between them is as large for all tasks.

Solution 4 - C++

Definitely seems like a quality-of-implementation issue.

I ran some tests with the OP's code and made some changes. I actually got D going faster for LDC/clang++, operating on the assumption that arrays must be allocated dynamically (xs and associated scalars). See below for some numbers.

Questions for the OP

Is it intentional that the same seed be used for each iteration of C++, while not so for D?

Setup

I have tweaked the original D source (dubbed scalar.d) to make it portable between platforms. This only involved changing the type of the numbers used to access and modify the size of arrays.

After this, I made the following changes:

Used uninitializedArray to avoid default inits for scalars in xs (probably made the biggest difference). This is important because D normally default-inits everything silently, which C++ does not.
Factored out printing code and replaced writefln with writeln
Changed imports to be selective
Used pow operator (^^) instead of manual multiplication for final step of calculating average
Removed the size_type and replaced appropriately with the new index_type alias

...thus resulting in scalar2.cpp (pastebin):

    import std.stdio : writeln;
    import std.datetime : Clock, Duration;
    import std.array : uninitializedArray;
    import std.random : uniform;

    alias result_type = long;
    alias value_type = int;
    alias vector_t = value_type[];
    alias index_type = typeof(vector_t.init.length);// Make index integrals portable - Linux is ulong, Win8.1 is uint

    immutable long N = 20000;
    immutable int size = 10;

    // Replaced for loops with appropriate foreach versions
    value_type scalar_product(in ref vector_t x, in ref vector_t y) { // "in" is the same as "const" here
      value_type res = 0;
      for(index_type i = 0; i < size; ++i)
        res += x[i] * y[i];
      return res;
    }

    int main() {
      auto tm_before = Clock.currTime;
      auto countElapsed(in string taskName) { // Factor out printing code
        writeln(taskName, ": ", Clock.currTime - tm_before);
        tm_before = Clock.currTime;
      }

      // 1. allocate and fill randomly many short vectors
      vector_t[] xs = uninitializedArray!(vector_t[])(N);// Avoid default inits of inner arrays
      for(index_type i = 0; i < N; ++i)
        xs[i] = uninitializedArray!(vector_t)(size);// Avoid more default inits of values
      countElapsed("allocation");

      for(index_type i = 0; i < N; ++i)
        for(index_type j = 0; j < size; ++j)
          xs[i][j] = uniform(-1000, 1000);
      countElapsed("random");

      // 2. compute all pairwise scalar products:
      result_type avg = 0;
      for(index_type i = 0; i < N; ++i)
        for(index_type j = 0; j < N; ++j)
          avg += scalar_product(xs[i], xs[j]);
      avg /= N ^^ 2;// Replace manual multiplication with pow operator
      writeln("result: ", avg);
      countElapsed("scalar products");

      return 0;
    }

After testing scalar2.d (which prioritized optimization for speed), out of curiousity I replaced the loops in main with foreach equivalents, and called it scalar3.d (pastebin):

    import std.stdio : writeln;
    import std.datetime : Clock, Duration;
    import std.array : uninitializedArray;
    import std.random : uniform;

    alias result_type = long;
    alias value_type = int;
    alias vector_t = value_type[];
    alias index_type = typeof(vector_t.init.length);// Make index integrals portable - Linux is ulong, Win8.1 is uint

    immutable long N = 20000;
    immutable int size = 10;

    // Replaced for loops with appropriate foreach versions
    value_type scalar_product(in ref vector_t x, in ref vector_t y) { // "in" is the same as "const" here
      value_type res = 0;
      for(index_type i = 0; i < size; ++i)
        res += x[i] * y[i];
      return res;
    }

    int main() {
      auto tm_before = Clock.currTime;
      auto countElapsed(in string taskName) { // Factor out printing code
        writeln(taskName, ": ", Clock.currTime - tm_before);
        tm_before = Clock.currTime;
      }

      // 1. allocate and fill randomly many short vectors
      vector_t[] xs = uninitializedArray!(vector_t[])(N);// Avoid default inits of inner arrays
      foreach(ref x; xs)
        x = uninitializedArray!(vector_t)(size);// Avoid more default inits of values
      countElapsed("allocation");

      foreach(ref x; xs)
        foreach(ref val; x)
          val = uniform(-1000, 1000);
      countElapsed("random");

      // 2. compute all pairwise scalar products:
      result_type avg = 0;
      foreach(const ref x; xs)
        foreach(const ref y; xs)
          avg += scalar_product(x, y);
      avg /= N ^^ 2;// Replace manual multiplication with pow operator
      writeln("result: ", avg);
      countElapsed("scalar products");

      return 0;
    }

I compiled each of these tests using an LLVM-based compiler, since LDC seems to be the best option for D compilation in terms of performance. On my x86_64 Arch Linux installation I used the following packages:

clang 3.6.0-3
ldc 1:0.15.1-4
dtools 2.067.0-2

I used the following commands to compile each:

C++: clang++ scalar.cpp -o"scalar.cpp.exe" -std=c++11 -O3
D: rdmd --compiler=ldc2 -O3 -boundscheck=off <sourcefile>

Results

The results (screenshot of raw console output) of each version of the source as follows:

scalar.cpp (original C++):

 allocation: 2 ms

 random generation: 12 ms

 result: 29248300000

 time: 2582 ms

C++ sets the standard at 2582 ms.

scalar.d (modified OP source):

 allocation: 5 ms, 293 μs, and 5 hnsecs 

 random: 10 ms, 866 μs, and 4 hnsecs 

 result: 53237080000

 scalar products: 2 secs, 956 ms, 513 μs, and 7 hnsecs

This ran for ~2957 ms. Slower than the C++ implementation, but not too much.

scalar2.d (index/length type change and uninitializedArray optimization):

 allocation: 2 ms, 464 μs, and 2 hnsecs

 random: 5 ms, 792 μs, and 6 hnsecs

 result: 59

 scalar products: 1 sec, 859 ms, 942 μs, and 9 hnsecs

In other words, ~1860 ms. So far this is in the lead.

scalar3.d (foreaches):

 allocation: 2 ms, 911 μs, and 3 hnsecs

 random: 7 ms, 567 μs, and 8 hnsecs

 result: 189

 scalar products: 2 secs, 182 ms, and 366 μs

~2182 ms is slower than scalar2.d, but faster than the C++ version.

Conclusion

With the correct optimizations, the D implementation actually went faster than its equivalent C++ implementation using the LLVM-based compilers available. The current gap between D and C++ for most applications seems only to be based on limitations of current implementations.

Solution 5 - C++

dmd is the reference implementation of the language and thus most work is put into the frontend to fix bugs rather than optimizing the backend.

"in" is faster in your case cause you are using dynamic arrays which are reference types. With ref you introduce another level of indirection (which is normally used to alter the array itself and not only the contents).

Vectors are usually implemented with structs where const ref makes perfect sense. See smallptD vs. smallpt for a real-world example featuring loads of vector operations and randomness.

Note that 64-Bit can also make a difference. I once missed that on x64 gcc compiles 64-Bit code while dmd still defaults to 32 (will change when the 64-Bit codegen matures). There was a remarkable speedup with "dmd -m64 ...".

Solution 6 - C++

Whether C++ or D is faster is likely to be highly dependent on what you're doing. I would think that when comparing well-written C++ to well-written D code, they would generally either be of similar speed, or C++ would be faster, but what the particular compiler manages to optimize could have a big effect completely aside from the language itself.

However, there are a few cases where D stands a good chance of beating C++ for speed. The main one which comes to mind would be string processing. Thanks to D's array slicing capabalities, strings (and arrays in general) can be processed much faster than you can readily do in C++. For D1, Tango's XML processor is extremely fast, thanks primarily to D's array slicing capabilities (and hopefully D2 will have a similarly fast XML parser once the one that's currently being worked on for Phobos has been completed). So, ultimately whether D or C++ is going to be faster is going to be very dependent on what you're doing.

Now, I am suprised that you're seeing such a difference in speed in this particular case, but it is the sort of thing that I would expect to improve as dmd improves. Using gdc might yield better results and would likely be a closer comparison of the language itself (rather than the backend) given that it's gcc-based. But it wouldn't surprise me at all if there are a number of things which could be done to speed up the code that dmd generates. I don't think that there's much question that gcc is more mature than dmd at this point. And code optimizations are one of the prime fruits of code maturity.

Ultimately, what matters is how well dmd performs for your particular application, but I do agree that it would definitely be nice to know how well C++ and D compare in general. In theory, they should be pretty much the same, but it really depends on the implementation. I think that a comprehensive set of benchmarks would be required to really test how well the two presently compare however.

Solution 7 - C++

You can write C code is D so as far as which is faster, it will depend on a lot of things:

What compiler you use
What feature you use
how aggressively you optimize

Differences in the first aren't fair to drag in. The second might give C++ an advantage as it, if anything, has fewer heavy features. The third is the fun one: D code in some ways is easier to optimize because in general it is easier to understand. Also it has the ability to do a large degree of generative programing allowing things like verbose and repetitive but fast code to be written in a shorter forms.

Solution 8 - C++

Seems like a quality of implementation issue. For example, here's what I've been testing with:

import std.datetime, std.stdio, std.random;

version = ManualInline;

immutable N = 20000;
immutable Size = 10;

alias int value_type;
alias long result_type;
alias value_type[] vector_type;

result_type scalar_product(in vector_type x, in vector_type y)
in
{
    assert(x.length == y.length);
}
body
{
    result_type result = 0;
    
    foreach(i; 0 .. x.length)
        result += x[i] * y[i];
    
    return result;
}

void main()
{   
    auto startTime = Clock.currTime();

    // 1. allocate vectors
    vector_type[] vectors = new vector_type[N];
    foreach(ref vec; vectors)
        vec = new value_type[Size];
    
    auto time = Clock.currTime() - startTime;
    writefln("allocation: %s ", time);
    startTime = Clock.currTime();
    
    // 2. randomize vectors
    foreach(ref vec; vectors)
        foreach(ref e; vec)
            e = uniform(-1000, 1000);

    time = Clock.currTime() - startTime;
    writefln("random: %s ", time);
    startTime = Clock.currTime();

    // 3. compute all pairwise scalar products
    result_type avg = 0;
    
    foreach(vecA; vectors)
        foreach(vecB; vectors)
        {
            version(ManualInline)
            {
                result_type result = 0;
    
                foreach(i; 0 .. vecA.length)
                    result += vecA[i] * vecB[i];
            
                avg += result;
            }
            else
            {
                avg += scalar_product(vecA, vecB);
            }
        }
    
    avg = avg / (N * N);
    
    time = Clock.currTime() - startTime;
    writefln("scalar products: %s ", time);
    writefln("result: %s", avg);
}

With ManualInline defined I get 28 seconds, but without I get 32. So the compiler isn't even inlining this simple function, which I think it's clear it should be.

(My command line is dmd -O -noboundscheck -inline -release ....)

Content Type	Original Author	Original Content on Stackoverflow
Question	Lars	View Question on Stackoverflow
Solution 1 - C++	Vladimir Panteleev	View Answer on Stackoverflow
Solution 2 - C++	dsimcha	View Answer on Stackoverflow
Solution 3 - C++	Andrei Alexandrescu	View Answer on Stackoverflow
Solution 4 - C++	Erich Gubler	View Answer on Stackoverflow
Solution 5 - C++	Trass3r	View Answer on Stackoverflow
Solution 6 - C++	Jonathan M Davis	View Answer on Stackoverflow
Solution 7 - C++	BCS	View Answer on Stackoverflow
Solution 8 - C++	GManNickG	View Answer on Stackoverflow