Why has std::reduce been added in C++17?

C++StdC++17

C++ Problem Overview


I was looking for a thorough explanation of the meaning of "Return value" description for std::reduce, which according to cppreference.com, is:

enter image description here

Maybe after I properly understand this section, I can better determine when I should choose std::reduce over std::accumulate.

C++ Solutions


Solution 1 - C++

Since you asked for a thorough explanation, and the previous answer covers only basics, I'm taking the liberty of adding a more thorough one.

std::reduce is intended to perform the second major step of the MapReduce programming model. The basic idea is that the platform (in this case, C++ implementation) provides these two primitive operations map and reduce, and the programmer provides callback operations for each of the two that perform the "actual work".

Basically, the callback for the map operation maps an input value to an intermediate value. The callback for reduce combines two intermediate values into one intermediate value. The last intermediate value left becomes the output of the whole MapReduce. It seems to be a very restrictive model per se, but still, it has a large range of applications.

The platform has to do more, of course, such as "shuffling" (distributing the inputs, usually in groups, to different processing units) and scheduling, but this is of little concern to the application programmer.

By the way, in the C++ standard library, the "map" operation is called transform. It has received parallelism support in C++17 as well, but I'll get into parallelism later.

Here's a made-up example: Let's say we have a function that converts an integer to a string representation. Now, given a list of integers, we want the textual representation containing the greatest ratio of consonants to vocals. E.g.

  • Input: 1, 17, 22, 4, 8
  • Output: twenty-two

(Try it for yourself if you don't believe this result.)

We could use MapReduce here by using our int-to-text function as the callback to map (rsp. std::transform), and a function that counts the number of consonants and vocals and then selects either the left hand or right-hand argument accordingly. There are some inefficiencies here, in particular, the "ratio" should be cached, but these are details.

Now your question may and possibly should be: Why should I possibly care? After all, so far you didn't gain much from using e.g. std::transform and std::reduce in this way, and you could have used std::accumulate in place of the latter as well. The answer of course, given a sufficiently large number of input values, is execution policies – the former two standard function templates have overloads that allow implicit parallelism. But that still begs the question why one would use MapReduce and not a thread pool or std::async, and a bunch of hand-written loops? First, especially for "pure" computations on large vectors or other containers, with no I/O, it is often more convenient to write the two MapReduce callbacks because you don't have to deal with all the details of how the input data is spread around to different threads and then combined.

Second, MapReduce encourages a discipline of structuring your computations in a way that can be parallelized very effectively. Of course, in a programming language that supports the imperative paradigm, such as C++, you can still mess things up by locking a bunch of mutexes, or whatever way you may have of interfering with other threads. But the MapReduce paradigm encourages writing independent mapping callbacks. As a simple rule of thumb, if these tasks share data at all, then it should be read-only so that copies of it can be stored in multiple CPU caches at the same time. Well-written tasks, combined with an efficient platform implementation of the underlying algorithms, can scale to hundreds or even thousands of CPU cores, as is shown by the MapReduce platforms already in common use (like Apache Hadoop, but take this only as a necessary example and not as gratuitous advertisement).

But the question was mainly about std::reduce – we could still perform this highly scalable mapping and then run std::accumulate on the intermediate results, right? And this is where we get to what François Andrieux previously wrote. accumulate performs what mathematicians call a left fold. If you view the operations and their operands as a binary tree, then this would be a left-leaning tree, e.g. to add 1, 2, 3 and 4:

   +
  / \
  + 4
 / \
 + 3
/ \
1 2

As you can see, the result of each operation is one of the arguments of the next operation. This means there is a linear chain of data dependencies, and that is the bane of all parallelism. To add a million numbers, you need a million consecutive operations, which will block a single core, and you cannot make any use of additional cores. They will have nothing to do most of the time, and the communication overhead will greatly outweigh the cost of the computations. (It's actually worse than that because modern CPUs can perform multiple simple calculations per clock, but this doesn't work when there are so many data dependencies, so most of the ALUs or FPUs go unused.)

By lifting the restriction that a left fold must be used, std::reduce allows the platform to make more efficient use of computational resources. Even if you use the single-threaded overload, the platform could, for example, use SIMD to add a million integers in much less than a million operations, and the number of data dependencies will be greatly reduced. A 10x speedup on a well-written integer addition reduce would not surprise me. Granted, this particular special case could probably be optimized under the as-if rule because the C++ implementation "knows" that integer addition is (almost, see below) associative.

But reduce goes further than that, as was mentioned, by supporting execution policies, i.e. in most cases multi-core parallelism. In theory, a balanced binary tree of operations could be used. (Remember that a tree is balanced if the depth is either less than two, or the depth of the left subtree is different from the depth of the right subtree by at most 1.) Such a tree has only logarithmic depth. If we have a million integers, the minimum tree depth is 20, so – theoretically – given enough cores and no communication overhead, we could achieve a factor 50,000 speedup over even the optimized single-threaded calculation. Of course, in practice, that's a load of wishful thinking, but we can still expect large speedups.

All that said, let me add a quick disclaimer/reminder that performance is not the same as efficiency. Using 64 cores for 100ms each means much higher performance than using one core for 1,000ms, but much less CPU efficiency. Another way to put it is that performance is efficiency in the sense of minimizing the elapsed time, but there are other measures of efficiency – total CPU time used, RAM used, energy used, and so on. The primary motivation of parallel MapReduce is to provide higher performance. Whether it reduces CPU time or energy consumption is unclear, and it will very likely increase peak RAM usage.

To top it all off, here are some caveats. As was mentioned, reduce is non-deterministic if the operations are not associative or not commutative. Fortunately, the most important arithmetic operations such as addition and multiplication are associative and commutative, right? We all know that integer and floating-point addition, for example, have both of these properties. And of course, I am being facetious. In C++, neither signed integer addition nor floating-point addition, are associative. For floating-point numbers, this is due to possible differences in rounding of intermediate results. This is easily visualized if we, as an example, pick a simple decimal floating point format with two significant digits, and consider the sum 10 + 0.4 + 0.4. If this is done by the normal C++ syntax rules as a left-fold, we get (10 + 0.4) + 0.4 = 10 + 0.4 = 10 because each time the result is rounded back down to 10. But if we do it as 10 + (0.4 + 0.4), the first intermediate result is 0.8, and 10 + 0.8 is then rounded up to 11. Also, rounding errors can become greatly magnified by a high depth of the operation tree, so doing a left fold is actually one of the worst things one could do when it comes to accuracy. There are several ways to deal with this behavior, ranging from sorting and grouping the inputs to using an increased intermediate precision, but when it comes to reduce there may simply be no way to get 100% run-for-run consistency.

The other, possibly more surprising, observation is that signed integer addition is not associative in C++. The reason for this is the possibility of overflow, to put it bluntly: (-1) + 1 + INT_MAX. According to the normal syntax rules, or accumulate, the result is INT_MAX. But if you use reduce, this might get rewritten as (-1) + (1 + INT_MAX) which contains an integer overflow and hence undefined behavior. In fact, because reduce may arbitrarily change the order of operands, this is even true if the inputs are { INT_MAX, -1, 1 }.

My recommendation here is to ensure that the callback to reduce cannot produce an overflow. This could be done by limiting the allowed range of inputs (e.g. if you add 1000 ints, make sure that none of them is greater than INT_MAX / 1000 or less than INT_MIN / 1000, rounded up), for example, or, equivalently, by using a larger integer type, or, as an absolute last resort (because this is both expensive and hard to handle correctly), putting additional checks in the reduce callback. In most practical cases, reduce is only marginally less safe regarding integer overflow than accumulate, however.

Solution 2 - C++

std::accumulate iterates over the container in order where as std:reduce might not. Because the order is not guaranteed, std::reduce introduces extra requirements :

> The behavior is non-deterministic if binary_op is not associative or not commutative. The behavior is undefined if binary_op modifies any element or invalidates any iterator in [first; last], including the end iterator.

However, std::reduce provides overloads that support parallelization which are not available with std::accumulate. std::reduce allows you to automatically parallelize your operation, provided it meets the requirements mentioned above.

Solution 3 - C++

  • Allowing parallelism is the main reason for addition of std::reduce

  • Also one needs make sure operation that you want to use with std::reduce is both associative and commutative.

  • For example, Addition is associative and gives the same results when accumulation is done in parallel using std::reduce.
    100 + 50 + 40 + 10 = 200 (100 + 40) + (50 + 10) = 200

  • But subtraction not being associative, std::reduce may give wrong results.
    100 - 50 - 40 - 10 = 0 NOT SAME AS (100 - 40) - (50 - 10) = 20

  • Efficiency
    std::vector<double> v(100000, 0.1); double result = std::accumulate(v.begin(), v.end(), 0.0); double result = std::reduce(std::execution::par,v.begin(),v.end()) //Faster

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionAria PahlavanView Question on Stackoverflow
Solution 1 - C++Arne VogelView Answer on Stackoverflow
Solution 2 - C++François AndrieuxView Answer on Stackoverflow
Solution 3 - C++kbagalView Answer on Stackoverflow