Should I always use a parallel stream when possible?

Java Parallel Processing Java 8 Java Stream

Java Problem Overview

With Java 8 and lambdas it's easy to iterate over collections as streams, and just as easy to use a parallel stream. Two examples from the docs, the second one using parallelStream:

myShapesCollection.stream()
    .filter(e -> e.getColor() == Color.RED)
    .forEach(e -> System.out.println(e.getName()));

myShapesCollection.parallelStream() // <-- This one uses parallel
    .filter(e -> e.getColor() == Color.RED)
    .forEach(e -> System.out.println(e.getName()));

As long as I don't care about the order, would it always be beneficial to use the parallel? One would think it is faster dividing the work on more cores.

Are there other considerations? When should parallel stream be used and when should the non-parallel be used?

^{(This question is asked to trigger a discussion about how and when to use parallel streams, not because I think always using them is a good idea.)}

Java Solutions

Solution 1 - Java

A parallel stream has a much higher overhead compared to a sequential one. Coordinating the threads takes a significant amount of time. I would use sequential streams by default and only consider parallel ones if

I have a massive amount of items to process (or the processing of each item takes time and is parallelizable)
I have a performance problem in the first place
I don't already run the process in a multi-thread environment (for example: in a web container, if I already have many requests to process in parallel, adding an additional layer of parallelism inside each request could have more negative than positive effects)

In your example, the performance will anyway be driven by the synchronized access to System.out.println(), and making this process parallel will have no effect, or even a negative one.

Moreover, remember that parallel streams don't magically solve all the synchronization problems. If a shared resource is used by the predicates and functions used in the process, you'll have to make sure that everything is thread-safe. In particular, side effects are things you really have to worry about if you go parallel.

In any case, measure, don't guess! Only a measurement will tell you if the parallelism is worth it or not.

Solution 2 - Java

The Stream API was designed to make it easy to write computations in a way that was abstracted away from how they would be executed, making switching between sequential and parallel easy.

However, just because its easy, doesn't mean its always a good idea, and in fact, it is a bad idea to just drop .parallel() all over the place simply because you can.

First, note that parallelism offers no benefits other than the possibility of faster execution when more cores are available. A parallel execution will always involve more work than a sequential one, because in addition to solving the problem, it also has to perform dispatching and coordinating of sub-tasks. The hope is that you'll be able to get to the answer faster by breaking up the work across multiple processors; whether this actually happens depends on a lot of things, including the size of your data set, how much computation you are doing on each element, the nature of the computation (specifically, does the processing of one element interact with processing of others?), the number of processors available, and the number of other tasks competing for those processors.

Further, note that parallelism also often exposes nondeterminism in the computation that is often hidden by sequential implementations; sometimes this doesn't matter, or can be mitigated by constraining the operations involved (i.e., reduction operators must be stateless and associative.)

In reality, sometimes parallelism will speed up your computation, sometimes it will not, and sometimes it will even slow it down. It is best to develop first using sequential execution and then apply parallelism where

(A) you know that there's actually benefit to increased performance and

(B) that it will actually deliver increased performance.

(A) is a business problem, not a technical one. If you are a performance expert, you'll usually be able to look at the code and determine (B), but the smart path is to measure. (And, don't even bother until you're convinced of (A); if the code is fast enough, better to apply your brain cycles elsewhere.)

The simplest performance model for parallelism is the "NQ" model, where N is the number of elements, and Q is the computation per element. In general, you need the product NQ to exceed some threshold before you start getting a performance benefit. For a low-Q problem like "add up numbers from 1 to N", you will generally see a breakeven between N=1000 and N=10000. With higher-Q problems, you'll see breakevens at lower thresholds.

But the reality is quite complicated. So until you achieve experthood, first identify when sequential processing is actually costing you something, and then measure if parallelism will help.

Solution 3 - Java

I watched one of the presentations of Brian Goetz (Java Language Architect & specification lead for Lambda Expressions). He explains in detail the following 4 points to consider before going for parallelization:

Splitting / decomposition costs
– Sometimes splitting is more expensive than just doing the work!
Task dispatch / management costs
– Can do a lot of work in the time it takes to hand work to another thread.
Result combination costs
– Sometimes combination involves copying lots of data. For example, adding numbers is cheap whereas merging sets is expensive.
Locality
– The elephant in the room. This is an important point which everyone may miss. You should consider cache misses, if a CPU waits for data because of cache misses then you wouldn't gain anything by parallelization. That's why array-based sources parallelize the best as the next indices (near the current index) are cached and there are fewer chances that CPU would experience a cache miss.

He also mentions a relatively simple formula to determine a chance of parallel speedup.

NQ Model:

N x Q > 10000

where,
N = number of data items
Q = amount of work per item

Solution 4 - Java

Other answers have already covered profiling to avoid premature optimization and overhead cost in parallel processing. This answer explains the ideal choice of data structures for parallel streaming.

> As a rule, performance gains from parallelism are best on streams over ArrayList , HashMap , HashSet , and ConcurrentHashMap instances; arrays; int ranges; and long ranges. What these data structures have in common is that they can all be accurately and cheaply split into subranges of any desired sizes, which makes it easy to divide work among parallel threads. The abstraction used by the streams library to perform this task is the spliterator , which is returned by the spliterator method on Stream and Iterable. > > Another important factor that all of these data structures have in common is that they provide good-to-excellent locality of reference when processed sequentially: sequential element references are stored together in memory. The objects referred to by those references may not be close to one another in memory, which reduces locality-of-reference. Locality-of-reference turns out to be critically important for parallelizing bulk operations: without it, threads spend much of their time idle, waiting for data to be transferred from memory into the processor’s cache. The data structures with the best locality of reference are primitive arrays because the data itself is stored contiguously in memory.

Source: Item #48 Use Caution When Making Streams Parallel, Effective Java 3e by Joshua Bloch

Solution 5 - Java

Never parallelize an infinite stream with a limit. Here is what happens:

    public static void main(String[] args) {
        // let's count to 1 in parallel
        System.out.println(
            IntStream.iterate(0, i -> i + 1)
                .parallel()
                .skip(1)
                .findFirst()
                .getAsInt());
    }

Result

    Exception in thread "main" java.lang.OutOfMemoryError
        at ...
        at java.base/java.util.stream.IntPipeline.findFirst(IntPipeline.java:528)
        at InfiniteTest.main(InfiniteTest.java:24)
    Caused by: java.lang.OutOfMemoryError: Java heap space
	    at java.base/java.util.stream.SpinedBuffer$OfInt.newArray(SpinedBuffer.java:750)
        at ...

Same if you use .limit(...)

Explanation here: https://stackoverflow.com/questions/30825708/java-8-using-parallel-in-a-stream-causes-oom-error

Similarly, don't use parallel if the stream is ordered and has much more elements than you want to process, e.g.

public static void main(String[] args) {
    // let's count to 1 in parallel
    System.out.println(
            IntStream.range(1, 1000_000_000)
                    .parallel()
                    .skip(100)
                    .findFirst()
                    .getAsInt());
}

This may run much longer because the parallel threads may work on plenty of number ranges instead of the crucial one 0-100, causing this to take very long time.

Solution 6 - Java

Collection.parallelStream() is a great way to do work in parallel. However you need to keep in mind that this effectively uses a common thread pool with only a few worker threads internally (number of threads equals to the number of cpu cores by default), see ForkJoinPool.commonPool(). If some of pool's tasks are a long-running I/O-bound work then others, potentially fast, parallelStream calls will get stuck waiting for the free pool threads. This obviously leads to a requirement of fork-join tasks being non-blocking and short or, in other words, cpu-bound. For better understanding of details I strongly recommend careful reading of java.util.concurrent.ForkJoinTask javadoc, here are some relevant quotes:

>The efficiency of ForkJoinTasks stems from ... their main use as computational tasks calculating pure functions or operating on purely isolated objects.

>Computations should ideally avoid synchronized methods or blocks, and should minimize other blocking synchronization

>Subdividable tasks should also not perform blocking I/O

These indicate the main purpose of parallelStream() tasks as short computations over isolated in-memory structures. Also recommend checking out article Common parallel stream pitfalls

Content Type	Original Author	Original Content on Stackoverflow
Question	Matsemann	View Question on Stackoverflow
Solution 1 - Java	JB Nizet	View Answer on Stackoverflow
Solution 2 - Java	Brian Goetz	View Answer on Stackoverflow
Solution 3 - Java	Ram Patra	View Answer on Stackoverflow
Solution 4 - Java	ruhong	View Answer on Stackoverflow
Solution 5 - Java	tkruse	View Answer on Stackoverflow
Solution 6 - Java	Roman Sinyakov	View Answer on Stackoverflow