Why is there no implicit parallelism in Haskell?

Haskell Concurrency Parallel Processing Compiler Optimization

Haskell Problem Overview

Haskell is functional and pure, so basically it has all the properties needed for a compiler to be able to tackle implicit parallelism.

Consider this trivial example:

f = do
  a <- Just 1
  b <- Just $ Just 2
  -- ^ The above line does not utilize an `a` variable, so it can be safely
  -- executed in parallel with the preceding line
  c <- b
  -- ^ The above line references a `b` variable, so it can only be executed
  -- sequentially after it
  return (a, c)
  -- On the exit from a monad scope we wait for all computations to finish and 
  -- gather the results

Schematically the execution plan can be described as:

               do
                |
      +---------+---------+
      |                   |
  a <- Just 1      b <- Just $ Just 2
      |                   |
      |                 c <- b
      |                   |
      +---------+---------+
                |
           return (a, c)

Why is there no such functionality implemented in the compiler with a flag or a pragma yet? What are the practical reasons?

Haskell Solutions

Solution 1 - Haskell

This is a long studied topic. While you can implicitly derive parallelism in Haskell code, the problem is that there is too much parallelism, at too fine a grain, for current hardware.

So you end up spending effort on book keeping, not running things faster.

Since we don't have infinite parallel hardware, it is all about picking the right granularity -- too coarse and there will be idle processors, too ﬁne and the overheads will be unacceptable.

What we have is more coarse grained parallelism (sparks) suitable for generating thousands or millions of parallel tasks (so not at the instruction level), which map down onto the mere handful of cores we typically have available today.

Note that for some subsets (e.g. array processing) there are fully automatic parallelization libraries with tight cost models.

For background on this see Feedback Directed Implicit Parallelism, where they introduce an automated approach to the insertion of par in arbitrary Haskell programs.

Solution 2 - Haskell

While your code block may not be the best example due to implicit data dependence between the a and b, it is worth noting that these two bindings commute in that

f = do
  a <- Just 1
  b <- Just $ Just 2
  ...

will give the same results

f = do
  b <- Just $ Just 2
  a <- Just 1
  ...

so this could still be parallelized in a speculative fashion. It is worth noting that this does not need to have anything to do with monads. We could, for instance, evaluate all independent expressions in a let-block in parallel or we could introduce a version of let that would do so. The lparallel library for Common Lisp does this.

Now, I am by no means an expert on the subject, but this is my understanding of the problem. A major stumbling block is determining when it is advantageous to parallelize the evaluation of multiple expressions. There is overhead associated with starting the separate threads for evaluation, and, as your example shows, it may result in wasted work. Some expressions may be too small to make parallel evaluation worth the overhead. As I understand it, coming up will a fully accurate metric of the cost of an expression would amount to solving the halting problem, so you are relegated to using an heuristic approach to determining what to evaluate in parallel.

Then it is not always faster to throw more cores at a problem. Even when explicitly parallelizing a problem with the many Haskell libraries available, you will often not see much speedup just by evaluating expressions in parallel due to heavy memory allocation and usage and the strain this puts on the garbage collector and CPU cache. You end up needing a nice compact memory layout and to traverse your data intelligently. Having 16 threads traverse linked lists will just bottleneck you at your memory bus and could actually make things slower.

At the very least, what expressions can be effectively parallelized is something that is not obvious to many programmers (at least it isn't to this one), so getting a compiler to do it effectively is non-trivial.

Solution 3 - Haskell

Short answer: Sometimes running stuff in parallel turns out to be slower, not faster. And figuring out when it is and when it isn't a good idea is an unsolved research problem.

However, you still can be "suddenly utilizing all those cores without ever bothering with threads, deadlocks and race conditions". It's not automatic; you just need to give the compiler some hints about where to do it! :-D

Solution 4 - Haskell

One of the reason is because Haskell is non-strict and it does not evaluate anything by default. In general the compiler does not know that computation of a and b terminates hence trying to compute it would be waste of resources:

x :: Maybe ([Int], [Int])
x = Just undefined
y :: Maybe ([Int], [Int])
y = Just (undefined, undefined)
z :: Maybe ([Int], [Int])
z = Just ([0], [1..])
a :: Maybe ([Int], [Int])
a = undefined
b :: Maybe ([Int], [Int])
b = Just ([0], map fib [0..])
    where fib 0 = 1
          fib 1 = 1
          fib n = fib (n - 1) + fib (n - 2)

Consider it for the following functions

main1 x = case x of
              Just _ -> putStrLn "Just"
              Nothing -> putStrLn "Nothing"

(a, b) part does not need to be evaluated. As soon as you get that x = Just _ you can proceed to branch - hence it will work for all values but a

main2 x = case x of
              Just (_, _) -> putStrLn "Just"
              Nothing -> putStrLn "Nothing"

This function enforces evaluation of tuple. Hence x will terminate with error while rest will work.

main3 x = case x of
              Just (a, b) -> print a >> print b
              Nothing -> putStrLn "Nothing"

This function will first print first list and then second. It will work for z (resulting in printing infinite stream of numbers but Haskell can deal with it). b will eventually run out of memory.

Now in general you don't know if computation terminates or not and how many resources it will consume. Infinite lists are perfectly fine in Haskell:

main = maybe (return ()) (print . take 5 . snd) b -- Prints first 5 Fibbonacci numbers

Hence spawning threads to evaluate expression in Haskell might try to evaluate something which is not meant to be evaluated fully - say list of all primes - yet programmers use as part of structure. The above examples are very simple and you may argue that compiler could notice them - however it is not possible in general due to Halting problem (you cannot write program which takes arbitrary program and its input and check if it terminates) - therefore it is not safe optimization.

In addition - which was mentioned by other answers - it is hard to predict whether the overhead of additional thread are worth engaging. Even though GHC doesn't spawn new threads for sparks using green threading (with fixed number of kernel threads - setting aside a few exceptions) you still need to move data from one core to another and synchronize between them which can be quite costly.

However Haskell do have guided parallelization without breaking the purity of language by par and similar functions.

Solution 5 - Haskell

Actually there was such attempt but not on common hardware due to the low available quantity of cores. The project is called Reduceron. It runs Haskell code with a high level of parallelism. In case it was ever released as a proper 2 GHz ASIC core, we'd have a serious breakthrough in Haskell execution speed.

Content Type	Original Author	Original Content on Stackoverflow
Question	Nikita Volkov	View Question on Stackoverflow
Solution 1 - Haskell	Don Stewart	View Answer on Stackoverflow
Solution 2 - Haskell	sabauma	View Answer on Stackoverflow
Solution 3 - Haskell	MathematicalOrchid	View Answer on Stackoverflow
Solution 4 - Haskell	Maciej Piechotka	View Answer on Stackoverflow
Solution 5 - Haskell	polkovnikov.ph	View Answer on Stackoverflow