What optimizations can GHC be expected to perform reliably?

OptimizationHaskellGhc

Optimization Problem Overview


GHC has a lot of optimizations that it can perform, but I don't know what they all are, nor how likely they are to be performed and under what circumstances.

My question is: what transformations can I expect it to apply every time, or nearly so? If I look at a piece of code that's going to be executed (evaluated) frequently and my first thought is "hmm, maybe I should optimize that", in which cases should my second thought be, "don't even think about it, GHC got this"?

I was reading the paper Stream Fusion: From Lists to Streams to Nothing at All, and the technique they used of rewriting list processing into a different form which GHC's normal optimizations would then reliably optimize down into simple loops was novel to me. How can I tell when my own programs are eligible for that kind of optimization?

There's some information in the GHC manual, but it only goes part of the way towards answering the question.

EDIT: I'm starting a bounty. What I would like is a list of lower-level transformations like lambda/let/case-floating, type/constructor/function argument specialization, strictness analysis and unboxing, worker/wrapper, and whatever else significant GHC does that I've left out, along with explanations and examples of input and output code, and ideally illustrations of situations when the total effect is more than the sum of its parts. And ideally some mention of when transformations won't happen. I'm not expecting novel-length explanations of every transformation, a couple of sentences and inline one-liner code examples could be enough (or a link, if it's not to twenty pages of scientific paper), as long as the big picture is clear by the end of it. I want to be able to look at a piece of code and be able to make a good guess about whether it will compile down to a tight loop, or why not, or what I would have to change to make it. (I'm not interested so much here in the big optimization frameworks like stream fusion (I just read a paper about that); more in the kind of knowledge that people who write these frameworks have.)

Optimization Solutions


Solution 1 - Optimization

This GHC Trac page also explains the passes fairly well. This page explains the optimization ordering, though, like the majority of the Trac Wiki, it is out of date.

For specifics, the best thing to do is probably to look at how a specific program is compiled. The best way to see which optimizations are being performed is to compile the program verbosely, using the -v flag. Taking as an example the first piece of Haskell I could find on my computer:

Glasgow Haskell Compiler, Version 7.4.2, stage 2 booted by GHC version 7.4.1
Using binary package database: /usr/lib/ghc-7.4.2/package.conf.d/package.cache
wired-in package ghc-prim mapped to ghc-prim-0.2.0.0-7d3c2c69a5e8257a04b2c679c40e2fa7
wired-in package integer-gmp mapped to integer-gmp-0.4.0.0-af3a28fdc4138858e0c7c5ecc2a64f43
wired-in package base mapped to base-4.5.1.0-6e4c9bdc36eeb9121f27ccbbcb62e3f3
wired-in package rts mapped to builtin_rts
wired-in package template-haskell mapped to template-haskell-2.7.0.0-2bd128e15c2d50997ec26a1eaf8b23bf
wired-in package dph-seq not found.
wired-in package dph-par not found.
Hsc static flags: -static
*** Chasing dependencies:
Chasing modules from: *SleepSort.hs
Stable obj: [Main]
Stable BCO: []
Ready for upsweep
  [NONREC
      ModSummary {
         ms_hs_date = Tue Oct 18 22:22:11 CDT 2011
         ms_mod = main:Main,
         ms_textual_imps = [import (implicit) Prelude, import Control.Monad,
                            import Control.Concurrent, import System.Environment]
         ms_srcimps = []
      }]
*** Deleting temp files:
Deleting: 
compile: input file SleepSort.hs
Created temporary directory: /tmp/ghc4784_0
*** Checking old interface for main:Main:
[1 of 1] Compiling Main             ( SleepSort.hs, SleepSort.o )
*** Parser:
*** Renamer/typechecker:
*** Desugar:
Result size of Desugar (after optimization) = 79
*** Simplifier:
Result size of Simplifier iteration=1 = 87
Result size of Simplifier iteration=2 = 93
Result size of Simplifier iteration=3 = 83
Result size of Simplifier = 83
*** Specialise:
Result size of Specialise = 83
*** Float out(FOS {Lam = Just 0, Consts = True, PAPs = False}):
Result size of Float out(FOS {Lam = Just 0,
                              Consts = True,
                              PAPs = False}) = 95
*** Float inwards:
Result size of Float inwards = 95
*** Simplifier:
Result size of Simplifier iteration=1 = 253
Result size of Simplifier iteration=2 = 229
Result size of Simplifier = 229
*** Simplifier:
Result size of Simplifier iteration=1 = 218
Result size of Simplifier = 218
*** Simplifier:
Result size of Simplifier iteration=1 = 283
Result size of Simplifier iteration=2 = 226
Result size of Simplifier iteration=3 = 202
Result size of Simplifier = 202
*** Demand analysis:
Result size of Demand analysis = 202
*** Worker Wrapper binds:
Result size of Worker Wrapper binds = 202
*** Simplifier:
Result size of Simplifier = 202
*** Float out(FOS {Lam = Just 0, Consts = True, PAPs = True}):
Result size of Float out(FOS {Lam = Just 0,
                              Consts = True,
                              PAPs = True}) = 210
*** Common sub-expression:
Result size of Common sub-expression = 210
*** Float inwards:
Result size of Float inwards = 210
*** Liberate case:
Result size of Liberate case = 210
*** Simplifier:
Result size of Simplifier iteration=1 = 206
Result size of Simplifier = 206
*** SpecConstr:
Result size of SpecConstr = 206
*** Simplifier:
Result size of Simplifier = 206
*** Tidy Core:
Result size of Tidy Core = 206
writeBinIface: 4 Names
writeBinIface: 28 dict entries
*** CorePrep:
Result size of CorePrep = 224
*** Stg2Stg:
*** CodeGen:
*** CodeOutput:
*** Assembler:
'/usr/bin/gcc' '-fno-stack-protector' '-Wl,--hash-size=31' '-Wl,--reduce-memory-overheads' '-I.' '-c' '/tmp/ghc4784_0/ghc4784_0.s' '-o' 'SleepSort.o'
Upsweep completely successful.
*** Deleting temp files:
Deleting: /tmp/ghc4784_0/ghc4784_0.c /tmp/ghc4784_0/ghc4784_0.s
Warning: deleting non-existent /tmp/ghc4784_0/ghc4784_0.c
link: linkables are ...
LinkableM (Sat Sep 29 20:21:02 CDT 2012) main:Main
   [DotO SleepSort.o]
Linking SleepSort ...
*** C Compiler:
'/usr/bin/gcc' '-fno-stack-protector' '-Wl,--hash-size=31' '-Wl,--reduce-memory-overheads' '-c' '/tmp/ghc4784_0/ghc4784_0.c' '-o' '/tmp/ghc4784_0/ghc4784_0.o' '-DTABLES_NEXT_TO_CODE' '-I/usr/lib/ghc-7.4.2/include'
*** C Compiler:
'/usr/bin/gcc' '-fno-stack-protector' '-Wl,--hash-size=31' '-Wl,--reduce-memory-overheads' '-c' '/tmp/ghc4784_0/ghc4784_0.s' '-o' '/tmp/ghc4784_0/ghc4784_1.o' '-DTABLES_NEXT_TO_CODE' '-I/usr/lib/ghc-7.4.2/include'
*** Linker:
'/usr/bin/gcc' '-fno-stack-protector' '-Wl,--hash-size=31' '-Wl,--reduce-memory-overheads' '-o' 'SleepSort' 'SleepSort.o' '-L/usr/lib/ghc-7.4.2/base-4.5.1.0' '-L/usr/lib/ghc-7.4.2/integer-gmp-0.4.0.0' '-L/usr/lib/ghc-7.4.2/ghc-prim-0.2.0.0' '-L/usr/lib/ghc-7.4.2' '/tmp/ghc4784_0/ghc4784_0.o' '/tmp/ghc4784_0/ghc4784_1.o' '-lHSbase-4.5.1.0' '-lHSinteger-gmp-0.4.0.0' '-lgmp' '-lHSghc-prim-0.2.0.0' '-lHSrts' '-lm' '-lrt' '-ldl' '-u' 'ghczmprim_GHCziTypes_Izh_static_info' '-u' 'ghczmprim_GHCziTypes_Czh_static_info' '-u' 'ghczmprim_GHCziTypes_Fzh_static_info' '-u' 'ghczmprim_GHCziTypes_Dzh_static_info' '-u' 'base_GHCziPtr_Ptr_static_info' '-u' 'base_GHCziWord_Wzh_static_info' '-u' 'base_GHCziInt_I8zh_static_info' '-u' 'base_GHCziInt_I16zh_static_info' '-u' 'base_GHCziInt_I32zh_static_info' '-u' 'base_GHCziInt_I64zh_static_info' '-u' 'base_GHCziWord_W8zh_static_info' '-u' 'base_GHCziWord_W16zh_static_info' '-u' 'base_GHCziWord_W32zh_static_info' '-u' 'base_GHCziWord_W64zh_static_info' '-u' 'base_GHCziStable_StablePtr_static_info' '-u' 'ghczmprim_GHCziTypes_Izh_con_info' '-u' 'ghczmprim_GHCziTypes_Czh_con_info' '-u' 'ghczmprim_GHCziTypes_Fzh_con_info' '-u' 'ghczmprim_GHCziTypes_Dzh_con_info' '-u' 'base_GHCziPtr_Ptr_con_info' '-u' 'base_GHCziPtr_FunPtr_con_info' '-u' 'base_GHCziStable_StablePtr_con_info' '-u' 'ghczmprim_GHCziTypes_False_closure' '-u' 'ghczmprim_GHCziTypes_True_closure' '-u' 'base_GHCziPack_unpackCString_closure' '-u' 'base_GHCziIOziException_stackOverflow_closure' '-u' 'base_GHCziIOziException_heapOverflow_closure' '-u' 'base_ControlziExceptionziBase_nonTermination_closure' '-u' 'base_GHCziIOziException_blockedIndefinitelyOnMVar_closure' '-u' 'base_GHCziIOziException_blockedIndefinitelyOnSTM_closure' '-u' 'base_ControlziExceptionziBase_nestedAtomically_closure' '-u' 'base_GHCziWeak_runFinalizzerBatch_closure' '-u' 'base_GHCziTopHandler_flushStdHandles_closure' '-u' 'base_GHCziTopHandler_runIO_closure' '-u' 'base_GHCziTopHandler_runNonIO_closure' '-u' 'base_GHCziConcziIO_ensureIOManagerIsRunning_closure' '-u' 'base_GHCziConcziSync_runSparks_closure' '-u' 'base_GHCziConcziSignal_runHandlers_closure'
link: done
*** Deleting temp files:
Deleting: /tmp/ghc4784_0/ghc4784_1.o /tmp/ghc4784_0/ghc4784_0.s /tmp/ghc4784_0/ghc4784_0.o /tmp/ghc4784_0/ghc4784_0.c
*** Deleting temp dirs:
Deleting: /tmp/ghc4784_0

Looking from the first *** Simplifier: to the last, where all the optimization phases happen, we see quite a lot.

First of all, the Simplifier runs between almost all the phases. This makes writing many passes much easier. For example, when implementing many optimizations, they simply create rewrite rules to propagate the changes instead of having to do it manually. The simplifier encompasses a number of simple optimizations, including inlining and fusion. The main limitation of this that I know is that GHC refuses to inline recursive functions, and that things have to be named correctly for fusion to work.

Next, we see a full list of all the optimizations performed:

  • Specialise

    The basic idea of specialization is to remove polymorphism and overloading by identifying places where the function is called and creating versions of the function that aren't polymorphic - they are specific to the types they are called with. You can also tell the compiler to do this with the SPECIALISE pragma. As an example, take a factorial function:

     fac :: (Num a, Eq a) => a -> a
     fac 0 = 1
     fac n = n * fac (n - 1)
    

    As the compiler doesn't know any properties of the multiplication that is to be used, it cannot optimize this at all. If however, it sees that it is used on an Int, it now can create a new version, differing only in the type:

     fac_Int :: Int -> Int
     fac_Int 0 = 1
     fac_Int n = n * fac_Int (n - 1)
    

    Next, rules mentioned below can fire, and you end up with something working on unboxed Ints, which is much faster than the original. Another way to look at specialisation is partial application on type class dictionaries and type variables.

    The source here has a load of notes in it.

  • Float out

    EDIT: I apparently misunderstood this before. My explanation has completely changed.

    The basic idea of this is to move computations that shouldn't be repeated out of functions. For example, suppose we had this:

     \x -> let y = expensive in x+y
    

    In the above lambda, every time the function is called, y is recomputed. A better function, which floating out produces, is

     let y = expensive in \x -> x+y
    

    To facilitate the process, other transformations may be applied. For example, this happens:

      \x -> x + f 2
      \x -> x + let f_2 = f 2 in f_2
      \x -> let f_2 = f 2 in x + f_2
      let f_2 = f 2 in \x -> x + f_2
    

    Again, repeated computation is saved.

    The source is very readable in this case.

    At the moment bindings between two adjacent lambdas are not floated. For example, this does not happen:

     \x y -> let t = x+x in ...
    

    going to

      \x -> let t = x+x in \y -> ...
    
  • Float inwards

    Quoting the source code,

    The main purpose of floatInwards is floating into branches of a case, so that we don't allocate things, save them on the stack, and then discover that they aren't needed in the chosen branch.

    As an example, suppose we had this expression:

     let x = big in
         case v of
             True -> x + 1
             False -> 0
    

    If v evaluates to False, then by allocating x, which is presumably some big thunk, we have wasted time and space. Floating inwards fixes this, producing this:

     case v of
         True -> let x = big in x + 1
         False -> let x = big in 0
    

    , which is subsequently replaced by the simplifier with

     case v of
         True -> big + 1
         False -> 0
    

    This paper, although covering other topics, gives a fairly clear introduction. Note that despite their names, floating in and floating out don't get in an infinite loop for two reasons:

    1. Float in floats lets into case statements, while float out deals with functions.
    2. There is a fixed order of passes, so they shouldn't be alternating infinitely.

Solution 2 - Optimization

Laziness

It's not a "compiler optimisation", but it's something guaranteed by the language specification, so you can always count on it happening. Essentially, this means that work is not performed until you "do something" with the result. (Unless you do one of several things to deliberately turn off laziness.)

This, obviously, is an entire topic in its own right, and SO has lots of questions and answers about it already.

In my limited experience, making your code too lazy or too strict has vastly larger performance penalties (in time and space) than any of the other stuff I'm about to talk about...

Strictness analysis

Laziness is about avoiding work unless it's necessary. If the compiler can determine that a given result will "always" be needed, then it won't bother storing the calculation and performing it later; it'll just perform it directly, because that is more efficient. This is so-called "strictness analysis".

The gotcha, obviously, is that the compiler cannot always detect when something could be made strict. Sometimes you need to give the compiler little hints. (I'm not aware of any easy way to determine whether strictness analysis has done what you think it has, other than wading through the Core output.)

Inlining

If you call a function, and the compiler can tell which function you're calling, it may try to "inline" that function - that is, to replace the function call with a copy of the function itself. The overhead of a function call is usually pretty small, but inlining often enables other optimisations to happen which wouldn't have happened otherwise, so inlining can be a big win.

Functions are only inlined if they are "small enough" (or if you add a pragma specifically asking for inlining). Also, functions can only be inlined if the compiler can tell what function you're calling. There are two main ways that the compiler could be unable to tell:

  • If the function you're calling is passed in from somewhere else. E.g., when the filter function is compiled, you can't inline the filter predicate, because it's a user-supplied argument.

  • If the function you're calling is a class method and the compiler doesn't know what type is involved. E.g., when the sum function is compiled, the compiler can't inline the + function, because sum works with several different number types, each of which has a different + function.

In the latter case, you can use the {-# SPECIALIZE #-} pragma to generate versions of a function that are hard-coded to a particular type. E.g., {-# SPECIALIZE sum :: [Int] -> Int #-} would compile a version of sum hard-coded for the Int type, meaning that + can be inlined in this version.

Note, though, that our new special-sum function will only be called when the compiler can tell that we're working with Int. Otherwise the original, polymorphic sum gets called. Again, the actual function call overhead is fairly small. It's the additional optimisations that inlining can enable which are beneficial.

Common subexpression elimination

If a certain block of code calculates the same value twice, the compiler may replace that with a single instance of the same computation. For example, if you do

(sum xs + 1) / (sum xs + 2)

then the compiler might optimise this to

let s = sum xs in (s+1)/(s+2)

You might expect that the compiler would always do this. However, apparently in some situations this can result in worse performance, not better, so GHC does not always do this. Frankly, I don't really understand the details behind this one. But the bottom line is, if this transformation is important to you, it's not hard to do it manually. (And if it's not important, why are you worrying about it?)

Case expressions

Consider the following:

foo (0:_ ) = "zero"
foo (1:_ ) = "one"
foo (_:xs) = foo xs
foo (  []) = "end"

The first three equations all check whether the list is non-empty (among other things). But checking the same thing thrice is wasteful. Fortunately, it's very easy for the compiler to optimise this into several nested case expressions. In this case, something like

foo xs =
  case xs of
    y:ys ->
      case y of
        0 -> "zero"
        1 -> "one"
        _ -> foo ys
    []   -> "end"

This is rather less intuitive, but more efficient. Because the compiler can easily do this transformation, you don't have to worry about it. Just write your pattern matching in the most intuitive way possible; the compiler is very good at reordering and rearranging this to make it as fast as possible.

Fusion

The standard Haskell idiom for list processing is to chain together functions that take one list and produce a new list. The canonical example being

map g . map f

Unfortunately, while laziness guarantees skipping unecessary work, all the allocations and deallocations for the intermediate list sap performance. "Fusion" or "deforestation" is where the compiler tries to eliminate these intermediate steps.

The trouble is, most of these functions are recursive. Without the recursion, it would be an elementary exercise in inlining to squish all the functions into one big code block, run the simplifier over it and produce really optimal code with no intermediate lists. But because of the recursion, that won't work.

You can use {-# RULE #-} pragmas to fix some of this. For example,

{-# RULES "map/map" forall f g xs. map f (map g xs) = map (f.g) xs #-}

Now every time GHC sees map applied to map, it squishes it into a single pass over the list, eliminating the intermediate list.

Trouble is, this works only for map followed by map. There are many other possibilities - map followed by filter, filter followed by map, etc. Rather than hand-code a solution for each of them, so-called "stream fusion" was invented. This is a more complicated trick, which I won't describe here.

The long and short of it is: These are all special optimisation tricks written by the programmer. GHC itself knows nothing about fusion; it's all in the list librarys and other container libraries. So what optimisations happen depends on how your container libraries are written (or, more realistically, which libraries you choose to use).

For example, if you work with Haskell '98 arrays, don't expect any fusion of any kind. But I understand that the vector library has extensive fusion capabilities. It's all about the libraries; the compiler just provides the RULES pragma. (Which is extremely powerful, by the way. As a library author, you can use it to rewrite client code!)


Meta:

  • I agree with the people saying "code first, profile second, optimise third".

  • I also agree with the people saying "it is useful to have a mental model for how much cost a given design decision has".

Balance in all things, and all that...

Solution 3 - Optimization

If a let binding v = rhs is used in only one place you can count on the compiler to inline it, even if rhs is big.

The exception (that almost isn't one in the context of the current question) is lambdas risking work duplication. Consider:

let v = rhs
    l = \x-> v + x
in map l [1..100]

there inlining v would be dangerous because the one (syntactic) use would translate into 99 extra evaluations of rhs. However, in this case, you would be very unlikely to want to inline it manually either. So essentially you can use the rule:

If you'd consider inlining a name that only appears once, the compiler will do it anyway.

As a happy corollary, using a let binding simply to decompose a long statement (with hope of gaining clarity) is essentially free.

This comes from community.haskell.org/~simonmar/papers/inline.pdf which includes a lot more information about inlining.

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionglaebhoerlView Question on Stackoverflow
Solution 1 - OptimizationgereeterView Answer on Stackoverflow
Solution 2 - OptimizationMathematicalOrchidView Answer on Stackoverflow
Solution 3 - OptimizationDanielView Answer on Stackoverflow