The cost of nested methods

ScalaMethods

Scala Problem Overview


In Scala one might define methods inside other methods. This limits their scope of use to inside of definition block. I use them to improve readability of code that uses several higher-order functions. In contrast to anonymous function literals, this allows me to give them meaningful names before passing them on.

For example:

class AggregatedPerson extends HashSet[PersonRecord] {
  def mostFrequentName: String = {
    type NameCount = (String, Int)
    def moreFirst(a: NameCount, b: NameCount) = a._2 > b._2
    def countOccurrences(nameGroup: (String, List[PersonRecord])) =
      (nameGroup._1, nameGroup._2.size) 

    iterator.toList.groupBy(_.fullName).
      map(countOccurrences).iterator.toList.
      sortWith(moreFirst).head._1
  }
}

Is there any runtime cost because of the nested method definition I should be aware of?

Does the answer differ for closures?

Scala Solutions


Solution 1 - Scala

During compilaton, the nested functions are moveFirst and countOccurences are moved out to the same level as mostFrequentName. They get compiler synthesized names: moveFirst$1 and countOccurences$1.

In addition, when you refer to one of these methods without an argument list, it is lifted into a function. So map(countOccurences) is the same as writing map((a: (String, List[PersonRecord])) => countOccurences(a)). This anonymous function is compiled to a separate class AggregatedPerson$$anonfun$mostFrequentName$2, which does nothing more than forward to countOccurences$.

As a side note, the process of lifting the method to a function is called Eta Expansion. It is triggered if you omit the argument list in a context where a function type is expected (as in your example), or if you use _ in place of the entire argument list, or in place of each argument (val f1 = countOccurences _ ; val f2 = countOccurences(_).

If the code was directly in the closure, you would have one fewer method call in your stack, and one fewer synthetic method generated. The performance impact of this is likely to be zero in most cases.

I find it to be a fantastically useful tool to structure code, and consider your example very idiomatic Scala.

Another useful tool is using small blocks to initialize a val:

val a = {
   val temp1, temp2 = ...
   f(temp1, temp2)
}

You can use scalac -print to see exactly how Scala code is translated into a form ready for the JVM. Heres the output from your program:

[[syntax trees at end of cleanup]]// Scala source: nested-method.scala
package <empty> {

  class AggregatedPerson extends scala.collection.mutable.HashSet with ScalaObject {
    def mostFrequentName(): java.lang.String = AggregatedPerson.this.iterator().toList().groupBy({
      (new AggregatedPerson$$anonfun$mostFrequentName$1(AggregatedPerson.this): Function1)
    }).map({
      {
        (new AggregatedPerson$$anonfun$mostFrequentName$2(AggregatedPerson.this): Function1)
      }
    }, collection.this.Map.canBuildFrom()).$asInstanceOf[scala.collection.MapLike]().iterator().toList().sortWith({
      {
        (new AggregatedPerson$$anonfun$mostFrequentName$3(AggregatedPerson.this): Function2)
      }
    }).$asInstanceOf[scala.collection.IterableLike]().head().$asInstanceOf[Tuple2]()._1().$asInstanceOf[java.lang.String]();
    final def moreFirst$1(a: Tuple2, b: Tuple2): Boolean = scala.Int.unbox(a._2()).>(scala.Int.unbox(b._2()));
    final def countOccurrences$1(nameGroup: Tuple2): Tuple2 = new Tuple2(nameGroup._1(), scala.Int.box(nameGroup._2().$asInstanceOf[scala.collection.SeqLike]().size()));
    def this(): AggregatedPerson = {
      AggregatedPerson.super.this();
      ()
    }
  };

  @SerialVersionUID(0) @serializable final <synthetic> class AggregatedPerson$$anonfun$mostFrequentName$1 extends scala.runtime.AbstractFunction1 {
    final def apply(x$1: PersonRecord): java.lang.String = x$1.fullName();
    final <bridge> def apply(v1: java.lang.Object): java.lang.Object = AggregatedPerson$$anonfun$mostFrequentName$1.this.apply(v1.$asInstanceOf[PersonRecord]());
    def this($outer: AggregatedPerson): AggregatedPerson$$anonfun$mostFrequentName$1 = {
      AggregatedPerson$$anonfun$mostFrequentName$1.super.this();
      ()
    }
  };

  @SerialVersionUID(0) @serializable final <synthetic> class AggregatedPerson$$anonfun$mostFrequentName$2 extends scala.runtime.AbstractFunction1 {
    final def apply(nameGroup: Tuple2): Tuple2 = AggregatedPerson$$anonfun$mostFrequentName$2.this.$outer.countOccurrences$1(nameGroup);
    <synthetic> <paramaccessor> private[this] val $outer: AggregatedPerson = _;
    final <bridge> def apply(v1: java.lang.Object): java.lang.Object = AggregatedPerson$$anonfun$mostFrequentName$2.this.apply(v1.$asInstanceOf[Tuple2]());
    def this($outer: AggregatedPerson): AggregatedPerson$$anonfun$mostFrequentName$2 = {
      if ($outer.eq(null))
        throw new java.lang.NullPointerException()
      else
        AggregatedPerson$$anonfun$mostFrequentName$2.this.$outer = $outer;
      AggregatedPerson$$anonfun$mostFrequentName$2.super.this();
      ()
    }
  };
  @SerialVersionUID(0) @serializable final <synthetic> class AggregatedPerson$$anonfun$mostFrequentName$3 extends scala.runtime.AbstractFunction2 {
    final def apply(a: Tuple2, b: Tuple2): Boolean = AggregatedPerson$$anonfun$mostFrequentName$3.this.$outer.moreFirst$1(a, b);
    <synthetic> <paramaccessor> private[this] val $outer: AggregatedPerson = _;
    final <bridge> def apply(v1: java.lang.Object, v2: java.lang.Object): java.lang.Object = scala.Boolean.box(AggregatedPerson$$anonfun$mostFrequentName$3.this.apply(v1.$asInstanceOf[Tuple2](), v2.$asInstanceOf[Tuple2]()));
    def this($outer: AggregatedPerson): AggregatedPerson$$anonfun$mostFrequentName$3 = {
      if ($outer.eq(null))
        throw new java.lang.NullPointerException()
      else
        AggregatedPerson$$anonfun$mostFrequentName$3.this.$outer = $outer;
      AggregatedPerson$$anonfun$mostFrequentName$3.super.this();
      ()
    }
  }
}

Solution 2 - Scala

There is a small runtime cost. You can observe it here (apologies for the long code):

object NestBench {
  def countRaw() = {
    var sum = 0
    var i = 0
    while (i<1000) {
      sum += i
      i += 1
      var j = 0
      while (j<1000) {
        sum += j
        j += 1
        var k = 0
        while (k<1000) {
          sum += k
          k += 1
          sum += 1
        }
      }
    }
    sum
  }
  def countClosure() = {
    var sum = 0
    var i = 0
    def sumI {
      sum += i
      i += 1
      var j = 0
      def sumJ {
        sum += j
        j += 1
        var k = 0
        def sumK {
          def sumL { sum += 1 }
          sum += k
          k += 1
          sumL
        }
        while (k<1000) sumK
      }
      while (j<1000) sumJ
    }
    while (i<1000) sumI
    sum
  }
  def countInner() = {
    var sum = 0
    def whileI = {
      def whileJ = {
        def whileK = {
          def whileL() = 1
          var ksum = 0
          var k = 0
          while (k<1000) { ksum += k; k += 1; ksum += whileL }
          ksum
        }
        var jsum = 0
        var j = 0
        while (j<1000) {
          jsum += j; j += 1
          jsum += whileK
        }
        jsum
      }
      var isum = 0
      var i = 0
      while (i<1000) {
        isum += i; i += 1
        isum += whileJ
      }
      isum
    }
    whileI
  }
  def countFunc() = {
    def summer(f: => Int)() = {
      var sum = 0
      var i = 0
      while (i<1000) {
        sum += i; i += 1
        sum += f
      }
      sum
    }
    summer( summer( summer(1) ) )()
  }
  def nsPerIteration(f:() => Int): (Int,Double) = {
    val t0 = System.nanoTime
    val result = f()
    val t1 = System.nanoTime
    (result , (t1-t0)*1e-9)
  }
  def main(args: Array[String]) {
    for (i <- 1 to 5) {
      val fns = List(countRaw _, countClosure _, countInner _, countFunc _)
      val labels = List("raw","closure","inner","func")
      val results = (fns zip labels) foreach (fl => {
        val x = nsPerIteration( fl._1 )
        printf("Method %8s produced %d; time/it = %.3f ns\n",fl._2,x._1,x._2)
      })
    }
  }
}

There are four different methods for summing integers:

  • A raw while loop ("raw")
  • While loop inner methods that are closures over the sum variable
  • While loop inner methods that return the partial sum
  • A while-and-sum nested function

And we see the results on my machine in terms of nanoseconds taken in the inner loop:

scala> NestBench.main(Array[String]())
Method      raw produced -1511174132; time/it = 0.422 ns
Method  closure produced -1511174132; time/it = 2.376 ns
Method    inner produced -1511174132; time/it = 0.402 ns
Method     func produced -1511174132; time/it = 0.836 ns
Method      raw produced -1511174132; time/it = 0.418 ns
Method  closure produced -1511174132; time/it = 2.410 ns
Method    inner produced -1511174132; time/it = 0.399 ns
Method     func produced -1511174132; time/it = 0.813 ns
Method      raw produced -1511174132; time/it = 0.411 ns
Method  closure produced -1511174132; time/it = 2.372 ns
Method    inner produced -1511174132; time/it = 0.399 ns
Method     func produced -1511174132; time/it = 0.813 ns
Method      raw produced -1511174132; time/it = 0.411 ns
Method  closure produced -1511174132; time/it = 2.370 ns
Method    inner produced -1511174132; time/it = 0.399 ns
Method     func produced -1511174132; time/it = 0.815 ns
Method      raw produced -1511174132; time/it = 0.412 ns
Method  closure produced -1511174132; time/it = 2.357 ns
Method    inner produced -1511174132; time/it = 0.400 ns
Method     func produced -1511174132; time/it = 0.817 ns

So, bottom line is: nesting functions really doesn't hurt you at all in simple cases--the JVM will figure out that the call can be inlined (thus raw and inner give the same times). If you take a more functional approach, the function call can't be completely neglected, but the time taken is vanishingly small (approximately 0.4 ns extra per call). If you use a lot of closures, then closing them gives an overhead of something like 1 ns per call at least in this case where a single mutable variable is written to.

You can modify the code above to find answers to other questions, but the bottom line is that it is all very fast, ranging between "no penalty whatsoever" through to "only worry about in the very tightest inner loops that otherwise have minimal work to do".

(P.S. For comparison, the creation of a single small object takes ~4 ns on my machine.)

Solution 3 - Scala

Current as of January 2014

The current benchmark is ~3 years old and Hotspot and the compiler have significantly evolved. I am also using Google Caliper to perform the benchmarks.

Code

import com.google.caliper.SimpleBenchmark

class Benchmark extends SimpleBenchmark {
	def timeRaw(reps: Int) = {
		var i = 0
		var result = 0L
		while (i < reps) {
			result += 0xc37e ^ (i * 0xd5f3)
			i = i + 1
		}
		result
	}

	def normal(i: Int): Long = 0xc37e ^ (i * 0xd5f3)
	def timeNormal(reps: Int) = {
		var i = 0
		var result = 0L
		while (i < reps) {
			result += normal(i)
			i = i + 1
		}
		result
	}

	def timeInner(reps: Int) = {
		def inner(i: Int): Long = 0xc37e ^ (i * 0xd5f3)

		var i = 0
		var result = 0L
		while (i < reps) {
			result += inner(i)
			i = i + 1
		}
		result
	}

	def timeClosure(reps: Int) = {
		var i = 0
		var result = 0L
		val closure = () => result += 0xc37e ^ (i * 0xd5f3)
		while (i < reps) {
			closure()
			i = i + 1
		}
		result
	}

	def normal(i: Int, j: Int, k: Int, l: Int): Long = i ^ j ^ k ^ l 
	def timeUnboxed(reps: Int) = {
		var i = 0
		var result = 0L
		while (i < reps) {
			result += normal(i,i,i,i)
			i = i + 1
		}
		result
	}

	val closure = (i: Int, j: Int, k: Int, l: Int) => (i ^ j ^ k ^ l).toLong 
	def timeBoxed(reps: Int) = {
		var i = 0
		var result = 0L
		while (i < reps) {
			closure(i,i,i,i)
			i = i + 1
		}
		result
	}
}

Results

benchmark     ns linear runtime
   Normal  0.576 =
      Raw  0.576 =
    Inner  0.576 =
  Closure  0.532 =
  Unboxed  0.893 =
    Boxed 15.210 ==============================

What is very surprising is that closure test was completed about 4ns faster than the others. This seems to be an idiosyncrasy of Hotspot instead of the execution environment, multiple runs have returned the same trend.

Using a closure that performs boxing is a huge performance hit, it takes approx 3.579ns to perform one unboxing and reboxing, enough to do a lot of primitive math operations. In this specific position, things might get better with the work being done on a new optimizer. In the general case, boxing might be alleviated by miniboxing.

Edit: The new optimizer doesn't really help here, it makes Closure 0.1 ns slower and Boxed 0.1 ns faster:

 benchmark     ns linear runtime
       Raw  0.574 =
    Normal  0.576 =
     Inner  0.575 =
   Closure  0.645 =
   Unboxed  0.889 =
     Boxed 15.107 ==============================

Performed with 2.11.0-20131209-220002-9587726041 from magarciaEPFL/scala

Versions

JVM

java version "1.7.0_51"
Java(TM) SE Runtime Environment (build 1.7.0_51-b13)
Java HotSpot(TM) 64-Bit Server VM (build 24.51-b03, mixed mode)

Scalac

Scala compiler version 2.10.3 -- Copyright 2002-2013, LAMP/EPFL

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionPalimondoView Question on Stackoverflow
Solution 1 - ScalaretronymView Answer on Stackoverflow
Solution 2 - ScalaRex KerrView Answer on Stackoverflow
Solution 3 - Scalauser60561View Answer on Stackoverflow