Java program runs slower when code that is never executed is commented out
JavaRecursionCompiler OptimizationJava Problem Overview
I observed some strange behaviour in one of my Java programs. I have tried to strip the code down as much as possible while still being able to replicate the behaviour. Code in full below.
public class StrangeBehaviour {
static boolean recursionFlag = true;
public static void main(String[] args) {
long startTime = System.nanoTime();
for (int i = 0; i < 10000; i ++) {
functionA(6, 0);
}
long endTime = System.nanoTime();
System.out.format("%.2f seconds elapsed.\n", (endTime - startTime) / 1000.0 / 1000 / 1000);
}
static boolean functionA(int recursionDepth, int recursionSwitch) {
if (recursionDepth == 0) { return true; }
return functionB(recursionDepth, recursionSwitch);
}
static boolean functionB(int recursionDepth, int recursionSwitch) {
for (int i = 0; i < 16; i++) {
if (StrangeBehaviour.recursionFlag) {
if (recursionSwitch == 0) {
if (functionA(recursionDepth - 1, 1 - recursionSwitch)) return true;
} else {
if (!functionA(recursionDepth - 1, 1 - recursionSwitch)) return false;
}
} else {
// This block is never entered into.
// Yet commenting out one of the lines below makes the program run slower!
System.out.println("...");
System.out.println("...");
System.out.println("...");
System.out.println("...");
System.out.println("...");
System.out.println("...");
System.out.println("...");
System.out.println("...");
System.out.println("...");
System.out.println("...");
System.out.println("...");
System.out.println("...");
System.out.println("...");
System.out.println("...");
System.out.println("...");
System.out.println("...");
System.out.println("...");
System.out.println("...");
System.out.println("...");
System.out.println("...");
System.out.println("...");
System.out.println("...");
System.out.println("...");
System.out.println("...");
System.out.println("...");
System.out.println("...");
System.out.println("...");
System.out.println("...");
System.out.println("...");
System.out.println("...");
System.out.println("...");
System.out.println("...");
System.out.println("...");
System.out.println("...");
}
}
return false;
}
}
I have two functions, functionA()
and functionB()
that call each other recursively. Both functions take a recursionDepth
parameter that controls the termination of the recursion. functionA()
calls functionB()
a maximum of once with recursionDepth
unchanged. functionB()
calls functionA()
16 times with recursionDepth - 1
. The recursion terminates when functionA()
is called with a recursionDepth
of 0
.
functionB()
has a code block with a number of System.out.println()
calls. This block is never entered into, as entry is controlled by a boolean recursionFlag
variable that is set to true
and never changed during program execution. However, commenting out even one of the println()
calls causes the program to run slower. On my machine, execution time is <0.2s with all the println()
calls present, and >2s when one of the calls is commented out.
What could be causing this behaviour? My only guess is that there is some naive compiler optimisation that is being triggered by a parameter related to the length of the code block (or the number of function calls etc.). Any further insight into this will be much appreciated!
Edit: I'm using JDK 1.8.
Java Solutions
Solution 1 - Java
The commented code affects how inlining is handled. If functionB gets longer/bigger (more bytecode instructions) it will not be inlined into functionA.
> So @J3D1 was able to use VMOptions to manually switch off inlining for functionB(): -XX:CompileCommand=dontinline,com.jd.benchmarking.StrangeBehaviour::functionB
This appears to eliminate the delay with the shorter function.
with the vm options you can display inlining
-XX:+UnlockDiagnosticVMOptions -XX:+PrintInlining
##The bigger Version, wont inline functionB
@ 8 StrangeBehaviour::functionB (326 bytes) callee is too large
@ 21 StrangeBehaviour::functionA (12 bytes)
@ 8 StrangeBehaviour::functionB (326 bytes) callee is too large
@ 35 StrangeBehaviour::functionA (12 bytes)
@ 8 StrangeBehaviour::functionB (326 bytes) callee is too large
##The shorter Version will try to inline functionB, causing a few of further attempts.
@ 8 StrangeBehaviour::functionB (318 bytes) inline (hot)
@ 21 StrangeBehaviour::functionA (12 bytes) inline (hot)
@ 8 StrangeBehaviour::functionB (318 bytes) inline (hot)
@ 35 StrangeBehaviour::functionA (12 bytes) recursive inlining is too deep
@ 35 StrangeBehaviour::functionA (12 bytes) inline (hot)
@ 8 StrangeBehaviour::functionB (318 bytes) inline (hot)
@ 21 StrangeBehaviour::functionA (12 bytes) recursive inlining is too deep
@ 35 StrangeBehaviour::functionA (12 bytes) recursive inlining is too deep
@ 21 StrangeBehaviour::functionA (12 bytes) inline (hot)
@ 8 StrangeBehaviour::functionB (318 bytes) inline (hot)
@ 35 StrangeBehaviour::functionA (12 bytes) inline (hot)
@ 8 StrangeBehaviour::functionB (318 bytes) recursive inlining is too deep
@ 35 StrangeBehaviour::functionA (12 bytes) inline (hot)
@ 8 StrangeBehaviour::functionB (318 bytes) inline (hot)
@ 21 StrangeBehaviour::functionA (12 bytes) inline (hot)
@ 8 StrangeBehaviour::functionB (318 bytes) recursive inlining is too deep
@ 35 StrangeBehaviour::functionA (12 bytes) inline (hot)
@ 8 StrangeBehaviour::functionB (318 bytes) recursive inlining is too deep
Mostly guessing, but the bigger/inlined bytecode will cause problems with branch prediction and caching
Solution 2 - Java
The complete answer is a combination of k5_ and Tony's answers.
The code that the OP posted omits a warmup loop to trigger HotSpot compilation before doing the benchmark; hence the 10-fold (on my computer) speedup when the print statements are included, combines both the time spent in HotSpot to compile the bytecode to CPU instructions, as well as the actual running of the CPU instructions.
If I add a separate warmup loop before the timing loop, there is only a 2.5-fold speedup with the print statement.
That indicates that both the HotSpot/JIT compilation takes longer when the method is inlined (as Tony explained) as well as that the running of the code takes longer, probably because of worse cache or branch-prediction/pipelining performance, as k5_ showed.
public static void main(String[] args) {
// Added the following warmup loop before the timing loop
for (int i = 0; i < 50000; i++) {
functionA(6, 0);
}
long startTime = System.nanoTime();
for (int i = 0; i < 50000; i++) {
functionA(6, 0);
}
long endTime = System.nanoTime();
System.out.format("%.2f seconds elapsed.\n", (endTime - startTime) / 1000.0 / 1000 / 1000);
}
Solution 3 - Java
I am with @k5_, it seems that there exists a threshold to determine whether to inline a function. And if JIT compiler decided to inline it, it will cause a lot of work and time to do it as -XX:+PrintCompilation
shows:
task-id
158 32 3 so_test.StrangeBehaviour::functionB (326 bytes) made not entrant
159 35 3 java.lang.String::<init> (82 bytes)
160 36 s 1 java.util.Vector::size (5 bytes)
1878 37 % 3 so_test.StrangeBehaviour::main @ 6 (65 bytes)
1898 38 3 so_test.StrangeBehaviour::main (65 bytes)
2665 39 3 java.util.regex.Pattern::has (15 bytes)
2667 40 3 sun.misc.FDBigInteger::mult (64 bytes)
2668 41 3 sun.misc.FDBigInteger::<init> (30 bytes)
2668 42 3 sun.misc.FDBigInteger::trimLeadingZeros (57 bytes)
2.51 seconds elapsed.
The upper is info without comment, the following is with a comment which reduce the method size from 326 byte to 318 byte. And you can notice the task id in the column 1 of output, is so much larger in latter, which causes more time.
task-id
126 35 4 so_test.StrangeBehaviour::functionA (12 bytes)
130 33 3 so_test.StrangeBehaviour::functionA (12 bytes) made not entrant
131 36 s 1 java.util.Vector::size (5 bytes)
14078 37 % 3 so_test.StrangeBehaviour::main @ 6 (65 bytes)
14296 38 3 so_test.StrangeBehaviour::main (65 bytes)
14296 39 % 4 so_test.StrangeBehaviour::functionB @ 2 (318 bytes)
14300 40 4 so_test.StrangeBehaviour::functionB (318 bytes)
14304 34 3 so_test.StrangeBehaviour::functionB (318 bytes) made not entrant
14628 41 3 java.util.regex.Pattern::has (15 bytes)
14631 42 3 sun.misc.FDBigInteger::mult (64 bytes)
14632 43 3 sun.misc.FDBigInteger::<init> (30 bytes)
14632 44 3 sun.misc.FDBigInteger::trimLeadingZeros (57 bytes)
14.50 seconds elapsed.
And if you change the code into following (add two line and commnet out a print line), you can see that the code size change into 326 byte, and runs faster now:
if (StrangeBehaviour.recursionFlag) {
int a = 1;
int b = 1;
if (recursionSwitch == 0) {
if (functionA(recursionDepth - 1, 1 - recursionSwitch)) return true;
} else {
if (!functionA(recursionDepth - 1, 1 - recursionSwitch)) return false;
}
} else {
// This block is never entered into.
// Yet commenting out one of the lines below makes the program run slower!
System.out.println("...");
System.out.println("...");
System.out.println("...");
//System.out.println("...");
System.out.println("...");
System.out.println("...");
System.out.println("...");
System.out.println("...");
System.out.println("...");
System.out.println("...");
System.out.println("...");
System.out.println("...");
System.out.println("...");
System.out.println("...");
System.out.println("...");
System.out.println("...");
System.out.println("...");
System.out.println("...");
System.out.println("...");
System.out.println("...");
System.out.println("...");
System.out.println("...");
System.out.println("...");
System.out.println("...");
System.out.println("...");
System.out.println("...");
System.out.println("...");
System.out.println("...");
System.out.println("...");
System.out.println("...");
System.out.println("...");
System.out.println("...");
System.out.println("...");
System.out.println("...");
}
New time and JIT compiler info:
140 34 3 so_test.StrangeBehaviour::functionB (326 bytes) made not entrant
145 36 3 java.lang.String::<init> (82 bytes)
148 37 s 1 java.util.Vector::size (5 bytes)
162 38 4 so_test.StrangeBehaviour::functionA (12 bytes)
163 33 3 so_test.StrangeBehaviour::functionA (12 bytes) made not entrant
1916 39 % 3 so_test.StrangeBehaviour::main @ 6 (65 bytes)
1936 40 3 so_test.StrangeBehaviour::main (65 bytes)
2686 41 3 java.util.regex.Pattern::has (15 bytes)
2689 42 3 sun.misc.FDBigInteger::mult (64 bytes)
2690 43 3 sun.misc.FDBigInteger::<init> (30 bytes)
2690 44 3 sun.misc.FDBigInteger::trimLeadingZeros (57 bytes)
2.55 seconds elapsed.
In conclusion:
- When method size exceeds some limits, JIT will not inline this function;
- And if we comment out a line, which reduce to a size below the threshold, JIT decide to inline it;
- Inlining that function causes a lot of JIT tasks which slows down the program.
update:
Accroding to my latest trial, the answer for this question is not so easy:
As my code sample shows, a normal inline optimization will
- accelerates program
- and not costs much compiler work (in my test, it even cost less work when inline happens).
But in this problem, the code causes much JIT work and slows down program which seems a bug of JIT. And it is still not clear why it causes so much work of JIT.