Why does my application spend 24% of its life doing a null check?

C#PerformanceOptimizationIlMicro Optimization

C# Problem Overview


I've got a performance critical binary decision tree, and I'd like to focus this question on a single line of code. The code for the binary tree iterator is below with the results from running performance analysis against it.

        public ScTreeNode GetNodeForState(int rootIndex, float[] inputs)
        {
0.2%        ScTreeNode node = RootNodes[rootIndex].TreeNode;

24.6%       while (node.BranchData != null)
            {
0.2%            BranchNodeData b = node.BranchData;
0.5%            node = b.Child2;
12.8%           if (inputs[b.SplitInputIndex] <= b.SplitValue)
0.8%                node = b.Child1;
            }

0.4%        return node;
        }

BranchData is a field, not a property. I did this to prevent the risk of it not being inlined.

The BranchNodeData class is as follows:

public sealed class BranchNodeData
{
    /// <summary>
    /// The index of the data item in the input array on which we need to split
    /// </summary>
    internal int SplitInputIndex = 0;

    /// <summary>
    /// The value that we should split on
    /// </summary>
    internal float SplitValue = 0;

    /// <summary>
    /// The nodes children
    /// </summary>
    internal ScTreeNode Child1;
    internal ScTreeNode Child2;
}

As you can see, the while loop / null check is a massive hit on performance. The tree is massive, so I would expect searching for a leaf to take a while, but I'd like to understand the disproportionate amount of time spent on that one line.

I've tried:

  • Separating the Null check from the while - it's the Null check that's the hit.
  • Adding a boolean field to the object and checking against that, it made no difference. It doesn't matter what's being compared, it's the comparison that's the issue.

Is this a branch prediction issue? If so, what can I do about it? If anything?

I won't pretend to understand the CIL, but I'll post it for anyone does so they can try to scrape some information from it.

.method public hidebysig
instance class OptimalTreeSearch.ScTreeNode GetNodeForState (
    int32 rootIndex,
    float32[] inputs
) cil managed
{
    // Method begins at RVA 0x2dc8
    // Code size 67 (0x43)
    .maxstack 2
    .locals init (
        [0] class OptimalTreeSearch.ScTreeNode node,
        [1] class OptimalTreeSearch.BranchNodeData b
    )

    IL_0000: ldarg.0
    IL_0001: ldfld class [mscorlib]System.Collections.Generic.List`1<class OptimalTreeSearch.ScRootNode> OptimalTreeSearch.ScSearchTree::RootNodes
    IL_0006: ldarg.1
    IL_0007: callvirt instance !0 class [mscorlib]System.Collections.Generic.List`1<class OptimalTreeSearch.ScRootNode>::get_Item(int32)
    IL_000c: ldfld class OptimalTreeSearch.ScTreeNode OptimalTreeSearch.ScRootNode::TreeNode
    IL_0011: stloc.0
    IL_0012: br.s IL_0039
    // loop start (head: IL_0039)
        IL_0014: ldloc.0
        IL_0015: ldfld class OptimalTreeSearch.BranchNodeData OptimalTreeSearch.ScTreeNode::BranchData
        IL_001a: stloc.1
        IL_001b: ldloc.1
        IL_001c: ldfld class OptimalTreeSearch.ScTreeNode OptimalTreeSearch.BranchNodeData::Child2
        IL_0021: stloc.0
        IL_0022: ldarg.2
        IL_0023: ldloc.1
        IL_0024: ldfld int32 OptimalTreeSearch.BranchNodeData::SplitInputIndex
        IL_0029: ldelem.r4
        IL_002a: ldloc.1
        IL_002b: ldfld float32 OptimalTreeSearch.BranchNodeData::SplitValue
        IL_0030: bgt.un.s IL_0039

        IL_0032: ldloc.1
        IL_0033: ldfld class OptimalTreeSearch.ScTreeNode OptimalTreeSearch.BranchNodeData::Child1
        IL_0038: stloc.0

        IL_0039: ldloc.0
        IL_003a: ldfld class OptimalTreeSearch.BranchNodeData OptimalTreeSearch.ScTreeNode::BranchData
        IL_003f: brtrue.s IL_0014
    // end loop

    IL_0041: ldloc.0
    IL_0042: ret
} // end of method ScSearchTree::GetNodeForState

Edit: I decided to do a branch prediction test, I added an identical if within the while, so we have

while (node.BranchData != null)

and

if (node.BranchData != null)

inside that. I then ran performance analysis against that, and it took six times longer to execute the first comparison as it did to execute the second comparison that always returned true. So it looks like it is indeed a branch prediction issue - and I'm guessing there's nothing I can do about it?!

Another Edit

The above result would also occur if node.BranchData had to be loaded from the RAM for the while check - it would then be cached for the if statement.


This is my third question on a similar topic. This time I'm focusing on a single line of code. My other questions on this subject are:

C# Solutions


Solution 1 - C#

> The tree is massive

By far the most expensive thing a processor ever does is not executing instructions, it is accessing memory. The execution core of a modern CPU is many times faster than the memory bus. A problem related to distance, the further an electrical signal has to travel, the harder it gets to get that signal delivered to the other end of the wire without it being corrupted. The only cure for that problem is to make it go slower. A big problem with the wires that connect the CPU to the RAM in your machine, you can pop the case and see the wires.

Processors have a counter-measure for this problem, they use caches, buffers that store a copy of the bytes in RAM. An important one is the L1 cache, typically 16 kilobytes for data and 16 kilobytes for instructions. Small, allowing it to be close to the execution engine. Reading bytes from the L1 cache typically takes 2 or 3 CPU cycles. Next up is the L2 cache, bigger and slower. Upscale processors also have an L3 cache, bigger and slower yet. As process technology improves, those buffers take less space and automatically becomes faster as they get closer to the core, a big reason why newer processors are better and how they manage to use an ever increasing number of transistors.

Those caches are however not a perfect solution. The processor will still stall on a memory access if the data is not available in one of the caches. It cannot continue until the very slow memory bus has supplied the data. Losing a fat hundred CPU cycles is possible on a single instruction.

Tree structures are a problem, they are not cache friendly. Their nodes tend to be scattered throughout the address space. The fastest way to access memory is by reading from sequential addresses. The unit of storage for the L1 cache is 64 bytes. Or in other words, once the processor reads one byte, the next 63 are very fast since they'll be present in the cache.

Which makes an array by far the most efficient data structure. Also the reason that the .NET List<> class isn't a list at all, it uses an array for storage. The same for other collection types, like Dictionary, structurally not remotely similar to an array, but internally implemented with arrays.

So your while() statement is very likely to be suffering from CPU stalls because it is dereferencing a pointer to access the BranchData field. The next statement is very cheap because the while() statement already did the heavy lifting of retrieving the value from memory. Assigning the local variable is cheap, a processor uses a buffer for writes.

Not otherwise a simple problem to solve, flattening your tree into arrays is very likely to be unpractical. Not in the least because you typically cannot predict in which order the nodes of the tree will be visited. A red-black tree might help, it isn't clear from the question. So a simple conclusion to draw is that it is already running as fast you can hope for. And if you need it to go faster then you'll need better hardware with a faster memory bus. DDR4 is going mainstream this year.

Solution 2 - C#

To complement Hans' great answer about memory cache effects, I add a discussion of virtual memory to physical memory translation and NUMA effects.

With virtual memory computer (all current computer), when doing a memory access, each virtual memory address must be translated to a physical memory address. This is done by the memory management hardware using a translation table. This table is managed by the operating system for each process and it is itself stored in RAM. For each page of virtual memory, there is an entry in this translation table mapping a virtual to a physical page. Remember Hans' discussion about memory accesses that are expensive: if each virtual to physical translation needs a memory lookup, all memory access would cost twice as much. The solution is to have a cache for the translation table which is called the translation lookaside buffer (TLB for short). TLB are not large (12 to 4096 entries), and typical page size on x86-64 architecture are only 4 KB, which means that there is at most 16 MB directly accessible with TLB hits (it is probably even less than that, the Sandy Bridge having a TLB size of 512 items). To reduce the number of TLB misses, you can have the operating system and the application work together to use a larger page size like 2 MB, leading to a much larger memory space accessible with TLB hits. This page explain how to use large pages with Java which can greatly speedup memory accesses.

If your computer has many sockets, it is probably a NUMA architecture. NUMA means Non-Uniform Memory Access. In these architectures, some memory accesses cost more than others. As an example, with a 2 socket computer with 32 GB of RAM, each socket probably has 16 GB of RAM. On this example computer, local memory accesses are cheaper than accesses to memory of another socket (remote access are 20 to 100% slower, maybe even more). If on such computer, your tree uses 20 GB of RAM, at least 4 GB of your data is on the other NUMA node, and if accesses are 50% slower for remote memory, NUMA accesses slow down your memory accesses by 10%. In addition, if you only have free memory on a single NUMA node, all the processes needing memory on the starved node will be allocated memory from the other node which accesses are more expensive. Even worst, the operating system could think it is a good idea to swap out part of the memory of the starved node, which would cause even more expensive memory accesses. This is explained in more details in The MySQL “swap insanity” problem and the effects of the NUMA architecture where some solutions are given for Linux (spreading memory accesses on all NUMA nodes, biting the bullet on remote NUMA accesses to avoid swapping). I can also think of allocating more RAM to a socket (24 and 8 GB instead of 16 and 16 GB) and making sure your program is schedule on the larger NUMA node, but this needs physical access to the computer and a screwdriver ;-).

Solution 3 - C#

This is not an answer per se but rather an emphasis on what Hans Passant wrote about delays in the memory system.

Really high performance software - such as computer games - is not only written to implement the game itself, it is also adapted such that code and data structures make the most of the cache and memory systems i e treat them as a limited resource. When I deal with cache issues I typically assume that the L1 will deliver in 3 cycles if the data is present there. If it isn't and I have to go to L2 I assume 10 cycles. For L3 30 cycles and for RAM memory 100.

There is an additional memory-related action that - if you need to use it - imposes an even greater penalty and that is a bus lock. Bus locks are called critical sections if you use the Windows NT functionality. If you use a home-grown variety you might call it a spinlock. Whatever the name it synchronizes down to the slowest bus-mastering device in the system before the lock is in place. The slowest bus-mastering device might be a classic 32-bit PCI card connected @ 33MHz. 33MHz is one hundredth of the frequency of a typical x86 CPU (@ 3.3 GHz). I assume not less than 300 cycles to complete a bus lock but I know they can take many times that long so if I see 3000 cycles I won't be surprised.

Novice multi-threading software developers will use bus locks all over the place and then wonder why their code is slow. The trick - as with everything that has to do with memory - is to economize on accesses.

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionWill CalderwoodView Question on Stackoverflow
Solution 1 - C#Hans PassantView Answer on Stackoverflow
Solution 2 - C#jfg956View Answer on Stackoverflow
Solution 3 - C#Olof ForshellView Answer on Stackoverflow