Why does Monitor.PulseAll result in a "stepping stair" latency pattern in signaled threads?

C# Problem Overview

In a library using Monitor.PulseAll() for thread synchronization, I noticed that the latency from the time PulseAll(...) is called to the time a thread is woken up seems to follow a "stepping stair" distribution -- with extremely large steps. The woken threads are doing almost no work; and almost immediately go back to waiting on the monitor. For example, on a box with 12 cores with 24 threads waiting on a Monitor (2x Xeon5680/Gulftown; 6 physical cores per processor; HT Disabled), the latency between the Pulse and a Thread waking up is as such:

Latency using Monitor.PulseAll(); 3rd party library

The first 12 threads (note we have 12 cores) take between 30 and 60 microseconds to respond. Then we start getting very large jumps; with the plateaus around 700, 1300, 1900, and 2600 microseconds.

I was able to successfully recreate this behavior independent of the 3rd party library using the code below. What this code does is launch a large number of threads (change the numThreads parameter) which just Wait on a Monitor, read a timestamp, log it to a ConcurrentSet, then immediately go back to Waiting. Once a second PulseAll() wakes up all the threads. It does this 20 times, and reports the latencies for the 10th iteration to the Console.

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading;
using System.Threading.Tasks;
using System.Collections.Concurrent;
using System.Diagnostics;

namespace PulseAllTest
{
    class Program
    {
        static long LastTimestamp;
        static long Iteration;
        static object SyncObj = new object();
        static Stopwatch s = new Stopwatch();
        static ConcurrentBag<Tuple<long, long>> IterationToTicks = new ConcurrentBag<Tuple<long, long>>();

        static void Main(string[] args)
        {
            long numThreads = 32;

            for (int i = 0; i < numThreads; ++i)
            {
                Task.Factory.StartNew(ReadLastTimestampAndPublish, TaskCreationOptions.LongRunning);
            }
        
            s.Start();
            for (int i = 0; i < 20; ++i)
            {
                lock (SyncObj)
                {
                    ++Iteration;
                    LastTimestamp = s.Elapsed.Ticks;
                    Monitor.PulseAll(SyncObj);
                }
                Thread.Sleep(TimeSpan.FromSeconds(1));
            }

            Console.WriteLine(String.Join("\n",
                from n in IterationToTicks where n.Item1 == 10 orderby n.Item2 
                    select ((decimal)n.Item2)/TimeSpan.TicksPerMillisecond));
            Console.Read();
        }

        static void ReadLastTimestampAndPublish()
        {
            while(true)
            {
                lock(SyncObj)
                {
                    Monitor.Wait(SyncObj);
                }
                IterationToTicks.Add(Tuple.Create(Iteration, s.Elapsed.Ticks - LastTimestamp));
            }
        }
    }
}

Using the code above, here is an example of latencies on a box with 8 cores /w hyperthreading enabled (i.e. 16 cores in Task Manager) and 32 threads (*2x Xeon5550/Gainestown; 4 physical cores per processor; HT Enabled):

Latency using Monitor.PulseAll(), sample code

EDIT: To try to take NUMA out of the equation, below is a graph running the sample program with 16 threads on a Core i7-3770 (Ivy Bridge); 4 Physical Cores; HT Enabled:

Latency using Monitor.PulseAll(), sample code, no NUMA

Can anyone explain why Monitor.PulseAll() behaves in this way?

EDIT2:

To try and show that this behavior isn't inherent to waking up a bunch of threads at the same time, I've replicated the behavior of the test program using Events; and instead of measuring the latency of PulseAll() I'm measuring the latency of ManualResetEvent.Set(). The code is creating a number of worker threads then waiting for a ManualResetEvent.Set() event on the same ManualResetEvent object. When the event is triggered, they take a latency measurement then immediately wait on their own individual per-thread AutoResetEvent. Well before the next iteration (500ms before), the ManualResetEvent is Reset() and then each AutoResetEvent is Set() so the threads can go back to waiting on the shared ManualResetEvent.

I hesitated posting this because it could be a giant red hearing (I make no claims Events and Monitors behave similarly) plus it's using some absolutely terrible practices to get an Event to behave like a Monitor (I'd love/hate to see what my co-workers would do if I submitted this to a code review); but I think the results are enlightening.

This test was done on the same machine as the original test; a 2xXeon5680/Gulftown; 6 cores per processor (12 cores total); Hyperthreading disabled.

ManualResetEventLatency

If it's not obvious how radically different this is than Monitor.PulseAll; here is the first graph overlaid onto the last graph:

ManualResetEventLatency vs. Monitor Latency

The code used to generate these measurements is below:

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading;
using System.Threading.Tasks;
using System.Collections.Concurrent;
using System.Diagnostics;

namespace MRETest
{
    class Program
    {
        static long LastTimestamp;
        static long Iteration;
        static ManualResetEventSlim MRES = new ManualResetEventSlim(false);
        static List<ReadLastTimestampAndPublish> Publishers = 
            new List<ReadLastTimestampAndPublish>();
        static Stopwatch s = new Stopwatch();
        static ConcurrentBag<Tuple<long, long>> IterationToTicks = 
            new ConcurrentBag<Tuple<long, long>>();

        static void Main(string[] args)
        {
            long numThreads = 24;
            s.Start();

            for (int i = 0; i < numThreads; ++i)
            {
                AutoResetEvent ares = new AutoResetEvent(false);
                ReadLastTimestampAndPublish spinner = new ReadLastTimestampAndPublish(
                    new AutoResetEvent(false));
                Task.Factory.StartNew(spinner.Spin, TaskCreationOptions.LongRunning);
                Publishers.Add(spinner);
            }

            for (int i = 0; i < 20; ++i)
            {
                ++Iteration;
                LastTimestamp = s.Elapsed.Ticks;
                MRES.Set();
                Thread.Sleep(500);
                MRES.Reset();
                foreach (ReadLastTimestampAndPublish publisher in Publishers)
                {
                    publisher.ARES.Set();
                }
                Thread.Sleep(500);
            }

            Console.WriteLine(String.Join("\n",
                from n in IterationToTicks where n.Item1 == 10 orderby n.Item2
                    select ((decimal)n.Item2) / TimeSpan.TicksPerMillisecond));
            Console.Read();
        }

        class ReadLastTimestampAndPublish
        {
            public AutoResetEvent ARES { get; private set; }

            public ReadLastTimestampAndPublish(AutoResetEvent ares)
            {
                this.ARES = ares;
            }

            public void Spin()
            {
                while (true)
                {
                    MRES.Wait();
                    IterationToTicks.Add(Tuple.Create(Iteration, s.Elapsed.Ticks - LastTimestamp));
                    ARES.WaitOne();
                }
            }
        }
    }
}

C# Solutions

Solution 1 - C#

One difference between these version is that in PulseAll case - the threads immediately repeat the loop, locking the object again.

You have 12 cores, so 12 threads are running, execute the loop, and enter the loop again, locking the object (one after another) and then entering wait state. All that time the other threads wait. In ManualEvent case you have two events, so threads don't immediately repeat the loop, but gets blocked on ARES events instead - this allows other threads to take lock ownership faster.

I've simulated similar behavior in PulseAll by adding sleep at the end of the loop in ReadLastTimestampAndPublish. This lets other thread to lock syncObj faster and seem to improve the numbers I'm getting from the program.

static void ReadLastTimestampAndPublish()
{
    while(true)
    {
        lock(SyncObj)
        {
            Monitor.Wait(SyncObj);
        }
        IterationToTicks.Add(Tuple.Create(Iteration, s.Elapsed.Ticks - LastTimestamp));
        Thread.Sleep(TimeSpan.FromMilliseconds(100));   // <===
    }
}

Solution 2 - C#

To start off, this is not an answer, merely my notes from looking at the SSCLI to find out exactly what is going on. Most of this is well above my head, but interesting nonetheless.

The trip down the rabbit hole starts with a call to Monitor.PulseAll, which is implemented in C#:

clr\src\bcl\system\threading\monitor.cs:

namespace System.Threading
{
    public static class Monitor 
    {
        // other methods omitted

        [MethodImplAttribute(MethodImplOptions.InternalCall)]
        private static extern void ObjPulseAll(Object obj);

        public static void PulseAll(Object obj)
        {
            if (obj==null) {
                throw new ArgumentNullException("obj");
            }

            ObjPulseAll(obj);
        } 
    }
}

InternalCall methods get routed in clr\src\vm\ecall.cpp:

FCFuncStart(gMonitorFuncs)
    FCFuncElement("Enter", JIT_MonEnter)
    FCFuncElement("Exit", JIT_MonExit)
    FCFuncElement("TryEnterTimeout", JIT_MonTryEnter)
    FCFuncElement("ObjWait", ObjectNative::WaitTimeout)
    FCFuncElement("ObjPulse", ObjectNative::Pulse)
    FCFuncElement("ObjPulseAll", ObjectNative::PulseAll)
    FCFuncElement("ReliableEnter", JIT_MonReliableEnter)
FCFuncEnd()

ObjectNative lives in clr\src\vm\comobject.cpp:

FCIMPL1(void, ObjectNative::PulseAll, Object* pThisUNSAFE)
{
    CONTRACTL
    {
        MODE_COOPERATIVE;
        DISABLED(GC_TRIGGERS);  // can't use this in an FCALL because we're in forbid gc mode until we setup a H_M_F.
        THROWS;
        SO_TOLERANT;
    }
    CONTRACTL_END;

    OBJECTREF pThis = (OBJECTREF) pThisUNSAFE;
    HELPER_METHOD_FRAME_BEGIN_1(pThis);
    //-[autocvtpro]-------------------------------------------------------

    if (pThis == NULL)
        COMPlusThrow(kNullReferenceException, L"NullReference_This");

    pThis->PulseAll();

    //-[autocvtepi]-------------------------------------------------------
    HELPER_METHOD_FRAME_END();
}
FCIMPLEND

OBJECTREF is some magic sprinkled on top of Object (the -> operator is overloaded), so OBJECTREF->PulseAll() is actually Object->PulseAll() which is implemented in clr\src\vm\object.h and just forwards the call on to ObjHeader->PulseAll:

class Object
{
  // snip   
  public:
  // snip
    ObjHeader   *GetHeader()
    {
        LEAF_CONTRACT;
        return PTR_ObjHeader(PTR_HOST_TO_TADDR(this) - sizeof(ObjHeader));
    }
  // snip
    void PulseAll()
    {
        WRAPPER_CONTRACT;
        GetHeader()->PulseAll();
    }
  // snip
}

ObjHeader::PulseAll retrieves the SyncBlock, which uses AwareLock for Entering and Exiting the lock on the object. AwareLock (clr\src\vm\syncblk.cpp) uses a CLREvent (clr\src\vm\synch.cpp) created as a MonitorEvent (CLREvent::CreateMonitorEvent(SIZE_T)), which calls UnsafeCreateEvent (clr\src\inc\unsafe.h) or the hosting environment's synchronization methods.

clr\src\vm\syncblk.cpp:

void ObjHeader::PulseAll()
{
    CONTRACTL
    {
        INSTANCE_CHECK;
        THROWS;
        GC_TRIGGERS;
        MODE_ANY;
        INJECT_FAULT(COMPlusThrowOM(););
    }
    CONTRACTL_END;

    //  The following code may cause GC, so we must fetch the sync block from
    //  the object now in case it moves.
    SyncBlock *pSB = GetBaseObject()->GetSyncBlock();

    // GetSyncBlock throws on failure
    _ASSERTE(pSB != NULL);

    // make sure we own the crst
    if (!pSB->DoesCurrentThreadOwnMonitor())
        COMPlusThrow(kSynchronizationLockException);

    pSB->PulseAll();
}

void SyncBlock::PulseAll()
{
    CONTRACTL
    {
        INSTANCE_CHECK;
        NOTHROW;
        GC_NOTRIGGER;
        MODE_ANY;
    }
    CONTRACTL_END;

    WaitEventLink  *pWaitEventLink;

    while ((pWaitEventLink = ThreadQueue::DequeueThread(this)) != NULL)
        pWaitEventLink->m_EventWait->Set();
}

DequeueThread uses a crst (clr\src\vm\crst.cpp) which is a wrapper around critical sections. m_EventWait is a manual CLREvent.

So, all of this is using OS primitives unless the default hosting provider is overriding things.

Content Type	Original Author	Original Content on Stackoverflow
Question	Chuu	View Question on Stackoverflow
Solution 1 - C#	Michael Entin	View Answer on Stackoverflow
Solution 2 - C#	Mitch	View Answer on Stackoverflow

Why does Monitor.PulseAll result in a "stepping stair" latency pattern in signaled threads?

C# Problem Overview

C# Solutions

Solution 1 - C#

Solution 2 - C#

How to display text in pygame?

How to obtain connection ID of signalR client on the server side?

Attributions