How to estimate the thread context switching overhead?

C++CMultithreadingWindows Mobile

C++ Problem Overview


I am trying to improve the performance of the threaded application with real-time deadlines. It is running on Windows Mobile and written in C / C++. I have a suspicion that high frequency of thread switching might be causing tangible overhead, but can neither prove it or disprove it. As everybody knows, lack of proof is not a proof of opposite :).

Thus my question is twofold:

  • If exists at all, where can I find any actual measurements of the cost of switching thread context?

  • Without spending time writing a test application, what are the ways to estimate the thread switching overhead in the existing application?

  • Does anyone know a way to find out the number of context switches (on / off) for a given thread?

C++ Solutions


Solution 1 - C++

I doubt you can find this overhead somewhere on the web for any existing platform. There exists just too many different platforms. The overhead depends on two factors:

  • The CPU, as the necessary operations may be easier or harder on different CPU types
  • The system kernel, as different kernels will have to perform different operations on each switch

Other factors include how the switch takes place. A switch can take place when

  1. the thread has used all of its time quantum. When a thread is started, it may run for a given amount of time before it has to return control to the kernel that will decide who's next.

  2. the thread was preempted. This happens when another thread needs CPU time and has a higher priority. E.g. the thread that handles mouse/keyboard input may be such a thread. No matter what thread owns the CPU right now, when the user types something or clicks something, he doesn't want to wait till the current threads time quantum has been used up completely, he wants to see the system reacting immediately. Thus some systems will make the current thread stop immediately and return control to some other thread with higher priority.

  3. the thread doesn't need CPU time anymore, because it's blocking on some operation or just called sleep() (or similar) to stop running.

These 3 scenarios might have different thread switching times in theory. E.g. I'd expect the last one to be slowest, since a call to sleep() means the CPU is given back to the kernel and the kernel needs to setup a wake-up call that will make sure the thread is woken up after about the amount of time it requested to sleep, it then must take the thread out of the scheduling process, and once the thread is woken up, it must add the thread again to the scheduling process. All these steeps will take some amount of time. So the actual sleep-call might be longer than the time it takes to switch to another thread.

I think if you want to know for sure, you must benchmark. The problem is that you usually will have to either put threads to sleep or you must synchronize them using mutexes. Sleeping or Locking/Unlocking mutexes has itself an overhead. This means your benchmark will include these overheads as well. Without having a powerful profiler, it's hard to later on say how much CPU time was used for the actual switch and how much for the sleep/mutex-call. On the other hand, in a real life scenario, your threads will either sleep or synchronize via locks as well. A benchmark that purely measures the context switch time is a synthetically benchmark as it does not model any real life scenario. Benchmarks are much more "realistic" if they base on real-life scenarios. Of what use is a GPU benchmark that tells me my GPU can in theory handle 2 billion polygons a second, if this result can never be achieved in a real life 3D application? Wouldn't it be much more interesting to know how many polygons a real life 3D application can have the GPU handle a second?

Unfortunately I know nothing of Windows programming. I could write an application for Windows in Java or maybe in C#, but C/C++ on Windows makes me cry. I can only offer you some source code for POSIX.

#include <stdlib.h>
#include <stdint.h>
#include <stdio.h>
#include <pthread.h>
#include <sys/time.h>
#include <unistd.h>

uint32_t COUNTER;
pthread_mutex_t LOCK;
pthread_mutex_t START;
pthread_cond_t CONDITION;

void * threads (
	void * unused
) {
	// Wait till we may fire away
	pthread_mutex_lock(&START);
	pthread_mutex_unlock(&START);
	
	pthread_mutex_lock(&LOCK);
	// If I'm not the first thread, the other thread is already waiting on
	// the condition, thus Ihave to wake it up first, otherwise we'll deadlock
	if (COUNTER > 0) {
		pthread_cond_signal(&CONDITION);
	}
	for (;;) {
		COUNTER++;
		pthread_cond_wait(&CONDITION, &LOCK);
		// Always wake up the other thread before processing. The other
		// thread will not be able to do anything as long as I don't go
		// back to sleep first.
		pthread_cond_signal(&CONDITION);
	}
	pthread_mutex_unlock(&LOCK); //To unlock
}

int64_t timeInMS ()
{
	struct timeval t;
	
	gettimeofday(&t, NULL);
	return (
		(int64_t)t.tv_sec * 1000 +
		(int64_t)t.tv_usec / 1000
	);
}


int main (
	int argc,
	char ** argv
) {
	int64_t start;
	pthread_t t1;
	pthread_t t2;
	int64_t myTime;
	
	pthread_mutex_init(&LOCK, NULL);
	pthread_mutex_init(&START, NULL);	
	pthread_cond_init(&CONDITION, NULL);
	
	pthread_mutex_lock(&START);
	COUNTER = 0;
	pthread_create(&t1, NULL, threads, NULL);
	pthread_create(&t2, NULL, threads, NULL);
	pthread_detach(t1);
	pthread_detach(t2);
	// Get start time and fire away
	myTime = timeInMS();
	pthread_mutex_unlock(&START);
	// Wait for about a second
	sleep(1);
	// Stop both threads
	pthread_mutex_lock(&LOCK);
	// Find out how much time has really passed. sleep won't guarantee me that
	// I sleep exactly one second, I might sleep longer since even after being
	// woken up, it can take some time before I gain back CPU time. Further
	// some more time might have passed before I obtained the lock!
	myTime = timeInMS() - myTime;
	// Correct the number of thread switches accordingly
	COUNTER = (uint32_t)(((uint64_t)COUNTER * 1000) / myTime);
	printf("Number of thread switches in about one second was %u\n", COUNTER);
	return 0;
}

Output

Number of thread switches in about one second was 108406

Over 100'000 is not too bad and that even though we have locking and conditional waits. I'd guess without all this stuff at least twice as many thread switches were possible a second.

Solution 2 - C++

You can't estimate it. You need to measure it. And it's going to vary depending on the processor in the device.

There are two fairly simple ways to measure a context switch. One involves code, the other doesn't.

First, the code way (pseudocode):

DWORD tick;

main()
{
  HANDLE hThread = CreateThread(..., ThreadProc, CREATE_SUSPENDED, ...);
  tick = QueryPerformanceCounter();
  CeSetThreadPriority(hThread, 10); // real high
  ResumeThread(hThread);
  Sleep(10);
}

ThreadProc()
{
  tick = QueryPerformanceCounter() - tick;
  RETAILMSG(TRUE, (_T("ET: %i\r\n"), tick));
}

Obviously doing it in a loop and averaging will be better. Keep in mind that this doesn't just measure the context switch. You're also measuring the call to ResumeThread and there's no guarantee the scheduler is going to immediately switch to your other thread (though the priority of 10 should help increase the odds that it will).

You can get a more accurate measurement with CeLog by hooking into scheduler events, but it's far from simple to do and not very well documented. If you really want to go that route, Sue Loh has several blogs on it that a search engine can find.

The non-code route would be to use Remote Kernel Tracker. Install eVC 4.0 or the eval version of Platform Builder to get it. It will give a graphical display of everything the kernel is doing and you can directly measure a thread context switch with the provided cursor capabilities. Again, I'm certain Sue has a blog entry on using Kernel Tracker as well.

All that said, you're going to find that CE intra-process thread context switches are really, really fast. It's the process switches that are expensive, as it requires swapping the active process in RAM and then doing the migration.

Solution 3 - C++

While you said you don't want to write a test application, I did this for a previous test on an ARM9 Linux platform to find out what the overhead is. It was just two threads that would boost::thread::yield() (or, you know) and increment some variable, and after a minute or so (without other running processes, at least none that do something), the app printed how many context switches it could do per second. Of course this is not really exact, but the point is that both threads yielded the CPU to each other, and it was so fast that it just didn't make sense any more to think about the overhead. So, simply go ahead and just write a simple test instead of thinking too much about a problem that may be non-existent.

Other than that, you might try like 1800 suggested with performance counters.

Oh, and I remember an application running on Windows CE 4.X, where we also have four threads with intensive switching at times, and never ran into performance issues. We also tried to implement the core threading thing without threads at all, and saw no performance improvement (the GUI just responded much slower, but everything else was the same). Maybe you can try the same, by either reducing the number of context switches or by removing threads completely (just for testing).

Solution 4 - C++

My 50 lines of C++ show for Linux (QuadCore Q6600) the context switch time ~ 0.9us (0.75us for 2 threads, 0.95 for 50 threads). In this benchmark threads call yield immediately when they get a quantum of time.

Solution 5 - C++

Context Switch is very expensive. Not because of the CPU operation itself, but because of cache invalidation. If you have an intensive task running, it will fill the CPU cache, both for instructions and data, also the memory prefetch, TLB and RAM will optimize the work toward some areas of ram.

When you change context all these cache mechanisms are reset and the new thread start from "blank" state.

The accepted answer is wrong unless your thread are just incrementing a counter. Of course there is no cache flush involved in this case. There is no point in benchmarking context switching without filling cache like real applications.

Solution 6 - C++

Context Switch is expensive, as a rule of thumb it costs 30µs of CPU overhead http://blog.tsunanet.net/2010/11/how-long-does-it-take-to-make-context.html

Solution 7 - C++

I've only ever tried to estimate this once and that was on a 486! The upshot was that the processor context switch was taking about 70 instructions to complete (note this was happening for many OS api calls as well as thread switching). We calculated that it was taking approx 30us per thread switch (including OS overhead) on a DX3. The few thousand context switches we were doing per second was absorbing between 5-10% of the processor time.

How that would translate to a multi-core, multi-ghz modern processor I don't know but I would guess that unless you were completely going over the top with thread switching its a negligible overhead.

Note that thread creation/deletion is a more expensive CPU/OS hogger than activating/deactivating threads. A good policy for heavily threaded apps is to use thread pools and activate/deactivate as required.

Solution 8 - C++

The problem with context switches is that they have a fixed time. GPU's implemented 1 cycle context switch between threads. The following for example can not be threaded on CPU's:

double * a; 
...
for (i = 0; i < 1000; i ++)
{
    a[i] = a[i] + a[i]
}

because its time of execution is much less than context switch cost. On Core i7 this code takes around 1 micro second (depends on the compiler). So context switch time does matter because it defines how small jobs can be threaded. I guess this also provides a method for effective measurement of context switch. Check how long does the array (in the upper example) has to be so that two threads from thread pool will start showing some real advantage in compare to a single threaded one. This may easily become 100 000 elements and therefore the effective context switch time would be somewhere in the range of 20us within the same app.

All the encapsulations used by the thread pool have to be counted to the thread switch time because that is what it all comes down to (at the end).

Atmapuri

Solution 9 - C++

I don't know but do you have the usual performance counters in windows mobile? You could look at things like context switches/sec. I don't know if there is one that specifically measures context switch time though.

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionIgnas LimanauskasView Question on Stackoverflow
Solution 1 - C++MeckiView Answer on Stackoverflow
Solution 2 - C++ctackeView Answer on Stackoverflow
Solution 3 - C++OregonGhostView Answer on Stackoverflow
Solution 4 - C++bobahView Answer on Stackoverflow
Solution 5 - C++bokanView Answer on Stackoverflow
Solution 6 - C++SoroushView Answer on Stackoverflow
Solution 7 - C++Tim RingView Answer on Stackoverflow
Solution 8 - C++AtmapuriView Answer on Stackoverflow
Solution 9 - C++1800 INFORMATIONView Answer on Stackoverflow