How much overhead is there when creating a thread?

C++ Problem Overview

I just reviewed some really terrible code - code that sends messages on a serial port by creating a new thread to package and assemble the message in a new thread for every single message sent. Yes, for every message a pthread is created, bits are properly set up, then the thread terminates. I haven't a clue why anyone would do such a thing, but it raises the question - how much overhead is there when actually creating a thread?

C++ Solutions

Solution 1 - C++

To resurrect this old thread, I just did some simple test code:

#include <thread>

int main(int argc, char** argv)
{
  for (volatile int i = 0; i < 500000; i++)
    std::thread([](){}).detach();
  return 0;
}

I compiled it with g++ test.cpp -std=c++11 -lpthread -O3 -o test. I then ran it three times in a row on an old (kernel 2.6.18) heavily loaded (doing a database rebuild) slow laptop (Intel core i5-2540M). Results from three consecutive runs: 5.647s, 5.515s, and 5.561s. So we're looking at a tad over 10 microseconds per thread on this machine, probably much less on yours.

That's not much overhead at all, given that serial ports max out at around 1 bit per 10 microseconds. Now, of course there's various additional thread losses one can get involving passed/captured arguments (although function calls themselves can impose some), cache slowdowns between cores (if multiple threads on different cores are battling over the same memory at the same time), etc. But in general I highly doubt the use case you presented will adversely impact performance at all (and could provide benefits, depending), despite having you already preemptively labeled the concept "really terrible code" without even knowing how much time it takes to launch a thread.

Whether it's a good idea or not depends a lot on the details of your situation. What else is the calling thread responsible for? What precisely is involved in preparing and writing out the packets? How frequently are they written out (with what sort of distribution? uniform, clustered, etc...?) and what's their structure like? How many cores does the system have? Etc. Depending on the details, the optimal solution could be anywhere from "no threads at all" to "shared thread pool" to "thread for each packet".

Note that thread pools aren't magic and can in some cases be a slowdown versus unique threads, since one of the biggest slowdowns with threads is synchronizing cached memory used by multiple threads at the same time, and thread pools by their very nature of having to look for and process updates from a different thread have to do this. So either your primary thread or child processing thread can get stuck having to wait if the processor isn't sure whether the other process has altered a section of memory. By contrast, in an ideal situation, a unique processing thread for a given task only has to share memory with its calling task once (when it's launched) and then they never interfere with each other again.

Solution 2 - C++

I have always been told that thread creation is cheap, especially when compared to the alternative of creating a process. If the program you are talking about does not have a lot of operations that need to run concurrently then threading might not be necessary, and judging by what you wrote this might well be the case. Some literature to back me up:

http://www.personal.kent.edu/~rmuhamma/OpSystems/Myos/threads.htm

> Threads are cheap in the sense that > > 1. They only need a stack and storage for registers therefore, threads are cheap to create. > > 2. Threads use very little resources of an operating system in > which they are working. That is, > threads do not need new address space, > global data, program code or operating > system resources. > > 3. Context switching are fast when working with threads. The reason is > that we only have to save and/or > restore PC, SP and registers.

More of the same here.

In Operating System Concepts 8th Edition (page 155) the authors write about the benefits of threading:

> Allocating memory and resources for process creation is costly. Because > threads share the resource of the > process to which they belong, it is > more economical to create and > context-switch threads. Empirically > gauging the difference in overhead can > be difficult, but in general it is > much more time consuming to create and > manage processes than threads. In > Solaris, for example, creating a > process is about thirty times slower > than is creating a thread, and context > switching is about five times slower.

Solution 3 - C++

>> ...sends Messages on a serial port ... for every message a pthread is created, bits are properly set up, then the thread terminates. ...how much overhead is there when actually creating a thread?

This is highly system specific. For example, last time I used VMS threading was nightmarishly slow (been years, but from memory one thread could create something like 10 more per second (and if you kept that up for a few seconds without threads exiting you'd core)), whereas on Linux you can probably create thousands. If you want to know exactly, benchmark it on your system. But, it's not much use just knowing that without knowing more about the messages: whether they average 5 bytes or 100k, whether they're sent contiguously or the line idles in between, and what the latency requirements for the app are are all as relevant to the appropriateness of the code's thread use as any absolute measurement of thread creation overhead. And performance may not have needed to be the dominant design consideration.

Solution 4 - C++

There is some overhead in thread creation, but comparing it with usually slow baud rates of the serial port (19200 bits/sec being the most common), it just doesn't matter.

Solution 5 - C++

You definitely do not want to do this. Create a single thread or a pool of threads and just signal when messages are available. Upon receiving the signal, the thread can perform any necessary message processing.

In terms of overhead, thread creation/destruction, especially on Windows, is fairly expensive. Somewhere on the order of tens of microseconds, to be specific. It should, for the most part, only be done at the start/end of an app, with the possible exception of dynamically resized thread pools.

Solution 6 - C++

I used the above "terrible" design in a VOIP app I made. It worked very well ... absolutely no latency or missed/dropped packets for locally connected computers. Each time a data packet arrived in, a thread was created and handed that data to process it to the output devices. Of course the packets were large so it caused no bottleneck. Meanwhile the main thread could loop back to wait and receive another incoming packet.

I have tried other designs where the threads I need are created in advance but this creates it's own problems. First you need to design your code properly for threads to retrieve the incoming packets and process them in a deterministic fashion. If you use multiple (pre-allocated) threads it's possible that the packets may be processed 'out of order'. If you use a single (pre-allocated) thread to loop and pick up the incoming packets, there is a chance that thread might encounter a problem and terminate leaving no threads to process any data.

Creating a thread to process each incoming data packet works very cleanly, especially on multi-core systems and where incoming packets are large. Also to answer your question more directly, the alternative to thread creation is to create a run-time process that manages the pre-allocated threads. Being able to synchronize data hand-off and processing as well as detecting errors may add just as much, if not more overhead as just simply creating a new thread. It all depends on your design and requirements.

Solution 7 - C++

Thread creation and computing in a thread is pretty expensive. All data strucutres need to be set up, the thread registered with the kernel and a thread switch must occur so that the new thread actually gets executed (in an unspecified and unpredictable time). Executing thread.start does not mean that the thread main function is called immediately. As the article (mentioned by typoking) points out creation of a thread is cheap only compared to the creation of a process. Overall, it is pretty expensive.

I would never use a thread

for a short computation
a computation where I need the result in my flow of code (that means, I am starting the thread and wait for it to return the result of it's computation

In your example, it would make sense (as has already been pointed out) to create a thread that handles all of the serial communication and is eternal.

hth

Mario

Solution 8 - C++

For comparison , take a look of OSX: Link

Kernel data structures : Approximately 1 KB Stack space: 512 KB (secondary threads) : 8 MB (OS X main thread) , 1 MB (iOS main thread)
Creation time: Approximately 90 microseconds

The posix thread creation also should be around this (not a far away figure) I guess.

Solution 9 - C++

On any sane implementation, the cost of thread creation should be proportional to the number of system calls it involves, and on the same order of magnitude as familiar system calls like open and read. Some casual measurements on my system showed pthread_create taking about twice as much time as open("/dev/null", O_RDWR), which is very expensive relative to pure computation but very cheap relative to any IO or other operations which would involve switching between user and kernel space.

Solution 10 - C++

It is indeed very system dependent, I tested @Nafnlaus code:

#include <thread>

int main(int argc, char** argv)
{
  for (volatile int i = 0; i < 500000; i++)
    std::thread([](){}).detach();
  return 0;
}

On my Desktop Ryzen 5 2600:

windows 10, compiled with MSVC 2019 release adding std::chrono calls around it to time it. Idle (only Firefox with 217 tabs):

It took around 20 seconds (20.274, 19.910, 20.608) (also ~20 seconds with Firefox closed)

Ubuntu 18.04 compiled with:

g++ main.cpp -std=c++11 -lpthread -O3 -o thread

timed with:

time ./thread

It took around 5 seconds (5.595, 5.230, 5.297)

The same code on my raspberry pi 3B compiled with:

g++ main.cpp -std=c++11 -lpthread -O3 -o thread

timed with:

time ./thread

took around 15 seconds (16.225, 14.689, 16.235)

Solution 11 - C++

Interesting.

I tested with my FreeBSD PCs and got the following results:

FreeBSD 12-STABLE, Core i3-8100T, 8GB RAM: 9.523sec<br/>
FreeBSD 12.1-RELEASE, Core i5-6600K, 16GB: 8.045sec

You need to do

sysctl kern.threads.max_threads_per_proc=500100

though.

Core i3-8100T is pretty slow but the results are not very different. Rather the CPU clocks seem to be more relevant: i3-8100T 3.1GHz vs i5-6600k 3.50GHz.

Solution 12 - C++

As others have mentioned, this seems to be very OS dependent. On my Core i5-8350U running Win10, it took 118 seconds which indicates an overhead of around 237 uS per thread (I suspect that the virus scanner and all the other rubbish IT installed is really slowing it down too). Dual core Xeon E5-2667 v4 running Windows Server 2016 took 41.4 seconds (82 uS per thread), but it's also running a lot of IT garbage in the background including the virus scanner. I think a better approach is to implement a queue with a thread that continuously processes whatever is in the queue to avoid the overhead of creating and destroying the thread everytime.

Content Type	Original Author	Original Content on Stackoverflow
Question	jdt141	View Question on Stackoverflow
Solution 1 - C++	Nafnlaus	View Answer on Stackoverflow
Solution 2 - C++	ubiquibacon	View Answer on Stackoverflow
Solution 3 - C++	Tony Delroy	View Answer on Stackoverflow
Solution 4 - C++	ruslik	View Answer on Stackoverflow
Solution 5 - C++	Michael Goldshteyn	View Answer on Stackoverflow
Solution 6 - C++	user2074102	View Answer on Stackoverflow
Solution 7 - C++	Mario The Spoon	View Answer on Stackoverflow
Solution 8 - C++	Lunar Mushrooms	View Answer on Stackoverflow
Solution 9 - C++	R.. GitHub STOP HELPING ICE	View Answer on Stackoverflow
Solution 10 - C++	sosssego	View Answer on Stackoverflow
Solution 11 - C++	Hiroshi Nishida	View Answer on Stackoverflow
Solution 12 - C++	486DX2-66	View Answer on Stackoverflow