Is non-blocking I/O really faster than multi-threaded blocking I/O? How?

Multithreading Problem Overview

I searched the web on some technical details about blocking I/O and non blocking I/O and I found several people stating that non-blocking I/O would be faster than blocking I/O. For example in this document.

If I use blocking I/O, then of course the thread that is currently blocked can't do anything else... Because it's blocked. But as soon as a thread starts being blocked, the OS can switch to another thread and not switch back until there is something to do for the blocked thread. So as long as there is another thread on the system that needs CPU and is not blocked, there should not be any more CPU idle time compared to an event based non-blocking approach, is there?

Besides reducing the time the CPU is idle I see one more option to increase the number of tasks a computer can perform in a given time frame: Reduce the overhead introduced by switching threads. But how can this be done? And is the overhead large enough to show measurable effects? Here is an idea on how I can picture it working:

To load the contents of a file, an application delegates this task to an event-based i/o framework, passing a callback function along with a filename
The event framework delegates to the operating system, which programs a DMA controller of the hard disk to write the file directly to memory
The event framework allows further code to run.
Upon completion of the disk-to-memory copy, the DMA controller causes an interrupt.
The operating system's interrupt handler notifies the event-based i/o framework about the file being completely loaded into memory. How does it do that? Using a signal??
The code that is currently run within the event i/o framework finishes.
The event-based i/o framework checks its queue and sees the operating system's message from step 5 and executes the callback it got in step 1.

Is that how it works? If it does not, how does it work? That means that the event system can work without ever having the need to explicitly touch the stack (such as a real scheduler that would need to backup the stack and copy the stack of another thread into memory while switching threads)? How much time does this actually save? Is there more to it?

Multithreading Solutions

Solution 1 - Multithreading

The biggest advantage of nonblocking or asynchronous I/O is that your thread can continue its work in parallel. Of course you can achieve this also using an additional thread. As you stated for best overall (system) performance I guess it would be better to use asynchronous I/O and not multiple threads (so reducing thread switching).

Let's look at possible implementations of a network server program that shall handle 1000 clients connected in parallel:

One thread per connection (can be blocking I/O, but can also be non-blocking I/O).
Each thread requires memory resources (also kernel memory!), that is a disadvantage. And every additional thread means more work for the scheduler.
One thread for all connections.
This takes load from the system because we have fewer threads. But it also prevents you from using the full performance of your machine, because you might end up driving one processor to 100% and letting all other processors idle around.
A few threads where each thread handles some of the connections.
This takes load from the system because there are fewer threads. And it can use all available processors. On Windows this approach is supported by Thread Pool API.

Of course having more threads is not per se a problem. As you might have recognized I chose quite a high number of connections/threads. I doubt that you'll see any difference between the three possible implementations if we are talking about only a dozen threads (this is also what Raymond Chen suggests on the MSDN blog post Does Windows have a limit of 2000 threads per process?).

On Windows using unbuffered file I/O means that writes must be of a size which is a multiple of the page size. I have not tested it, but it sounds like this could also affect write performance positively for buffered synchronous and asynchronous writes.

The steps 1 to 7 you describe give a good idea of how it works. On Windows the operating system will inform you about completion of an asynchronous I/O (WriteFile with OVERLAPPED structure) using an event or a callback. Callback functions will only be called for example when your code calls WaitForMultipleObjectsEx with bAlertable set to true.

Some more reading on the web:

Multiple Threads in the User Interface on MSDN, also shortly handling the cost of creating threads
Section Threads and Thread Pools says "Although threads are relatively easy to create and use, the operating system allocates a significant amount of time and other resources to manage them."
CreateThread documentation on MSDN says "However, your application will have better performance if you create one thread per processor and build queues of requests for which the application maintains the context information.".
Old article Why Too Many Threads Hurts Performance, and What to do About It

Solution 2 - Multithreading

I/O includes multiple kind of operations like reading and writing data from hard drives, accessing network resources, calling web services or retrieving data from databases. Depending on the platform and on the kind of operation, asynchronous I/O will usually take advantage of any hardware or low level system support for performing the operation. This means that it will be performed with as little impact as possible on the CPU.

At application level, asynchronous I/O prevents threads from having to wait for I/O operations to complete. As soon as an asynchronous I/O operation is started, it releases the thread on which it was launched and a callback is registered. When the operation completes, the callback is queued for execution on the first available thread.

If the I/O operation is executed synchronously, it keeps its running thread doing nothing until the operation completes. The runtime doesn't know when the I/O operation completes, so it will periodically provide some CPU time to the waiting thread, CPU time that could have otherwise be used by other threads that have actual CPU bound operations to perform.

So, as @user1629468 mentioned, asynchronous I/O does not provide better performance but rather better scalability. This is obvious when running in contexts that have a limited number of threads available, like it is the case with web applications. Web application usually use a thread pool from which they assign threads to each request. If requests are blocked on long running I/O operations there is the risk of depleting the web pool and making the web application freeze or slow to respond.

One thing I have noticed is that asynchronous I/O isn't the best option when dealing with very fast I/O operations. In that case the benefit of not keeping a thread busy while waiting for the I/O operation to complete is not very important and the fact that the operation is started on one thread and it is completed on another adds an overhead to the overall execution.

You can read a more detailed research I have recently made on the topic of asynchronous I/O vs. multithreading here.

Solution 3 - Multithreading

To presume a speed improvement due to any form of multi-computing you must presume either that multiple CPU-based tasks are being executed concurrently upon multiple computing resources (generally processor cores) or else that not all of the tasks rely upon the concurrent usage of the same resource -- that is, some tasks may depend on one system subcomponent (disk storage, say) while some tasks depend on another (receiving communication from a peripheral device) and still others may require usage of processor cores.

The first scenario is often referred to as "parallel" programming. The second scenario is often referred to as "concurrent" or "asynchronous" programming, although "concurrent" is sometimes also used to refer to the case of merely allowing an operating system to interleave execution of multiple tasks, regardless of whether such execution must take place serially or if multiple resources can be used to achieve parallel execution. In this latter case, "concurrent" generally refers to the way that execution is written in the program, rather than from the perspective of the actual simultaneity of task execution.

It's very easy to speak about all of this with tacit assumptions. For example, some are quick to make a claim such as "Asynchronous I/O will be faster than multi-threaded I/O." This claim is dubious for several reasons. First, it could be the case that some given asynchronous I/O framework is implemented precisely with multi-threading, in which case they are one in the same and it doesn't make sense to say one concept "is faster than" the other.

Second, even in the case when there is a single-threaded implementation of an asynchronous framework (such as a single-threaded event loop) you must still make an assumption about what that loop is doing. For example, one silly thing you can do with a single-threaded event loop is request for it to asynchronously complete two different purely CPU-bound tasks. If you did this on a machine with only an idealized single processor core (ignoring modern hardware optimizations) then performing this task "asynchronously" wouldn't really perform any differently than performing it with two independently managed threads, or with just one lone process -- the difference might come down to thread context switching or operating system schedule optimizations, but if both tasks are going to the CPU it would be similar in either case.

It is useful to imagine a lot of the unusual or stupid corner cases you might run into.

"Asynchronous" does not have to be concurrent, for example just as above: you "asynchronously" execute two CPU-bound tasks on a machine with exactly one processor core.

Multi-threaded execution doesn't have to be concurrent: you spawn two threads on a machine with a single processor core, or ask two threads to acquire any other kind of scarce resource (imagine, say, a network database that can only establish one connection at a time). The threads' execution might be interleaved however the operating system scheduler sees fit, but their total runtime cannot be reduced (and will be increased from the thread context switching) on a single core (or more generally, if you spawn more threads than there are cores to run them, or have more threads asking for a resource than what the resource can sustain). This same thing goes for multi-processing as well.

So neither asynchronous I/O nor multi-threading have to offer any performance gain in terms of run time. They can even slow things down.

If you define a specific use case, however, like a specific program that both makes a network call to retrieve data from a network-connected resource like a remote database and also does some local CPU-bound computation, then you can start to reason about the performance differences between the two methods given a particular assumption about hardware.

The questions to ask: How many computational steps do I need to perform and how many independent systems of resources are there to perform them? Are there subsets of the computational steps that require usage of independent system subcomponents and can benefit from doing so concurrently? How many processor cores do I have and what is the overhead for using multiple processors or threads to complete tasks on separate cores?

If your tasks largely rely on independent subsystems, then an asynchronous solution might be good. If the number of threads needed to handle it would be large, such that context switching became non-trivial for the operating system, then a single-threaded asynchronous solution might be better.

Whenever the tasks are bound by the same resource (e.g. multiple needs to concurrently access the same network or local resource), then multi-threading will probably introduce unsatisfactory overhead, and while single-threaded asynchrony may introduce less overhead, in such a resource-limited situation it too cannot produce a speed-up. In such a case, the only option (if you want a speed-up) is to make multiple copies of that resource available (e.g. multiple processor cores if the scarce resource is CPU; a better database that supports more concurrent connections if the scarce resource is a connection-limited database, etc.).

Another way to put it is: allowing the operating system to interleave the usage of a single resource for two tasks cannot be faster than merely letting one task use the resource while the other waits, then letting the second task finish serially. Further, the scheduler cost of interleaving means in any real situation it actually creates a slowdown. It doesn't matter if the interleaved usage occurs of the CPU, a network resource, a memory resource, a peripheral device, or any other system resource.

Solution 4 - Multithreading

The main reason to use AIO is for scalability. When viewed in the context of a few threads, the benefits are not obvious. But when the system scales to 1000s of threads, AIO will offer much better performance. The caveat is that AIO library should not introduce further bottlenecks.

Solution 5 - Multithreading

One possible implementation of non-blocking I/O is exactly what you said, with a pool of background threads that do blocking I/O and notify the thread of the originator of the I/O via some callback mechanism. In fact, this is how the http://www.gnu.org/s/hello/manual/libc/Asynchronous-I_002fO.html#Asynchronous-I_002fO">AIO</a> module in glibc works. http://www.gnu.org/s/hello/manual/libc/Configuration-of-AIO.html#Configuration-of-AIO">Here</a> are some vague details about the implementation.

While this is a good solution that is quite portable (as long as you have threads), the OS is typically able to service non-blocking I/O more efficiently. http://en.wikipedia.org/wiki/Asynchronous_I/O#Implementation">This Wikipedia article lists possible implementations besides the thread pool.

Solution 6 - Multithreading

I am currently in the process of implementing async io on an embedded platform using protothreads. Non blocking io makes the difference between running at 16000fps and 160fps. The biggest benefit of non blocking io is that you can structure your code to do other things while hardware does its thing. Even initialization of devices can be done in parallel.

Martin

Solution 7 - Multithreading

In Node, multiple threads are being launched, but it's a layer down in the C++ run-time.

>"So Yes NodeJS is single threaded, but this is a half truth, actually it is event-driven and single-threaded with background workers. The main event loop is single-threaded but most of the I/O works run on separate threads, because the I/O APIs in Node.js are asynchronous/non-blocking by design, in order to accommodate the event loop. "

https://codeburst.io/how-node-js-single-thread-mechanism-work-understanding-event-loop-in-nodejs-230f7440b0ea

>"Node.js is non-blocking which means that all functions ( callbacks ) are delegated to the event loop and they are ( or can be ) executed by different threads. That is handled by Node.js run-time."

https://itnext.io/multi-threading-and-multi-process-in-node-js-ffa5bb5cde98 ;

The "Node is faster because it's non-blocking..." explanation is a bit of marketing and this is a great question. It's efficient and scaleable, but not exactly single threaded.

Solution 8 - Multithreading

The improvement as far as I know is that Asynchronous I/O uses ( I'm talking about MS System, just to clarify ) the so called I/O completion ports. By using the Asynchronous call the framework leverage such architecture automatically, and this is supposed to be much more efficient that standard threading mechanism. As a personal experience I can say that you would sensibly feel your application more reactive if you prefer AsyncCalls instead of blocking threads.

Solution 9 - Multithreading

Let me give you a counterexample that asynchronous I/O does not work. I am writing a proxy similar to below-using boost::asio. https://github.com/ArashPartow/proxy/blob/master/tcpproxy_server.cpp

However, the scenario of my case is, incoming (from clients side) messages are fast while outgoing (to server side) is slow for one session, to keep up with the incoming speed or to maximize the total proxy throughput, we have to use multiple sessions under one connection.

Thus this async I/O framework does not work anymore. We do need a thread pool to send to the server by assigning each thread a session.

Content Type	Original Author	Original Content on Stackoverflow
Question	yankee	View Question on Stackoverflow
Solution 1 - Multithreading	Werner Henze	View Answer on Stackoverflow
Solution 2 - Multithreading	Florin Dumitrescu	View Answer on Stackoverflow
Solution 3 - Multithreading	ely	View Answer on Stackoverflow
Solution 4 - Multithreading	fissurezone	View Answer on Stackoverflow
Solution 5 - Multithreading	Miguel Grinberg	View Answer on Stackoverflow
Solution 6 - Multithreading	user2826084	View Answer on Stackoverflow
Solution 7 - Multithreading	SmokestackLightning	View Answer on Stackoverflow
Solution 8 - Multithreading	Felice Pollano	View Answer on Stackoverflow
Solution 9 - Multithreading	Zhidian Du	View Answer on Stackoverflow