How do stackless coroutines differ from stackful coroutines?

C++ConcurrencyCoroutineBoost Coroutine

C++ Problem Overview


Background:

I'm asking this because I currently have an application with many (hundreds to thousands) of threads. Most of those threads are idle a great portion of the time, waiting on work items to be placed in a queue. When a work item comes available, it is then processed by calling some arbitrarily-complex existing code. On some operating system configurations, the application bumps up against kernel parameters governing the maximum number of user processes, so I'd like to experiment with means to reduce the number of worker threads.

My proposed solution:

It seems like a coroutine-based approach, where I replace each worker thread with a coroutine, would help to accomplish this. I can then have a work queue backed by a pool of actual (kernel) worker threads. When an item is placed in a particular coroutine's queue for processing, an entry would be placed into the thread pool's queue. It would then resume the corresponding coroutine, process its queued data, and then suspend it again, freeing up the worker thread to do other work.

Implementation details:

In thinking about how I would do this, I'm having trouble understanding the functional differences between stackless and stackful coroutines. I have some experience using stackful coroutines using the Boost.Coroutine library. I find it's relatively easy to comprehend from a conceptual level: for each coroutine, it maintains a copy of the CPU context and stack, and when you switch to a coroutine, it switches to that saved context (just like a kernel-mode scheduler would).

What is less clear to me is how a stackless coroutine differs from this. In my application, the amount of overhead associated with the above-described queuing of work items is very important. Most implementations that I've seen, like the new CO2 library suggest that stackless coroutines provide much lower-overhead context switches.

Therefore, I'd like to understand the functional differences between stackless and stackful coroutines more clearly. Specifically, I think of these questions:

  • References like this one suggest that the distinction lies in where you can yield/resume in a stackful vs. stackless coroutine. Is this the case? Is there a simple example of something that I can do in a stackful coroutine but not in a stackless one?

  • Are there any limitations on the use of automatic storage variables (i.e. variables "on the stack")?

  • Are there any limitations on what functions I can call from a stackless coroutine?

  • If there is no saving of stack context for a stackless coroutine, where do automatic storage variables go when the coroutine is running?

C++ Solutions


Solution 1 - C++

First, thank you for taking a look at CO2 :)

The Boost.Coroutine doc describes the advantage of stackful coroutine well:

> stackfulness > > In contrast to a stackless coroutine a stackful coroutine > can be suspended from within a nested stackframe. Execution resumes at > exactly the same point in the code where it was suspended before. With > a stackless coroutine, only the top-level routine may be suspended. > Any routine called by that top-level routine may not itself suspend. > This prohibits providing suspend/resume operations in routines within > a general-purpose library. > > first-class continuation > > A first-class continuation can be passed as > an argument, returned by a function and stored in a data structure to > be used later. In some implementations (for instance C# yield) the > continuation can not be directly accessed or directly manipulated. > > Without stackfulness and first-class semantics, some useful execution > control flows cannot be supported (for instance cooperative > multitasking or checkpointing).

What does that mean to you? for example, imagine you have a function that takes a visitor:

template<class Visitor>
void f(Visitor& v);

You want to transform it to iterator, with stackful coroutine, you can:

asymmetric_coroutine<T>::pull_type pull_from([](asymmetric_coroutine<T>::push_type& yield)
{
    f(yield);
});

But with stackless coroutine, there's no way to do so:

generator<T> pull_from()
{
    // yield can only be used here, cannot pass to f
    f(???);
}

In general, stackful coroutine is more powerful than stackless coroutine. So why do we want stackless coroutine? short answer: efficiency.

Stackful coroutine typically needs to allocate a certain amount of memory to accomodate its runtime-stack (must be large enough), and the context-switch is more expensive compared to the stackless one, e.g. Boost.Coroutine takes 40 cycles while CO2 takes just 7 cycles in average on my machine, because the only thing that a stackless coroutine needs to restore is the program counter.

That said, with language support, probably stackful coroutine can also take the advantage of the compiler-computed max-size for the stack as long as there's no recursion in the coroutine, so the memory usage can also be improved.

Speaking of stackless coroutine, bear in mind that it doesn't mean that there's no runtime-stack at all, it only means that it uses the same runtime-stack as the host side, so you can call recursive functions as well, just that all the recursions will happen on the host's runtime-stack. In contrast, with stackful coroutine, when you call recursive functions, the recursions will happen on the coroutine's own stack.

To answer the questions:

  • Are there any limitations on the use of automatic storage variables (i.e. variables "on the stack")?

No. It's the emulation limitation of CO2. With language support, the automatic storage variables visible to the coroutine will be placed on the coroutine's internal storage. Note my emphasis on "visible to the coroutine", if the coroutine calls a function that uses automatic storage variables internally, then those variables will be placed on the runtime-stack. More specifically, stackless coroutine only has to preserve the variables/temporaries that can be used after resumed.

To be clear, you can use automatic storage variables in CO2's coroutine body as well:

auto f() CO2_RET(co2::task<>, ())
{
    int a = 1; // not ok
    CO2_AWAIT(co2::suspend_always{});
    {
        int b = 2; // ok
        doSomething(b);
    }
    CO2_AWAIT(co2::suspend_always{});
    int c = 3; // ok
    doSomething(c);
} CO2_END

As long as the definition does not precede any await.

  • Are there any limitations on what functions I can call from a stackless coroutine?

No.

  • If there is no saving of stack context for a stackless coroutine, where do automatic storage variables go when the coroutine is running?

Answered above, a stackless coroutine doesn't care about the automatic storage variables used in the called functions, they'll just be placed on the normal runtime-stack.

If you have any doubt, just check CO2's source code, it may help you understand the mechanics under the hood ;)

Solution 2 - C++

What you want are user-land threads/fibers - usually you want to suspend the your code (running in fiber) in a deep nested call stack (for instance parsing messages from TCP-connection). In this case you can not use stackless context switching (application stack is shared between stackless coroutines -> stack frames of called subroutines would be overwritten).

You can use something like boost.fiber which implements user-land threads/fibers based on boost.context.

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionJason RView Question on Stackoverflow
Solution 1 - C++JamboreeView Answer on Stackoverflow
Solution 2 - C++olkView Answer on Stackoverflow