What is the performance penalty of C++11 thread_local variables in GCC 4.8?

C++LinuxMultithreadingGccC++11

C++ Problem Overview


From the GCC 4.8 draft changelog:

> G++ now implements the C++11 thread_local keyword; this differs from > the GNU __thread keyword primarily in that it allows dynamic > initialization and destruction semantics. Unfortunately, this support > requires a run-time penalty for references to non-function-local > thread_local variables even if they don't need dynamic initialization, > so users may want to continue to use __thread for TLS variables with > static initialization semantics.

What is precisely the nature and origin of this run-time penalty?

Obviously to support non-function-local thread_local variables there needs to be a thread initialization phase before the entry to every thread main (just as there is a static initialization phase for global variables), but are they referring to some run-time penalty beyond that?

Roughly speaking what is the architecture of gcc's new implementation of thread_local?

C++ Solutions


Solution 1 - C++

(Disclaimer: I don't know much about the internals of GCC, so this is also an educated guess.)

The dynamic thread_local initialization is added in commit 462819c. One of the change is:

> * semantics.c (finish_id_expression): Replace use of thread_local
> variable with a call to its wrapper.

So the run-time penalty is that, every reference of the thread_local variable will become a function call. Let's check with a simple test case:

// 3.cpp
extern thread_local int tls;    
int main() {
    tls += 37;   // line 6
    tls &= 11;   // line 7
    tls ^= 3;    // line 8
    return 0;
}

// 4.cpp

thread_local int tls = 42;

When compiled*, we see that every use of the tls reference becomes a function call to _ZTW3tls, which lazily initialize the the variable once:

00000000004005b0 <main>:
main():
  4005b0:	55                      	push   rbp
  4005b1:	48 89 e5                	mov    rbp,rsp
  4005b4:	e8 26 00 00 00          	call   4005df <_ZTW3tls>    // line 6
  4005b9:	8b 10                   	mov    edx,DWORD PTR [rax]
  4005bb:	83 c2 25                	add    edx,0x25
  4005be:	89 10                   	mov    DWORD PTR [rax],edx
  4005c0:	e8 1a 00 00 00          	call   4005df <_ZTW3tls>    // line 7
  4005c5:	8b 10                   	mov    edx,DWORD PTR [rax]
  4005c7:	83 e2 0b                	and    edx,0xb
  4005ca:	89 10                   	mov    DWORD PTR [rax],edx
  4005cc:	e8 0e 00 00 00          	call   4005df <_ZTW3tls>    // line 8
  4005d1:	8b 10                   	mov    edx,DWORD PTR [rax]
  4005d3:	83 f2 03                	xor    edx,0x3
  4005d6:	89 10                   	mov    DWORD PTR [rax],edx
  4005d8:	b8 00 00 00 00          	mov    eax,0x0              // line 9
  4005dd:	5d                      	pop    rbp
  4005de:	c3                      	ret

00000000004005df <_ZTW3tls>:
_ZTW3tls():
  4005df:	55                      	push   rbp
  4005e0:	48 89 e5                	mov    rbp,rsp
  4005e3:	b8 00 00 00 00          	mov    eax,0x0
  4005e8:	48 85 c0                	test   rax,rax
  4005eb:	74 05                   	je     4005f2 <_ZTW3tls+0x13>
  4005ed:	e8 0e fa bf ff          	call   0 <tls> // initialize the TLS
  4005f2:	64 48 8b 14 25 00 00 00 00 	mov    rdx,QWORD PTR fs:0x0
  4005fb:	48 c7 c0 fc ff ff ff    	mov    rax,0xfffffffffffffffc
  400602:	48 01 d0                	add    rax,rdx
  400605:	5d                      	pop    rbp
  400606:	c3                      	ret

Compare it with the __thread version, which won't have this extra wrapper:

00000000004005b0 <main>:
main():
  4005b0:	55                      	push   rbp
  4005b1:	48 89 e5                	mov    rbp,rsp
  4005b4:	48 c7 c0 fc ff ff ff    	mov    rax,0xfffffffffffffffc // line 6
  4005bb:	64 8b 00                	mov    eax,DWORD PTR fs:[rax]
  4005be:	8d 50 25                	lea    edx,[rax+0x25]
  4005c1:	48 c7 c0 fc ff ff ff    	mov    rax,0xfffffffffffffffc
  4005c8:	64 89 10                	mov    DWORD PTR fs:[rax],edx
  4005cb:	48 c7 c0 fc ff ff ff    	mov    rax,0xfffffffffffffffc // line 7
  4005d2:	64 8b 00                	mov    eax,DWORD PTR fs:[rax]
  4005d5:	89 c2                   	mov    edx,eax
  4005d7:	83 e2 0b                	and    edx,0xb
  4005da:	48 c7 c0 fc ff ff ff    	mov    rax,0xfffffffffffffffc
  4005e1:	64 89 10                	mov    DWORD PTR fs:[rax],edx
  4005e4:	48 c7 c0 fc ff ff ff    	mov    rax,0xfffffffffffffffc // line 8
  4005eb:	64 8b 00                	mov    eax,DWORD PTR fs:[rax]
  4005ee:	89 c2                   	mov    edx,eax
  4005f0:	83 f2 03                	xor    edx,0x3
  4005f3:	48 c7 c0 fc ff ff ff    	mov    rax,0xfffffffffffffffc
  4005fa:	64 89 10                	mov    DWORD PTR fs:[rax],edx
  4005fd:	b8 00 00 00 00          	mov    eax,0x0                // line 9
  400602:	5d                      	pop    rbp
  400603:	c3                      	ret

This wrapper is not needed for in every use case of thread_local though. This can be revealed from decl2.c. The wrapper is generated only when:

  • It is not function-local, and,

    1. It is extern (the example shown above), or
    2. The type has a non-trivial destructor (which is not allowed for __thread variables), or
    3. The type variable is initialized by a non-constant-expression (which is also not allowed for __thread variables).

In all other use cases, it behaves the same as __thread. That means, unless you have some extern __thread variables, you could replace all __thread by thread_local without any loss of performance.


*: I compiled with -O0 because the inliner will make the function boundary less visible. Even if we turn up to -O3 those initialization checks still remain.

Solution 2 - C++

C++11 thread_local has the same runtime effect as the __thread specifier (__thread is not part of the C standard; thread_local is part of the C++ standard)

it depends where the TLS variable (declared with __thread specifier) is declared.

  • if TLS variable is declared in an executable then access is fast
  • if TLS variable is declared within shared library code (compiled with -fPIC compiler option) and -ftls-model=initial-exec compiler option is specified then access is fast; however the following limitation applies: the shared library can't be loaded via dlopen/dlsym (dynamic loading), the only way of using the library is to link with it during compilation (linker option -l<libraryname> )
  • if TLS variable is declared within a shared library (-fPIC compiler option set) then access is very slow, as the general dynamic TLS model is assumed - here each access to a TLS variable results in a call to _tls_get_addr() ; this is the default case because you are not limited in the way that the shared library is used.

Sources: ELF Handling For Thread-Local Storage by Ulrich Drepper https://www.akkadia.org/drepper/tls.pdf this text also lists the code that is generated for the supported target platforms.

Solution 3 - C++

If the variable is defined in the current TU, the inliner will take care of the overhead. I expect that this will be true of most uses of thread_local.

For extern variables, if the programmer can be sure that no use of the variable in a non-defining TU needs to trigger dynamic initialization (either because the variable is statically initialized, or a use of the variable in the defining TU will be executed before any uses in another TU), they can avoid this overhead with the -fno-extern-tls-init option.

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionAndrew TomazosView Question on Stackoverflow
Solution 1 - C++kennytmView Answer on Stackoverflow
Solution 2 - C++MichaelMoserView Answer on Stackoverflow
Solution 3 - C++Jason MerrillView Answer on Stackoverflow