gcc, strict-aliasing, and horror stories

C Problem Overview

In gcc-strict-aliasing-and-casting-through-a-union I asked whether anyone had encountered problems with union punning through pointers. So far, the answer seems to be No.

This question is broader: Do you have any horror stories about gcc and strict-aliasing?

Background: Quoting from AndreyT's answer in c99-strict-aliasing-rules-in-c-gcc:

>"Strict aliasing rules are rooted in parts of the standard that were present in C and C++ since the beginning of [standardized] times. The clause that prohibits accessing object of one type through a lvalue of another type is present in C89/90 (6.3) as well as in C++98 (3.10/15). ... It is just that not all compilers wanted (or dared) to enforce it or rely on it."

Well, gcc is now daring to do so, with its -fstrict-aliasing switch. And this has caused some problems. See, for example, the excellent article http://davmac.wordpress.com/2009/10/ about a Mysql bug, and the equally excellent discussion in http://cellperformance.beyond3d.com/articles/2006/06/understanding-strict-aliasing.html.

Some other less-relevant links:

So to repeat, do you have a horror story of your own? Problems not indicated by -Wstrict-aliasing would, of course, be preferred. And other C compilers are also welcome.

Added June 2nd: The first link in Michael Burr's answer, which does indeed qualify as a horror story, is perhaps a bit dated (from 2003). I did a quick test, but the problem has apparently gone away.

Source:

#include <string.h>
struct iw_event {               /* dummy! */
    int len;
};
char *iwe_stream_add_event(
    char *stream,               /* Stream of events */
    char *ends,                 /* End of stream */
    struct iw_event *iwe,       /* Payload */
    int event_len)              /* Real size of payload */
{
    /* Check if it's possible */
    if ((stream + event_len) < ends) {
            iwe->len = event_len;
            memcpy(stream, (char *) iwe, event_len);
            stream += event_len;
    }
    return stream;
}

The specific complaint is:

>Some users have complained that when the [above] code is compiled without the -fno-strict-aliasing, the order of the write and memcpy is inverted (which means a bogus len is mem-copied into the stream).

Compiled code, using gcc 4.3.4 on CYGWIN wih -O3 (please correct me if I am wrong--my assembler is a bit rusty!):

_iwe_stream_add_event:
        pushl       %ebp
        movl        %esp, %ebp
        pushl       %ebx
        subl        $20, %esp
        movl        8(%ebp), %eax       # stream    --> %eax
        movl        20(%ebp), %edx      # event_len --> %edx
        leal        (%eax,%edx), %ebx   # sum       --> %ebx
        cmpl        12(%ebp), %ebx      # compare sum with ends
        jae L2
        movl        16(%ebp), %ecx      # iwe       --> %ecx
        movl        %edx, (%ecx)        # event_len --> iwe->len (!!)
        movl        %edx, 8(%esp)       # event_len --> stack
        movl        %ecx, 4(%esp)       # iwe       --> stack
        movl        %eax, (%esp)        # stream    --> stack
        call        _memcpy
        movl        %ebx, %eax          # sum       --> retval
L2:
        addl        $20, %esp
        popl        %ebx
        leave
        ret

And for the second link in Michael's answer,

*(unsigned short *)&a = 4;

gcc will usually (always?) give a warning. But I believe a valid solution to this (for gcc) is to use:

#define CAST(type, x) (((union {typeof(x) src; type dst;}*)&(x))->dst)
// ...
CAST(unsigned short, a) = 4;

I've asked SO whether this is OK in gcc-strict-aliasing-and-casting-through-a-union, but so far nobody disagrees.

C Solutions

Solution 1 - C

No horror story of my own, but here are some quotes from Linus Torvalds (sorry if these are already in one of the linked references in the question):

http://lkml.org/lkml/2003/2/26/158: > Date Wed, 26 Feb 2003 09:22:15 -0800 > Subject Re: Invalid compilation without -fno-strict-aliasing > From Jean Tourrilhes <> > > On Wed, Feb 26, 2003 at 04:38:10PM +0100, Horst von Brand wrote: > > Jean Tourrilhes <> said: > > > It looks like a compiler bug to me... > > > Some users have complained that when the following code is > > > compiled without the -fno-strict-aliasing, the order of the write and > > > memcpy is inverted (which mean a bogus len is mem-copied into the > > > stream). > > > Code (from linux/include/net/iw_handler.h) : > > > > > > static inline char * > > > iwe_stream_add_event(char * stream, /* Stream of events / > > > char * ends, / End of stream */ > > > struct iw_event iwe, / Payload / > > > int event_len) / Real size of payload / > > > { > > > / Check if it's possible */ > > > if((stream + event_len) < ends) { > > > iwe->len = event_len; > > > memcpy(stream, (char *) iwe, event_len); > > > stream += event_len; > > > } > > > return stream; > > > } > > > > > > IMHO, the compiler should have enough context to know that the > > > reordering is dangerous. Any suggestion to make this simple code more > > > bullet proof is welcomed. > > > > The compiler is free to assume char *stream and struct iw_event *iwe point > > to separate areas of memory, due to strict aliasing. > > Which is true and which is not the problem I'm complaining about.

(Note with hindsight: this code is fine, but Linux's implementation of memcpy was a macro that cast to long * to copy in larger chunks. With a correctly-defined memcpy, gcc -fstrict-aliasing isn't allowed to break this code. But it means you need inline asm to define a kernel memcpy if your compiler doesn't know how turn a byte-copy loop into efficient asm, which was the case for gcc before gcc7)

> And Linus Torvald's comment on the above: > > Jean Tourrilhes <[email protected]> wrote: > > > > It looks like a compiler bug to me... > > Why do you think the kernel uses "-fno-strict-aliasing"? > > The gcc people are more interested in trying to find out what can be > allowed by the c99 specs than about making things actually work. The > aliasing code in particular is not even worth enabling, it's just not > possible to sanely tell gcc when some things can alias. > > > Some users have complained that when the following code is > >compiled without the -fno-strict-aliasing, the order of the write and > >memcpy is inverted (which mean a bogus len is mem-copied into the > >stream). > > The "problem" is that we inline the memcpy(), at which point gcc won't > care about the fact that it can alias, so they'll just re-order > everything and claim it's out own fault. Even though there is no sane > way for us to even tell gcc about it. > > I tried to get a sane way a few years ago, and the gcc developers really > didn't care about the real world in this area. I'd be surprised if that > had changed, judging by the replies I have already seen. > > I'm not going to bother to fight it. > > Linus

http://www.mail-archive.com/[email protected]/msg01647.html: > Type-based aliasing is stupid. It's so incredibly stupid that it's not even funny. It's broken. And gcc took the broken notion, and made it more so by making it a "by-the-letter-of-the-law" thing that makes no sense. > > ... > > I know for a fact that gcc would re-order write accesses that were clearly to (statically) the same address. Gcc would suddenly think that > > unsigned long a; > > a = 5; > *(unsigned short *)&a = 4; > > could be re-ordered to set it to 4 first (because clearly they don't alias - by reading the standard), and then because now the assignment of 'a=5' was later, the assignment of 4 could be elided entirely! And if somebody complains that the compiler is insane, the compiler people would say "nyaah, nyaah, the standards people said we can do this", with absolutely no introspection to ask whether it made any SENSE.

Solution 2 - C

SWIG generates code that depends on strict aliasing being off, which can cause all sorts of problems.

SWIGEXPORT jlong JNICALL Java_com_mylibJNI_make_1mystruct_1_1SWIG_12(
       JNIEnv *jenv, jclass jcls, jint jarg1, jint jarg2) {
  jlong jresult = 0 ;
  int arg1 ;
  int arg2 ;
  my_struct_t *result = 0 ;
  
  (void)jenv;
  (void)jcls;
  arg1 = (int)jarg1; 
  arg2 = (int)jarg2; 
  result = (my_struct_t *)make_my_struct(arg1,arg2);
  *(my_struct_t **)&jresult = result;              /* <<<< horror*/
  return jresult;
}

Solution 3 - C

gcc, aliasing, and 2-D variable-length arrays: The following sample code copies a 2x2 matrix:

#include <stdio.h>

static void copy(int n, int a[][n], int b[][n]) {
   int i, j;
   for (i = 0; i < 2; i++)    // 'n' not used in this example
      for (j = 0; j < 2; j++) // 'n' hard-coded to 2 for simplicity
         b[i][j] = a[i][j];
}

int main(int argc, char *argv[]) {
   int a[2][2] = {{1, 2},{3, 4}};
   int b[2][2];
   copy(2, a, b);    
   printf("%d %d %d %d\n", b[0][0], b[0][1], b[1][0], b[1][1]);
   return 0;
}

With gcc 4.1.2 on CentOS, I get:

$ gcc -O1 test.c && a.out
1 2 3 4
$ gcc -O2 test.c && a.out
10235717 -1075970308 -1075970456 11452404 (random)

I don't know whether this is generally known, and I don't know whether this a bug or a feature. I can't duplicate the problem with gcc 4.3.4 on Cygwin, so it may have been fixed. Some work-arounds:

Use __attribute__((noinline)) for copy().
Use the gcc switch -fno-strict-aliasing.
Change the third parameter of copy() from b[][n] to b[][2].
Don't use -O2 or -O3.

Further notes:

This is an answer, after a year and a day, to my own question (and I'm a bit surprised there are only two other answers).
I lost several hours with this on my actual code, a Kalman filter. Seemingly small changes would have drastic effects, perhaps because of changing gcc's automatic inlining (this is a guess; I'm still uncertain). But it probably doesn't qualify as a horror story.
Yes, I know you wouldn't write copy() like this. (And, as an aside, I was slightly surprised to see gcc did not unroll the double-loop.)
No gcc warning switches, include -Wstrict-aliasing=, did anything here.
1-D variable-length arrays seem to be OK.

Update: The above does not really answer the OP's question, since he (i.e. I) was asking about cases where strict aliasing 'legitimately' broke your code, whereas the above just seems to be a garden-variety compiler bug.

I reported it to GCC Bugzilla, but they weren't interested in the old 4.1.2, even though (I believe) it is the key to the $1-billion RHEL5. It doesn't occur in 4.2.4 up.

And I have a slightly simpler example of a similar bug, with only one matrix. The code:

static void zero(int n, int a[][n]) {
   int i, j;
   for (i = 0; i < n; i++)
   for (j = 0; j < n; j++)
      a[i][j] = 0;
}

int main(void) {
   int a[2][2] = {{1, 2},{3, 4}};
   zero(2, a);    
   printf("%d\n", a[1][1]);
   return 0;
}

produces the results:

gcc -O1 test.c && a.out
0
gcc -O1 -fstrict-aliasing test.c && a.out
4

It seems it is the combination -fstrict-aliasing with -finline which causes the bug.

Solution 4 - C

here is mine:

http://forum.openscad.org/CGAL-3-6-1-causing-errors-but-CGAL-3-6-0-OK-tt2050.html

it caused certain shapes in a CAD program to be drawn incorrectly. thank goodness for the project's leaders work on creating a regression test suite.

the bug only manifested itself on certain platforms, with older versions of GCC and older versions of certain libraries. and then only with -O2 turned on. -fno-strict-aliasing solved it.

Solution 5 - C

The Common Initial Sequence rule of C used to be interpreted as making it possible to write a function which could work on the leading portion of a wide variety of structure types, provided they start with elements of matching types. Under C99, the rule was changed so that it only applied if the structure types involved were members of the same union whose complete declaration was visible at the point of use.

The authors of gcc insist that the language in question is only applicable if the accesses are performed through the union type, notwithstanding the facts that:

There would be no reason to specify that the complete declaration must be visible if accesses had to be performed through the union type.
Although the CIS rule was described in terms of unions, its primary usefulness lay in what it implied about the way in which structs were laid out and accessed. If S1 and S2 were structures that shared a CIS, there would be no way that a function that accepted a pointer to an S1 and an S2 from an outside source could comply with C89's CIS rules without allowing the same behavior to be useful with pointers to structures that weren't actually inside a union object; specifying CIS support for structures would thus have been redundant given that it was already specified for unions.

Solution 6 - C

The following code returns 10, under gcc 4.4.4. Is anything wrong with the union method or gcc 4.4.4?

int main()
{
  int v = 10;

  union vv {
    int v;
    short q;
  } *s = (union vv *)&v;

  s->v = 1;

  return v;
}

Content Type	Original Author	Original Content on Stackoverflow
Question	Joseph Quinsey	View Question on Stackoverflow
Solution 1 - C	Michael Burr	View Answer on Stackoverflow
Solution 2 - C	paleozogt	View Answer on Stackoverflow
Solution 3 - C	Joseph Quinsey	View Answer on Stackoverflow
Solution 4 - C	don bright	View Answer on Stackoverflow
Solution 5 - C	supercat	View Answer on Stackoverflow
Solution 6 - C	user470617	View Answer on Stackoverflow