In a structure, is it legal to use one array field to access another one?

C++CArraysStruct

C++ Problem Overview


As an example, consider the following structure:

struct S {
  int a[4];
  int b[4];
} s;

Would it be legal to write s.a[6] and expect it to be equal to s.b[2]? Personally, I feel that it must be UB in C++, whereas I'm not sure about C. However, I failed to find anything relevant in the standards of C and C++ languages.


Update

There are several answers suggesting ways to make sure there is no padding between fields in order to make the code work reliably. I'd like to emphasize that if such code is UB, then absense of padding is not enough. If it is UB, then the compiler is free to assume that accesses to S.a[i] and S.b[j] do not overlap and the compiler is free to reorder such memory accesses. For example,

    int x = s.b[2];
    s.a[6] = 2;
    return x;

can be transformed to

    s.a[6] = 2;
    int x = s.b[2];
    return x;

which always returns 2.

C++ Solutions


Solution 1 - C++

> Would it be legal to write s.a[6] and expect it to be equal to s.b[2]?

No. Because accessing an array out of bound invoked undefined behaviour in C and C++.

C11 J.2 Undefined behavior

> - Addition or subtraction of a pointer into, or just beyond, an array object and an integer type produces a result that points just beyond > the array object and is used as the operand of a unary * operator that > is evaluated (6.5.6). > > - An array subscript is out of range, even if an object is apparently accessible with the given subscript (as in the lvalue expression > a[1][7] given the declaration int a[4][5]) (6.5.6).

C++ standard draft section 5.7 Additive operators paragraph 5 says:

> When an expression that has integral type is added to or subtracted > from a pointer, the result has the type of the pointer operand. If the > pointer operand points to an element of an array object, and the array > is large enough, the result points to an element offset from the > original element such that the difference of the subscripts of the > resulting and original array elements equals the integral expression. > [...] If both the pointer operand and the result point to elements > of the same array object, or one past the last element of the array > object, the evaluation shall not produce an overflow; otherwise, the > behavior is undefined.

Solution 2 - C++

Apart from the answer of @rsp (Undefined behavior for an array subscript that is out of range) I can add that it is not legal to access b via a because the C language does not specify how much padding space can be between the end of area allocated for a and the start of b, so even if you can run it on a particular implementation , it is not portable.

instance of struct:
+-----------+----------------+-----------+---------------+
|  array a  |  maybe padding |  array b  | maybe padding |
+-----------+----------------+-----------+---------------+

The second padding may miss as well as the alignment of struct object is the alignment of a which is the same as the alignment of b but the C language also does not impose the second padding not to be there.

Solution 3 - C++

a and b are two different arrays, and a is defined as containing 4 elements. Hence, a[6] accesses the array out of bounds and is therefore undefined behaviour. Note that array subscript a[6] is defined as *(a+6), so the proof of UB is actually given by section "Additive operators" in conjunction with pointers". See the following section of the C11-standard (e.g. this online draft version) describing this aspect:

> 6.5.6 Additive operators > > When an expression that has integer type is added to or subtracted > from a pointer, the result has the type of the pointer operand. If the > pointer operand points to an element of an array object, and the array > is large enough, the result points to an element offset from the > original element such that the difference of the subscripts of the > resulting and original array elements equals the integer expression. > In other words, if the expression P points to the i-th element of an > array object, the expressions (P)+N (equivalently, N+(P)) and (P)-N > (where N has the value n) point to, respectively, the i+n-th and > i-n-th elements of the array object, provided they exist. Moreover, if > the expression P points to the last element of an array object, the > expression (P)+1 points one past the last element of the array object, > and if the expression Q points one past the last element of an array > object, the expression (Q)-1 points to the last element of the array > object. If both the pointer operand and the result point to elements > of the same array object, or one past the last element of the array > object, the evaluation shall not produce an overflow; otherwise, the > behavior is undefined. If the result points one past the last element > of the array object, it shall not be used as the operand of a unary * > operator that is evaluated.

The same argument applies to C++ (though not quoted here).

Further, though it is clearly undefined behaviour due to the fact of exceeding array bounds of a, note that the compiler might introduce padding between members a and b, such that - even if such pointer arithmetics were allowed - a+6 would not necessarily yield the same address as b+2.

Solution 4 - C++

Is it legal? No. As others mentioned, it invokes Undefined Behavior.

Will it work? That depends on your compiler. That's the thing about undefined behavior: it's undefined.

On many C and C++ compilers, the struct will be laid out such that b will immediately follow a in memory and there will be no bounds checking. So accessing a[6] will effectively be the same as b[2] and will not cause any sort of exception.

Given

struct S {
  int a[4];
  int b[4];
} s

and assuming no extra padding, the structure is really just a way of looking at a block of memory containing 8 integers. You could cast it to (int*) and ((int*)s)[6] would point to the same memory as s.b[2].

Should you rely on this sort of behavior? Absolutely not. Undefined means that the compiler doesn't have to support this. The compiler is free to pad the structure which could render the assumption that &(s.b[2]) == &(s.a[6]) incorrect. The compiler could also add bounds checking on the array access (although enabling compiler optimizations would probably disable such a check).

I've have experienced the effects of this in the past. It's quite common to have a struct like this

struct Bob {
    char name[16];
    char whatever[64];
} bob;
strcpy(bob.name, "some name longer than 16 characters");

Now bob.whatever will be " than 16 characters". (which is why you should always use strncpy, BTW)

Solution 5 - C++

As @MartinJames mentioned in a comment, if you need to guarantee that a and b are in contiguous memory (or at least able to be treated as such, (edit) unless your architecture/compiler uses an unusual memory block size/offset and forced alignment that would require padding to be added), you need to use a union.

union overlap {
    char all[8]; /* all the bytes in sequence */
    struct { /* (anonymous struct so its members can be accessed directly) */
        char a[4]; /* padding may be added after this if the alignment is not a sub-factor of 4 */
        char b[4];
    };
};

You can't directly access b from a (e.g. a[6], like you asked), but you can access the elements of both a and b by using all (e.g. all[6] refers to the same memory location as b[2]).

(Edit: You could replace 8 and 4 in the code above with 2*sizeof(int) and sizeof(int), respectively, to be more likely to match the architecture's alignment, especially if the code needs to be more portable, but then you have to be careful to avoid making any assumptions about how many bytes are in a, b, or all. However, this will work on what are probably the most common (1-, 2-, and 4-byte) memory alignments.)

Here is a simple example:

#include <stdio.h>

union overlap {
    char all[2*sizeof(int)]; /* all the bytes in sequence */
    struct { /* anonymous struct so its members can be accessed directly */
        char a[sizeof(int)]; /* low word */
        char b[sizeof(int)]; /* high word */
    };
};

int main()
{
    union overlap testing;
    testing.a[0] = 'a';
    testing.a[1] = 'b';
    testing.a[2] = 'c';
    testing.a[3] = '\0'; /* null terminator */
    testing.b[0] = 'e';
    testing.b[1] = 'f';
    testing.b[2] = 'g';
    testing.b[3] = '\0'; /* null terminator */
    printf("a=%s\n",testing.a); /* output: a=abc */
    printf("b=%s\n",testing.b); /* output: b=efg */
    printf("all=%s\n",testing.all); /* output: all=abc */
    
    testing.a[3] = 'd'; /* makes printf keep reading past the end of a */
    printf("a=%s\n",testing.a); /* output: a=abcdefg */
    printf("b=%s\n",testing.b); /* output: b=efg */
    printf("all=%s\n",testing.all); /* output: all=abcdefg */

    return 0;
}

Solution 6 - C++

No, since accesing an array out of bounds invokes Undefined Behavior, both in C and C++.

Solution 7 - C++

Short Answer: No. You're in the land of undefined behavior.

Long Answer: No. But that doesn't mean that you can't access the data in other sketchier ways... if you're using GCC you can do something like the following (elaboration of dwillis's answer):

struct __attribute__((packed,aligned(4))) Bad_Access {
    int arr1[3];
    int arr2[3];
};

and then you could access via (Godbolt source+asm):

int x = ((int*)ba_pointer)[4];

But that cast violates strict aliasing so is only safe with g++ -fno-strict-aliasing. You can cast a struct pointer to a pointer to the first member, but then you're back in the UB boat because you're accessing outside the first member.

Alternatively, just don't do that. Save a future programmer (probably yourself) the heartache of that mess.

Also, while we're at it, why not use std::vector? It's not fool-proof, but on the back-end it has guards to prevent such bad behavior.

Addendum:

If you're really concerned about performance:

Let's say you have two same-typed pointers that you're accessing. The compiler will more than likely assume that both pointers have the chance to interfere, and will instantiate additional logic to protect you from doing something dumb.

If you solemnly swear to the compiler that you're not trying to alias, the compiler will reward you handsomely: https://stackoverflow.com/questions/1965487/does-the-restrict-keyword-provide-significant-benefits-in-gcc-g

Conclusion: Don't be evil; your future self, and the compiler will thank you.

Solution 8 - C++

Jed Schaff’s answer is on the right track, but not quite correct. If the compiler inserts padding between a and b, his solution will still fail. If, however, you declare:

typedef struct {
  int a[4];
  int b[4];
} s_t;

typedef union {
  char bytes[sizeof(s_t)];
  s_t s;
} u_t;

You may now access (int*)(bytes + offsetof(s_t, b)) to get the address of s.b, no matter how the compiler lays out the structure. The offsetof() macro is declared in <stddef.h>.

The expression sizeof(s_t) is a constant expression, legal in an array declaration in both C and C++. It will not give a variable-length array. (Apologies for misreading the C standard before. I thought that sounded wrong.)

In the real world, though, two consecutive arrays of int in a structure are going to be laid out the way you expect. (You might be able to engineer a very contrived counterexample by setting the bound of a to 3 or 5 instead of 4 and then getting the compiler to align both a and b on a 16-byte boundary.) Rather than convoluted methods to try to get a program that makes no assumptions whatsoever beyond the strict wording of the standard, you want some kind of defensive coding, such as static assert(&both_arrays[4] == &s.b[0], "");. These add no run-time overhead and will fail if your compiler is doing something that would break your program, so long as you don’t trigger UB in the assertion itself.

If you want a portable way to guarantee that both sub-arrays are packed into a contiguous memory range, or split a block of memory the other way, you can copy them with memcpy().

Solution 9 - C++

The Standard does not impose any restrictions upon what implementations must do when a program tries to use an out-of-bounds array subscript in one structure field to access a member of another. Out-of-bounds accesses are thus "illegal" in strictly conforming programs, and programs which make use of such accesses cannot simultaneously be 100% portable and free of errors. On the other hand, many implementations do define the behavior of such code, and programs which are targeted solely at such implementations may exploit such behavior.

There are three issues with such code:

  1. While many implementations lay out structures in predictable fashion, the Standard allows implementations to add arbitrary padding before any structure member other than the first. Code could use sizeof or offsetof to ensure that structure members are placed as expected, but the other two issues would remain.

  2. Given something like:

     if (structPtr->array1[x])
      structPtr->array2[y]++;
     return structPtr->array1[x];
    

    it would normally be useful for a compiler to assume that the use of structPtr->array1[x] will yield the same value as the preceding use in the "if" condition, even though it would change the behavior of code that relies upon aliasing between the two arrays.

  3. If array1[] has e.g. 4 elements, a compiler given something like:

     if (x < 4) foo(x);
     structPtr->array1[x]=1;
    

might conclude that since there would be no defined cases where x isn't less than 4, it could call foo(x) unconditionally.

Unfortunately, while programs can use sizeof or offsetof to ensure that there aren't any surprises with struct layout, there's no way by which they can test whether compilers promise to refrain from the optimizations of types #2 or #3. Further, the Standard is a little vague about what would be meant in a case like:

struct foo {char array1[4],array2[4]; };

int test(struct foo *p, int i, int x, int y, int z)
{
  if (p->array2[x])
  {
    ((char*)p)[x]++;
    ((char*)(p->array1))[y]++;
    p->array1[z]++;
  }
  return p->array2[x];
}

The Standard is pretty clear that behavior would only be defined if z is in the range 0..3, but since the type of p->array in that expression is char* (due to decay) it's not clear the cast in the access using y would have any effect. On the other hand, since converting pointer to the first element of a struct to char* should yield the same result as converting a struct pointer to char*, and the converted struct pointer should be usable to access all bytes therein, it would seem the access using x should be defined for (at minimum) x=0..7 [if the offset of array2 is greater than 4, it would affect the value of x needed to hit members of array2, but some value of x could do so with defined behavior].

IMHO, a good remedy would be to define the subscript operator on array types in a fashion that does not involve pointer decay. In that case, the expressions p->array[x] and &(p->array1[x]) could invite a compiler to assume that x is 0..3, but p->array+x and *(p->array+x) would require a compiler to allow for the possibility of other values. I don't know if any compilers do that, but the Standard doesn't require it.

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionNikolaiView Question on Stackoverflow
Solution 1 - C++mscView Answer on Stackoverflow
Solution 2 - C++alinsoarView Answer on Stackoverflow
Solution 3 - C++Stephan LechnerView Answer on Stackoverflow
Solution 4 - C++dwillissView Answer on Stackoverflow
Solution 5 - C++Jed SchaafView Answer on Stackoverflow
Solution 6 - C++gsamarasView Answer on Stackoverflow
Solution 7 - C++Alex ShirleyView Answer on Stackoverflow
Solution 8 - C++DavislorView Answer on Stackoverflow
Solution 9 - C++supercatView Answer on Stackoverflow