What does it mean to align the stack?

C Problem Overview

I have been a high-level coder, and architectures are pretty new to me, so I decided to read the tutorial on Assembly here:

http://en.wikibooks.org/wiki/X86_Assembly/Print_Version

Far down the tutorial, instructions on how to convert the Hello World! program

#include <stdio.h>

int main(void) {
    printf("Hello, world!\n");
    return 0;
}

into equivalent assembly code was given and the following was generated:

        .text
LC0:
        .ascii "Hello, world!\12\0"
.globl _main
_main:
        pushl   %ebp
        movl    %esp, %ebp
        subl    $8, %esp
        andl    $-16, %esp
        movl    $0, %eax
        movl    %eax, -4(%ebp)
        movl    -4(%ebp), %eax
        call    __alloca
        call    ___main
        movl    $LC0, (%esp)
        call    _printf
        movl    $0, %eax
        leave
        ret

For one of the lines,

andl    $-16, %esp

the explanation was:

> This code "and"s ESP with 0xFFFFFFF0, > aligning the stack with the next > lowest 16-byte boundary. An > examination of Mingw's source code > reveals that this may be for SIMD > instructions appearing in the "_main" > routine, which operate only on aligned > addresses. Since our routine doesn't > contain SIMD instructions, this line > is unnecessary.

I do not understand this point. Can someone give me an explanation of what it means to align the stack with the next 16-byte boundary and why it is required? And how is the andl achieving this?

C Solutions

Solution 1 - C

Assume the stack looks like this on entry to _main (the address of the stack pointer is just an example):

|    existing     |
|  stack content  |
+-----------------+  <--- 0xbfff1230

Push %ebp, and subtract 8 from %esp to reserve some space for local variables:

|    existing     |
|  stack content  |
+-----------------+  <--- 0xbfff1230
|      %ebp       |
+-----------------+  <--- 0xbfff122c
:    reserved     :
:     space       :
+-----------------+  <--- 0xbfff1224

Now, the andl instruction zeroes the low 4 bits of %esp, which may decrease it; in this particular example, it has the effect of reserving an additional 4 bytes:

|    existing     |
|  stack content  |
+-----------------+  <--- 0xbfff1230
|      %ebp       |
+-----------------+  <--- 0xbfff122c
:    reserved     :
:     space       :
+ - - - - - - - - +  <--- 0xbfff1224
:   extra space   :
+-----------------+  <--- 0xbfff1220

The point of this is that there are some "SIMD" (Single Instruction, Multiple Data) instructions (also known in x86-land as "SSE" for "Streaming SIMD Extensions") which can perform parallel operations on multiple words in memory, but require those multiple words to be a block starting at an address which is a multiple of 16 bytes.

In general, the compiler can't assume that particular offsets from %esp will result in a suitable address (because the state of %esp on entry to the function depends on the calling code). But, by deliberately aligning the stack pointer in this way, the compiler knows that adding any multiple of 16 bytes to the stack pointer will result in a 16-byte aligned address, which is safe for use with these SIMD instructions.

Solution 2 - C

This does not sound to be stack specific, but alignment in general. Perhaps think of the term integer multiple.

If you have items in memory that are a byte in size, units of 1, then lets just say they are all aligned. Things that are two bytes in size, then integers times 2 will be aligned, 0, 2, 4, 6, 8, etc. And non-integer multiples, 1, 3, 5, 7 will not be aligned. Items that are 4 bytes in size, integer multiples 0, 4, 8, 12, etc are aligned, 1,2,3,5,6,7, etc are not. Same goes for 8, 0,8,16,24 and 16 16,32,48,64, and so on.

What this means is you can look at the base address for the item and determine if it is aligned.

size in bytes, address in the form of
1, xxxxxxx
2, xxxxxx0
4, xxxxx00
8, xxxx000
16,xxx0000
32,xx00000
64,x000000
and so on

In the case of a compiler mixing in data with instructions in the .text segment it is fairly straightforward to align data as needed (well, depends on the architecture). But the stack is a runtime thing, the compiler cannot normally determine where the stack will be at run time. So at runtime if you have local variables that need to be aligned you would need to have the code adjust the stack programmatically.

Say for example you have two 8 byte items on the stack, 16 total bytes, and you really want them aligned (on 8 byte boundaries). On entry the function would subtract 16 from the stack pointer as usual to make room for these two items. But to align them there would need to be more code. If we wanted these two 8 byte items aligned on 8 byte boundaries and the stack pointer after subtracting 16 was 0xFF82, well the lower 3 bits are not 0 so it is not aligned. The lower three bits are 0b010. In a generic sense we want to subtract 2 from the 0xFF82 to get 0xFF80. How we determine it is a 2 would be by anding with 0b111 (0x7) and subtracting that amount. That means to alu operations an and and a subtract. But we can take a shortcut if we and with the ones complement value of 0x7 (~0x7 = 0xFFFF...FFF8) we get 0xFF80 using one alu operation (so long as the compiler and processor have a single opcode way to do that, if not it may cost you more than the and and subtract).

This appears to be what your program was doing. Anding with -16 is the same as anding with 0xFFFF....FFF0, resulting in an address that is aligned on a 16 byte boundary.

So to wrap this up, if you have something like a typical stack pointer that works its way down memory from higher addresses to lower addresses, then you want to

sp = sp & ((n-1))

where n is the number of bytes to align (must be powers but that is okay most alignment usually involves powers of two). If you have say done a malloc (addresses increase from low to high) and want to align the address of something (remember to malloc more than you need by at least the alignment size) then

if(ptr&((n-)) { ptr = (ptr+n)&(~(n-1)); }

Or if you want just take the if out there and perform the add and mask every time.

many/most non-x86 architectures have alignment rules and requirements. x86 is overly flexible as far as the instruction set goes, but as far as execution goes you can/will pay a penalty for unaligned accesses on an x86, so even though you can do it you should strive to stay aligned as you would with any other architecture. Perhaps that is what this code was doing.

Solution 3 - C

This has to do with byte alignment. Certain architectures require addresses used for a specific set of operations be aligned to specific bit boundaries.

That is, if you wanted 64-bit alignment for a pointer, for example, then you could conceptually divide the entire addressable memory into 64-bit chunks starting at zero. An address would be "aligned" if it fit exactly into one of these chunks, and not aligned if it took part of one chunk and part of another.

A significant feature of byte alignment (assuming the number is a power of 2) is that the least-significant X bits of the address are always zero. This allows the processor to represent more addresses with fewer bits by simply not using the bottom X bits.

Solution 4 - C

Imagine this "drawing"

addresses
xxx0123456789abcdef01234567 ...
[------][------][------] ...
registers

Values at addresses multiple of 8 "slide" easily into (64-bit) registers

addresses
56789abc ...
[------][------][------] ...
registers

Of course registers "walk" in steps of 8 bytes

Now if you want to put the value at address xxx5 into a register is much more difficult :-)

Edit andl -16

-16 is 11111111111111111111111111110000 in binary

when you "and" anything with -16 you get a value with the last 4 bits set to 0 ... or a multitple of 16.

Solution 5 - C

When the processor loads data from memory into a register, it needs to access by a base address and a size. For example, it will fetch 4 bytes from address 10100100. Notice that there are two zeros at the end of that example. That's because the four bytes are stored so that the 101001 leading bits are significant. (The processor really accesses these through a "don't care" by fetching 101001XX.)

So to align something in memory means to rearrange data (usually through padding) so that the desired item's address will have enough zero bytes. Continuing the above example, we can't fetch 4 bytes from 10100101 since the last two bits aren't zero; that would cause a bus error. So we must bump the address up to 10101000 (and waste three address locations in the process).

The compiler does this for you automatically and is represented in the assembly code.

Note that this is manifest as an optimization in C/C++:

struct first {
    char letter1;
    int number;
    char letter2;
};

struct second {
    int number;
    char letter1;
    char letter2;
};

int main ()
{
    cout << "Size of first: " << sizeof(first) << endl;
    cout << "Size of second: " << sizeof(second) << endl;
    return 0;
}

The output is

Size of first: 12
Size of second: 8

Rearranging the two char's means that the int will be aligned properly, and so the compiler doesn't have to bump the base address via padding. That's why the size of the second is smaller.

Solution 6 - C

It should only be at even addresses, not at the odd ones, because there is a performance deficit accessing them.

Content Type	Original Author	Original Content on Stackoverflow
Question	Legend	View Question on Stackoverflow
Solution 1 - C	Matthew Slattery	View Answer on Stackoverflow
Solution 2 - C	old_timer	View Answer on Stackoverflow
Solution 3 - C	tylerl	View Answer on Stackoverflow
Solution 4 - C	pmg	View Answer on Stackoverflow
Solution 5 - C	chrisaycock	View Answer on Stackoverflow
Solution 6 - C	AndreKR	View Answer on Stackoverflow