If registers are so blazingly fast, why don't we have more of them?

Performance History Cpu Registers Assembly

Performance Problem Overview

In 32bit, we had 8 "general purpose" registers. With 64bit, the amount doubles, but it seems independent of the 64bit change itself.
Now, if registers are so fast (no memory access), why aren't there more of them naturally? Shouldn't CPU builders work as many registers as possible into the CPU? What is the logical restriction to why we only have the amount we have?

Performance Solutions

Solution 1 - Performance

There's many reasons you don't just have a huge number of registers:

They're highly linked to most pipeline stages. For starters, you need to track their lifetime, and forward results back to previous stages. The complexity gets intractable very quickly, and the number of wires (literally) involved grows at the same rate. It's expensive on area, which ultimately means it's expensive on power, price and performance after a certain point.
It takes up instruction encoding space. 16 registers takes up 4 bits for source and destination, and another 4 if you have 3-operand instructions (e.g ARM). That's an awful lot of instruction set encoding space taken up just to specify the register. This eventually impacts decoding, code size and again complexity.
There's better ways to achieve the same result...

These days we really do have lots of registers - they're just not explicitly programmed. We have "register renaming". While you only access a small set (8-32 registers), they're actually backed by a much larger set (e.g 64-256). The CPU then tracks the visibility of each register, and allocates them to the renamed set. For example, you can load, modify, then store to a register many times in a row, and have each of these operations actually performed independently depending on cache misses etc. In ARM:

ldr r0, [r4]
add r0, r0, #1
str r0, [r4]
ldr r0, [r5]
add r0, r0, #1
str r0, [r5]

Cortex A9 cores do register renaming, so the first load to "r0" actually goes to a renamed virtual register - let's call it "v0". The load, increment and store happen on "v0". Meanwhile, we also perform a load/modify/store to r0 again, but that'll get renamed to "v1" because this is an entirely independent sequence using r0. Let's say the load from the pointer in "r4" stalled due to a cache miss. That's ok - we don't need to wait for "r0" to be ready. Because it's renamed, we can run the next sequence with "v1" (also mapped to r0) - and perhaps that's a cache hit and we just had a huge performance win.

ldr v0, [v2]
add v0, v0, #1
str v0, [v2]
ldr v1, [v3]
add v1, v1, #1
str v1, [v3]

I think x86 is up to a gigantic number of renamed registers these days (ballpark 256). That would mean having 8 bits times 2 for every instruction just to say what the source and destination is. It would massively increase the number of wires needed across the core, and its size. So there's a sweet spot around 16-32 registers which most designers have settled for, and for out-of-order CPU designs, register renaming is the way to mitigate it.

Edit: The importance of out-of-order execution and register renaming on this. Once you have OOO, the number of registers doesn't matter so much, because they're just "temporary tags" and get renamed to the much larger virtual register set. You don't want the number to be too small, because it gets difficult to write small code sequences. This is a problem for x86-32, because the limited 8 registers means a lot of temporaries end up going through the stack, and the core needs extra logic to forward reads/writes to memory. If you don't have OOO, you're usually talking about a small core, in which case a large register set is a poor cost/performance benefit.

So there's a natural sweet spot for register bank size which maxes out at about 32 architected registers for most classes of CPU. x86-32 has 8 registers and it's definitely too small. ARM went with 16 registers and it's a good compromise. 32 registers is slightly too many if anything - you end up not needing the last 10 or so.

None of this touches on the extra registers you get for SSE and other vector floating point coprocessors. Those make sense as an extra set because they run independently of the integer core, and don't grow the CPU's complexity exponentially.

Solution 2 - Performance

We Do Have More of Them

Because almost every instruction must select 1, 2, or 3 architecturally visible registers, expanding the number of them would increase code size by several bits on each instruction and so reduce code density. It also increases the amount of context that must be saved as thread state, and partially saved in a function's activation record. These operations occur frequently. Pipeline interlocks must check a scoreboard for every register and this has quadratic time and space complexity. And perhaps the biggest reason is simply compatibility with the already-defined instruction set.

But it turns out, thanks to register renaming, we really do have lots of registers available, and we don't even need to save them. The CPU actually has many register sets, and it automatically switches between them as your code exeutes. It does this purely to get you more registers.

Example:

load  r1, a  # x = a
store r1, x
load  r1, b  # y = b
store r1, y

In an architecture that has only r0-r7, the following code may be rewritten automatically by the CPU as something like:

load  r1, a
store r1, x
load  r10, b
store r10, y

In this case r10 is a hidden register that is substituted for r1 temporarily. The CPU can tell that the the value of r1 is never used again after the first store. This allows the first load to be delayed (even an on-chip cache hit usually takes several cycles) without requiring the delay of the second load or the second store.

Solution 3 - Performance

They add registers all of the time, but they are often tied to special purpose instructions (e.g. SIMD, SSE2, etc) or require compiling to a specific CPU architecture, which lowers portability. Existing instructions often work on specific registers and couldn't take advantage of other registers if they were available. Legacy instruction set and all.

Solution 4 - Performance

To add a little interesting info here you'll notice that having 8 same sized registers allows opcodes to maintain consistency with hexadecimal notation. For example the instruction push ax is opcode 0x50 on x86 and goes up to 0x57 for the last register di. Then the instruction pop ax starts at 0x58 and goes up to 0x5F pop di to complete the first base-16. Hexadecimal consistency is maintained with 8 registers per a size.

Content Type	Original Author	Original Content on Stackoverflow
Question	Xeo	View Question on Stackoverflow
Solution 1 - Performance	John Ripley	View Answer on Stackoverflow
Solution 2 - Performance	DigitalRoss	View Answer on Stackoverflow
Solution 3 - Performance	Seth Robertson	View Answer on Stackoverflow
Solution 4 - Performance	user922475	View Answer on Stackoverflow