Do x86 instructions require their own encoding as well as all of their arguments to be present in memory at the same time?


Assembly Problem Overview

I am trying to figure out whether it is possible to run a Linux VM whose RAM is only backed by a single physical page.

To simulate this, I modified the nested page fault handler in KVM to remove the present bit from all nested page table (NPT) entries, except the one corresponding to the currently processed page fault.

While trying to start a Linux guest, I observed that assembly instructions that use memory operands, like

add [rbp+0x820DDA], ebp

lead to a page fault loop until I restore the present bit for the page containing the instruction as well as for the page referenced in the operand (in this example [rbp+0x820DDA]).

I am wondering why this is the case. Shouldn't the CPU access the memory pages sequentially, i.e. first read the instruction and then access the memory operand? Or does x86 require that the instruction page as well as all operand pages are accessible at the same time?

I am testing on AMD Zen 1.

Assembly Solutions

Solution 1 - Assembly

Yes, they do require the machine code and all memory operands.

> Shouldn't the CPU access the memory pages sequentially, i.e. first read the instruction and then access the memory operand?

Yes that's logically what happens, but a page-fault exception interrupts that 2-step process and discards any progress. The CPU doesn't have any way to remember what instruction it was in the middle of when a page-fault occurred.

When a page-fault handler returns after handling a valid page fault, RIP= the address of the faulting instruction, so the CPU retries executing it from scratch.

It would be legal for the OS to modify the machine code of the faulting instruction and expect it to execute a different instruction after iret from the page-fault handler (or any other exception or interrupt handler). So AFAIK it's architecturally required that the CPU redoes code-fetch from CS:RIP in the case you're talking about. (Assuming it even does return to the faulting CS:RIP instead of scheduling another process while waiting for disk on hard page fault, or delivering a SIGSEGV to a signal handler on an invalid page fault.)

It's probably also architecturally required for hypervisor entry/exit. And even if it's not explicitly forbidden on paper, it's not how CPUs work.

@torek comments that Some (CISC) microprocessors partially decode instructions and dump microregister state on a page fault, but x86 is not like that.

A few instructions are interruptible and can make partial progress, like rep movs (memcpy in a can) and other string instructions, or gather loads/scatter stores. But the only mechanism is updating architectural registers like RCX / RSI / RDI for string ops, or the destination and mask registers for gathers (e.g. manual for AVX2 vpgatherdd). Not keeping the opcode / decode results in some hidden internal register and restarting it after iret from a page fault handler. These are instructions that do multiple separate data accesses.

Also keep in mind that x86 (like most ISAs) guarantees that instructions are atomic wrt. interrupts / exceptions: they either fully happen, or don't happen at all, before an interrupt. So for example add [mem], reg would be required to discard the load if the store part faulted, even without a lock prefix.

The worst case number of guest user-space pages present to make forward progress might be 6 (plus separate guest-kernel page-table subtrees for each one):

  • movsq or movsw 2-byte instruction spanning a page boundary, so both pages are needed for it to decode.
  • qword source operand [rsi] also a page-split
  • qword destination operand [rdi] also a page-split

If any of these 6 pages fault, we're back to square one.

rep movsd is also a 2-byte instruction, and making progress on one step of it would have the same requirement. Similar cases like push [mem] or pop [mem] could be constructed with a misaligned stack.

One of the reasons (or side benefits) for/of making gather loads / scatter stores "interruptible" (updating the mask vector with their progress) is to avoid increasing this minimum footprint to execute a single instruction. Also to improve efficiency of handling multiple faults during one gather or scatter.

@Brandon points out in comments that a guest will need its page tables in memory, and the user-space page splits can also be 1GiB splits so the two sides are in different sub-trees of the top level PML4. HW page walk will need to touch all of these guest page-table pages to make progress. A situation this pathological is unlikely to happen by chance.

The TLB (and page-walker internals) are allowed to cache some of the page-table data, and aren't required to restart page-walk from scratch unless the OS did invlpg or set a new CR3 top-level page directory. Neither of these are necessary when changing a page from not-present to present; x86 on paper guarantees that it's not needed (so "negative caching" of not-present PTEs isn't allowed, at least not visible to software). So the CPU might not VMexit even if some of the guest-physical page-table pages are not actually present.

PMU performance counters can be enabled and configured such that the instruction also requires a perf event to a write into a PEBS buffer for that instruction. With a counter's mask configured to count only user-space instructions, not kernel, it could well be that it keeps trying to overflow the counter and store a sample in the buffer every time you return to userspace, producing a page-fault.


All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionsavvybugView Question on Stackoverflow
Solution 1 - AssemblyPeter CordesView Answer on Stackoverflow