Computers are fascinating machines, and the vehicle for most of the innovation that is happening today. Most software engineers write application programs in high-level languages and thus don't appreciate the layers beneath. This article delineates some details at the boundary of software and hardware. This article is inspired by the discussions that I had in the paper read session where Vijay, Vinayak, Chirag, Himanshu and I read the paper "A Comparison of Software and Hardware Techniques for x86 Virtualization". This paper was published by two VMware engineers while I was at VMware, and reading it again helped me brush up the main ideas behind software virtualization, specifically binary translation and how processors work in general.
A computer has three prominent parts: (a) CPU (b) Memory (c) I/O. In this article we will study CPU and Memory, focusing on the hardware-software boundary.
CPU - "Fetch and Execute"
A CPU, at the core, has a straightforward life cycle - fetch-decode-execute - instruction after instruction. In the next few paragraphs we elucidate on what it means.
In order to understand "fetch-decode-execute", we should visualize that the processor consists of registers, and some circuitry (called ALUs) to do operations on the values in registers. (Often the values to be operated upon come from main memory too.)
An instruction is a sequence of bits. What else could it be! There is however an elaborate structure to the bits in the instruction, which is described by ISA (Instruction Set Architecture) of the processor. Typically, first few bits encode the operations (load vs store vs add vs multiply etc). Next few bits may specify first operand and subsequent few bits may specify second operand.
For instance, consider the following bit sequence:
On x86, here is the interpretation of these bits: In other words, the assembly corresponding to this machine code is: add rax, 34 i.e. add 34 to the contents of register rax. Specifically:
- The first byte (48) is the REX prefix that enables 64-bit operation.
- The second byte (83) denotes that this is an add instruction with 8-bit immediate.
- The third byte (C0) contains ModR/M information specifying register-direct mode and the rax register.
- The fourth byte (22) represents the immediate value 34 in hexadecimal.
So, the job of the processor is - fetch the next instruction from the memory, decode it and "execute" it - i.e. do what is specified in the instruction.
Great. So now we know what an instruction is. Along the way we also came to know that there are "registers" in the processor. Registers are like scratch spaces which a programmer can use.
The second question that we needed to answer was "Where are the instructions stored, so we know where to fetch them from?". The answer is that they are stored in the main memory. So, somehow you need to find a sequence of instructions (in the binary format as seen above) and put it in the memory.
We talked about registers but did not discuss what they are. So, here is a quick description of registers.
Registers
Some details on registers here. Each register is a 64-bit (8-byte) storage space. They are of two types: (a) general purpose and (b) special purpose registers. A general purpose register is really like the scratch space to be used by the programmer to achieve some computation. A special purpose register has some specific purpose - they are meant to communicate some specific information to the rest of the processor.
There are 15 general purpose registers:
Then there are some special purpose registers. There are many of them, but in this article we will need to know of the following:
- RIP: the instruction pointer - contains the address of the next instruction to be executed
- IR: the instruction register - contains the instruction currently being executed
- RSP: the (extended) stack pointer - contains the address of the top of the stack
- RFLAGS: the flags register. This register contains some flags (0 or 1) around the execution results of some other instructions. E.g. if there is an overflow in a previous operation then certain bit will get set. Another bit will get set based on the result of comparison of two values in an instruction
- CR3: contains the page table base address. We will discuss more about this when we discuss MMU (memory management unit) later in this article
There are many more, but for the purpose of our discussion only the above registers will suffice.
The Common Flow of Instructions, and the Instruction Pointer RIP
Now that we have seen how a single instruction executes, it is easy to see that instructions can be stored one after the other in memory, fetched serially and executed. And that is, indeed, most of the execution. Once an instruction execution is done, RIP is incremented by the length of the current instruction and then the processor fetches-decodes-executes the next instruction.
Let's say content of address 0x12345678 is 01001000 10000011 11000000 00100010. Instead of writing human unfriendly bit sequence we will write it as "add rax, 34". Let's also say that at next address i.e. 0x1234567C, we have "add rbx, 22". We succinctly write this as:
A: add rax, 34 add rbx, 22
where A is a shorthand for 0x12345678. Initially, value of RIP is A, and thus "add rax, 34" gets executed. Post the execution of this instruction RIP changes to A + 4 where the next instruction resides. Thus, the next instruction gets fetched and executed.
Function Calls and the Stack Pointer (RSP register)
The next concept we need to understand to appreciate fetch-decode-execute lifecycle is how function calls are implemented in the processors and the role of RSP.
If the concept of "function" did not exist in high level programming languages, then perhaps one could do without RSP. However, functions are integral to programming languages - they help us organize our code in logical groupings.
Now suppose that in your high level language you write a function func. At the time of writing this function func, you probably won't know which all functions can call this function. Maybe today, func1 and func2 can call this, and tomorrow someone might write a third function func3 which can call the function func. func itself can also call func (recursion).
So, you have to compile the function func to assembly and then to machine code in a way that does not depend on who will call this function. Let's say the compiler (or even you!) compiles the function and you use a lot of registers - rax, rbx, rcx etc. Now, when function func1 calls you, even that function might be using the same registers, and once execution of func is finished, func1 will need those values. So, if code for func1 is executing and function call to func is made, then before starting to execute the code of func, all the registers should be saved somewhere, and when execution of func finishes, all the registers should be restored. The only place where we can save it is in memory. And since there can be function call within a function call within a function call, we also need to store, for each saving of the set of the registers, which address in memory we have saved it to. We refer to the set of registers as the "context" - so we say that context is saved in the stack or restored from the stack.
This whole problem is solved by having a concept of "stack" in the memory. For any given process (now what is a "process" will be more concretely discussed in a later section of this article), the kernel allocates some memory and the start of the address is stored in the register RSP. Then, at each "call" instruction ("call" is the assembly instruction for function call), processor automatically saves all the registers, as well as current RIP, in memory, starting from address RSP, updates RSP, and then jumps the execution to the address provided in the call instruction. Similarly, on "ret" instruction ("ret" is the assembly instruction for returning from a function call), processor automatically restores the registers from the top of the stack, updates RIP to the value at the top of the stack (it will be value of RIP we saved at the time of call), and updates RSP again. RSP updates are easy - they are decrements or increments by an amount which is equal to the data that is being saved or restored from the stack.
Above description should make clear that (1) Function call is not a software-only concept - the hardware supports it. However, were the hardware not supporting it, it could still have been implemented in software (assembly code could have done all the saving/restoring of the context), however, it would have resulted in slower execution of the programs. (2) RSP - the extended stack pointer is involved in implementing the saving/restore of the context on the stack, and stores the address of the top of the stack.
Details about the Instruction Pointer (RIP)
We had a brief look at the RIP earlier, and saw that in the linear execution of the instructions, RIP always gets incremented by the length of the currently executing instruction (Note that in some architectures, like x86, instruction length is variable; but in other architectures, like ARM, instruction length is fixed). Now, let's look at RIP in details and see the scenarios where it changes in non-linear fashion.
The value of RIP changes as per the following logic:
- Unless the current instruction is a jump instruction, RIP gets incremented to RIP + instruction_length as the current instruction executes
- If the current instruction is a jump or conditional jump instruction, then the value of RIP changes as per the result of jump instruction
- At any time an "interrupt" may come. If that happens, RIP immediately changes to interrupt handler
- RIP changes in different ways when a call / ret instruction happens
The first of the cases has been studied already. Let's look at the next scenarios.
Branches
if and for statements of high level languages are translated to a variety of jump instructions.
Consider the following C code from the paper:
int isPrime(int a) { for (int i = 2; i < a; i++) { if (a % i == 0) return 0; } return 1; }
It translates to following assembly:
isPrime: mov %edi, %ecx ; %ecx = %edi (a) mov $2, %esi ; i = 2 cmp %ecx, %esi ; is i >= a? jge prime ; jump if yes nexti: mov %ecx, %eax ; set %eax = a cdq ; sign-extend idiv %esi ; a % i test %edx, %edx ; is remainder zero? jz notPrime ; jump if yes inc %esi ; i++ cmp %ecx, %esi ; is i >= a? jl nexti ; jump if no prime: mov $1, %eax ; return value in %eax ret notPrime: xor %eax, %eax ; %eax = 0 ret
In the above example, 'isPrime', 'nexti' etc will all be addresses. When the processor executes "cmp %ecx, %esi" it compares the values in the registers %esi and %ecx, and sets or unsets a specific bit in a "flags register" (This is another register just like RIP) depending on whether the two values are equal or not. Next when "jge prime" ("jump if greater than or equal") executes - then RIP is set to either the address called 'prime' if previous comparison was true (this info is now in flags register), or else RIP is incremented by the instruction length if previous comparison was false. So, the execution either jumps to some target or else continues linearly depending on the result of previous comparison.
Interrupt Handling
Another phenomenon in which RIP changes to value other than RIP + instruction_length is when interrupt happens. When an interrupt happens (e.g. when you type in something on your keyboard or when disk I/O completes), current context is saved on the stack (just as it happens in function call) and RIP becomes the address of interrupt handler. How does the processor know the address of the interrupt handler? The kernel needs to set that address at a specific location in memory when the machine is starting up.
Call/ret Instructions
We have already looked at call and ret instructions, they also lead to non-linear jumps in the value of RIP.
This ends the description of the fetch-decode-execute cycle of the processors.