# Compilation and Validation I try here to sketch out how a simple compiler of EVM to machine code could work, and why validation is needed for the compiler to work. ## Stack and Register Machines Let's consider two very simple virtual machines -- a stack machine and a register machine. The stack machine has just the following instructions ``` <OPERATOR> PUSH value POP JUMP address JUMPIF address STOP ``` and the register machine has one less ``` <OPERATOR> register_y, register_x STORE register_x, value JUMP address JUMPIF register_x, address STOP ``` where `<OPERATOR>` is one of `ADD, SUB, MUL, DIV`. On the stack machine the instructions use the top stack items and leave the result, if any, on top of the stack. On the register machine they use the given registers and leave the result, if any, in `register_y`. Register indexes, code addresses and data values are `uint64`, encoded MSB-first. The code for`STOP` is `0`. The EVM and real CPUs are of course much more complex, but these are adequate abstractions for our purposes. ## Compile from Stack Code to Register Code Say we want to translate code for the stack machine to code for the register machine. Here is a simple compiler. ``` // traverse stack machine code, emitting register machine code Code::compile(SP) { while opcode = get_s_code() { // get next opcode and zero_s_code() // mark code as compiled switch opcode { case ADD, SUB, MUL, DIV: put_r_code(opcode) put_r_code_data(SP+2) put_r_code_data(SP+1) SP-- case PUSH: value = get_s_code_data() put_r_code(STORE) put_r_code_data(value) SP-- case POP: SP++ case JUMP: s_pc = get_s_code_data() r_pc = s_to_r_code_address(s_pc) put_r_code(JUMP) put_r_code_data(r_pc) case JUMPIF: s_pc = get_s_code_data() r_pc = s_to_r_code_address(s_pc) put_r_code(JUMPIF) put_r_code_data(r_address) SP++ Code(stack_code, register_code, s_pc, r_pc).compile(SP) case STOP: return } } } ``` `Code::compile()` uses an array of stack machine code, addressed by `S_PC` and an array of register machine code, addressed by `R_PC`. It also assumes some helper functions for accessing opcodes and immediate data in code arrays, all of which advance PCs over immediate data, and a helper to adjust addresses in the stack code to new addresses in the register code. (I don't try to define all the functions here). ``` Code { byte stack_code[] byte register_code[] int64 S_PC int64 R_PC int64 SP get_s_code() { return stack_code[S_PC++] } zero_s_code() { stack_code[S_PC] = 0 } get_s_code_data() { ... } put_r_code(opcode) { register_code[R_PC++] = opcode } put_r_code_data(data) { ... } s_to_r_code_address(s_code_address) { ... } validate(int64) compile(int64) } ``` Rather than optimizing register assignment, we simply alias the current data stack pointer `SP` with the current register index. Like the EVM, the data stack is defined as growing towards smaller addresses. Better code is easy enough to generate, if only by a more compact encoding of immediate data. Well-optimized code is of course more difficult, but even unoptimized machine code will perform better than an interpreter. ## Stack Validity and Code Correctness Where is the weakness here? Where we call `zero_s_code()` to mark code as already compiled. We need to do that so that we do not compile the same loop more than once, but _if a loop grows or shrinks the stack_ the code we generate the first time around a loop may fail to index the correct registers on subsequent iterations -- operators meant to use the top of stack might keep using indexes further down. However, if we first validate that the stack height will be constant at each iteration of the loop then the compiler can use the correct indexes. (We can also check for stack underflow and overflow.) These is not an onerous requirement: structured programming languages enforce it by default, even if they include`break`, `continue`, and restricted forms of `goto`. What does the validator look like? Not unlike the compiler. It simply traverses the code, checking the stack height, and marking code as traversed to avoid checking it again. If the stack use is valid the compiler will not fail because of it. ``` // traverse stack machine code, validating stack use Code::validate(int64 SP) { while opcode = get_s_code() { // get next opcode and zero_s_code() // mark code as validated stack_height = state.SP - SP // check stack heights if (stack_heights[S_PC] == 0) stack_heights[S_PC] = stack_height else if stack_heights[S_PC] == height return true else return false switch opcode { // check stack bounds case ADD, SUB, MUL, DIV: SP -= 2 if SP < 0 return false ++SP case PUSH: if --SP < 0 return false case POP: if ++SP > MAX_SP return false case JUMP: s_pc = get_s_code_data() PC = s_pc case JUMPIF: if SP++ > MAX_SP return false // recurse to handle true branch of conditional s_pc = get_s_code_data() r_pc = s_to_r_code_address(s_pc) Code(stack_code, register_code, s_pc, r_pc).validate(SP) case STOP: return true default: return false // invalid opcode } } while state = pop(continuations) } ```