SP1 AIR Documentation

# SP1 AIR Documentation This is a document made for the Code4rena audit contest for the SP1 Hypercube, as well as an documentation on how the SP1 AIRs work in the newest version. We assume familiarity with ZKP & multi-table circuit designs & interactions, as well as lookup arguments. Some of the fundamentals of SP1 that remain the same can be found in [the audit documentation for SP1 v1](https://hackmd.io/wztOd455QKWf-K8LXh_Fqw). **Note that most AIRs have changed significantly since. The audit scope is the circuits of SP1 Hypercube, and not the SP1 v1.** This documentation may be edited at any time for clarification, typos, and added details. The Part 2 of the documentation, on untrusted program and mprotect, can be found [here](https://hackmd.io/@rkm0959/rydiLQaqel). ## no-CPU Architecture One of the biggest change in this version of circuits is that there is no CPU chip now. This means that each chip not only handles the RISC-V instruction itself, but also handles the "housekeeping" of the CPU state, which includes - fetching the instruction from the program counter `pc` - reading and writing to the registers by using the memory argument - incrementing the timestamp and modifying the program counter The handling of these can be found in `core/machine/src/adapter/`. First, there's the `CPUState` in `state.rs`, which handles the state of the CPU. The state of the CPU is composed of the `clk` and `pc`. The `clk` is a value that can be at most `2^48`, and is sent throughout as two 24-bit limbs. This `clk` is used for the memory argument, which requires a strictly increasing timestamp. The `pc` is the program counter, which also can be at most `2^48`, and is represented as three 16-bit limbs. ```rust= pub fn eval<AB: SP1AirBuilder>( builder: &mut AB, cols: CPUState<AB::Var>, next_pc: [AB::Expr; 3], clk_increment: AB::Expr, is_real: AB::Expr, ) { let clk_high = cols.clk_high::<AB>(); let clk_low = cols.clk_low::<AB>(); builder.assert_bool(is_real.clone()); builder.receive_state(clk_high.clone(), clk_low.clone(), cols.pc, is_real.clone()); builder.send_state( clk_high.clone(), clk_low.clone() + clk_increment, next_pc, is_real.clone(), ); builder.send_byte( AB::Expr::from_canonical_u32(ByteOpcode::Range as u32), (cols.clk_0_16 - AB::Expr::one()) * AB::F::from_canonical_u8(8).inverse(), AB::Expr::from_canonical_u32(13), AB::Expr::zero(), is_real.clone(), ); builder.slice_range_check_u8(&[cols.clk_16_24.into(), AB::Expr::zero()], is_real.clone()); } ``` Here, the `receive_state` function receives the current CPU state as an interaction, then `send_state` sends the next CPU state based on the `next_pc` value and `clk_increment`, the clock increment. Therefore, it's the circuit's job to constrain `next_pc` and `clk_increment` correctly. There are hidden details in this code, which we discuss later. Then, there's the `adapter`'s that are used to handle register accesses and fetching instructions. There are four adapters: `ALUType`, `RType`, `IType`, and `JType`. This is based on the type of instructions that the chip aims to handle - - `ALUType`: handles both `RType` and `IType`, used for common ALU opcodes, since they have both versions of instructions, for example `SLT` and `SLTI` both exist as opcodes. - `RType`: Instructions where `op_a, op_b, op_c` are registers. - `IType`: Instructions where `op_a, op_b` are registers and `op_c` is immediate. - `JType`: Instructions where `op_a` is register and `op_b, op_c` are immediate. Since `ALUType` is the most complex one, we'll just go over that one. ```rust= pub struct ALUTypeReader<T> { pub op_a: T, pub op_a_memory: RegisterAccessCols<T>, pub op_a_0: T, pub op_b: T, pub op_b_memory: RegisterAccessCols<T>, pub op_c: Word<T>, pub op_c_memory: RegisterAccessCols<T>, pub imm_c: T, pub is_trusted: T, } #[allow(clippy::too_many_arguments)] fn eval_alu_reader<AB: SP1AirBuilder + MemoryAirBuilder + ProgramAirBuilder>( builder: &mut AB, clk_high: AB::Expr, clk_low: AB::Expr, pc: [AB::Var; 3], opcode: impl Into<AB::Expr> + Clone, instr_field_consts: [AB::Expr; 4], op_a_write_value: Word<impl Into<AB::Expr> + Clone>, cols: ALUTypeReader<AB::Var>, is_real: AB::Expr, ) { builder.assert_bool(is_real.clone()); let is_untrusted = is_real.clone() - cols.is_trusted; builder.assert_bool(is_untrusted.clone()); builder.assert_bool(cols.is_trusted); // A real row must be executing either a trusted program or untrusted program. builder.assert_eq(is_untrusted.clone() + cols.is_trusted, is_real.clone()); // If the row is running an untrusted program, the page protection checks must be on. let public_values = builder.extract_public_values(); builder.when(is_untrusted.clone()).assert_one(public_values.is_untrusted_programs_enabled); // Assert that `imm_c` is zero if the operation is not real. // This is to ensure that the `op_c` read multiplicity is zero on padding rows. builder.when_not(is_real.clone()).assert_eq(cols.imm_c, AB::Expr::zero()); let instruction: InstructionCols<AB::Expr> = InstructionCols { opcode: opcode.clone().into(), op_a: cols.op_a.into(), op_b: Word::extend_expr::<AB>(cols.op_b.into()), op_c: cols.op_c.map(Into::into), op_a_0: cols.op_a_0.into(), imm_b: AB::Expr::zero(), imm_c: cols.imm_c.into(), }; builder.send_program(pc, instruction.clone(), cols.is_trusted); builder.send_instruction_fetch( pc, instruction, instr_field_consts, [clk_high.clone(), clk_low.clone()], is_untrusted.clone(), ); // Assert that `op_a` is zero if `op_a_0` is true. builder.when(cols.op_a_0).assert_word_eq(op_a_write_value.clone(), Word::zero::<AB>()); builder.eval_register_access_write( clk_high.clone(), clk_low.clone() + AB::Expr::from_canonical_u32(MemoryAccessPosition::A as u32), [cols.op_a.into(), AB::Expr::zero(), AB::Expr::zero()], cols.op_a_memory, op_a_write_value, is_real.clone(), ); builder.eval_register_access_read( clk_high.clone(), clk_low.clone() + AB::Expr::from_canonical_u32(MemoryAccessPosition::B as u32), [cols.op_b.into(), AB::Expr::zero(), AB::Expr::zero()], cols.op_b_memory, is_real.clone(), ); // Read the `op_c[0]` register only when `imm_c` is zero and `is_real` is true. builder.eval_register_access_read( clk_high.clone(), clk_low.clone() + AB::Expr::from_canonical_u32(MemoryAccessPosition::C as u32), [cols.op_c[0].into(), AB::Expr::zero(), AB::Expr::zero()], cols.op_c_memory, is_real - cols.imm_c, ); // If `op_c` is an immediate, assert that `op_c` value is copied into // `op_c_memory.prev_value`. builder.when(cols.imm_c).assert_word_eq(cols.op_c_memory.prev_value, cols.op_c); } ``` We'll defer the discussion of `is_trusted` and `send_instruction_fetch` to later. For now, note that these are for executing untrusted programs, which is not a part of the ELF itself, so has to be decoded directly rather than taken from the preprocessed `ProgramChip`. The column `is_real` denotes whether or not the row of the table is real or just a padding. If `is_real` is true, then the instruction is fetched from the `pc`. For the case of `is_trusted = 1`, which is the case we'll discuss for now, this interaction of `send_program` is received by the `ProgramChip`, which holds the program information in preprocessed tables. ```rust= fn eval(&self, builder: &mut AB) { let main = builder.main(); let preprocessed = builder.preprocessed(); let prep_local = preprocessed.row_slice(0); let prep_local: &ProgramPreprocessedCols<AB::Var> = (*prep_local).borrow(); let mult_local = main.row_slice(0); let mult_local: &ProgramMultiplicityCols<AB::Var> = (*mult_local).borrow(); // Constrain the interaction with CPU table builder.receive_program(prep_local.pc, prep_local.instruction, mult_local.multiplicity); } ``` Since `op_a` is the only "writeable" register, as in `op_b, op_c` are guaranteed to be read-only, we use read methods on `op_b, op_c` and use the write method on `op_a`. The write value will be provided as `op_a_write_value`, and in the case of instructions with `op_a` immutable, this value will be `op_a_memory.prev_value` so that the value remains the same. The `op_a_0` flag represents whether `op_a = x0`, and if it is true, that hard-wires the `op_a` value to zero, hence the constraint on `op_a_write_value` being all-zeroes. Finally, note that if `is_real = 0`, then `imm_c` is asserted to be zero. This is because the `op_c` memory access is with `is_real - imm_c` multiplicity, which must equal zero. So, - `is_real = 1, imm_c = 0`: `op_c` is a register, multiplicity of `op_c` memory read is 1. - `is_real = 1, imm_c = 1`: `op_c` is immediate, multiplicity of `op_c` memory read is 0. - `is_real = 0`: This constrains `imm_c = 0`, so multiplicity of `op_c` memoy read is 0. For `imm_c = 1`, we assert that the `op_c` value (the immediate) is copied over to `op_c_memory.prev_value`, so it can always be used as the `c` value regardless of `imm_c`. ```rust= impl<T> ALUTypeReader<T> { pub fn prev_a(&self) -> &Word<T> { &self.op_a_memory.prev_value } pub fn b(&self) -> &Word<T> { &self.op_b_memory.prev_value } pub fn c(&self) -> &Word<T> { &self.op_c_memory.prev_value } } ``` ## RISC-V64 & u16 limbs Note that `pc` can be at most `2^48`, which doesn't need to be the case for RISC-V32. This version of SP1 supports RISC-V64 instead of RISC-V32. However, for performance purposes, we enforce that the address space is `2^48`, instead of `2^64`. This is fine for realistic programs. The addresses will be represented with three 16-bit limbs, like `pc`. Additionally, this version of SP1 represents the `u64` values in RISC-V64 as four 16-bit limbs instead of eight bytes. This has performance advantages usually, but sometimes you require the `u64` to be written in byte form to enforce constraints. There's two cases. First, consider bitwise operations. The current form of SP1 cannot directly do bitwise operations over `u16`s, since the naive lookup table required for that will be too large. Therefore, you need to break the `u16` down to bytes, then apply the byte lookups. In this case, when you break down the `u16`, you do not need to constrain that the two bytes is actually bytes. This is because **the byte lookup to do the bitwise operation will automatically constrain that** the inputs are also bytes indirectly. ```rust= pub struct U16toU8Operation<T> { low_bytes: [T; WORD_SIZE], } /// Converts four u16 limbs into eight u8 limbs. /// This function assumes that the u8 limbs will be range checked. pub fn eval_u16_to_u8_unsafe<AB: SP1AirBuilder>( _: &mut AB, u16_values: [AB::Expr; WORD_SIZE], cols: U16toU8Operation<AB::Var>, ) -> [AB::Expr; WORD_BYTE_SIZE] { let mut ret = core::array::from_fn(|_| AB::Expr::zero()); let divisor = AB::F::from_canonical_u32(1 << 8).inverse(); for i in 0..WORD_SIZE { ret[i * 2] = cols.low_bytes[i].into(); ret[i * 2 + 1] = (u16_values[i].clone() - ret[i * 2].clone()) * divisor; } ret } ``` On the other hand, consider a multiplication opcode on u64s. You cannot multiply the `u64`s in four `u16`-limbs form directly, since the product of two `u16` limbs will overflow the underlying prime. Therefore, we need to break it down to bytes, but in this case there's no byte lookups being done directly around them. So, in this case, we need to range check the broken down bytes directly - which is done in `eval_u16_to_u8_safe`. ```rust= /// Converts four u16 limbs into eight u8 limbs. /// This function range checks the eight u8 limbs. fn eval_u16_to_u8_safe<AB: SP1AirBuilder>( builder: &mut AB, u16_values: [AB::Expr; WORD_SIZE], cols: U16toU8Operation<AB::Var>, is_real: AB::Expr, ) -> [AB::Expr; WORD_BYTE_SIZE] { let ret = U16toU8Operation::<AB::F>::eval_u16_to_u8_unsafe(builder, u16_values, cols); builder.slice_range_check_u8(&ret, is_real); ret } ``` In many cases, the operation can be done directly over u16s. MSB fetching, range checking, addition, subtraction, comparison can be done without breaking the u16 down. We ensure that all the memory values is `u64` words with valid four `u16` limbs. This is done in a way exactly same as SP1 v1, via an inductive argument. [Refer to the V1 docs](https://hackmd.io/wztOd455QKWf-K8LXh_Fqw#Valid-u32-Word-Checking-Propagation). ## Timestamp & Program Counter First, consider the timestamp. As mentioned, this is used for the Blum memory argument - see [SP1 v1 document](https://hackmd.io/wztOd455QKWf-K8LXh_Fqw#Offline-Memory-Checking) for a more detailed explanation. Now, the timestamp - increases by `8` on every RISCV cycle except syscalls - increases by `8 + 256 = 264` on all syscalls Also, because a single cycle of RISCV does multiple memory accesses, we have a `MemoryAccessPosition`. Let's explain what this does. Consider an `ADD` opcode on `op_a = op_b = op_c = x10`. This means that if we do a memory access on `op_a, op_b, op_c` once each, `x10` will be accessed three times in a single cycle. However, the timestamp must be strictly increasing on each access for the Blum argument to work out. Therefore, we increment the timestamp by the position, so that the timestamp strictly increases. ```rust= #[derive(Copy, Clone, Debug, PartialEq)] pub enum MemoryAccessPosition { /// Untrusted instruction access position. UntrustedInstruction = 0, /// Memory access position. Memory = 1, /// C register access position. C = 2, /// B register access position. B = 3, /// A register access position. A = 4, } ``` Now, we have a problem. We want the timestamp used for the Blum argument to be represented as two 24-bit limbs all the time, so we need to make `timestamp + position` be represented as such, and we also need to handle the timestamp increments each cycle. This is doable, but does incur overhead in trace area. Here's how we optimize this. First, we go around the `position` handling. We make the timestamp start at `1`, and note that the timestamps always increment by a multiple of `8`. This means that the timestamp of the cycle will always be `1 (mod 8)`. Therefore, if you add `0 <= position <= 4` to it, the bottom 24-bit limb cannot overflow. This means that you can add the `position` to the bottom 24-bit limb directly, and not worry about potential overflow concerns. Now, we need to handle the increments per cycle. Since the overflows will happen rarely, the ideas is to have a separate chip that "cleans up" the state to be correct. So, the individual RISC-V chips will not handle the carries properly, and just send over `(clk_high, clk_low + increment)` as the next timestamp, without carry propagation. This carry will then be handled in the `StateBump` chip. Ultimately, this chip will first receive the potentially incorrectly formatted `clk` as `(clk_high, clk_low)`. Then, it does - assert that `is_clk` is boolean. this will be the carry from the low limb to the high limb. - if `is_clk` is `1`, `next_clk_low = clk_low - 2^24`, `next_clk_high = clk_high + 1`. - if `is_clk` is `0`, `next_clk_low = clk_low`, `next_clk_high = clk_high`. - range check that `next_clk_low, next_clk_high` is valid 24-bit limb. Then send the correctly formatted `clk` as `(next_clk_high, next_clk_low)`. ```rust= // Check that `is_real` is a boolean value. builder.assert_bool(local.is_real); // Receive the state with values potentially in non-canonical forms. builder.receive_state(local.clk_high, local.clk_low, local.pc, local.is_real); // Send the state with `clk_high, clk_low, next_pc` being in canonical forms. builder.send_state( local.next_clk_24_32 + local.next_clk_32_48 * AB::F::from_canonical_u32(1 << 8), local.next_clk_0_16 + local.next_clk_16_24 * AB::F::from_canonical_u32(1 << 16), local.next_pc, local.is_real, ); // Check that the sent state's clk is in canonical form. // The bottom 16 bits of the `clk` is a u16 value that is 1 (mod 8). builder.send_byte( AB::Expr::from_canonical_u32(ByteOpcode::Range as u32), (local.next_clk_0_16 - AB::Expr::one()) * AB::F::from_canonical_u8(8).inverse(), AB::Expr::from_canonical_u32(13), AB::Expr::zero(), local.is_real, ); // The top 16 bits of the `clk` is a u16 value. builder.send_byte( AB::Expr::from_canonical_u32(ByteOpcode::Range as u32), local.next_clk_32_48.into(), AB::Expr::from_canonical_u32(16), AB::Expr::zero(), local.is_real, ); // The two 8 bit limbs in the middle of the clk are valid u8 values. builder.slice_range_check_u8(&[local.next_clk_16_24, local.next_clk_24_32], local.is_real); // If `is_clk` is true, a carry happens from the bottom 24 bit limb to the top. // First, check that `is_clk` is a boolean value. This is possible because the `clk` does // not increment by more than `2^24` in a single instruction cycle. builder.assert_bool(local.is_clk); builder.when(local.is_real).assert_eq( local.next_clk_24_32 + local.next_clk_32_48 * AB::F::from_canonical_u32(1 << 8), local.clk_high + local.is_clk, ); builder.when(local.is_real).assert_eq( local.next_clk_0_16 + local.next_clk_16_24 * AB::F::from_canonical_u32(1 << 16) + local.is_clk * AB::F::from_canonical_u32(1 << 24), local.clk_low, ); ``` For the overall constraint system to work, we constrain the following things. - In the initial & final timestamp of the shard, the timestamp is represented with two valid 24-bit limbs, and both timestamps are `1 (mod 8)`. - In each RISC-V chip, the low limb is a valid 24-bit limb and is `1 (mod 8)`. - Refer to the `CPUState` implementation for the details. - In each RISC-V chip, the `clk_increment` is correct, either `8` or `8 + 256`. - In StateBump chip, the received state and the sent state has the same `timestamp` value that is `1 (mod 8)`. The sent state has two valid 24-bit limbs. This is sufficient to constrain that in each RISC-V chips, the timestamp is correct. Two additional comments. First, note that the high limb doesn't need to be constrained to be 24-bits in individual RISC-V chips, as the high limbs of the RISC-V chips are either from - initial / final timestamp, which are constrained to be valid 24-bit limbs, - the sent state of StateBump chip, which are constrained to be valid 24-bit limbs as the high limbs do not change within the RISC-V chips. Second, we briefly mention how we constrain a value to be 24-bit limb that is `1 (mod 8)`. We break it down to two parts, a low 16-bit part and a high 8-bit part. We range check that high part to be a byte using a byte lookup, and the low 16-bit part is checked with $$ (\text{value} - 1) / 8 \in [0, 1, 2, \cdots, 2^{13} - 1]$$ using the `RangeChip`, which can check `0 <= a < 2^b` for `0 <= b <= 16`. This is sufficient to show that `value` is a `u16` value that is `1 (mod 8)`, even in the field arithmetic. The same goes for the program counter. There's two types of instructions - instructions that just increment the `pc` by 4 - instructions that may jump the `pc` by different values In the latter type of instructions, it's natural to range check the `next_pc` as three u16 limbs. In the former, we just use `[pc[0] + 4, pc[1], pc[2]]` as the `next_pc`. The carry propagation for the `pc` also happens in the `StateBump` chip. We constrain the following things, which we believe to be sufficient for correctness. - in the initial / final `pc` of the shard, the program counter is three valid u16 limbs. - in each RISC-V chips, the `pc` is three valid u16 limbs. This is implicitly done, as the lookup to the preprocessed parts of `ProgramChip` shows that the `pc` is valid. - In StateBump chip, the received state and the sent state has the same `pc` value with correct carry propagation. The sent state has three valid u16 limbs. There's one more important part about `pc` handling. The execution must end with a `HALT` syscall, which sets the next `pc` to `HALT_PC`. We check that the final `pc` of the execution is equal to this `HALT_PC` to verify that the execution halted with `HALT`. To make sure that this way of verifying that `HALT` was the last instruction is correct, we ensure the following - All `pc` in a RISC-V execution is a multiple of `4`. Note that for trusted programs, this is implicitly done as this is a part of preprocessed trace in `ProgramChip`. - All `next_pc` in a RISC-V execution, except `HALT`, is a multiple of `4`. This is implicitly done for opcodes that do not branch/jump, as the `pc` simply increments by `4`. - We use `HALT_PC = 1`, which is **not** a multiple of `4`. Since the final `pc` is `HALT_PC = 1`, the execution must end with a `HALT` syscall, as any other execution will end with a `pc` value a multiple of `4`. Also, once `HALT_PC = 1` is reached, the execution must finish, since there's no program corresponding to `HALT_PC`. ## Register Access Optimization There's one more trick we need to go over regarding timestamps & Blum memory argument. In Blum, `crates/core/machine/src/air/memory.rs`, we need to enforce that the timestamps are strictly increasing. Recall that the timestamp is represented with two 24-bit limbs. Therefore, checking the increasing timestamp requires - checking if the most significant limb is different, and if so, if the new one is larger - if the most significant limb is equal, check that the lower limb is larger which has more cost than just comparing one limb. To avoid this cost in the common register accesses, we add a trick. Basically, the goal is to make all the register accesses have the same high 24-bit limb of timestamp between the previous and the current access. This means that the first check doesn't need to happen, and the low limb can just be directly compared. The high limb of the previous access also doesn't need to be stored. Let's see how we do it. Consider a register that was previously accessed at timestamp `2^24 - 7` (which is `1 (mod 8)`) and now will be accessed at timestamp `2^24 + 1`. In this case, the high limb of the timestamp has changed from `0` to `1`. To avoid handling this in the RISC-V instruction, we sneak in a register read in between to bump the timestamp from `2^24 - 7` to `2^24`. This is secure since it still follows the Blum argument, and it is not a write. Now, once we access the memory in timestamp `2^24 + 1`, the previous access was at timestamp `2^24`, so the high limb remains the same. This is what `MemoryBump` chip does. Note that this memory access does handle both timestamp limbs, since the high limb definitely can change during this "shadow read". The `MemoryBump` chip basically allows a "shadow read" of a register at a valid timestamp of two 24-bit limbs. The timestamp and register address is range checked to be correct. ```rust= fn eval(&self, builder: &mut AB) { let main = builder.main(); let local = main.row_slice(0); let local: &MemoryBumpCols<AB::Var> = (*local).borrow(); // Check that `is_real` is a boolean value. builder.assert_bool(local.is_real); // Check that the timestamp limbs are within range. builder.slice_range_check_u16(&[local.clk_0_16, local.clk_32_48], local.is_real); builder.slice_range_check_u8(&[local.clk_16_24, local.clk_24_32], local.is_real); // Check that the address is a valid register. builder.send_byte( AB::Expr::from_canonical_u32(ByteOpcode::LTU as u32), AB::Expr::one(), local.addr, AB::Expr::from_canonical_u8(32), local.is_real, ); // Bump the memory timestamp by doing an additional read. builder.eval_memory_access_read( local.clk_24_32 + local.clk_32_48 * AB::Expr::from_canonical_u32(1 << 8), local.clk_0_16 + local.clk_16_24 * AB::Expr::from_canonical_u32(1 << 16), &[local.addr.into(), AB::Expr::zero(), AB::Expr::zero()], local.access, local.is_real, ); } ``` Also, note that this still allows the memory timestamp to be strictly increasing all the time, since the RISC-V instructions have timestamp `1 (mod 8)`, so adding a fake memory read with timestamp a multiple of `2^24` will not cause any duplicate reads at the same timestamp. Basically, a multiple of `2^24` cannot equal any `timestamp + position` value. With these all in mind, we invite you to read the `adapter`s and `CPUState` again. ## Handling Public Values One may have noticed than each RISC-V chip receives a CPU state and sends the next CPU state, but that requires something to send the initial state and receive the final state of that shard. This is where the `eval_public_values` and public value constraints come in. Notice that the initial / final state of the shard is something that should be exposed to the public values of the shard, as they need to be "connected" across shards: the next shard's initial timestamp should be the current's final timestamp. `eval_public_values` is the constraints that the public values of a shard should satisfy, as well as interactions that are needed to send the initial state and receive the final one. For starters, the constraints here include - initial / final timestamp are with range checked 24-bit limbs and `(1 mod 8)` - initial / final program counters are with range checked 16-bit limbs - the initial / final states are sent and received - if the timestamp changed over the shard, it's considered an execution shard - if the timestamp remained the same over the shard, it's a non-execution shard - in a non-execution shard, the initial / final program counters are the same Here, execution shard is a shard that proves a non-zero amount of RISC-V cycles. We discuss different parts of `eval_public_values` as we go. One framework important to understand is that the overall zkVM has a state $S$, and a "shard" should be viewed as a state transition $S \rightarrow S'$. In this view, `eval_public_values` and the individual AIRs should constrain that this transition makes sense, and then the "cross-shard" constraints should just be that the next shard's $S$ equals the current shard's $S'$, plus the boundary checks. While at it, let's discuss the boundary constraints for the timestamp and the program counter. The boundary constraint for the timestamp is that it starts with `1`. In the program counter constraints, we enforce the boundary constraint that the last shard's `next_pc` is equal to the `HALT_PC`, to ensure that the program execution halted. Also, we add the constraint that the starting `pc_start` is the `vk.pc_start`. Refer `crates/prover/src/verify.rs` for the cross-shard checks and boundary constraints. ## Handling Global Interactions Recall that we use [elliptic curve based multiset hashing](https://docs.succinct.xyz/assets/files/SP1_Turbo_Memory_Argument-b042ba18b58c4add20a8370f4802f077.pdf) to handle cross-shard permutation checks. As we move to RISCV-64, the size of the messages transferred via these "global interactions" have increased. To handle this, we pack the messages into 8 field elements so that it can be hashed with a single Poseidon2 permutation. In the case of memory, the message is the `clk_high` (24 bits), `clk_low` (24 bits), `address` (three 16-bit limbs), `value` (four 16-bit limbs). To pack this, we first break the third value limb into two bytes `lower` and `upper`. The packed 8 field elements are - first 5 elements: `clk_high, clk_low, addr[0], addr[1], addr[2]` - last 3 elements: `value[0] + lower * 2^16, value[1] + upper * 2^16, value[3]` We directly range check that the `value` is of u16 limbs, and that both `lower, upper` are bytes that satisfy `lower + upper * 2^8 == value[2]`. This ensures that the map from the actual message to the packed one is injective, and no field overflows are possible. Check `crates/core/machine/src/memory/local.rs` for the relevant code above. In the case of syscalls, the message is `clk_high` (24 bits), `clk_low` (24 bits), `syscall_id` (8 bits), `arg1` (three 16-bit limbs), `arg2` (three 16-bit limbs). The packed message is - first 2 elements: `clk_high, clk_low` - 3rd element: `syscall_id + arg1[0] * 2^8` - final 5 elements: `arg1[1], arg1[2], arg2[0], arg2[1], arg2[2]` We directly range check `syscall_id` to be a byte, and `arg1[0]` to be 16-bits. This ensures that the map from the actual message to the packed is injective, without overflow. We note that the injectivity is important, as interpreting a global interaction (the packed message) back to the actual message should be well-defined. Once the message is packed and sent to the `Global` chip, we handle the following. - the `InteractionKind`, sent as a column `kind`, should be incorporated - the 8-bit "tweak" to make the `y`-coordinate exist for the EC should be incorporated To do this, we place some additional range checks. - The first element of the message is range checked to be 24-bits. This is fine, since in both cases the first element of the message is `clk_high`, intended to be 24-bits. - Range check the `kind` to be 6-bits. - Range check the last element of the message to be range checked to 16-bits. Note that in both `kind`s, the last element of the message will be 16-bits. Then, we incorporate `kind` by making the first element of the hash input `message[0] + 2^24 * kind`, which also doesn't overflow the prime. We incorporate the `offset` (tweak) by making the last element of the hash input `message[7] + 2^16 * offset`. From that point, we apply the usual hashing & elliptic curve lifting logic and add up the digests. Each row of the `Global` chip adds up a digest to the running sum of digests. This means that the `eval_public_values` need to send the initial "zero" digest, and receive the final global cumulative sum. This is shown in the `eval_global_sum` function inside. ## Memory Initialization & Finalization For the Blum memory argument to work, the initial memory state and the final memory state have to be handled. This is done through the `MemoryGlobal` chip. There are two important parts that the `MemoryGlobal` chip needs to handle. These are - The register `x0` must be initialized to zero. - Each address must be initialized at most once and finalized at most once. To ensure this, we make sure that the addresses are increasing as we initialize & finalize the addresses, and that the address zero is not skipped in this process. For this, consider the following state machine `(index, addr, valid)`, where `index` represents the index of the current initialization/finalization, `addr` represents the address previously handled, and `valid` is a boolean that represents the validity of the state. The state machine starts with `(0, 0, 1)`, and transitions as follows. - If `index == 0` and `addr == 0`, then you must initialize/finalize address `0` with value `0`, and then the next state becomes `(index + 1, 0, 0) = (1, 0, 0)`. - If `index != 0` or `addr != 0`, then you must initialize/finalize an `addr' > addr`, then the next state becomes `(index + 1, addr', 1)`. Then, you check that the last state has `valid = 1` and `addr != 0`. Basically, this skips the `addr < addr'` check just for the zero address, and ensures that the zero address is initialized. The check that the final `addr != 0` ensures that `x0` is initialized. However, initialization & finalization should be split across shards. To handle this, we add in `previous_init_addr` and `last_init_addr` in the public values. We ensure the following. - The state machine for the shard starts at `(0, public_values.previous_init_addr, 1)` and ends at `(global_init_count, public_values.last_init_addr, 1)` for each shard. - The current shard's `last_init_addr` is the next shard's `previous_init_addr`. - The first shard's `previous_init_addr == 0` and the last shard's `last_init_addr != 0`. The same goes for the memory finalization. Notice that the `validity` flag defends against potential attacks where `previous_init_addr == last_init_addr == 0` while address zero **is** initialized in that shard, which could lead to multiple initializations of `x0`. Indeed, if `x0` is not initialized, then `previous_init_addr == last_init_addr == 0` with nothing happening, while if `x0` is initialized, the condition that the `validity == 1` enforces that some other address is initialized in that shard, making `last_init_addr != 0`. The `MemoryGlobal` chip implements this state machine by first receving the current state and sending the next state, as shown in the code below. The index always increments by 1. ```rust= // Receive the previous index, address, and validity state. builder.receive( AirInteraction::new( vec![local.index] .into_iter() .chain(local.prev_addr) .chain(once(local.prev_valid)) .map(Into::into) .collect(), local.is_real.into(), interaction_kind, ), InteractionScope::Local, ); // Send the next index, address, and validity state. builder.send( AirInteraction::new( vec![local.index + AB::Expr::one()] .into_iter() .chain(local.addr.map(Into::into)) .chain(once(local.is_comp.into())) .collect(), local.is_real.into(), interaction_kind, ), InteractionScope::Local, ); ``` We ensure that the `prev_addr` and `addr` as well as `value` are of valid `u16` limbs. ```rust= // Constrain that `local.is_real` is boolean. builder.assert_bool(local.is_real); // Constrain that the value is a valid `Word`. builder.slice_range_check_u16(&local.value.0, local.is_real); // Constrain that the previous address is a valid `Word`. builder.slice_range_check_u16(&local.prev_addr, local.is_real); // Constrain that the address is a valid `Word`. builder.slice_range_check_u16(&local.addr, local.is_real); ``` Then, we check if `index == 0 && prev_addr == 0` holds true. ```rust= // Assert that `prev_addr < addr` when `prev_addr != 0` or `index != 0`. // First, check if `prev_addr != 0`, and check if `index != 0`. // SAFETY: Since `prev_addr` are composed of valid u16 limbs, adding them to check if // all three limbs are zero is safe, as overflows are impossible. IsZeroOperation::<AB::F>::eval( builder, IsZeroOperationInput::new( local.prev_addr[0] + local.prev_addr[1] + local.prev_addr[2], local.is_prev_addr_zero, local.is_real.into(), ), ); IsZeroOperation::<AB::F>::eval( builder, IsZeroOperationInput::new( local.index.into(), local.is_index_zero, local.is_real.into(), ), ); // Comparison will be done unless both `prev_addr == 0` and `index == 0`. // If `is_real = 0`, then `is_comp` will be zero. // If `is_real = 1`, then `is_comp` will be zero when `prev_addr == 0` and `index == 0`. // If `is_real = 1`, then `is_comp` will be one when `prev_addr != 0` or `index != 0`. builder.assert_eq( local.is_comp, local.is_real * (AB::Expr::one() - local.is_prev_addr_zero.result * local.is_index_zero.result), ); builder.assert_bool(local.is_comp); ``` If `is_comp` is true, then `prev_addr < addr` must hold. If `is_comp` is false, then this means that `addr == 0` and `value == 0` should be true. We ensure this with constraints. The initial/final state for the shard is sent in `eval_public_values`. ```rust= // crates/core/executor/src/record.rs:eval_global_memory_init builder.send( AirInteraction::new( once(AB::Expr::zero()) .chain(public_values.previous_init_addr.into_iter().map(Into::into)) .chain(once(AB::Expr::one())) .collect(), AB::Expr::one(), InteractionKind::MemoryGlobalInitControl, ), InteractionScope::Local, ); builder.receive( AirInteraction::new( once(public_values.global_init_count.into()) .chain(public_values.last_init_addr.into_iter().map(Into::into)) .chain(once(AB::Expr::one())) .collect(), AB::Expr::one(), InteractionKind::MemoryGlobalInitControl, ), InteractionScope::Local, ); ``` ## Handling `exit_code` and `committed_value_digest` The ultimate goal of the SP1 is to provide the output of the program execution. This is embedded in the `exit_code` of the program, and the `committed_value_digest`, which is a SHA2 hash of all the values committed. These values are also propagated throughout the recursion and kept up until the very last proof that is sent to the on-chain verifier. Consider the `exit_code`. Recall that we view each shard as a transition function of the zkVM state. In that spirit, we keep a `prev_exit_code` and `exit_code` as public values of each core shard. The current shard's `exit_code` will be equal to next shard's `prev_exit_code`. Now, the `exit_code` is determined when the `HALT` syscall is called, and this will be the last instruction of the program execution. Therefore, we constrain that - if `HALT` is called, then `public_values.exit_code` is the declared exit code. Here, the declared exit code is a well-formed prime field element represented as a u64 word. - in a non-execution shard, `prev_exit_code == exit_code`. - the first shard's `prev_exit_code == 0` This enforces that the last shard's `public_values.exit_code` is equal to the execution's exit code. Since `HALT` is the last instruction executed, the last execution shard will have `public_values.exit_code` be the correct exit code. Then, all the shards after that will have `prev_exit_code == exit_code`, so this exit code will remain correct all the way. Once again, note that we enforce `HALT` is the last instruction. ```rust= #[allow(clippy::type_complexity)] fn eval_exit_code<AB: SP1AirBuilder>( public_values: &PublicValues< [AB::PublicVar; 4], [AB::PublicVar; 3], [AB::PublicVar; 4], AB::PublicVar, >, builder: &mut AB, ) { let is_execution_shard = public_values.next_execution_shard.into() - public_values.execution_shard.into(); // If the `prev_exit_code` is non-zero, then the `exit_code` must be equal to it. builder.assert_zero( public_values.prev_exit_code.into() * (public_values.exit_code.into() - public_values.prev_exit_code.into()), ); // If it's not an execution shard, assert that `exit_code` will not change in that shard. builder .when_not(is_execution_shard.clone()) .assert_eq(public_values.prev_exit_code, public_values.exit_code); } ``` Similarly, we have `prev_committed_value_digest` and `committed_value_digest` inside the public values. The `COMMIT` syscall enforces the `public_values.committed_value_digest` to be correct, and this syscall is called 8 times, once per every 32-bits of the digest. To prevent the `committed_value_digest` changing after the `COMMIT` syscall is called, we add in `prev_commit_syscall` and `commit_syscall` inside the public values. These represent whether or not a `COMMIT` syscall has been called so far. The constraints are Constraints on `prev_commit_syscall` and `commit_syscall`. - The current shard's `commit_syscall` is next shard's `prev_commit_syscall`. - `prev_commit_syscall` and `commit_syscall` are both boolean. - if `prev_commit_syscall == 1`, then `commit_syscall == 1`. - if `COMMIT` syscall is called, `commit_syscall == 1` - in a non-execution shard, `prev_commit_syscall == commit_syscall`. - the first shard's `prev_commit_syscall == 0`, the last shard's `commit_syscall == 1`. Constraints on `prev_committed_value_digest` and `committed_value_digest`. - Current shard's `committed_value_digest` is next's `prev_committed_value_digest`. - `prev_committed_value_digest` and `committed_value_digest` are composed of bytes - the first shard's `prev_committed_value_digest == [0u8; 32]`. - if one of the following hold, `prev_committed_value_digest == committed_value_digest` - the shard is a non-execution shard - `prev_committed_value_digest != 0` (there's a non-zero byte) - `prev_commit_syscall == 1` This enforces that when the first `COMMIT` is called, the public values have the correct `committed_value_digest` loaded in, and from that point the value will remain correct. Note that it's the user program's responsibility to call `syscall_halt` at the end, so that the `COMMIT` syscall is called. Note that SP1 enforces the execution ends with a `HALT` syscall. Most of these constraints are inside the `eval_public_values` constraints. ```rust= // Assert that both `prev_commit_syscall` and `commit_syscall` are boolean. builder.assert_bool(public_values.prev_commit_syscall); builder.assert_bool(public_values.commit_syscall); // Assert that `prev_commit_syscall == 1` implies `commit_syscall == 1`. builder.when(public_values.prev_commit_syscall).assert_one(public_values.commit_syscall); // Assert that the `commit_syscall` value doesn't change in a non-execution shard. builder .when_not(is_execution_shard.clone()) .assert_eq(public_values.prev_commit_syscall, public_values.commit_syscall); // Assert that `committed_value_digest` will not change in a non-execution shard. for i in 0..PV_DIGEST_NUM_WORDS { builder.when_not(is_execution_shard.clone()).assert_all_eq( public_values.prev_committed_value_digest[i], public_values.committed_value_digest[i], ); } // Assert that `prev_committed_value_digest != [0u8; 32]` implies `committed_value_digest` // must remain equal to the `prev_committed_value_digest`. for word in public_values.prev_committed_value_digest { for limb in word { for i in 0..PV_DIGEST_NUM_WORDS { builder.when(limb).assert_all_eq( public_values.prev_committed_value_digest[i], public_values.committed_value_digest[i], ); } } } // Assert that if `prev_commit_syscall` is true, `committed_value_digest` doesn't change. for i in 0..PV_DIGEST_NUM_WORDS { builder.when(public_values.prev_commit_syscall).assert_all_eq( public_values.prev_committed_value_digest[i], public_values.committed_value_digest[i], ); } ``` The same goes for the `deferred_proofs_digest` and the `COMMIT_DEFERRED_PROOFS` syscall. Note that the `COMMIT` syscall enforcing `commit_syscall == 1` and vice-versa is enforced inside the `SyscallInstrsChip`, as shown in the code below. ```rust= // COMMIT/COMMIT_DEFERRED_PROOFS ecall instruction. self.eval_commit( builder, &a, local, public_values.commit_syscall, public_values.commit_deferred_syscall, public_values.committed_value_digest, public_values.deferred_proofs_digest, ); // ... let (is_commit, is_commit_deferred_proofs) = self.get_is_commit_related_syscall( builder, prev_a_byte, commit_syscall, commit_deferred_syscall, local, ); // ... // If the `COMMIT` syscall is called in the shard, then `pv.commit_syscall == 1`. builder.when(is_commit).assert_one(commit_syscall); // If the `COMMIT_DEFERRED_PROOFS` syscall was called, `pv.commit_deferred_syscall == 1`. builder.when(is_commit_deferred_proofs).assert_one(commit_deferred_syscall); ``` ## Public Values in Recursion We briefly go over the public values handling in the core verifier and recursion. Once again, each shards are considered as a transition function, and the validity of the transitions is constrained in `eval_public_values`. The remaining things is to "connect" the public values between consecutive shards, and boundary constraints. The `verify` function in `crates/prover/src/verify.rs`, which verify the core shard proofs, show these. First, we document the desired behaviors in `crates/prover/src/verify.rs`. The first shard should have the following - `pc_start == vk.pc_start` - `initial_timestamp == 1` - `prev_committed_value_digest == [0u8; 32]` - `prev_deferred_proofs_digest == [0; 8]` - `prev_commit_syscall == prev_commit_deferred_syscall == 0` - `prev_exit_code == 0` - `previous_init_addr == 0`, `previous_finalize_addr == 0` The last shard should have the following. - `next_pc == HALT_PC` - `commit_syscall == commit_deferred_syscall == 1` - `last_init_addr != 0, last_finalize_addr != 0` Between consecutive shards, the following values are connected - `last_timestamp` and next shard's `initial_timestamp` - `next_pc` and next shard's `pc_start` - `exit_code` and next shard's `prev_exit_code` - `last_init_addr` and next shard's `previous_init_addr` - `last_finalize_addr` and next shard's `previous_finalize_addr` - `committed_value_digest` and next shard's `prev_committed_value_digest` - `deferred_proofs_digest` and next shard's `prev_deferred_proofs_digest` - `commit_syscall` and next shard's `prev_commit_syscall` - `commit_deferred_syscall` and next shard's `prev_commit_deferred_syscall` Also, across all shards, the sum of the `global_cumulative_sum` as well as the `vk.initial_global_cumulative_sum` should be the zero digest, to show that the global permutation argument resolved. The total number of shards is also bounded. In the Rust verifier, the verification is straightforward. However, in recursive verification, since the recursion verifier does not get all the shard proofs at once, it becomes more complicated. The main issue is that it's more difficult to pinpoint a unique first shard - this is a requirement to compare the starting program counter against `vk.pc_start`. To solve this, we add an `is_first_shard` flag to the public values, and enforce - `is_first_shard` is boolean - If `is_first_shard` is false, then `initial_timestamp != 1`. - If `is_first_shard` is true, then `initial_timestamp == 1`. - For all shards, `last_timestamp != 1`. - If `is_first_shard` is true, then the shard is an execution shard. - If `is_first_shard` is true, then the initial shard constraints are applied. This enforces that the shard with `is_first_shard == 1` must be the first shard. Since all shard's have `last_timestamp != 1`, a shard with `initial_timestamp == 1` cannot come after another shard, since it'll violate the `last_timestamp` being connected to the next shard's `initial_timestamp`. This is complete, as all proofs start with an execution shard. We briefly note how the circuit constrains a timestamp being not equal to 1. Since the timestamp is in two 24-bit limbs, is range checked in `eval_public_values`, and is checked to be `1 (mod 8)`, one can simply check that the sum of the two 24-bit limbs is not equal to `1` using an modular inverse. To see this, note that - the sum of the two limbs cannot overflow the underlying prime - the sum of the two limbs being `1` is only possible when the timestamp is `1` or `2^24` - timestamp being `2^24` is impossible as it must be `1 (mod 8)` In `crates/recursion/circuit/src/machine/core.rs`, we handle a single core shard by verifying the core proof and handling some of the boundary constraints for the first shard. Then, `crates/recursion/circuit/src/machine/compress.rs` will handle connecting multiple shard proofs by checking the public values are connected to each other. First, let's look at `core.rs`. We use the `is_first_shard` flag as follows. ```rust= // If it's the first shard, then the `pc_start` should be vk.pc_start. for (pc, vk_pc) in public_values.pc_start.iter().zip_eq(vk.pc_start.iter()) { builder.assert_felt_eq(public_values.is_first_shard * (*pc - *vk_pc), C::F::zero()); } // If it's the first shard, we add the vk's `initial_global_cumulative_sum` to the // digest. If it's not the first shard, we add the zero digest to the digest. global_cumulative_sums.push(builder.select_global_cumulative_sum( public_values.is_first_shard, vk.initial_global_cumulative_sum, )); ``` Basically, the `vk.pc_start` check is applied only for the first shard, and the `vk.initial_global_cumulative_sum` is added to the sum only on the first shard, so that it's added exactly once. We also put `sp1_vk_digest` value as the hash of the vkey. In `compress.rs`, we connect the public values as described in `crates/prover/src/verify.rs`, and assert `sp1_vk_digest`, `vk_root` and `proof_nonce` remain equal throughout. It also accumulates the `global_cumulative_sum` over the leaves. There's a couple more things to do: to ensure that - There is a unique shard with `is_first_shard == 1`. - The number of shards is bounded accordingly. To handle these, the values `contains_first_shard` and `num_included_shards` are included in the recursion public values. These values would be equal to `is_first_shard` and `1` respectively at the core verification level, and be accumulated through recursion. Then, these accumulated values would be checked to be `1` and a value within the bound in the end. To prevent overflows, the `contains_first_shard` value is checked to be boolean at every layer of recursion, and `num_included_shards`'s bound is also checked every layer. Ultimately, the compress logic is to combine a state transition from $S \rightarrow S'$ and $S' \rightarrow S''$ into a single $S \rightarrow S''$, and build up the proofs in a tree-like structure. In `complete.rs`, there's the full boundary condition to check if the proof is complete. The `is_complete` flag represents whether the `RecursionPublicValues` is a complete proof. These conditions are, for `is_complete = 1` to hold, - `prev_committed_value_digest == [0u8; 32]` - `prev_deferred_proofs_digest == [0; 8]` - `next_pc == HALT_PC` - `contains_first_shard == 1` - `initial_timestamp == 1` - `previous_init_addr == 0`, `last_init_addr != 0` - `previous_finalize_addr == 0`, `last_finalize_addr != 0` - `start_reconstruct_deferred_digest == [0; 8]` - `end_reconstruct_deferred_digest == deferred_proofs_digest` - `prev_exit_code == 0` - `prev_commit_syscall == 0`, `commit_syscall == 1` - `prev_commit_deferred_syscall == 0`, `commit_deferred_syscall == 1` - `global_cumulative_sum == 0` Note that the first shard's `start_pc == vk.pc_start` constraint is done in core. Also, `is_complete` is checked to be boolean. In the top level of recursion, this `is_complete` is checked to be `1`, so that the proof is ensured to be of a complete execution. ## Preventing Lookup Multiplicity Overflow in AIRs **Note**: The following discussion is for core AIRs. In recursion AIRs, the multiplicity of interactions is in the preprocessed trace, so multiplicity overflow is out of question. There's also one aspect of AIRs that is subtle but very important. The following must hold: in a interaction that is not sending/receiving preprocessed trace (that is, not inside `ByteChip`, `RangeChip`, `ProgramChip`) all interactions must have a **boolean multiplicity**. This is to prevent lookup multiplicities overflowing the underlying prime. In short, SP1's defense against this multiplicity overflow attack is - All "untrusted" interaction multiplicity are enforced to be boolean - Ensure that have `#[number of rows] >= #[number of interactions of a given kind]` - Ensure that the trace area is at most `2^30`, which is less than the underlying prime. The first check is done through the AIR. The second check is in the test `test_chips_main_width_interaction_ratio`. Note that since all the chips are fixed, this can be checked at compile time. The third check is done through the proof system, and is not a responsibility of the AIRs. In general, our way of doing the first check is - Define `is_real`, which is whether or not the row is a padding or a real row. This value is checked to be boolean, and most untrusted interactions are of `is_real` multiplicity. - There are cases where the multiplicity `mult` is not identical to `is_real`. Then, - Ensure that `is_real = 1` enforces booleanity of `mult`. - Importantly, when `is_real = 0`, enforce booleanity of `mult` (usually `mult = 0`). ## Misc Notes about SP1's RISC-V AIRs With these context, the AIRs of SP1, especially the ones that handle are relatively straightforward, as they simply aim to prove the RISC-V opcode behavior. There are some tidbits that are additionally constrained in addition to the RISC-V specification. **Memory Opcodes** The memory opcode chips themselves are fairly straightforward, but what may not be straightforward at first glance is how the opcodes are split into multiple AIR tables. - `LoadByte`: handles `LB`, `LBU` opcodes where `rd != x0`. - `LoadHalf`: handles `LH`, `LHU` opcodes where `rd != x0`. - `LoadWord`: handles `LW`, `LWU` opcodes where `rd != x0`. - `LoadDouble`: handles `LD` opcode where `rd != x0`. - `LoadX0`: handles all load opcodes where `rd == x0`. - `StoreByte`: handles `SB` opcode - `StoreHalf`: handles `SH` opcode - `StoreWord`: handles `SW` opcode - `StoreDouble`: handles `SD` opcode Notably, this split is based on the fact that load opcodes with `rd == x0` doesn't actually load the memory values to the register, as `x0` is hard-wired to zero. Also, another important aspect of SP1 is that registers and memory share the same permutation check - the `InteractionKind` used for both of them is the same. Therefore, we enforce that the memory chips and other functions intended to read memory, not registers, do not interfere with the register space. SP1 holds `[0x00, 0x20)` as registers. To enforce the addresses are valid memory addresses, we assert that the address is at least `2^16`. This can be done by summing the top two `u16` limbs of the address and asserting that it is non-zero. This ensures that registers are untouched. Refer to `AddressOperation` for the exact implementation and handling of addresses. **Branch & Jump Opcodes** Branch, JAL, JALR opcodes constrain the `next_pc` based according to the RISC-V specifications, but we additionally constrain that the `next_pc` value is a multiple of `4`. This is complete, as all valid `pc` with an actual program would be a multiple of `4`. As mentioned, this is to ensure the execution arrives at `HALT_PC = 1` with a `HALT` syscall. > We are also aware that our JALR implementation doesn't clear the least significant bit of the (rs1 + imm) value. Since this value is asserted to be a multiple of 4 in the circuit, there are no soundness issues. The only issue that may arise from this is that there might be valid execution where (rs1 + imm) value is not even, leading to a potential completeness issue. However, in usual programs compiled through the standard SP1 toolchain, this behavior will not be observed. Note that this clearing of LSB usually only happens on incorrect pointer arithmetic or an unsafe program, so this completeness issue is very minor. ## Misc Notes about SP1's Precompiles There are two types of precompiles in SP1. - Precompiles with bigint operations (elliptic curves, for example) - Precompiles about hash functions (SHA2, Keccak, Poseidon2) **Precompiles with Bigint Operations** For the former, the core logic is mostly identical to the one of [SP1 v1](https://hackmd.io/wztOd455QKWf-K8LXh_Fqw?view#Core-VM---Field-Operations)'s logic. The difference is that since the unit of memory is now a `u64`, there's less memory addresses to read from than before, where `u32` was the unit of memory. Also, because SP1 now uses a `u16` limb, a conversion (with range checking) from `u16` limbs to `u8` limbs is used. ```rust= fn generate_limbs( &mut self, memory_access_cols: &[MemoryAccessColsU8<Self::Var>], is_real: Self::Expr, ) -> Vec<Self::Expr> { let u16_to_u8_input = |access: &MemoryAccessColsU8<Self::Var>| { U16toU8OperationSafeInput::new( access.memory_access.prev_value.0.map(|x| x.into()), access.prev_value_u8, is_real.clone(), ) }; // Convert the u16 limbs to u8 limbs using the safe API with range checks. let limbs = memory_access_cols .iter() .flat_map(|access| { let input = u16_to_u8_input(access); U16toU8OperationSafe::eval(self, input) }) .collect::<Vec<_>>(); limbs } ``` **Hash Precompiles** Let's discuss the hash precompiles. `Poseidon2` is a new hash precompile, but `SHA2` and `Keccak` have been inside SP1 since v1. The core logic, just like the bigint precompiles, remains the same. However, since our implementation of the hypercube proof system doesn't support constraints across two consecutive rows in the AIR table, the control flow implementation has changed in SP1 Hypercube. We take simple `ShaExtend` as an example. The precompile aims to do the following. ```python= for i from 16 to 63 s0 := (w[i-15] rightrotate 7) xor (w[i-15] rightrotate 18) xor (w[i-15] rightshift 3) s1 := (w[i-2] rightrotate 17) xor (w[i-2] rightrotate 19) xor (w[i-2] rightshift 10) w[i] := w[i-16] + s0 + w[i-7] + s1 ``` To do this, the `for` loop must be handled. Previously, when constraints across two consescutive rows were possible, this was doable by having `i` itself as a column, and have it increment by `1`. Now, this isn't possible, at least directly. To go around this, we view this as a state machine once again, and let each row of the AIR do the following. - Receive the state, which is `(timestamp, w_ptr, i)` - Handle the computation of `s0, s1` and the writing of `w[i]` - Send the state, which is `(timestamp, w_ptr, i + 1)` - Range check that `16 <= i < 64` holds. Here, `w_ptr` is the pointer of the array `w`. Of course, similar to the state machine inside the RISC-V core, this means that there must be something to send and receive the initial, and final state. For this, an additional controller chip is added. The `ShaExtendControlChip` aims to do the following. - Receive the syscall (which was sent through `SyscallInstrsChip` as an ECALL), which has the information on the `timestamp` and `w_ptr`, the arguments. - Send the initial state `(timestamp, w_ptr, 16)` - Receive the final state `(timestamp, w_ptr, 64)` Note that this enforces that each syscall is handled correctly. This idea is common in the other `ShaCompress` and `KeccakPermute` syscalls: a controller chip and an index is used. There's another change to the `SHA2` precompiles. As the unit of memory in SP1 Hypercube is `u64`, but `SHA2` deals with `u32` blocks, we store each `u32` value within a `u64`. While this does waste memory space, this makes the AIR much simpler. We refer to the [SHA2 patch](https://github.com/sp1-patches/RustCrypto-hashes/releases/tag/patch-sha2-0.10.9-sp1-6.0.0) for memory state handling in the user program before the precompile. In the contrary, `Poseidon2` precompile expects two KoalaBear field elements packed inside a `u64` Word, each taking a slot of `u32`. This means that the input is 8 `u64`. Note that this is not a problem for `Keccak`, which takes in `u64` blocks as input. **Handling Addresses** As mentioned in the document, the valid address space is `[0, 2^48)`, but in memory opcodes we restrict the address space to `[2^16, 2^48)`. In precompiles, usually a slice of memory is read. For example, to read a 256-bit value, a slice of 4 `u64` is read from a pointer. To prevent reaching out of bounds, a `SyscallAddrOperation` is used to ensure that the starting pointer is within `[2^16, 2^48 - len)`, where `len` is the byte length. Also, all syscalls require that the input pointer is aligned to a multiple of `8`. At compile time, it's known that `len` is a multiple of `8` far less than `2^16`. Basically, `len` is not a column in the AIR, it's a fixed constant that may differ for each precompile. To ensure this, the following constraints are applied. - Let `S = addr[1] + addr[2]` be the sum of the top two limbs of the address pointer. - `S` is asserted to be non-zero using a modular inverse. This guarantees `S >= 2^16`. - Constrain `top_two_limb_max = (S == 2^17 - 2)` using an `IsZeroOperation`. - Constrain that $$(\text{addr}[0] + \text{top_two_limb_max} * \text{len}) / 8 \in [0, 1, 2, \cdots, 2^{13} - 1]$$ This does two things. - If `top_two_limb_max == 0`, then `S < 2^48 - len` is guaranteed as `S <= 2^48 - 2^16`. In this case, simply constrain that `addr[0]` is a multiple of `8` to ensure alignment. - If `top_two_limb_max == 1`, then `addr[1] = addr[2] = u16::MAX`. Therefore, `addr[0] < 2^16 - len` must be true, so `(addr[0] + len) / 8` is checked to be a 13-bit value. Since `len` is known to be a multiple of `8`, `addr[0]` is ensured to be as well. Note that the `SyscallAddrOperation` expects the `addr` to be valid 3 u16 limbs, which is ensured as they are derived from the `op_b` and `op_c` words in the `SyscallInstrsChip`. ## Quick Notes on Recursion Circuit In recursion circuit folder, there are the following. - A shard verifier written in recursion - `recursion/circuit/src/machine`: logic for verifying leaf proofs and combining the public values, as well as checking cross-shard & boundary conditions, as well as validity of the verifying keys using the merkle tree (with the root `vk_root`) - The recursion AIRs to aid the shard verifier are in `recursion/machine/`. These AIRs do have straightforward specification, so no further explanation is here. Regarding the shard verifier, the relevant files can be found in `slop/` and `hypercube/src/verifier/`. The goal should be to check that the recursion circuits hold the same verification checks as the "native" shard verifier itself. Of course, if auditors find a security issue in the native shard verifier / recursion verifier, then they should report. Regarding the `recursion/circuit/src/machine`, the discussions surrounding cross-shard & boundary conditions have already been done above. The additional thing to discuss is the `vk_root` and the surrounding logic. Basically, to ensure that the recursion `vk`'s are correct, we have a merkle tree of valid recursion verification keys with its root `vk_root` pre-computed. With this merkle tree infrastructure, the following can be constrained. - **All recursion can be done with a circuit, not a VM (no branching at all)** - **All recursion circuits can be pre-determined at setup, so they can be trusted** - `vk_root` can be constrained to be equal throughout the recursion - A validity of a recursion `vk` can be checked with a merkle proof verification This logic with `vk_root` has been present in previous versions of SP1. This is inside SP1 V5 as well. Auditors may focus on checking that the `vk_root` logic is more or less identical to SP1 V5, but also may attempt breaking the `vk_root` logic in SP1 V5 as well. One difference between SP1 V5 and SP1 Hypercube is how the `vk_root` propagates. In SP1 V5, in the wrap program, the `vk_root` would be asserted to be equal to the precomputed, hardcoded `vk_root`. Since `vk_root` was constrained to be equal throughout the recursion tree, this was sufficient to assert that all `vk_root` are equal. ```rust= // Attest that the merkle tree root is correct. let root = input.merkle_var.root; for (val, expected) in root.iter().zip(self.recursion_vk_root.iter()) { builder.assert_felt_eq(*val, *expected); } ``` In SP1 Hypercube, the `vk_root` becomes a part of the PLONK/Groth16 proof, exposed as a part of the public values in the on-chain verifier. To send the `vk_root` to the PLONK/Groth16 level public values, we add this to the `root_public_values` digest. ```rust= /// Compute the digest of the root public values. pub(crate) fn root_public_values_digest<C, H>( builder: &mut Builder<C>, public_values: &RecursionPublicValues<Felt<SP1Field>>, ) -> [Felt<SP1Field>; DIGEST_SIZE] where C: CircuitConfig, H: Poseidon2SP1FieldHasherVariable<C>, { let input = public_values .sp1_vk_digest .into_iter() .chain(public_values.committed_value_digest.into_iter().flat_map(|word| word.into_iter())) .chain(std::iter::once(public_values.exit_code)) .chain(public_values.vk_root) .chain(public_values.proof_nonce) .collect::<Vec<_>>(); H::poseidon2_hash(builder, &input) } ``` Note that compared to the version in SP1 V5, the following is included. - `exit_code`: previous versions of SP1 used to only be able to prove executions with `exit_code == 0`. However, this version allows proving panics as well. Note that as discussed before, the last core shard proof's `public_values.exit_code` must equal the correct `exit_code` value corresponding to the execution being proven. - `vk_root`: we put the merkle tree root to the on-chain public values, which allows easier updates of circuits. Note that the on-chain verifier is expected to check `vk_root` for validity, so that arbitrary `vk_root` cannot be used. This on-chain verifier is not implemented fully yet, so it is not a part of the scope of the audit. - `proof_nonce`: this is a nonce of the entire proof - and as it can be seen from the recursion codebase, it's asserted to be equal throughout all the core shards. For the PLONK/Groth16 circuit implementation, refer to `recursion/gnark-ffi`. ## Specifications of Individual Operations ### Addition & Subtraction **AddOperation** - input: `a: Word`, `b: Word`, `is_real` - output: `value: Word` - assumptions - `a`, `b` are u64 `Word`s of four valid `u16` limbs. - constrains - `is_real` is boolean. If `is_real` is true, then `cols.value` is a `Word` of four valid `u16` limbs representing `a + b (mod 2^64)`. If `is_real` is false, no constraints. **SubOperation** - input: `a: Word`, `b: Word`, `is_real` - output: `value: Word` - assumptions - `a`, `b` are u64 `Word`s of four valid `u16` limbs. - constrains - `is_real` is boolean. If `is_real` is true, then `cols.value` is a `Word` of four valid `u16` limbs representing `a - b (mod 2^64)`. If `is_real` is false, no constraints. **AddU32Operation** - input: `a: [Expr; 2]`, `b: [Expr; 2]`, `is_real` - output: `value: [Expr; 2]` - assumptions - `a`, `b` are u32 values of two valid `u16` limbs. - constrains - `is_real` is boolean. If `is_real` is true, then `cols.value` is two valid `u16` limbs representing the `u32` value `a + b (mod 2^32)`. If `is_real` is false, no constraints. **Add4Operation** - input: `a, b, c, d: [Expr; 2]`, `is_real` - output: `value: [Expr; 2]` - assumptions - `a`, `b`, `c`, `d` are u32 values of two valid `u16` limbs. - constrains - `is_real` is boolean. If `is_real` is true, `cols.value` is two valid `u16` limbs of the `u32` value `a + b + c + d (mod 2^32)`. If `is_real` is false, no constraints. **Add5Operation** - input: `words: [[Expr; 2]; 5]`, `is_real` - output: `value: [Expr; 2]` - assumptions - each `words[i]` are u32 values of two valid `u16` limbs. - constrains - `is_real` is boolean. If `is_real` is true, `cols.value` is two valid `u16` limbs of the `u32` value that is the sum of the five `u32`. If `is_real` is false, no constraints. Let's explain the `AddOperation` briefly. ```rust= builder.assert_bool(is_real.clone()); let base = AB::F::from_canonical_u32(1 << 16); let mut builder_is_real = builder.when(is_real.clone()); let mut carry = AB::Expr::zero(); for i in 0..WORD_SIZE { carry = (a[i].clone() + b[i].clone() - cols.value[i] + carry) * base.inverse(); builder_is_real.assert_bool(carry.clone()); } // Range check each limb. builder.slice_range_check_u16(&cols.value.0, is_real); ``` This constraint shows that `2^16 * carry_next + value[i] = a[i] + b[i] + carry`, with `carry_next` being boolean and `value[i]` range checked to a u16. Since overflows are impossible, this shows that the `carry_next` and `value` must be correct. Also, note that the `carry_next` value is guaranteed to be boolean, so it has completeness as well. The `SubOperation` uses the same method, with `a - b` thought of as `a + (2^64 - b)`. This means that the initial carry will be `1`, and the added limbs are `2^16 - 1 - b[i]`. The `Add4Operation` and `Add5Operation` are similar. ```rust= builder.assert_bool(is_real); let base = AB::F::from_canonical_u32(1 << 16); let mut carry_limbs = [AB::Expr::zero(), AB::Expr::zero()]; let mut carry = AB::Expr::zero(); // Initialize carry to zero for i in 0..2 { carry = (a[i].clone() + b[i].clone() + c[i].clone() + d[i].clone() - cols.value[i] + carry) * base.inverse(); carry_limbs[i] = carry.clone(); } // Range check each limb. builder.slice_range_check_u16(&cols.value, is_real); builder.slice_range_check_u8(&carry_limbs, is_real); ``` Here, `2^16 * carry_next + value[i] = a[i] + b[i] + c[i] + d[i] + carry`, with `carry_next` being range checked to `u8` and `value[i]` being range checked to `u16`. Since the left hand side doesn't overflow, and `value[i]` is a u16, the `carry_next` and `value` must be correct. Since the `carry_next` value is guaranteed to be either `0, 1, 2, 3`, it must be a `u8` value, so the circuit has completeness as well. ### Basic IsZero Utility **IsZeroOperation** - input: `a`, `is_real` - output: `inverse`, `result` - assumptions: None - constrains - If `is_real == 0` then no constraints are applied. - If `is_real != 0 && a != 0` then `result = 0` and `inverse = a^-1` - If `is_real != 0 && a == 0` then `result = 1` **IsZeroWordOperation** - input: `a: Word`, `is_real` - output: `result` - assumptions: None (notably, this `Word` may not be with u16 limbs) - constrains - `is_real` is boolean - If `is_real == 0` then no constraints are applied. - If `is_real == 1 && a != 0` then `result = 0`. - If `is_real == 1 && a == 0` then `result = 1`. **IsEqualWordOperation** - input: `a: Word`, `b: Word`, `is_real` - output: `is_diff_zero.result` - assumptions: None (notably, these `Word`s may not be with u16 limbs) - constrains - `is_real` is boolean - If `is_real == 0` then no constraints are applied. - If `is_real == 1 && a != b` then `is_diff_zero.result = 0`. - If `is_real == 1 && a == b` then `is_diff_zero.result = 1`. ### Comparison & MSBs **U16MSBOperation** - input: `a`, `is_real` - output: `msb` - assumptions: `a` is a valid `u16` value - constrains - `is_real` is boolean. If `is_real` is true, the `msb` is the u16 MSB of `a`. This is done by constraining `2 * a - 2^16 * msb` to be a u16 value, and `msb` to be a boolean value. If `0 <= a < 2^15`, then this forces `msb` to be `0`, and if `2^15 <= a < 2^16`, then this forces `msb` to be `1`. In both cases `2 * a - 2^16 * msb` is a u16 value. Based on this, we can also handle ADDW and SUBW operations over the words. **AddwOperation** - input: `a: Word`, `b: Word`, `is_real` - output: `value: [Expr; 2]`, `msb` - assumptions: `a, b` are valid `Word`s of four u16 limbs. - constrains - `is_real` is boolean. If `is_real` is true, then `value` are two u16 limbs that represent `a + b (mod 2^32)`, and `msb` is the most significant bit of `value`. **SubwOperation** - input: `a: Word`, `b: Word`, `is_real` - output: `value: [Expr; 2]`, `msb` - assumptions: `a, b` are valid `Word`s of four u16 limbs. - constrains - `is_real` is boolean. If `is_real` is true, then `value` are two u16 limbs that represent `a - b (mod 2^32)`, and `msb` is the most significant bit of `value`. **U16CompareOperation** - input: `a`, `b`, `is_real` - output: `bit` - assumptions: `a`, `b` are valid `u16` values - constrains - `is_real` is boolean. If `is_real` is true, then `bit == (a < b)`. This is done by constraining `a - b + 2^16 * bit` to be a u16 value, and `bit` to be a boolean value. If `a < b`, then `bit` must be `1`, whereas `a >= b` forces `bit` to be zero. **LtOperationUnsigned** - input: `b: Word`, `c: Word`, `is_real` - output: `u16_compare_operation.bit` - assumptions: `b`, `c` are valid u64 `Word`s with four u16 limbs - constrains - `is_real` is boolean, and no constraints when `is_real = 0`. - If `is_real = 1`, `u16_flags` is `1` for the most significant limb that is different between `b` and `c`, and `0` for all other limbs. If `b == c`, `u16_flags == [0; 4]`. - If `is_real = 1`, `comparison_limbs` are the most significant limbs that's different between `b` and `c`. If `b == c`, then `comparison_limbs == [0; 2]`. - If `is_real = 1`, then `u16_compare_operation.bit == (b < c)` ```rust= pub struct LtOperationUnsigned<T> { /// Instance of the U16CompareOperation. pub u16_compare_operation: U16CompareOperation<T>, /// Boolean flag to indicate which limb pair differs if the operands are not equal. pub u16_flags: [T; WORD_SIZE], /// An inverse of differing limb if b_comp != c_comp. pub not_eq_inv: T, /// The comparison limbs to be looked up. pub comparison_limbs: [T; 2], } ``` The code is pretty well documented, but basically note that - `u16_flags` are all boolean and their sum is boolean - if `u16_flags` are not all zero, then that means the limb with `u16_flags == 1` must be the most significant one that has `b` and `c` different - the limb itself must be different since `is_comp_eq == 0`, so `comparison_limbs` are asserted to be different via `not_eq_inv`. Also, all the limbs more significant than it must have equal limbs. - if `u16_flags` are all zero, then that means `b == c` as u64 `Word`. In this case, `comparison_limbs` are both equal to zero, so the comparison goes through correctly. Then, `U16CompareOperation` is done on the comparison limbs. **LtOperationSigned** - input: `b: Word`, `c: Word`, `is_signed`, `is_real` - output: `result.u16_compare_operation.bit` - assumptions: `b`, `c` are valid u64 `Word`s with four u16 limbs - constrains - `is_signed` is boolean. `is_signed == 1` means its a signed comparison, viewing the u64s as i64s. `is_signed == 0` means it's an unsigned comparison as u64s. - `is_real` is boolean, and if `is_real = 0` then `is_signed = 0`. - `b_msb.msb`, `c_msb.msb` are the MSBs of `b`, `c` if `is_signed == 1`. - `b_msb.msb`, `c_msb.msb` are equal to zero if `is_signed == 0`. - `result.u16_compare_operation.bit == (b < c)`. ```rust= pub struct LtOperationSigned<T> { /// The result of the SLTU operation. pub result: LtOperationUnsigned<T>, /// The most significant bit of operand b if `is_signed` is true. pub b_msb: U16MSBOperation<T>, /// The most significant bit of operand c if `is_signed` is true. pub c_msb: U16MSBOperation<T>, } ``` One can shown that `b < c` as `i64` if and only if `b ^ (1 << 63) < c ^ (1 << 63)` as `u64`. We use this fact to reduce the comparison to an unsigned case. Once the `msb`'s are constrained using the `U16MSBOperation`, the XOR'ed result can be computed by - `b[3] += is_signed * 2^15 - b_msb.msb * 2^16` - `c[3] += is_signed * 2^15 - c_msb.msb * 2^16` If `is_signed = 0`, the `msb`'s will also be constrained to zero, so nothing happens. If `is_signed = 1`, then `msb`'s will be the MSB values. If `b`'s msb is `1`, then `b[3] -= 2^15`. If `b`'s msb is `0`, then `b[3] += 2^15`, so the result is the XOR result as desired. ### U16toU8 Operations & Bitwise Operations As mentioned, the `U16toU8Operations` all aim to convert a `u16` in to two `u8`'s but differ by whether or not they range check the bytes on the spot. If the output is range checked to be bytes, either by direct range checking in `eval_u16_to_u8_safe` or by implicit range check later on, these guarantee that the byte conversion of a `u16` is correct. All bitwise operations guarantee that the bytes are looked up to implicitly because they are handled through a byte lookup. There's also a `U32toU8Operation` which aims to convert two u16 limbs into four u8 limbs, which work the exact same way. The following operations work the same way. - `eval_u16_to_u8_unsafe`: Assumes that the input is with valid u16 limbs, and that the output will be range checked to be valid u8s later. Returns the u8 form of the input. - `eval_u16_to_u8_safe`: Assumes that the input is with valid u16 limbs, and that `is_real` is boolean. If `is_real = 1`, range checks output as u8s. Returns u8 form. - `eval_u32_to_u8_unsafe`: Identical to `eval_u16_to_u8_unsafe`, but the `u32` version of the function takes in two u16 limbs instead of four u16 limbs. **BitwiseOperation** - input: `a, b: [Expr; 8]`, `opcode`, `is_real` - output: `result: [Expr; 8]` - assumptions: `is_real` is boolean, and `opcode` is a valid `ByteOpcode` (AND, OR, XOR) - constrains: If `is_real = 0`, no constraints. If `is_real = 1`, then `result` is the valid bitwise operation result, and also implicitly constrains that `a, b, result` are all bytes. **BitwiseU16Operation** - input: `b, c: Word`, `opcode`, `is_real` - output: `result: Word` - assumptions: - `b, c` are valid u64 `Word`s with four u16 limbs. - `opcode` is a valid `ByteOpcode` (AND, OR, XOR) - constrains: Asserts that `is_real` is boolean. If `is_real = 0`, no constraints. If `is_real = 1`, then `result` is the bitwise op result. `eval_u16_to_u8_unsafe` is used. `BitwiseOperation` is used, which is fine since `is_real` is checked to be boolean. **AndU32Operation**, **XorU32Operation** - input: `b, c: [Expr; 2]`, `is_real` - output: `result: [Expr; 2]` - assumptions: - `b, c` are valid u32 with two u16 limbs. - constrains: Asserts that `is_real` is boolean. If `is_real = 0`, no constraints. If `is_real = 1`, then `result` is the bitwise op result. `eval_u32_to_u8_unsafe` is used. ### Range Checks on Underlying Field **SP1FieldWordRangeChecker** - input: `value: Word`, `is_real` - output: None - assumptions: - `value` is a valid `Word` with four u16 limbs. - $p \equiv 1 \pmod{2^{16}}$ and $0 < p < 2^{32}$ (true for KoalaBear, BabyBear) - constrains: `value` represents a value less than $p$, the native prime field This function constrains `0 <= value < p` by first constraining that the top two limbs of `value` is zero. Then, `U16CompareOperation` is used to check if `value[1] < TOP_LIMB` where `TOP_LIMB = (p - 1) / 2^16`. If this is true, then `0 <= value < p` is guaranteed. If not, it must be the case that `value[0] = 0`, `value[1] = TOP_LIMB`. ### u64 Multiplication **MulOperation** - input: `b_word, c_word: Word`, `is_{real, mul, mulh, mulw, mulhu, mulhsu}` - output: `a_word: Word` - assumptions: `b_word, c_word` are valid `Word`s with four u16 limbs. - constrains: - `is_real, is_mul, is_mulh, is_mulw, is_mulhu, is_mulhsu` are boolean - at most one of `is_mul, is_mulh, is_mulw, is_mulhu, is_mulhsu` is true - If `is_real` is true and one of the five opcode flags is true, then `a_word` is the valid multiplication result of `b_word` and `c_word` according to that opcode First, `U16toU8Operation`'s safe conversion is used to convert u16 limbs to u8 limbs. Then, the inputs are sign-extended based on the opcode selectors and the input's top bit, which is constrained with a byte lookup. Then, the extended inputs are multiplied in a polynomial multiplication-like manner, than the carries are handled so that the result is composed of valid bytes. This process is similar to the multiplication AIRs in [SP1 v1](https://hackmd.io/wztOd455QKWf-K8LXh_Fqw?view#mul-MUL-MULH-MULHU-MULHSU). Note that for the `MULW` opcode, we just need the multiplication result `(mod 2^32)` then sign-extend it to 64 bits. This is done with a `U16MSBOperation`. ### Handling Addresses In memory operations of RISCV, the sum of two words (one from register and one from an immediate) is considered as the memory address that's relevant for the opcode. To handle this, we have an additional `AddrAddOperation` that does the following. **AddrAddOperation** - input: `a: Word`, `b: Word`, `is_real` - output: `value: [Expr; 3]` - assumptions - `a`, `b` are u64 `Word`s of four valid `u16` limbs. - constrains - `is_real` is boolean. If `is_real` is true, then `cols.value` is a three valid `u16` limbs representing `a + b (mod 2^64)`. If `is_real` is false, no constraints. This is identical to `AddOperation` except `0` is used in place of `value[3]`. Meanwhile, our argument for memory has the following properties. - the blum offline memory checking argument works for memory address multiple of `8`, and each memory address holds 8 bytes as a `Word` composed of four valid u16 limbs. - addresses `[0, 0x20)` are used for registers, and `[0x20, 2^16)` are considered as invalid memory addresses, as they are considered stack guard region Therefore, we have an `AddressOperation` which handles all these things. **AddressOperation** - input: `b, c: Word`, `offset_bit0, offset_bit1, offset_bit2`, `is_real` - output: `aligned_addr: [Expr; 3]` - assumptions - `b, c` are u64 `Word`s of four valid `u16` limbs. - constrains - `is_real` is boolean. If `is_real` is true, the following is true. - `addr = b + c (mod 2^64)` satisfies `2^16 <= addr < 2^48` - `offset_bit0, offset_bit1, offset_bit2` are boolean - `addr == offset_bit0 + 2 * offset_bit1 + 4 * offset_bit2 (mod 8)` - `aligned_addr = addr - (addr % 8)` and returns as valid three u16 limbs. Here, `AddrAddOperation` is used to get `addr` while constraining `0 <= addr < 2^48`. Then, we show `2^16 <= addr` by providing an inverse to `addr[1] + addr[2]`, showing `addr[1]` and `addr[2]` can't both be zero. Since `addr` is of three u16 limbs, this sum cannot overflow the prime. Then, `offset` bits are constrained using the fact that if `x / 8` is constrained to be at most 13 bits, then `x` must be a u16 value that is multiple of 8. Note that this is true even though the division is a field division.