Assignment1: RISC-V Assembly and Instruction Pipeline

# Assignment1: RISC-V Assembly and Instruction Pipeline contributed by < [WeiCheng14159](https://github.com/WeiCheng14159) > ###### tags: `computer architure 2020` ## Environment Setup ### Install riscv-gnu-toolchain riscv-gnu-toolchain compile program for RISCV ```bash $ git clone https://github.com/riscv/riscv-gnu-toolchain $ cd riscv-gnu-toolchain $ git submodule update --init --recursive ``` Follow the instructions on Github [README](https://github.com/riscv/riscv-gnu-toolchain/blob/master/README.md) page and configure the toolchain with RISCV 32I, which stands for I(Integer Instruction Set) ```bash ./configure --prefix=/opt/riscv --with-arch=rv32i make linux [sudo] make install ``` Remember to add the installed toolchain to PATH ```bash export PATH=$PATH:/opt/riscv/bin ``` ### Install riscv-isa-sim Spike, the RISC-V ISA Simulator ```bash sudo apt-get install device-tree-compiler git clone git@github.com:riscv/riscv-isa-sim.git cd riscv-isa-sim mkdir build cd build ../configure --prefix=/opt/riscv make [sudo] make install ``` ## Count Leading Zero Problem Definition: Given a 32 bit unsigned integer, this function `clz` counts the number of leading zeros in that integer. The C code implementation is as follow. ```cpp unsigned int clz(unsigned int i) { unsigned int one = 0x80000000; unsigned int res = 0; for (int cnt = 0; cnt < 32; cnt++) { if ((i & one) == 0) res++; else return res; one = one >> 1; } return res; } ``` The complete source code that prints out the results can be found [here](https://github.com/supernovaremnant/ncku_ca_hw1/blob/master/clz.c) ## Count Leading Zero Assembly Instead of compiling the C code directly with the compiler, manually written assembly code is shown as follow. However, you can still checkout the compiled assembly by running the following command ```bash riscv32-unknown-elf-gcc -O0 -S clz.c -o clz.asm ``` The assembly code for counting leading zeros is as followed ``` .data input: .word 0x0000000f one: .word 0x80000000 str1: .string "clz value of " str2: .string " is " .text main: lw a0, input # Load input from static data jal ra, clz # Jump-and-link to the 'clz' label # Print the result to console mv a1, a0 lw a0, input jal ra, printResult # Exit program li a7, 10 ecall clz: # t0 = one # t1 = cnt = 32 # t2 = res # a0 = i lw t0, one li t1, 32 li t2, 0 _beg: bne t1, zero, cnt _ret: mv a0, t2 ret cnt: addi t1,t1,-1 and t3, a0, t0 # i & one bne t3, zero, _ret addi t2, t2, 1 srli t0, t0, 1 j _beg # --- printResult --- # a0: input # a1: result printResult: mv t0, a0 mv t1, a1 la a0, str1 li a7, 4 ecall mv a0, t0 li a7, 1 ecall la a0, str2 li a7, 4 ecall mv a0, t1 li a7, 1 ecall ret ``` ### Assembly Explained There're 3 sections this program namely `main`, `clz`, `cnt`, `_ret` * `main` Load the input value to the `a0` argument register * `clz` Load the initial value for variables: * `t0` register is mapped to the `one` variable, and is initialized to `0x80000000`. * `t1` register is mapped to the `cnt` variable, and is initialized to `0x00000020` because we have 32 bits to count. * `t2` register is mapped to the `res` variable, which counts the number of zeros we known so far, and is initialized to `0x00000000`. * `a0` register stores the input value that is mappped to the `i` variable. * `cnt` Actually count the number of leading zeros in register `a0`. * `t3` register stores the bit-wise AND result of `a0` and `t0` * In case `t3` equals to zero, the register counting the number of leading zeros should be incremented by 1, which is done by `addi t2, t2, 1` * In case `t3` doesn't equal to zero, this function should return the number of zeros and jump to the return address, which is done by `bne t3, zero, _ret` * `_ret` * As mentioned in the previous section, `t3` stores the return value of the function; however, as stated in the [RISC-V Assembly Programmer's Manual](https://github.com/riscv/riscv-asm-manual/blob/master/riscv-asm.md), the `a0` and `a1` registers are used to store return values by convention. * Thus, `mv a0, t2` instruction move the return value from `t2` to `a0`, which overwrites the value of `a0` ## RISCV ISA ### 5-stage pipeline #### Instruction Fetch (IF) PC register stores the address of next instruction. Normally, the address of the next instruction will be PC+4 since each instruction is 32 bits (4 bytes) in RV32. Otherwise, the next instruction will be determined by the branch/jump instruciton, and the value of PC will be the computed destination address. The value of PC is loaded into the instruction memory as address and the value (instructions) will the passed down through the pipeline. #### Instruction Decode (ID) Depends on different types of instructions, bits can be interpreted differently. Thus, instructions need to be decoded in this stage. Furthermore, register-file, which holds 32 32 bits general purpose registers, is read in this stage. #### Execution (EX) After ID stage, CPU is fully aware of the instruction type, and thus, do further computation is EX stage. Branch target computation, arithmetic computation, and load/store memory address are computed in this stage. #### Memroy (MEM) RISCV is a load-store architecture, which means only lw/sw instructions will access memory. In this stage, address computed from previous stage will be supplied to the memory and control signal from control unit will determine data should be read/write to the memory. #### Write Back (WB) Write back stage write the value back to register file. The value could come from memory or ALU. ### Types of instructions In order to explain how instructions are executed in pipelined RISCV CPU, common types of instructions are listed here. #### R-type encoding ![](https://i.imgur.com/7dR3TXo.png) #### I-type encoding ![](https://i.imgur.com/kutZJTk.png) ![](https://i.imgur.com/c5hv3pD.png) ![](https://i.imgur.com/vhqnXyt.png) ![](https://i.imgur.com/hLouNj1.png) #### J-type encoding ![](https://i.imgur.com/4D2zK9y.png) #### U-type encoding ![](https://i.imgur.com/v1RW2zq.png) #### B-type encoding ![](https://i.imgur.com/MwyxfzS.png) #### S-type encoding ![](https://i.imgur.com/9V4mtv7.png) ### Single-cycle In order to better understand the behavior of pipeline RISCV processor, we must understand how it behaves if it's single cycle. In the following section, instructions in the assembly code are visualized in simulator to better demonstrate the control logic. A single-cycle MIPS processor (from the Computer Organization and Desgin Textbook) is shown here since `mux` in the RISCV simulator isn't labelled and the control signal is missing. MIPS and RISCV are nearly identical. ![](https://i.imgur.com/z3Ucwy8.png) #### R-type encoding * `and x28 x10 x5` ![](https://i.imgur.com/PqCE492.png) * The next PC will be `PC+4` * `Op1` of ALU is set to `Reg1` * `Op2` of ALU is set to `Reg2` * This isn't a branch instruction `branch taken` set to 0 * Nothing will be written to memory so `WrEn` is set to 0 * The computed result from ALU will be written back to register file so the last mux is set accordingly * `WrEn` of register file is set to `1` #### I-type encoding * `lw x10 0(x10)` ![](https://i.imgur.com/bJUBnSH.png) * The next PC will be `PC+4` * `Op1` of ALU is set to `Reg1` * `Op2` of ALU is set to the `imm` field * This isn't a branch instruction `branch taken` set to 0 * Nothing will be written to memory so `WrEn` is set to 0 * The data read from memory will be written back to register file so the last mux is set to `data out` * `WrEn` of register file is set to `1` * `addi x6 x0 32` ![](https://i.imgur.com/X03zfmz.png) * The next PC will be `PC+4` * `Op1` of ALU is set to `Reg1` * `Op2` of ALU is set to the `imm` field * This isn't a branch instruction `branch taken` set to 0 * Nothing will be written to memory so `WrEn` is set to 0 * The computed result from ALU will be written back to register file so the last mux is set accordingly * `WrEn` of register file is set to `1` #### J-type encoding * `jal x1 0x24 <clz>` ![](https://i.imgur.com/YT9IrUf.png) * The next PC will be `PC+0x1C` = `0x08+0x1C` = `0x24` * `Op1` of ALU is set to `PC` * `Op2` of ALU is set to the `imm` field * This isn't a branch instruction `branch taken` set to 0 * Nothing will be written to memory so `WrEn` is set to 0 * The next address `PC+4` will be written to the `$ra` register. * The last mux is set to `PC+4`, and `WrEn` of register file is set to `1` #### U-type encoding * `auipc x5 0x10000` ![](https://i.imgur.com/UqUIdcm.png) * The next PC will be `PC+4` * `Op1` of ALU is set to `PC` * `Op2` of ALU is set to the `imm` field * This isn't a branch instruction `branch taken` set to 0 * Nothing will be written to memory so `WrEn` is set to 0 * The computed result from ALU will be written back to `x5=PC+0x10000000` so the last mux is set accordingly * `WrEn` of register file is set to `1` #### B-type encoding * `bne x6 x0 12 <cnt>` ![](https://i.imgur.com/flzRhEU.png) * The next PC will be computed by ALU `PC+12` which is `<cnt>` * `Op1` of ALU is set to `PC` * `Op2` of ALU is set to the `imm` field * This is a branch instruction `branch taken` set to `1`, the next PC will be `PC+4` if `x6==x0` * Nothing will be written to memory so `WrEn` is set to `0` * Nothing will be written to register file so `WrEn` is set to `0` #### S-type encoding No `sw` instruction is used in my assembly program but it's similar to `lw` instruction. ### Multicycle Pipeline Hazard * **Structural hazard** * It means that the hardware cannot support the combination of instructions that we want to execute in the same clock cycle. * **Data Hazards** * When a planned instruction cannot execute in the proper clock cycle because data that is needed to execute the instruction is not yet available. * There're 3 types of data hazard: **RAW, WAW, WAR**. In our five-stage pipeline RISCV CPU, only **RAW** could occur. * The primary solution is based on the observation that we don’t need to wait for the instruction to complete before trying to resolve the data hazard. * Adding extra hardware to retrieve the missing item early from the internal resources is called **forwarding** or **bypassing**. However, forwarding and bypassing **CANNOT** solve ALL RAW hazards such as the **load-use data hazard** * **Control Hazards** * Also called branch hazard. When the proper instruction cannot execute in the proper pipeline clock cycle because the instruction that was fetched is not the one that is needed; that is, the flow of instruction addresses is not what the pipeline expected. ### Forwarding & Bypassing Consider the first few lines of assembly code from the count from zero example above. ```c=1 lw a0, input # Load input from static data jal ra, clz # Jump-and-link to the 'clz' label ... ``` The pseudo instruction `lw a0, input` is compiled into the following "real" instructions ```c=1 auipc x10 0x10000 lw x10 0(x10) jal x1 0x28 <clz> ... ``` In this section, the execution process of a 5-stage pipeline RISCV processor with/without forwarding & bypassing units will be shown. * 5-stage pipeline **WITH** forwarding & bypassing units ![](https://i.imgur.com/KRKjMcp.png) * `auipc x10 0x10000` instruction writes to `x10` register, and it's used by the next instruction `lw x10 0(x10)` * The value of `x10` is computed by the end of EX stage * The latest value of `x10` is fed into the `lw` instruction when `auipc` enters MEM stage and `lw` enters EX stage. * The pipeline doesn't have to stall because of the forwarding and bypassing unit. * The value of `x10` register right after the execution of `lw x10 0(x10)` is `0x0000000f` as shown in the table below. * ![](https://i.imgur.com/WoCQlm4.png) * 5-stage pipeline **WITHOUT** forwarding & bypassing units & hazard detection unit * `auipc x10 0x10000` instruction writes to `x10` register, and it's used by the next instruction `lw x10 0(x10)` * The correct value of `x10` will be written to register file at the WB stage * ![](https://i.imgur.com/IjiXkqV.png) * Without any forwarding unit and hazard detection unit, the processor itself have no idea that the value of `x10` read by the `lw x10 0(x10)` instruction is **wrong** (outdated). * ![](https://i.imgur.com/mwoT8I1.png) * Thus, the processor loads the wrong data from the wrong memory address. As shown in the following diagram, the value of `x10` is `0x10000517` after the execution of `lw x10 0(x10)`. Which is different from the correct value `0x0000000f`. * ![](https://i.imgur.com/gjeb2uk.png) * The program later will eventually crash due to the incorrect memory reference. Instead of inserting NOP into the pipeline to preserve the correctness of the program, my hand-written assembly code fail to complete because pseudo instructions are compiled into assembly without considering the underlying architecture. * ==Further discussion==: The built-in RISCV compiler of Ripes fail to compile pseudo instruction into correct real instruction while switching between preocessors with/without forwarding unit. * Instead of trusting the compiler to generate the correct "real" instructions, I manually break down the pseudo instructions as follow. * ```bash # lw a0, input # Load input from static data auipc x10, 0x10000 addi x0, x0, 0 addi x0, x0, 0 lw x10 0(x10) jal ra, clz # Jump-and-link to the 'clz' label ... ``` * Notice the `lw a0, input` pseudo instruction is broken down into `auipc x10, 0x10000` and `lw x10 0(x10)` with two `NOP` in between. * In that case, the assembly code could produce the correct result even without a forwarding unit at the cost of performance deterioration.