VLSI System Design Final Project Report

# VLSI System Design Final Project Report ## System overview ### Instruction Set format #### 1. RV32I Base Integer Instruction Set (32-bit) RV32I Base Integer Instrustions use 32-bits for single instruction. There are 6 types of instruction format in RV32I shown as figure below. ![](https://i.imgur.com/jaTnZu9.png) Also, there are 5 types of immediate encoding format in RV32I shown as figure below. ![](https://i.imgur.com/J0bGvQP.png) #### 2. “F” Standard Extension for Single-Precision Floating-Point There are 4 types of instruction format in “F” Standard Extension shown as figure below. In our CPU architecture, R4-type instructions are not supported. ![](https://i.imgur.com/UReDGVV.png) ### Instruction details: Containing 46 unique instructions. #### RV32I Base Integer Instructions * ##### R-type Instructions ![](https://i.imgur.com/cxr9HCt.png) *Register-Register Operations* R-type operations read the rs1 and rs2 integer registers as source operands and write the result into integer register rd. The funct7 and funct3 fields select the type of operation. | Supported Instructions | Operation | Description | |:----------------------:|:----------------------------------------:|:------------------------:| | ADD rd, rs1, rs2 | rd = rs1 + rs2 | addition | | SUB rd, rs1, rs2 | rd = rs1 - rs2 | substraction | | SLT rd, rs1, rs2 | rd = (signed)(rs1<rs2) ? 32'b1 : 32'b0 | set less than | | SLTU rd, rs1, rs2 | rd = (unsigned)(rs1<rs2) ? 32'b1 : 32'b0 | set less than (unsigned) | | AND rd, rs1, rs2 | rd = rs1 & rs2 | logical and | | OR rd, rs1, rs2 | rd = rs1 \| rs2 | logical or | | XOR rd, rs1, rs2 | rd = rs1 ^ rs2 | logical xor | | SLL rd, rs1, rs2 | rd = rs1 << rs2%32 | shift left logical | | SRL rd, rs1, rs2 | rd = rs1 >> rs2%32 | shift right logical | | SRA rd, rs1, rs2 | rd = rs1 >>> rs2%32 | shift right arithmetic | * ##### I-type Instructions ![](https://i.imgur.com/hH4yeQ5.png) *Register-immediate Operations*. I-type operations read the rs1 integer register and sign-extended 12-bit I-immediate as source operands and write the result into integer register rd. The funct3 fields select the type of operation. | Supported Instructions | Operation | Description | |:----------------------:|:----------------------------------------:|:------------------------:| | ADDI rd, rs1, imm | rd = rs1 + imm | addition | | SLTI rd, rs1, imm | rd = (signed)(rs1<imm) ? 32'b1 : 32'b0 | set less than | | SLTUI rd, rs1, imm | rd = (unsigned)(rs1<imm) ? 32'b1 : 32'b0 | set less than (unsigned) | | ANDI rd, rs1, imm | rd = rs1 & imm | logical and | | ORI rd, rs1, imm | rd = rs1 \| imm | logical or | | XORI rd, rs1, imm | rd = rs1 ^ imm | logical xor | | SLLI rd, rs1, imm | rd = rs1 << imm%32 | shift left logical | | SRLI rd, rs1, imm | rd = rs1 >> imm%32 | shift right logical | | SRAI rd, rs1, imm | rd = rs1 >>> imm%32 | shift right arithmetic | | NOP (ADDI x0, x0, 0) | x0 = x0 + 0 | no operation | *Load Operations* Load instructions copy a value from memory to integer register rd. The effective byte address is obtained by adding integer register rs1 to the sign-extended 12-bit I-immediate offset. The funct3 field selects the width of loaded data. | Supported instructions | Operation | Description | |:----------------------:|:---------------------------------:|:------------------------------:| | LW rd, imm(rs1) | rd = mem[(rs1+imm) : (rs1+imm+3)] | load word | | LH rd, imm(rs1) | rd = mem[(rs1+imm) : (rs1+imm+1)] | load half word (sign-extended) | | LHU rd, imm(rs1) | rd = mem[(rs1+imm) : (rs1+imm+1)] | load half word (unsigned) | | LB rd, imm(rs1) | rd = mem[rs1+imm] | load byte (sign-extended) | | LBU rd, imm(rs1) | rd = mem[rs1+imm] | load byte (unsigned) | *Indirect Jump Instruction JALR (jump and Link Register)* Detail about jump and link instruciton are shown in introduction of J-type instruction. The target address of JALR is obtained by adding the 12-bit signed I-immediate to the integer register rs1, then setting the least-significant bit of the result to zero. | Supported Instructions | Operation | Description | |:----------------------:|:-----------------------:|:----------------------:| | JALR rd, imm(rs1) | pc = rs1+imm, rd = pc+4 | jump and link register | * ##### S-type Instructions ![](https://i.imgur.com/LcBuhdd.png) *Store operation* Store instructions copy the value in integer register rs2 to memory. Similar to Load instructions, the effective byte address is obtained by adding integer register rs1 to the sign-extended 12-bit I-immediate offset. The funct3 field selects the width of stored data. | Supported Instructions | Operation | Description | |:----------------------:|:----------------------------------:|:-------------------------------:| | SW rs1, imm(rs2) | mem[(rs2+imm) : (rs2+imm+3)] = rs1 | store word | | SH rs1, imm(rs2) | mem[(rs2+imm) : (rs2+imm+1)] = rs1 | store half word (sign-extended) | | SB rs1, imm(rs2) | mem[(rs2+imm) : (rs2+imm+1)] = rs1 | store byte (sign-extended) | * ##### B-type Instructions ![](https://i.imgur.com/D17OwQn.png) *Conditional Branches* B-type instructions are similar to S-type instructions but with different way to decode the immediate bits. Branch instructions compare two integer registers to derermine taking branch or not. The 12-bit B-immediate encodes signed offsets in multiples of 2, and is added to the current pc to give the target address. The conditional branch range is ±4 KiB. | Supported Instructions | Operation | Description | |:----------------------:|:-----------------------------------------:|:----------------------------------:| | BEQ rs1, rs2, imm | pc = (rs1==rs2) ? pc+imm : pc+4 | branch equal | | BNE rs1, rs2, imm | pc = (rs1!=rs2) ? pc+imm : pc+4 | branch not equal | | BLT rs1, rs2, imm | pc = (signed)(rs1<rs2) ? pc+imm : pc+4 | branch less than (signed) | | BLTU rs1, rs2, imm | pc = (unsigned)(rs1<rs2) ? pc+imm : pc+4 | branch less than (unsigned) | | BGE rs1, rs2, imm | pc = (signed)(rs1>=rs2) ? pc+imm : pc+4 | branch greater or equal (signed) | | BGEU rs1, rs2, imm | pc = (unsigned)(rs1>=rs2) ? pc+imm : pc+4 | branch greater or equal (unsigned) | * ##### U-type Instructions ![](https://i.imgur.com/Pp2ZxrS.png) *Upper immediate Operations* U-type instructions combine 20-bit U-immediate as the upper 20 bits with 12-bit 0s as the lower 12 bits to build a 32-bit computation operand and write the result into integer register rd. | Supported Instructions | Operation | Description | |:----------------------:|:---------------------:|:--------------------------:| | LUI rd, imm | rd = imm << 12 | load upper immediate | | AUIPC rd, imm | pc = pc + (imm << 12) | add upper immediate to pc | * ##### J-type Instructions ![](https://i.imgur.com/WD1anUy.png) *Jump and Link (JAL) Instruction* JAL instructions is similar to U-type instructions but with different way to decode the immediate bits. The J-immediate encodes asigned offset in multiples of 2 bytes. The 20-bit offset is sign-extended and added to the pc to form the jump target address. Jumps can therefore target a ±1 MiB range. Also JAL stores the address of the instruction following the jump (pc+4) into integer register rd. | Supported Instructions | Operation | Description | |:----------------------:|:----------------------:|:-------------:| | JAL rd, imm | pc = pc+imm, rd = pc+4 | jump and link | *Note : The standard software calling convention uses x1 as the return address register and x5 as an alternate link register. Also, register x0 can be used as the destination if the result is not required.* #### Single-Precision Floationg-Point Instructions * ##### R-type Instructions ![](https://i.imgur.com/Xn3iZCz.png) *Register-Register Operations* R-type operations read the rs1 and rs2 floating-point registers as source operands and write the result into floating-point register rd or integer register depending on operation type. In our FPU, only compare insturcitons and some of the arithmetic operations are supported. The funct5 and fmt field select the type of operation. The funct3 field selects rounding mode. (In our FPU, we don't support alternative rounding mode.) | Supported Instructions | Operation | Description | |:----------------------:|:-------------------------------:|:--------------------------------------------------:| | FADD.s rd, rs1, rs2 | rd = rs1 + rs2 | addition | | FSUB.s rd, rs1, rs2 | rd = rs1 - rs2 | substraction | | FMUL.s rd, rs1, rs2 | rd = rs1 * rs2 | multiplication | | FDIV.s rd, rs1, rs2 | rd = rs1 / rs2 | division | | FEQ.s rd, rs1, rs2 | rd = (rs1==rs2) ? 32'b1 : 32'b0 | set equal to (rd is integer register) | | FLT.s rd, rs1, rs2 | rd = (rs1<rs2) ? 32'b1 : 32'b0 | set less than (rd is integer register) | | FLE.s rd, rs1, rs2 | rd = (rs1<=rs2) ? 32'b1 : 32'b0 | set less than or equal to (rd is integer register) | > We have removed FDIV.S due to long latency, which is not suited in our simple pipeline design, more details in discussion * ##### I-type Instructions ![](https://i.imgur.com/hH4yeQ5.png) *Load Operations* Load instructions copy a value from memory to floating-point register rd. The effective byte address is obtained by adding integer register rs1 to the sign-extended 12-bit I-immediate offset. The funct3 field selects the width of stored data. | Supported Instructions | Operation | Description | |:----------------------:|:---------------------------------:|:---------------------------------------------------------:| | FLW rd, imm(rs1) | rd = mem[(rs1+imm) : (rs1+imm+3)] | load word (rd is floating-point, rs1 is integer register) | * ##### S-type Instructions ![](https://i.imgur.com/nRstxrH.png) *Store operation* Store instructions copy the value in floating-point register rs2 to memory. Similar to Load instructions, the effective byte address is obtained by adding integer register rs1 to the sign-extended 12-bit I-immediate offset. The funct3 field selects the width of stored data. | Supported Instructions | Operation | Description | |:-------------------------:|:---------:|:-----------:| | FSW rs1, imm(rs2) | mem[(rs2+imm) : (rs2+imm+3)] = rs1 | store word (rs1 is floating-point, rs2 is integer register) | ### Memory address for jump and branch As mentioned above, there are 3 type of jump and branch instructions. Memory address for B-type conditional branch are given by adding 12-bit B-immediate encodes signed offsets in multiples of 2 to pc, thus the conditional branch jumping range is ±4 KiB based on pc. Memory address for JAL insturction is given by adding 20-bit J-immediate encodes signed offsets in multiples of 2 to pc, thus the conditional branch jumping range is ±1 MiB based on pc. Memory address for JALR instruction is given by adding 12-bit I-immediate encodes signed offsets in multiples of 2 to rs1. Since rs1 is a 32-bit register, JALR can jump to any place as rs1 setting properly. | Type | Destination | Range | |:------:|:------------------:|:-------------------------------:| | Branch | pc + imme(12-bit) | ±4 KiB based on pc | | JAL | pc + imme(20-bit) | ±1 MiB based on pc | | JAR | rs1 + imme(12-bit) | Any address with appropiate rs1 | ### Hardware Architecture #### Diagram + description Pipelined CPU architecture: 5 Stages ![](https://i.imgur.com/bydi9bC.png) #### Individual component 1. Controller: (1). Control signals a. Fetch: `F_im_w_en[4]` : Enable writing immediate. b. Decode: `D_rs1_data_sel, D_rs2_data_sel` : Choose data from integer register, Floating point register or forwarding. c. Execute: `E_rs1_data_sel, E_rs2_data_sel` : Choose data from **register file** or **two forwarding paths**. `E_jb_op1_sel` : Choose data from **pc** or **MUX of rs1** `E_alu_op1_sel` : Choose data between **rs1** or **PC**. `E_alu_op2_sel` : Choose data between **rs2** or **immediate**. `E_op[5], E_func3[3], E_func7[7]` : Operation for ALU and FPU. `E_res_sel` : Choose data between **ALU** or **FPU**. d. Memory: `M_dm_w_en[5]` : Enable writing data memory(DM) for store instruction SB SH SW. e. Writeback: `W_wb_en` : Enable writing back regfile. `W_rd_index[5]` : The first bit determines RegI or RegF. `W_func3[3]` : Operation for LD filter when there is a load instruction LB LBU LH LHU LW. `W_wb_data_sel` : Choose data from **memory** or **ALU result**. else : `next_pc_sel, stall` : (2). Registers and check if need to stall: a. Check if the register is passed b. compare with the next stage to see if they are overlap. 3. Program counter and PC Adder The register containing the address of the instruction in the program being executed. The address size is the same as the memory size(32 bits). 5. Immediate extension Extend immediate value from instruction to 32 bit wide. 7. Decoder: separate 32-bit instruction into (1). ①opcode(5): Actually the opcode of RISC-V has 7 bits, but we can ignore the constant last two bits `2'b11`. (2). ②`func3[3]` ③`func7[7]` (3). ④`rs1[6]` ⑤`rs2[6]` ⑥`rd[6]`: Because we separated the regfile into regI and regF, the extended one bit determines which regfile to acssess. 9. Jump & Branch unit Calculate target address for jump and branches. 11. Register File A register file can be implemented with a decoder for each read or write port and an array of registers built from D flip-flops. 10. Arithmetic Logic Unit (ALU) Perform integer operations. 12. Data Memory(RAM) External memory file and RAM will not be synthesized. The input `write_en[4]` has 4 modes of writing data: | `write_en[4]` | 1111 | 0011 | 0001 | 0000 | |:-------------:|:----:|:---------:|:----:|:----:| | **Mode** | Word | Half Word | Byte | None | 14. Load filter Because load instructions read different bits from memory, we need to extend them. | `func3[3]` | 000 | 100 | 001 | 101 | 010 | |:----------:|:-------------:|:------------------------:|:------------------:|:----------------------------:|:-------------:| | **Mode** | LB: load byte | LBU: load byte(unsigned) | LH: load half word | LHU:load half word(unsigned) | LW: load word | 16. FPU: Floating Point unit, refer to Special>>Floating Point Arithmetic. ## Supported instructions ### RISC-V Integer Instructions ![](https://i.imgur.com/81t0mv2.png) *R-R* ![](https://i.imgur.com/my2ScBU.png) *R-I* ![](https://i.imgur.com/5NxO7Qm.png) ![](https://i.imgur.com/Anh1T9Q.png) *Upper Immediate* ![](https://i.imgur.com/7QEvWCy.png) *Load* ![](https://i.imgur.com/gB1jO1F.png) *Store* ![](https://i.imgur.com/reVlOZj.png) *Branch* ![](https://i.imgur.com/RZhuPNt.png) *Jump and Link* ![](https://i.imgur.com/WNdcE0O.png) ![](https://i.imgur.com/5Cl4dwD.png) ### F-extension ![](https://i.imgur.com/BC9VcGA.png) ![](https://i.imgur.com/cJ3Iukx.png) ![](https://i.imgur.com/2t2ea4p.png) *F-Load* ![](https://i.imgur.com/ZWbqYc4.png) *F-Store* ![](https://i.imgur.com/RHDxwXl.png) *R-R* ![](https://i.imgur.com/IBzqUAR.png) ![](https://i.imgur.com/EAxw6Sk.png) > Though we have designed floating-point devider for FDIV.S , we removed FDIV.S due to long latency, which is not suited in our simple pipeline design, more details in discussion. ## Verification methods and result analysis ### Method #### 1. Generate Memory Image File for Assembly Programs After complete each assembly code program, we use gnu compiler to compile and generate the memory image file for instruction memory. And then, we use risc-v simulator to test the assembly code and generate memory image of answer data. >refer to compiler [riscv-gnu-toolchain](https://github.com/riscv-collab/riscv-gnu-toolchain) and simulator [Venus RISC-V simulator](https://marketplace.visualstudio.com/items?itemName=hm.riscv-venus) ![](https://i.imgur.com/QhqIcXk.png) #### 2. Testbench For testing, we arrange memory layout for each assembly program as following Table. | memory index | usage | |:--------:| -------- | | 0x0000~ | memory section for assembly porgram | | 0x8000~ | data section for assembly porgram | | 0x9000~ | answer section for computation result | | 0xFFFC | assembly program will write -1 to here to let testbench know program end successfully | ![](https://i.imgur.com/pz3CCrH.png) In testbench, it will load and run assembly code for `testing integer instructions`, `testing signle-precision floationg-point instructions`, `calculating Fibonacci sequence with integer insturctions`, `calculating Fibonacci sequence with floating-point insturctions`, `merge sort program` respectively. The working flow is shown in figure below. ![](https://i.imgur.com/jScohL9.png) ### Result(nWave) #### Instruction correctness ##### 1. Testing Integer instructions In this program, we test the correctness all the risc-v base integer instructions and try some sequence of instructions that will cause hazards to test if our pipeline designs to handle the hazards are working or not. After each trial, we write the computed result back to memory location `h9000` and check with the pre-stored answer in testbench. simulation result screenshot: ![](https://i.imgur.com/BYnLUEL.png) **Waveform :** >We can find the instructions with corresponding memory location in the compile imformation file given by GNU compiler. *add* As the waveform screenshot, instruction at `IM[0x38]` add `t0, t0, t1` successfully write the result of `t0(REG[5]) + t1(REG[6]) = ffff_ffff + ffff_ffff = ffff_fffe` to `t0(REG[5])`. ![](https://i.imgur.com/jMUWiCC.png) *sub* As waveform screenshot, instruction at `IM[0x74]` `sub t0, t0, t1` successfully write the result of `t0(REG[5]) + t1(REG[6]) = 0000_0000 - ffff_ffff = 0000_0001` to `t0(REG[5])`. ![](https://i.imgur.com/Z5QYGF1.png) *sll* As waveform screenshot, instruction at `IM[0xb0]` `sll t0, t0, t1` successfully write the result of `t0(REG[5]) << (t1(REG[6]))%32(dec) = 0000_0001 << 1 = 0000_0002` to `t0(REG[5])`. ![](https://i.imgur.com/chzwRo5.png) *slt* As waveform screenshot, instruction at `IM[0xec]` `slt t0, t0, t1` successfully write the result of `t0(REG[5]) < t1(REG[6]) ? 1'b1 : 1';'b0 = -1 < 1 ? 1'b1 : 1'b0 = 0000_0001` to `t0(REG[5])`. ![](https://i.imgur.com/chzwRo5.png) *sltu* As waveform screenshot, instruction at `IM[0x128]` `sltu t0, t0, t1` successfully write the result of `t0(REG[5]) < t1(REG[6]) ? 1'b1 : 1b0 = ffff_ffff < 1 ? 1'b1 : 1'b0 = 0000_0001` to `t0(REG[5])`. ![](https://i.imgur.com/f1SofoC.png) *xor* As waveform screenshot, instruction at `IM[0x16c]` `xor t0, t0, t1` successfully write the result of `t0(REG[5]) xor t1(REG[6]) = ffff_ffff xor f0f0_f0f0 = 0f0f_0f0f` to `t0(REG[5])`. ![](https://i.imgur.com/aTbaFEW.png) *srl* As waveform screenshot, instruction at `IM[0x1b0]` `srl t0, t0, t1` successfully write the result of `t0(REG[5]) >> (t1(REG[6]))%32(dec) = ffff_ffff >> 4 = 0fff_ffff` to `t0(REG[5])`. ![](https://i.imgur.com/XCiFhQc.png) *sra* As waveform screenshot, instruction at `IM[0x1f4]` `srl t0, t0, t1` successfully write the result of `t0(REG[5]) >>> (t1(REG[6]))%32(dec) = 8765_4321 >>> 4 = f876_5432` to `t0(REG[5])`. ![](https://i.imgur.com/Wip93xY.png) *or* As waveform screenshot, instruction at `IM[0x23c]` `or t0, t0, t1` successfully write the result of `t0(REG[5]) | t1(REG[6]) = 1234_5678 | fedc_ba98 = fefc_fef8` to `t0(REG[5])`. ![](https://i.imgur.com/Wip93xY.png) *and* As waveform screenshot, instruction at `IM[0x280]` `and t0, t0, t1` successfully write the result of `t0(REG[5]) & t1(REG[6]) = 1234_5678 & ffff_ffff = 1234_5678` to `t0(REG[5])`. ![](https://i.imgur.com/Wip93xY.png) *lw* As waveform screenshot, instruction at `IM[0x33c]` `lw t0, 0(t1)` successfully write the data of `DM[t1+0 = 0x8000] = 6666_6666` to `t0(REG[5])`. ![](https://i.imgur.com/GlEbwDl.png) *lb* As waveform screenshot, instruction at `IM[0x338]` `lb t1, -16(s0)` successfully write the signed extended lower 1 byte data of `DM[s0-16 = 0x9028] = cc = ffff_ffcc` to `t1(REG[6])`. ![](https://i.imgur.com/3hcYd0Z.png) *lh* As waveform screenshot, instruction at `IM[0x33c]` `lh t2, -16(s0)` successfully write the signed extended lower half word data of `DM[s0-16 = 0x9028] = cccc = ffff_cccc` to `t2(REG[7])`. ![](https://i.imgur.com/qFpzbLy.png) *lbu* As waveform screenshot, instruction at `IM[0x340]` `lbu t3, -16(s0)` successfully write the unsigned lower 1 byte data of `DM[s0-16 = 0x9028] = cc = 0000_00cc` to `t3(REG[28])`. ![](https://i.imgur.com/hwFGIJf.png) *lhu* As waveform screenshot, instruction at `IM[0x344]` `lhu t2, -16(s0)` successfully write the unsigned lower half word data of `DM[s0-16 = 0x9028] = cccc = 0000_cccc` to `t4(REG[29])`. ![](https://i.imgur.com/FhIcRyN.png) *addi* As waveform screenshot, instruction at `IM[0x360]` `addi t0, t0, -1` successfully add immediate -1 to `t0(REG[18])` ![](https://i.imgur.com/lkWNn4L.png) *slti* As waveform screenshot, instruction at `IM[0x3ac]` `slti, t0, t0, -1` successfully set `r0=-1` less than `immediate -1`, result = 0 ![](https://i.imgur.com/43uAc46.png) *sltiu* As waveform screenshot, instruction at `IM[0x3f8]` `sltiu t0, t0, -1` successfully set `r0=-2` less than `immediate -1`, result = 1 ![](https://i.imgur.com/QDNhC0f.png) *xori* As waveform screenshot, instruction at `IM[0x448]` `sltiu t0, t0, 123` successfully xor `t0=-1` with immediate 123, store result = `ffff_ff84` at `t0`. ![](https://i.imgur.com/zUbWnFh.png) *ori* As waveform screenshot, instruction at `IM[0x494]` `ori t0, 444` successfully or `t0=1`with immediate 444, store result = `1bd` at `t0.` ![](https://i.imgur.com/XWwFQ5I.png) *andi* As waveform screenshot, instruction at `IM[0x4e4]` `andi t0, -333` successfully and `t0=abcdef98` with immediate `fffffeb3`, store result = `abcdee90` ![](https://i.imgur.com/gAkZGnp.png) *slli* As waveform screenshot, instruction at `IM[0x534]` `slli t0, t0, 0x2` successfully shift `t0` left 2 bits. ![](https://i.imgur.com/JhgW6ZZ.png) *srli* As waveform screenshot, instruction at `IM[0x588]` `srli t0, t0, 0x3` successfully shift `t0` right 3 bits. ![](https://i.imgur.com/SmkqPpB.png) *srai* As waveform screenshot, instruction at `IM[0x5dc]` `srai t0, t0, 0x5` arithmetically shift right `t0 = a1b2_c3d4` to `t0 = fd0d_961e` ![](https://i.imgur.com/9UToQYO.png) *jalr* As waveform screenshot, instruction at `IM[0x634]` `jalr t1, t1, 0` jump pc to `t1+0 = 0x64c` and store `0x634+4 = 0x638` to `t1` ![](https://i.imgur.com/AJ2r1e2.png) *sw* As waveform screenshot, instruction at `IM[0x6cc]` `sw t0, -4(s0)` store the data of reg `t0 = 0000_000f` to `DM[s0-4 = 0x9074]`. ![](https://i.imgur.com/BMhoOJe.png) *sb* As waveform screenshot, instruction at`IM[0x6cc]` `sb t5, -8(s0)` store the data of reg `t5 = 1234_5678` to `DM[s0-4 = 0x9084]`. Because `write_en = 0001`, only 1 byte data `0x78` will be stored. ![](https://i.imgur.com/323Wra5.png) *sh* As waveform screenshot, instruction at`IM[0x6f4]` `sh t5, -12(s0)` store the data of reg `t5 = 1234_5678` to `DM[s0-4 = 0x9080]`. Because `write_en = 0011`, only half word data `0x5678` will be stored. ![](https://i.imgur.com/Xinbpkm.png) *beq* `beq t0, t1, 76c` From the screenshot, we can observe that after EX stage resulting in 1, the branch got taken and the previous two instructions fetched got flushed. The latest pc then was updated to 76c. ![](https://i.imgur.com/yJN1vlU.png) *bne* `bne t0, t1, 7cc` From the screenshot, we can observe that after EX stage resulting in 1, the branch got taken and the previous two instructions fetched got flushed. The latest pc then was updated to 7cc. ![](https://i.imgur.com/zEwWioZ.png) *blt* `blt t0, t1, 850` From the screenshot, we can observe that after EX stage resulting in 0, the branch did not get taken, therefore the pipeline remained the same. ![](https://i.imgur.com/qoUzK8g.png) *bge* `bge t0, t1, 8d0` From the screenshot, we can observe that after EX stage resulting in 1, the branch got taken and the previous two instructions fetched got flushed. The latest pc then was updated to 8d0. ![](https://i.imgur.com/csWQww5.png) *bltu* `bltu t0, t1 948` From the screenshot, we can observe that after EX stage resulting in 0, the branch did not get taken, therefore the pipeline remained the same. ![](https://i.imgur.com/AQJosmI.png) *bgeu* `bgeu t0, t1, 9c8` From the screenshot, we can observe that after EX stage resulting in 1, the branch got taken and the previous two instructions fetched got flushed. The latest pc then was updated to 9c8. ![](https://i.imgur.com/BM8sW9Y.png) *auipc* `auipc t1, 0xfffff` From the screenshot, we can observe that the CPU successfully wrote back `0xfffff9c4`, which is 0xfffff000 + current pc back to register t0. ![](https://i.imgur.com/i8onCI9.png) *lui* `lui t1, 0xfffff` From the screenshot, we can observe that the CPU successfully wrote back `0xfffff` back to register t1. ![](https://i.imgur.com/2l72qdo.png) *jal* `jal t1, a14` From the screenshot, we can observe that pc jumped to `a14` and stored the return value back to register t1. ![](https://i.imgur.com/l01N94Q.png) ##### Testing Single-Precision Floating-Point instructions In this program, we test the correctness all the risc-v single-precision floating-point instructions supported in our design and try some sequence of instructions that will cause hazards to test if our pipeline designs to handle the hazards are working or not. After each trial, we write the computed result back to memory location `h9000` and check with the pre-stored answer in testbench. simulation result screenshot: ![](https://i.imgur.com/YV2ILbS.png) ![](https://i.imgur.com/uYA8frs.png) *flw* As waveform screenshot, instruction at `IM[0x38]` successfully load the data of `DM[s0+4 = 0x8004]` = `3fe0_0000(1.75)` to reg `f1`. ![](https://i.imgur.com/dawIxIr.png) *fsw* As waveform screenshot, instruction at`IM[0x6c]` successfully store the data of reg `f1= 3fe0_0000(1.75)` to the `DM[s1+4 = 0x9004]`. ![](https://i.imgur.com/3212Ytl.png) *fadd.s* As waveform screenshot, instruction at `IM[0xa8]` is `fadd.s f12, f1, f1` successfully write the result of`f1 + f1 = 1.75 + 1.75 = 3.5` to reg `f12` ![](https://i.imgur.com/Zu1ECQw.png) *fsub.s* As waveform screenshot, instruction at `IM[0x170]` is `fsub.s f28, f1, f6` successfully write the result of `f1 - f6 = 1.75 - 1.5 = 0.25` to reg`f28` ![](https://i.imgur.com/n1O1ZXD.png) *fmul.s* As waveform screenshot, instruction at `IM[0x17c]` `and f11, f1, f2` successfully write the result of `f1 * f2 = 1.75 * -1.875 = -3.28125` to `f11`. ![](https://i.imgur.com/IDbbDVY.png) *feq.s* contidion true : As waveform screenshot, instruction at `IM[0x200]` `feq t1, f6, f6` successfully write the result of `f6 == f6 ? = 0000_0001` to `t1(REG[6])`. ![](https://i.imgur.com/sV1uCSP.png) contidion false : As waveform screenshot, instruction at `IM[0x218]` `feq t3, f0, t1` successfully write the result of `f0 == f1 ? = 0000_0000` to `t3(REG[28])`. ![](https://i.imgur.com/SNzVUHy.png) *flt.s* contidion true : As waveform screenshot, instruction at `IM[0x230]` `flt t1, f2, f1` successfully write the result of `f2 < f1 ? = -1.87 < 1.75 ? = 0000_0001` to `t1(REG[6])`. ![](https://i.imgur.com/Ihdh1Sq.png) contidion false : As waveform screenshot, instruction at `IM[0x218]` `flt t1, f1, t2` successfully write the result of `f1 < f2 ? = -1.87 < 1.75 ? = 0000_0000` to `t1(REG[6])`. ![](https://i.imgur.com/QgybuHc.png) *fle.s* contidion true : As waveform screenshot, instruction at `IM[0x2a8]` `fle t1, f0, f7` successfully write the result of `f0 <= f7 ? = +0 <= -0 ? = 0000_0001` to `t1(REG[6])`. ![](https://i.imgur.com/jqD1d4J.png) contidion false : As waveform screenshot, instruction at `IM[0x218]` `flt t1, f1, t2` successfully write the result of `f1 < f2 ? = -1.87 < 1.75 ? = 0000_0000` to `t1(REG[6])`. ![](https://i.imgur.com/uDppD57.png) #### Program correctness ##### 1. Fibonacci sequence using Integer arithmetic This program counts up to the twentieth Fibonacci number. We store all the numbers we computed back to memory location `h9000` and check with the pre-stored answer in testbench. simulation result screenshot: ![](https://i.imgur.com/z6UKaYU.png) Integer Fibonacci sequence program screenshot: ![](https://i.imgur.com/0yYEeji.png) ![](https://i.imgur.com/UGoZDUH.png) ##### 2. Fibonacci sequence using floating point arithmetic This program counts up to the fiftieth Fibonacci number, 12,586,269,025 in theory, which exceeds the legal range of a signed integer, and is well suited for demonstrating the power of floating point arithmetics.After computing, it is then stored back to the memory one by one. The final answer we got is a little off the theoretic value because of a different rounding method we implemented in our FPU adder. simulation result screenshot: ![](https://i.imgur.com/q0wazIO.png) Floating-point Fibonacci sequence program screenshot: ![](https://i.imgur.com/IXaXVKD.png) ![](https://i.imgur.com/2nB77Fu.png) ##### 3. Sorting ![](https://i.imgur.com/Is6807X.png) The sorting algorithm we choose is the infamous merge sort. It allows us to show that our machine is capable of doing recursive call functions using a speical register **sp** to record the current stack pointer. The test sorts three different preloaded arrays by calling our hand-coded sorting algorithm and stores the sorted array back to memory location `h9000` and check with the pre-stored answer in testbench. simulation result screenshot: ![](https://i.imgur.com/GHb7tYf.png) ![](https://i.imgur.com/b7nxQhX.png) ![](https://i.imgur.com/9LZWAqd.png) Mergesort program screenshot: ![](https://i.imgur.com/VKEnFs0.png) ![](https://i.imgur.com/oqYcRck.png) ![](https://i.imgur.com/2st7dt5.png) ![](https://i.imgur.com/Ns4sfLE.png) ![](https://i.imgur.com/InKLFmj.png) ![](https://i.imgur.com/m7PnvRk.png) ![](https://i.imgur.com/kVBl6Ye.png) ![](https://i.imgur.com/gUelOU0.png) ![](https://i.imgur.com/3DkgLLU.png) ![](https://i.imgur.com/6sEuOeE.png) ![](https://i.imgur.com/Yx3kSlX.png) ## SuperLint and ICC results ### SuperLint ![](https://i.imgur.com/IqeBNhy.png) ### ICC #### icc branch coverage(without fdiv) : 100% ![](https://i.imgur.com/PuiwsOO.png) #### icc expression coverage(without fdiv) : 100% ![](https://i.imgur.com/kCBEvVS.png) #### icc toggle coverage(without fdiv) > 70% ![](https://i.imgur.com/J1j1Gi6.png) For the sub-modules(FPU_Multipler & FPU_Adder) in FPU_Divider, some branch in the sub-module can never be reached with legal assembly program due to the algorithm we used to compute fdiv. The icc check with fdiv are shown below. branch&expression coverage with fdiv : ![](https://i.imgur.com/pNXvCfC.png) ![](https://i.imgur.com/nmqgUPE.png) toggle coverage > 70% with fdiv : ![](https://i.imgur.com/a6UASCw.png) ## Synthesis result ### Speed(timing max) library setup time = -0.28ns slack = 0.00ns ![](https://i.imgur.com/a81Y3D4.png) ### Speed(timing min) slack = 0.40ns ![](https://i.imgur.com/mc2yxy3.png) ### Area: Total cell area=408967.578605µm^2 ![](https://i.imgur.com/pybcZos.png) ### Power Total Dynamic Power=5.5256mW Cell Leakage Power(Static)=12.7036µW ![](https://i.imgur.com/kO7lbjP.png) ## Layout(screenshot) ### Clock Tree Synthesis and Route #### preCTS: ![](https://i.imgur.com/UyRs75Z.png) #### postCTS: ![](https://i.imgur.com/Hyb4RSo.png) #### postRoute: ![](https://i.imgur.com/8NX42nj.png) ### Layout after routing ![](https://i.imgur.com/ITpkr8D.png) ### Layout versus schematic(LVS) check result ![](https://i.imgur.com/6TPqK7Q.png) ## Pipeline ### Pipeline design Pipelining the execution of instructions allows us to achieve instruction level parallelism. We split the process of execution into multiple stages, allowing us to obtain shorter cycle time for each stage instead of doing the whole process under one cycle. On top of that, we overlap the execution of instructions, leaving the hardware components in each stage working thorugh out the process, or no idle time, which ahcieves 100% utilization. The latency of a single instruction does not change after switching to a pipelined design, but the overall throughput of the machine increases as shown in the following image. ![](https://i.imgur.com/ghQy8dI.png) ### Pipeline Stage distribution We separate the execution of a single instruction into five stages: #### IF: Instruction fetch Update current pc(program counter), and fetch the corresponding instruction from the instruction memory. #### ID: Instruction decode & register file read Decode the instruction into separated fields, including rs1, rs2, rd, immediate, and opcode. After confirming rs1 and rs2, obtain the value from register file at the same stage. #### EX: Execute or address calculation Do arithmetic execution for instruction. #### MEM: Memory access Access memory for store and load instruction. #### WB: Write back Write back results to destination register. | Instruciton Type | EX stage | MEM stage | WB stage | |:----------------:| -------- | --------- | -------- | | R-type I-type U-type | ALU & FPU compute the arithmetic result | X | Write result to REGFILE_I or REGFILE_I | | L-type | ALU calculate the source memory address | Read data from DM[address] | Write result to REGFILE_I or REGFILE_I | | S-type | ALU calculate the destination memory address | Write data to DM[address] | X | | B-type | ALU compare register data | X | Write result to PC | | J-type | ALU calculate destiantion | X | Write destination to PC & write original pc to REGFILE | ### Pipeline hazard #### Control hazard Since the machine keeps feeding the pipeline new instructions every cycle, when a jumping branch is taken wrong instructions will be fetched into the pipeline which requires further modification. Example: ![](https://i.imgur.com/ILhCRr4.png) Assume that branch is taken for `beq`, which implies that `subi` and `addi` will not be executed. However, they have been fed into the pipeline in previous cycles. A simple way to solve control hazard is to flush the wrong instructions being fetched and take two cycles of penalty. A common method to avoid control hazards is to implement a better branch predictor. #### Data hazard Since write back happens at the latest stage, instructions that require the latest value of registers may not be presented in the current register file. For example: ![](https://i.imgur.com/k07Cwti.png) The second `add` instruction using `t1` as a source requires its value at the fourth cycle. However, the latest value of `t1`, which is the result of the first `add` instruction has not yet been written back. A simple way to solve this issue is to stall the pipeline, but there is a more clever solution that burdens no penalty. We forward the result from `MEM` or `WB` stage to the current `EX` stage. #### Hardware hazard Hardware hazard occurs when different instructions uses the same hardware component at the same time, for example: ![](https://i.imgur.com/kXy22cy.png) `IF` of the third `addi` and `lw` both reads data from memory at the forth cycle. Possible solutions including instruction prefetching using an instruciton cache, or a more naive way by stalling the pipeline. ## Specials ### Floating Point Arithmetic Arithmetic operations on floating point numbers consist of addition, subtraction, multiplication and division. #### Addition Example on floating point value given in binary: | Flpt. | S | E | F | | ---- | --- |:--------:|:-----------------------:| | 0.25 | 0 | 01111101 | 00000000000000000000000 | | 100 | 0 | 10000101 | 10010000000000000000000 With the hidden bit: ``` 0 01111101 1.00000000000000000000000 +0 10000101 1.10010000000000000000000 ``` ##### Step 1: Align radix point ``` Shifting the mantissa LEFT by 1 bit DECREASES THE EXPONENT by 1 Shifting the mantissa RIGHT by 1 bit INCREASES THE EXPONENT by 1 ``` Choose to shift the number that has the smaller exponent -> choose to shift 0.25 Shift by: ``` 10000101 - 01111101 = 00001000 (8) ``` With the hidden bit, for clarity: `0 01111101 1.00000000000000000000000` (original value) The result after shifting right 8 places is: `0 10000101 0.00000001000000000000000` ##### Step 2: Add Add two mantissa 1.10010000000000000000000 + 0.00000001000000000000000 = 1.10010001000000000000000 ##### Step 3: Normalize the result It already normalize This is the result stored (not the hidden bit) : `0 10000101 10010001000000000000000` Assume that the result of an addition aligned matissas is: `10.11110000000000000000000` and the exponent go to with this is `10000000` We must put the mantissa back in the normalized form. Shift the mantissa to the right by one place, and increase the exponent by 1. The exponent and mantissa become `10000001 1.01111000000000000000000` #### Subtraction Example: | Flpt. | S | E | F | |:--------:| --- |:--------:|:-----------------------:| | 6.265625 | 0 | 10000001 | 10010001000000000000000 | | 3.75 | 0 | 10000000 | 11100000000000000000000 | With the hidden bit ``` 0 10000001 1.10010001000000000000000 - 0 10000000 1.11100000000000000000000 ``` ##### Step 1: Align Choose to shift the number with the smaller exponent. `0 10000000 1.11100000000000000000000` becomes `0 10000001 0.11110000000000000000000` ##### Step 2: Subtract two mantissa 1.10010001000000000000000 - 0.11110000000000000000000 = 0.10100001000000000000000 ##### Step 3: Normalize the result `0 10000001 0.10100001000000000000000` Shifting left the mantissa by 1 place, and the exponent subtracts for 1: `0 10000000 1.01000010000000000000000` Below is the result stored (not the hidden bit): `0 10000000 01000010000000000000000` #### Multiplication Example in binary: Use a mantissa that is only 4 bits ``` 0 10000100 0100 x 1 00111100 1100 ``` ##### Step 1: Mantissa multiplication (don't forget hidden bit): ``` 1.0100 x 1.1100 --------- 00000 00000 10100 10100 10100 ------------- 10.00110000 ``` ##### Step 2: Add exponent Add two exponents : ``` 10000100 + 00111100 = 11000000 ``` And subtract the result for 127: ``` 11000000 - 01111111(127) = 01000001 ``` ##### Step 3: Add the sign bit `1^0=1` Put the result together ` 1 01000001 10.00110000` ##### Step 4: Normalize the result `1 01000001 10.00110000` becomes `1 01000010 1.000110000` Below is the result stored: `1 01000010 000110000` #### Division Division by reciprocal approximation: Instead of doing a / b we do a * 1/b Example of a result that isn't the same as with true division: True division: `3/3 = 1 (exactly)` Reciprocal approx:` 1/3 = .33333333` `3 x .33333333 = .99999999, not 1` It is not always possible to get a perfectly accurate reciprocal. ##### Division Algorithm Uses Newton Raphson Iterations to find the reciprocal of the Divisor and then Multiplies the Reciprocal with the Dividend. Uses 8 multiplication instances and 3 addition instances: Division Algorithm : A/B Intial Seed : x0 = 48/17 - (32/17)*D where D - Divisor B adjusted to fit 0.5-1 range by replacing the exponent field with 8'd126 Newton Raphson Iterations : x1 = x0*(2-D*x0) x2 = x1*(2-D*x1) x3 = x2*(2-D*x2) x3 - Reciprocal of Adusted value D. Adjust the exponents to produce the final reciprocal of B 1/B : {B[31],x3[30:23]+8'd126-B[30:23],x3[22:0]} Final Value A*1/B #### Operations have undefined results ``` 1/0 = infinity any positive value / 0 = positive infinity any negative value / 0 = negative infinity infinity * x = infinity 1/infinity = 0 0/0 = NaN 0 * infinity = NaN infinity * infinity = NaN infinity - infinity = NaN infinity/infinity = NaN x + NaN = NaN NaN + x = NaN x - NaN = NaN NaN - x = NaN sqrt(negative value) = NaN NaN * x = NaN NaN * 0 = NaN 1/NaN = NaN ``` ## Discussion In this 5-stage pipeline CPU project, we designed the overall structure beforehand without considering too much about the critical path of every stage. We have assumed that memory access has no delay, which means that any write or read completes instantly, therefore `IF` and `MEM` won't be the bottleneck of us increasing the clock frequency. We finished the first version, without floating point arithmetic, of the CPU with around 5 to 10 nanosecond clock period. Soon after, we agreed to add floating point arithmetic, the risc-v F extension, into out CPU on top of the current design. What we did not realize is that floating point operations have longer latency than we expected. Due to limited time, we could not but add the FPU(floating point unit) on top of our current architecture, which leads to imbalanced pipeline stages, the worst nightmare of a pipeline machine. `EX` take so long that we have to increase our clock period to around 30 nanaseconds in order to alleviate timing constraints. A better approach is to overhaul the design of the architecture and make long latency instructions execute in mutliple cycles. Another major hinder in this group work is to synthesize the correct circuit for our CPU. A single misuse of register or wire in our verilog code will lead to an unexpected latch, which sabotaged the ***should be*** short latency operation. If we had put more attention on our coding style, we could have more spare time designing and planning the improved architecture instead of fixing broken code for two straight weeks. ## Review ### 林志懋 VSD的Final project 和其他課程不同的是，要從零開始自行設計CPU架構，如果架構設計錯誤，CPU效能就會非常低落，或是合成出完全不合理的硬體。經過這次合作，更加了解數位電路硬體設計的考量重點，收穫非常多。 ### 蔡奇佑在這次的Project中，我們必須協作進行CPU架構的設計。以往的Project都是以個人為單位進行，很多時候在事前不需要太嚴謹的定義就可以開始設計。然而在團隊協作的過程中，我們往往需要先溝通並達成共識才能開始後續的工作。這次的Project讓我收穫了十分寶貴的合作經驗! ### 曾藌瑢這次的final project是團體作業，事前需要花比較多的時間進行討論，雖然過程中有因為溝通不清楚的造成一些問題，但也因為遇到問題時有更多人一起思考解決方案，所以在debug上也變得更有效率。 ### 范俊義 I was responsible for designing the floating point unit in this project, but my code was not perfect for synthesis. With the cooperation of the team members, it was finally able to be synthesizable. To complete this project, there were days when we had to work continuously without having time to have lunch. I have learned a lot from my team members, thanks. ### 戴仁杰 ## Work | 姓名 | 學號 | 負責項目 | 貢獻度 | |:---:|:---:| --- |:---:| | 林志懋 | E24096043 | cpu architecture rtl, assembly, testbench | 20% | | 蔡奇佑 | E24096213 | assembly programs, icc check, testbench architecture | 20% | | 曾藌瑢 | E24096742 | fpu compare unit, fpu testbench | 20% | | 范俊義 | E24097015 | Floating point unit | 20% | | 戴仁杰 | E24094253 | ... | 20% | ## Reference 1. Risc-V Instruction set Manual : [https://riscv.org/wp-content/uploads/2017/05/riscv-spec-v2.2.pdf](https://riscv.org/wp-content/uploads/2017/05/riscv-spec-v2.2.pdf) 2. riscv-gnu-toolchain compiler : [https://github.com/riscv-collab/riscv-gnu-toolchain](https://github.com/riscv-collab/riscv-gnu-toolchain) 3. Venus RISC-V simulator : [https://marketplace.visualstudio.com/items?itemName=hm.riscv-venus](https://marketplace.visualstudio.com/items?itemName=hm.riscv-venus) 4. Floating Point Representation and Arithmetic: [https://pages.cs.wisc.edu/~markhill/cs354/Fall2008/notes/flpt.apprec.html](https://pages.cs.wisc.edu/~markhill/cs354/Fall2008/notes/flpt.apprec.html) 5. Computer Organization and Design RISC-V Edition: The Hardware Software Interface by David A. Patterson and John L. Hennessy