# Assignment1: RISC-V Assembly and Instruction Pipeline contributed by <`liballouo`> ## Problem `C` in [Quiz 1](https://hackmd.io/@sysprog/arch2024-quiz1-sol) We test our code using [Ripes](https://ripes.me/) simulator. ### Version 1 #### C code ```c static inline int my_clz(uint32_t x) { int count = 0; for (int i = 31; i >= 0; --i) { if (x & (1U << i)) break; count++; } return count; } ``` #### RISC-V assembly code You can see the entire code [here](https://github.com/liballouo/Computer-Architecture-2024-Fall-HW1/blob/main/Counting_Leading_Zero_v1.s). ```c # Input: uint32_t a0 # Output: uint32_t a0 my_clz: li t0, 0 # count = 0 li t1, 31 # t1 = 31 clz_loop: li t2, 1 sll t2, t2, t1 # (1U << i) and t3, a0, t2 # x & (1U << i) bnez t3, clz_done # if (x & (1U << i)) break loop addi t0, t0, 1 # count = count + 1 addi t1, t1, -1 # i = i - 1 bgez t1, clz_loop # if (i>=0) continue loop clz_done: addi a0, t0, 0 # set result to a0 ret ``` #### execution info | Info | Value | | --------------- | ----- | | Cycle | 443 | | Instrs. retired | 322 | | CPI | 1.38 | | IPC | 0.727 | | Clock rate | 0 Hz | ### Version 2 Counting leading zero utilizes population count. #### C code ```c static inline int my_clz(uint32_t x) { x |= (x >> 1); x |= (x >> 2); x |= (x >> 4); x |= (x >> 8); x |= (x >> 16); /* count ones (population count) */ x -= ((x >> 1) & 0x55555555); x = (x >> 2) & 0x33333333) + (x & 0x33333333); x = ((x >> 4) + x) & 0x0f0f0f0f; x += (x >> 8); x += (x >> 16); return (32 - (x & 0x3f)); } ``` #### RISC-V assembly code You can see the entire code [here](https://github.com/liballouo/Computer-Architecture-2024-Fall-HW1/blob/main/Counting_Leading_Zero_v2.s). ```c # Input: uint32_t a0 # Output: uint32_t a0 my_clz: srli t0, a0, 1 # x >> 1 or a0, a0, t0 # x |= (x >> 1) srli t0, a0, 2 # x >> 2 or a0, a0, t0 # x |= (x >> 2) srli t0, a0, 4 # x >> 4 or a0, a0, t0 # x |= (x >> 4) srli t0, a0, 8 # x >> 8 or a0, a0, t0 # x |= (x >> 8) srli t0, a0, 16 # x >> 16 or a0, a0, t0 # x |= (x >> 16) popcount: lui t1, 0x55555 ori t1, t1, 0x555 # 0x55555555 (t1) srli t0, a0, 1 # x >> 1 and t2, t0, t1 # (x >> 1) & 0x55555555 sub a0, a0, t2 # x -= ((x >> 1) & 0x55555555) lui t1, 0x33333 ori t1, t1, 0x333 # 0x33333333 (t1) srli t0, a0, 2 # x >> 2 and t2, t0, t1 # (x >> 2) & 0x33333333 and t3, a0, t1 # x & 0x33333333 add a0, t2, t3 # ((x >> 2) & 0x33333333) + (x & 0x33333333) lui t1, 0x0f0f0 ori t1, t1, 0x70f addi t1, t1, 0x800 # 0x0f0f0f0f (t1) srli t0, a0, 4 # x >> 4 add t2, a0, t0 # (x >> 4) + x and a0, t2, t1 # ((x >> 4) + x) & 0x0f0f0f0f srli t0, a0, 8 # x >> 8 add a0, a0, t0 # x += (x >> 8) srli t0, a0, 16 # x >> 16 add a0, a0, t0 # x += (x >> 16) andi t0, a0, 0x3f # x & 0x3f li t1, 32 sub a0, t1, t0 # 32 - (x & 0x3f) ret ``` #### execution info | Info | Value | | --------------- | ----- | | Cycle | 194 | | Instrs. retired | 151 | | CPI | 1.28 | | IPC | 0.778 | | Clock rate | 0 Hz | ### test cases #### test case 1 > Input: 0x3 > Output: 30 #### test case 2 > Input: 0xFFFFFFFF > Output: 0 #### test case 3 > Input: 0x3321675 > Output: 6 ## Biweekly Contest 141: Q1. Construct the Minimum Bitwise Array II [Q1. Construct the Minimum Bitwise Array II](https://leetcode.com/contest/biweekly-contest-141/problems/construct-the-minimum-bitwise-array-ii/description/) ### Description You are given an array `nums` consisting of `n` prime integers. You need to construct an array `ans` of length `n`, such that, for each index `i`, the bitwise `OR` of `ans[i]` and `ans[i] + 1` is equal to `nums[i]`, i.e. `ans[i] OR (ans[i] + 1) == nums[i]`. Additionally, you must minimize each value of `ans[i]` in the resulting array. If it is not possible to find such a value for `ans[i]` that satisfies the condition, then set `ans[i] = -1`. A prime number is a natural number greater than 1 with only two factors, 1 and itself. ### Constraints - `1 <= nums.length <= 100` - `2 <= nums[i] <= 10^9` - `nums[i]` is a prime number. ### Solution There are three cases to consider: 1. **Case1**: If `nums[i] = 2`, then `ans[i] = -1`. 2. **Case 2**: If `nums[i]` consists of all 1s in binary representation, then `ans[i] = nums[i] >> 1`. 3. **Case 3**: For other numbers, find the rightmost 0 bit and change the 1 bit immediately to its right to 0. Here's an example for **Case 3**: ``` input num = 19 (in decimal, 10011 in binary) expect output = 17 1. find the rightmost 0 : 10011 ^ index i = 2 2. set i-1 to 0 : 10011 -> 10001 ^ ^ answer = 10001 (binary) = 17 (decimal) 3. check the answer : 10001 ^ 10010 = 10011 = 19 (decimal) -> checked! ``` We implement it in C. You can see the entire code [here](https://github.com/liballouo/Computer-Architecture-2024-Fall-HW1/blob/main/Construct_the_Minimum_Bitewise_Array_I.c). To handle different cases, the solution involves multiple functions: #### `minBitwiseNum` Function This function determines the appropriate value for `ans` based on the input `num`. ```c uint32_t minBitwiseNum(uint32_t num) { // Input: num // Output: ans uint32_t ans; // case: 2 if(num == 2){ ans = -1; } // case: all 1s (2^n-1) else if(ilog2(num) + 1 == popcount(num)){ ans = num >> 1; } // case: others else{ ans = helper(num); } return ans; } ``` - **Case 1**: This handles the special case where `num` equals 2, for which there is no valid solution for `ans`, so the function returns -1. - **Case 2**: This checks if `num` has all 1s in its binary representation (e.g., 3, 7, 15). The condition uses the `ilog2 `function to get the number of bits required and `popcount` to check if all bits are set. - **Case 3**: When the number doesn't fall under the previous two cases, the `helper` function finds the correct value. #### `ilog2` Function This function calculates the integer logarithm base 2 of a number using the `my_clz` function to count leading zeros. ```c int ilog2(uint32_t num){ return 31 - my_clz(num); } ``` #### `my_clz` Function The `my_clz` function counts the number of leading zeros in a 32-bit unsigned integer. It uses bitwise operations to set all bits to 1 to the right of the highest set bit and then uses `popcount` to calculate the number of 1s. Last, it returns the value of leading zero. ```c int my_clz(uint32_t x) { x |= (x >> 1); x |= (x >> 2); x |= (x >> 4); x |= (x >> 8); x |= (x >> 16); x = popcount(x); return (32 - (x & 0x3f)); } ``` #### `popcount` Function The `popcount` function counts the number of 1s in the binary representation of a number. It uses bit manipulation techniques to sum the bits in groups, eventually reducing to a count of all 1s. ```c int popcount(uint32_t x){ /* count ones (population count) */ x -= ((x >> 1) & 0x55555555); x = ((x >> 2) & 0x33333333) + (x & 0x33333333); x = ((x >> 4) + x) & 0x0f0f0f0f; x += (x >> 8); x += (x >> 16); return x; } ``` #### `helper` Function The `helper` function is used to find the minimum possible value for `ans` when the number does not fit into the special cases. It looks for the rightmost 0 bit and clears the bit immediately to its left. ```c uint32_t helper(uint32_t num) { for(int i=1; i<32; i++){ // Find the rightmost 0 if(!((1 << i) & num)){ // Change the 1 on the right of the rightmost 0 to 0 num ^= (1 << (i - 1)); break; } } return num; } ``` After finishing the C code, we translate it from C code to a RISC-V assembly program. You can see the entire code [here](https://github.com/liballouo/Computer-Architecture-2024-Fall-HW1/blob/main/Construct_the_Minimum_Bitwise_Array_I.s). #### RISC-V assembly code ```c # Input: uint32_t a0 # Output: uint32_t a0 # t6: flag, 0 for ilog ; 1 for popcount min_bitwise_num: addi sp, sp, -4 sw ra, 0(sp) li t0, 2 beq a0, t0, case_2 # case: num == 2 addi a1, a0, 0 li t6, 0 # flag for ilog / popcount jal ra, ilog2 # do ilog2(num) and save the result in a2 addi a0, a1, 0 addi a2, a2, 1 # ilog2(num) + 1 li t6, 1 # flag for ilog / popcount jal ra, popcount # do popcount(num) and save the result in a3 addi a0, a1, 0 beq a2, a3, case_all_1s # case: all 1s jal ra, helper # case: others case_2: li a0, -1 # no possible value lw ra, 0(sp) addi sp, sp, 4 ret my_clz: srli t0, a0, 1 # x >> 1 or a0, a0, t0 # x |= (x >> 1) srli t0, a0, 2 # x >> 2 or a0, a0, t0 # x |= (x >> 2) srli t0, a0, 4 # x >> 4 or a0, a0, t0 # x |= (x >> 4) srli t0, a0, 8 # x >> 8 or a0, a0, t0 # x |= (x >> 8) srli t0, a0, 16 # x >> 16 or a0, a0, t0 # x |= (x >> 16) popcount: lui t1, 0x55555 ori t1, t1, 0x555 # 0x55555555 (t1) srli t0, a0, 1 # x >> 1 and t2, t0, t1 # (x >> 1) & 0x55555555 sub a0, a0, t2 # x -= ((x >> 1) & 0x55555555) lui t1, 0x33333 ori t1, t1, 0x333 # 0x33333333 (t1) srli t0, a0, 2 # x >> 2 and t2, t0, t1 # (x >> 2) & 0x33333333 and t3, a0, t1 # x & 0x33333333 add a0, t2, t3 # ((x >> 2) & 0x33333333) + (x & 0x33333333) lui t1, 0x0f0f0 ori t1, t1, 0x70f addi t1, t1, 0x800 # 0x0f0f0f0f (t1) srli t0, a0, 4 # x >> 4 add t2, a0, t0 # (x >> 4) + x and a0, t2, t1 # ((x >> 4) + x) & 0x0f0f0f0f srli t0, a0, 8 # x >> 8 add a0, a0, t0 # x += (x >> 8) srli t0, a0, 16 # x >> 16 add a0, a0, t0 # x += (x >> 16) beqz t6, clz_done addi a3, a0, 0 # save the result in a3 ret clz_done: andi t0, a0, 0x3f # x & 0x3f li t1, 32 sub a0, t1, t0 # 32 - (x & 0x3f) ret ilog2: addi sp, sp, -4 sw ra, 0(sp) jal ra my_clz # my_clz(num) and save the result in a0 li t2, 31 sub a2, t2, a0 # 31 - my_clz(num) lw ra, 0(sp) addi sp, sp, 4 ret case_all_1s: srli a0, a0, 1 # ans = num >> 1 lw ra, 0(sp) addi sp, sp, 4 ret helper: li t0, 32 # upper bound t0 = 32 li t1, 1 # loop start from 1, i(t1) helper_loop: li t2, 1 # t2 = 1 sll t2, t2, t1 # 1 << i and t3, a0, t2 # (1 << i) & num beqz t3, helper_done # correctly find the rightmost 0 addi t1, t1, 1 # i += 1 bne t0, t1, helper_loop # do not find the rightmost 0 yet, continue the loop helper_done: srli t2, t2, 1 # 1 << (i - 1) xor a0, a0, t2 # num ^= (1 << (i - 1)) lw ra, 0(sp) addi sp, sp, 4 ret ``` ### Result #### test cases ##### test case 1 > Input: 2 > Output: -1 ##### test case 2 > Input: 3 > Output: 1 ##### test case 3 > Input: 5 > Output: 4 ##### test case 4 > Input: 7 > Output: 3 ##### test case 5 > Input: 31 > Output: 15 ##### test case 6 > Input: 307 > Output: 305 ##### test case 7 > Input: 383 > Output: 319 ##### test case 8 > Input: 5039 > Output: 5031 #### execution info ##### C code | Info | Value | | --------------- | ----- | | Cycle | 6522 | | Instrs. retired | 5066 | | CPI | 1.26 | | IPC | 0.777 | | Clock rate | 0 Hz | ##### RISC-V assembly code | Info | Value | | --------------- | ----- | | Cycle | 1056 | | Instrs. retired | 806 | | CPI | 1.31 | | IPC | 0.763 | | Clock rate | 0 Hz | ### Analysis #### pseudo instruction The pseudo instruction below is `popcount` part of the [code](https://github.com/liballouo/Computer-Architecture-2024-Fall-HW1/blob/main/Construct_the_Minimum_Bitwise_Array_I.s) run on [Ripes](https://ripes.me/). ```c 000000cc <popcount>: cc: 55555337 lui x6 0x55555 d0: 55536313 ori x6 x6 1365 d4: 00155293 srli x5 x10 1 d8: 0062f3b3 and x7 x5 x6 dc: 40750533 sub x10 x10 x7 e0: 33333337 lui x6 0x33333 e4: 33336313 ori x6 x6 819 e8: 00255293 srli x5 x10 2 ec: 0062f3b3 and x7 x5 x6 f0: 00657e33 and x28 x10 x6 f4: 01c38533 add x10 x7 x28 f8: 0f0f0337 lui x6 0xf0f0 fc: 70f36313 ori x6 x6 1807 100: 80030313 addi x6 x6 -2048 104: 00455293 srli x5 x10 4 108: 005503b3 add x7 x10 x5 10c: 0063f533 and x10 x7 x6 110: 00855293 srli x5 x10 8 114: 00550533 add x10 x10 x5 118: 01055293 srli x5 x10 16 11c: 00550533 add x10 x10 x5 120: 000f8663 beq x31 x0 12 <clz_done> 124: 00050693 addi x13 x10 0 128: 00008067 jalr x0 x1 0 ``` #### 5-stage pipelined processor We choose **5-stage pipelined processor** as the processor we use in [Ripes](https://ripes.me/) to run the [code](https://github.com/liballouo/Computer-Architecture-2024-Fall-HW1/blob/main/Construct_the_Minimum_Bitwise_Array_I.s). Its block diagram look like this: ![{19B27C82-BEEA-48B8-971F-975848B1E2B8}](https://hackmd.io/_uploads/HJqZJn9JJx.png) The "5-stage" means this processor using five-stage pipeline to parallelize instructions. The stages are: 1. `IF`: Instruction fetch 2. `ID`: Instruction decode and register fetch 3. `EX`: Execute 4. `MEM`: Memory access 5. `WB`: Register write back To illustrate, we take the instruction `lui t1, 0x55555` as an example. According to [The RISC-V Instruction Set Manual](https://riscv.org//wp-content/uploads/2017/05/riscv-spec-v2.2.pdf) > `LUI` (load upper immediate) is used to build 32-bit constants and uses the **U-type format**. `LUI` places the U-immediate value in the top 20 bits of the destination register `rd`, filling in the lowest 12 bits with zeros. #### U-type format instruction The layout of the U-format instructions is: | imm [31:12] | rd [11:7] | opcode [6:0] | | --------------------- | --------- | ------------ | The pseudo instruction is `lui x6 0x55555` which decoded into `55555337`. | imm [31:12] | rd [11:7] | opcode [6:0] | representation | | --------------------- | --------- | ------------ | -------------- | | 01010101010101010101 | 00110 | 0110111 | binary | | 0x55555 | 0x6 | 0x37 | heximal | #### 1. `IF`: Instruction fetch ![{DDF6555A-3CA9-4180-B899-0E9DD8A8F7E7}](https://hackmd.io/_uploads/B1WUfpcyyg.png) - The `PC` starts at `0x000000cc`. `PC+4` will pass to the following stages for further usages. - We fetch the instruction at `0x000000cc` from instruction memory, resulting in the instruction `0x55555337`. #### 2. `ID`: Instruction decode and register fetch ![{470862F3-ECD1-4D9C-971B-35B808B3FD54}](https://hackmd.io/_uploads/BJB6kCqkJe.png) - The instruction `0x55555337` is decoded, revealing `lui x6, 0x55555`. - `opcode = LUI` - `Imm. = 0x55555000` - `Wr idx = 0x06` - Although U-type instructions do not require source registers, the processor still fetches the source register indices from the instruction, resulting in `rs1 = 0x0a` and `rs2 = 0x15`. These fields are irrelevant for this instruction. ``` 31 25 24 20 19 15 14 12 ... 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 ... | rs2 = 0x15| | rs1 = 0x0a| ------------- ------------- ``` - The values of registers `rs1` and `rs2` are fetched, resulting in `Reg 1 = 0x00000003` and `Reg 2 = 0x00000000`, but they are not used in this instruction. - The `PC` and `PC+4` are passed unchanged to the next stage. #### `EX`: Execute ![{178DB876-7955-48AB-B299-1BF518A088BF}](https://hackmd.io/_uploads/Hy415Cckkx.png) - We do not use the `PC+4`, so we just let it pass. - We do not use `Reg 1` and `Reg 2` in U-type format instructions, so after their value passes the multiplexers, they are filtered. The result of ALU is just `0x55555000` from `Imm.`. - There is no branch taken in this stage. - `rd` is not used in this stage, so it just passes to the next stage. #### `MEM`: Memory access ![{810F0247-C207-4ADC-BFF9-706850186FE2}](https://hackmd.io/_uploads/BJ_H3Aq11l.png) - The `PC+4` is still passed to the next stage. - Memory access is not required for U-type instructions, so there is no memory read or write. Consequently, the "read out" of data memory will be `0x00000000`, indicating that no memory access occurred. #### `WB`: Register write back ![{B3B0AC7B-504A-4F8B-86DD-C02378FE7460}](https://hackmd.io/_uploads/Hka-eysJyx.png) - We pass the `rd = 0x06` and `imm = 0x55555000` to `Wr idx` and `Wr data` respectively. ![{C9F14B02-8287-4F7E-81B9-143DCAEB338C}](https://hackmd.io/_uploads/HJl5xJiJJx.png) ![{F4D25E68-C6DB-4E17-A640-B9088B44F757}](https://hackmd.io/_uploads/Bka5eyo1Jx.png) After completing these five stages, the register `x6(t1)` is set to `0x55555000`. ![{0AD3DF49-0B60-49E9-9EC8-6DABFA12C7EC}](https://hackmd.io/_uploads/ry7A-1jJye.png) ## Reference [The RISC-V Instruction Set Manual](https://riscv.org//wp-content/uploads/2017/05/riscv-spec-v2.2.pdf) [Quiz1 of Computer Architecture (2023 Fall)](https://hackmd.io/@sysprog/arch2023-quiz1-sol) [Quiz1 of Computer Architecture (2024 Fall) ](https://hackmd.io/@sysprog/arch2024-quiz1-sol)