Assignment1: RISC-V Assembly and Instruction Pipeline

# Assignment1: RISC-V Assembly and Instruction Pipeline contributed by < [`hungyuhang`](https://github.com/hungyuhang) > ## Multiplication Overflow Prediction This method utilizes the formula below:$$ \lceil log_2(xy) \rceil \leq \lceil log_2(x) \rceil + \lceil log_2(y) \rceil $$ So if we calculate log2(x) and log2(y), we can predict whether the product of x and y will overflow without doing the multiplication. By counting the leading zeros of x and y in binary representation, we can obtain the logarithm of x and y with base 2. Although the concept is straightforward, this method has some exceptions. For example, for $x \approx 2^{61.4}$ and $y \approx 2^{1.6}$, the result of $\lceil log_2(x) \rceil + \lceil log_2(y) \rceil$ is 64, and the algorithm will predict that the product of $x$ and $y$ will overflow. However, the product of $x$ and $y$ is about $2^{63}$, since it is smaller than $2^{64}$, the value will not overflow. ## Implementation ### C Code In the implementation below, if $log_2(x)$ is an integer $u$, I will calculate $\lceil log_2(x) \rceil$ as $u+1$. This will only affect the result when: 1. $\lfloor log_2(x) + log_2(y) \rfloor = 62$, where $log_2(x)$ or $log_2(y)$ is an integer. 2. $log_2(x) + log_2(y) = 63$, where $log_2(x)$ and $log_2(y)$ are both integers. ```clike= #include <stdint.h> #include <stdbool.h> // test case a: no overflow, predict result is false uint64_t a_x0 = 0x0000000000000000; uint64_t a_x1 = 0x0000000000000000; // test case b: no overflow, predict result is false uint64_t b_x0 = 0x0000000000000001; uint64_t b_x1 = 0x0000000000000010; // test case c: no overflow, but predict result is true uint64_t c_x0 = 0x0000000000000002; uint64_t c_x1 = 0x4000000000000000; // test case d: overflow, and predict result is true uint64_t d_x0 = 0x0000000000000003; uint64_t d_x1 = 0x7FFFFFFFFFFFFFFF; uint16_t count_leading_zeros(uint64_t x) { x |= (x >> 1); x |= (x >> 2); x |= (x >> 4); x |= (x >> 8); x |= (x >> 16); x |= (x >> 32); /* count ones (population count) */ x -= ((x >> 1) & 0x5555555555555555); x = ((x >> 2) & 0x3333333333333333) + (x & 0x3333333333333333); x = ((x >> 4) + x) & 0x0f0f0f0f0f0f0f0f; x += (x >> 8); x += (x >> 16); x += (x >> 32); return (64 - (x & 0x7f)); } bool predict_if_mul_overflow(uint64_t *x0, uint64_t *x1) { int32_t exp_x0 = 63 - (int32_t)count_leading_zeros(*x0); int32_t exp_x1 = 63 - (int32_t)count_leading_zeros(*x1); if ((exp_x0 + 1) + (exp_x1 + 1) >= 64) return true; else return false; } void main() { printf("%d\n", predict_if_mul_overflow(&a_x0, &a_x1)); printf("%d\n", predict_if_mul_overflow(&b_x0, &b_x1)); printf("%d\n", predict_if_mul_overflow(&c_x0, &c_x1)); printf("%d\n", predict_if_mul_overflow(&d_x0, &d_x1)); return; } ``` ### Assembly Code Please check the assembly code on [GitHub](https://github.com/hungyuhang/computer_architecture_2023/blob/main/hw1/hw1.s). ## Analysis The code was tested using the [Ripes](https://github.com/mortbopet/Ripes) simulator. Here are the first few lines of the disassembled executable code: ``` 00000000 <main>: 0: ff010113 addi x2 x2 -16 4: 10000297 auipc x5 0x10000 8: ffc28293 addi x5 x5 -4 c: 00512023 sw x5 0 x2 10: 10000297 auipc x5 0x10000 14: 00028293 addi x5 x5 0 18: 00512223 sw x5 4 x2 1c: 10000297 auipc x5 0x10000 20: 00428293 addi x5 x5 4 24: 00512423 sw x5 8 x2 28: 10000297 auipc x5 0x10000 2c: 00828293 addi x5 x5 8 30: 00512623 sw x5 12 x2 34: 00400413 addi x8 x0 4 38: 00000493 addi x9 x0 0 3c: 00010913 addi x18 x2 0 ``` ### Instruction Pipelining Let's take a look at how the `auipc x5 0x10000` instruction works during execution, which is at address 0x4. #### IF Stage <img src="https://github.com/hungyuhang/computer_architecture_2023/raw/main/hw1/img/ip_if.png" width="40%" alt="cpu execution info"> Here we can see that the output of `PC` is `0x00000004`, which is the address of the `auipc` instruction. And the corresponding hex instruction `0x10000297` was output from the Compressed decoder. #### ID Stage <img src="https://github.com/hungyuhang/computer_architecture_2023/raw/main/hw1/img/ip_id.png" width="40%" alt="cpu execution info"> During this stage, the decoder decodes the instruction, and toggles the corresponding control bits. Here we can see that the Imm output is `0x10000000`. #### EX Stage <img src="https://github.com/hungyuhang/computer_architecture_2023/raw/main/hw1/img/ip_ex.png" width="60%" alt="cpu execution info"> In the EX stage, we add the imm input and the address together, so the ALU output is `0x10000004`. #### MEM Stage <img src="https://github.com/hungyuhang/computer_architecture_2023/raw/main/hw1/img/ip_mem.png" width="70%" alt="cpu execution info"> Since `auipc` will not store/load data from memory, the pipeline just pass the data to the next stage. But in this example, we can see that the pipeline forwards the new value of `x5` to the EX stage (the yellow line), because `x5` was used by the next instruction (`addi x5 x5 -4`) immediately. #### WB Stage <img src="https://github.com/hungyuhang/computer_architecture_2023/raw/main/hw1/img/ip_wb.png" width="100%" alt="cpu execution info"> At last, when the instruction reach write back stage, the pipeline stores the value `0x10000004` into the register. Here we can see that the data was transferred via the yellow line, and the `Write Enable` bit is set, which allows `x5` to be updated. ### Memory Update Here we will use `sw x5 0 x2` at address `0xC` as an example. This instruction stores the value of `x5` onto the stack. #### View from pipeline <img src="https://github.com/hungyuhang/computer_architecture_2023/raw/main/hw1/img/mu_1.png" width="30%" alt="cpu execution info"> When the instruction reaches the MEM stage, we can see that the values of Data Memory is set as below: | Addr. | Data in | Write Enable | |------------|------------|--------------| | 0x7fffffe0 | 0x10000000 | 1 | It means that we will store value `0x10000000` in address `0x7fffffe0`. #### View from memory <img src="https://github.com/hungyuhang/computer_architecture_2023/raw/main/hw1/img/mu_2.png" width="100%" alt="cpu execution info"> After the instruction left MEM stage, we can see that the memory was updated. <img src="https://github.com/hungyuhang/computer_architecture_2023/raw/main/hw1/img/mu_3.png" width="100%" alt="cpu execution info"> ### Cache Performance #### Data Cache <img src="https://github.com/hungyuhang/computer_architecture_2023/raw/main/hw1/img/cp_dc.png" width="50%" alt="cpu execution info"> No matter how I change the cache associativity, the hit rate remains the same. I assume that the 7 misses are all compulsory misses, since the program barely loads/stores data from different memory locations. (The image below is a fully-associative cache). <img src="https://github.com/hungyuhang/computer_architecture_2023/raw/main/hw1/img/cp_dc_2.png" width="50%" alt="cpu execution info"> #### Instruction Cache ##### Different Set Associativity <img src="https://github.com/hungyuhang/computer_architecture_2023/raw/main/hw1/img/cp_ic_1.png" width="50%" alt="cpu execution info"> First I tried direct-mapped cache, which yields a hit rate of 0.9091. <img src="https://github.com/hungyuhang/computer_architecture_2023/raw/main/hw1/img/cp_ic_2.png" width="50%" alt="cpu execution info"> And I changed the cache type to fully-associative cache, the hit rate then drops to 0.8508. The result was surprising, since fully-associative cache can avoid conflict misses. I assumeed that there might exists some pattern in the code that makes the direct-mapped cache performed better. ##### Different Number of Words per Line The image below shows the results of different words per line configuration: <img src="https://github.com/hungyuhang/computer_architecture_2023/raw/main/hw1/img/cp_ic_3.png" width="50%" alt="cpu execution info"><img src="https://github.com/hungyuhang/computer_architecture_2023/raw/main/hw1/img/cp_ic_4.png" width="50%" alt="cpu execution info"> The result shows that longer words per line has a higher hit rate. The reason I think is that my code has less branches, so that most of the code runs sequentially. This means when an long instruction block loads into the cache, the CPU will use all of the instructions in the cache line, and will not branch to somewhere else. ## Assembly Optimization ### Instruction Count Reduction Code section before optimization: ```clike= # x |= (x >> 1); srli t0, a0, 1 slli t1, a1, 31 or t0, t0, t1 srli t1, a1, 1 # t0 t1 = x >> 1 or a0, a0, t0 or a1, a1, t1 # a0 a1 = x | (x >> 1) # x |= (x >> 2); srli t0, a0, 2 slli t1, a1, 30 or t0, t0, t1 srli t1, a1, 2 # t0 t1 = x >> 2 or a0, a0, t0 or a1, a1, t1 # a0 a1 = x | (x >> 2) # x |= (x >> 4); srli t0, a0, 4 slli t1, a1, 28 or t0, t0, t1 srli t1, a1, 4 # t0 t1 = x >> 4 or a0, a0, t0 or a1, a1, t1 # a0 a1 = x | (x >> 4) # x |= (x >> 8); srli t0, a0, 8 slli t1, a1, 24 or t0, t0, t1 srli t1, a1, 8 # t0 t1 = x >> 8 or a0, a0, t0 or a1, a1, t1 # a0 a1 = x | (x >> 8) # x |= (x >> 16); srli t0, a0, 16 slli t1, a1, 16 or t0, t0, t1 srli t1, a1, 16 # t0 t1 = x >> 16 or a0, a0, t0 or a1, a1, t1 # a0 a1 = x | (x >> 16) # x |= (x >> 32); mv t0, a1 mv t1, zero # t0 t1 = x >> 32 or a0, a0, t0 or a1, a1, t1 # a0 a1 = x | (x >> 32) ``` Code section after optimization: ```clike= bne a1, zero, clz_fill_ones_upper clz_fill_ones_lower: srli t0, a0, 1 or a0, a0, t0 srli t0, a0, 2 or a0, a0, t0 srli t0, a0, 4 or a0, a0, t0 srli t0, a0, 8 or a0, a0, t0 srli t0, a0, 16 or a0, a0, t0 j clz_fill_ones_end clz_fill_ones_upper: srli t1, a1, 1 or a1, a1, t1 srli t1, a1, 2 or a1, a1, t1 srli t1, a1, 4 or a1, a1, t1 srli t1, a1, 8 or a1, a1, t1 srli t1, a1, 16 or a1, a1, t1 li a0, 0xffffffff clz_fill_ones_end: ``` After optimization, the instruction count changes from 1132 to 973, which is shown below: <img src="https://github.com/hungyuhang/computer_architecture_2023/raw/main/hw1/img/ao_icr_1.png" width="50%" alt="cpu execution info"><img src="https://github.com/hungyuhang/computer_architecture_2023/raw/main/hw1/img/ao_icr_2.png" width="50%" alt="cpu execution info"> But the CPI increases slightly after optimization. I think the reason is because I added the `bne` instruction, which introduced more branch hazard. ### Inline Expansion In this optimization, I modified the print function into an inline function: Before: ```clike= main_loop: lw a0, 0(s2) # a0 stores the pointer to first data in cmp_data_x addi a1, a0, 8 # a1 stores the pointer to second data in cmp_data_x jal ra, pimo jal ra, print_dec # print the result (which is in a0) addi s2, s2, 4 # s2 points to next cmp_data_x addi s1, s1, 1 # counter++ bne s1, s0, main_loop ``` After: ```clike= main_loop: lw a0, 0(s2) # a0 stores the pointer to first data in cmp_data_x addi a1, a0, 8 # a1 stores the pointer to second data in cmp_data_x jal ra, pimo li a7, 1 # tell ecall to print decimal ecall # print result of pimo (which is in a0) li a0, 32 # 32 is " " in ASCII li a7, 11 # tell ecall to print char ecall # print space addi s2, s2, 4 # s2 points to next cmp_data_x addi s1, s1, 1 # counter++ bne s1, s0, main_loop ``` And a number of metrics were improved, including instruction count, CPI, and data cache hit rate, after I inline the print function. Before: <img src="https://github.com/hungyuhang/computer_architecture_2023/raw/main/hw1/img/ie_b_1.png" width="40%" alt="cpu execution info"><img src="https://github.com/hungyuhang/computer_architecture_2023/raw/main/hw1/img/ie_b_2.png" width="60%" alt="cpu execution info"> After: <img src="https://github.com/hungyuhang/computer_architecture_2023/raw/main/hw1/img/ie_a_1.png" width="40%" alt="cpu execution info"><img src="https://github.com/hungyuhang/computer_architecture_2023/raw/main/hw1/img/ie_a_2.png" width="60%" alt="cpu execution info"> ## Analysis - Compared With the Multiplication Method To prove that the CLZ method described above is more efficient than multiplying two numbers together, I implement the multiplication version of overflow detection and analyze the performance. The assembly code is on [GitHub](https://github.com/hungyuhang/computer_architecture_2023/blob/main/hw1/hw1_mul_ver.s), and the code below is the pseudocode of the assembly. ```clike= /* C pseudocode */ bool check_if_multiplication_overflow(uint64_t *x0, uint64_t *x1) uint128_t dx0 = *x0; uint64_t dx1 = *x1; uint128_t sum = 0; uint32_t cnt = 0; while (true) { if (((word 2 of sum) != 0) or ((word 3 of sum) != 0)) return true; if (cnt == 64) return false; if (dx1 & 1 == 1) sum += dx0; dx0 = dx0 << 1; dx1 = dx1 >> 1; cnt++; } ``` Compared to the CLZ implementation, this method always returns the true answer, since it *really* calculates the product. But on the other hand, this method takes more clock cycles to run. I use the [Ripes](https://github.com/mortbopet/Ripes) simulator to run the assembly code, and here is the result: <img src="https://github.com/hungyuhang/computer_architecture_2023/raw/main/hw1/img/acwmm.png" width="50%" alt="cpu execution info"> The multiplication method takes 7167 cycles to complete the task, while the CLZ method only takes 933 cycles to run. Though the CLZ method may mispredict in some cases, it performs better than the multiplication method.