Assignment1: RISC-V Assembly and Instruction Pipeline

Assignment1: RISC-V Assembly and Instruction Pipeline === contributed by <[`YHKung`](https://github.com/YHK00103)> ## 1. Problem C ### 1.1 Abstract In problem C, the goal is to find the absolute value of a 32-bit floating-point number and convert a 16-bit floating-point number in IEEE half-precision format (`fp16`) to a 32-bit floating-point number in IEEE single-precision format (`fp32`). The problem defines three functions, including `fabsf`, `my_clz`, and `fp16_to_fp32`. The most significant function, `fp16_to_fp32`, implements the conversion of 16-bit half-precision floating-point numbers and 32-bit single-precision floating-point numbers. In order to ensure the accurate conversion, it also considers the special cases, such as `NaN`, `infinity`, and `zero`. I will first introduce the floating-point format and then explain each function along with its corresponding RISC-V assembly. ### 1.2 Half-precision and Single-precision floating-point format In IEEE 754 floating-point format, a floating-point number is separated into three main parts, including `sign`, `exponent`, and `Mantissa`. The following forms represent the structure of 16-bit half-precision floating-point number and 32-bit single-precision floating-point number. 16-bit half-precision floating-point number: | Sign (1 bit) | Exponent (5 bits) | Mantissa (10 bits) | |:------------:|:-----------------:|:------------------:| | x | xxxxx | xxxx xxxx xx | 32-bit single-precision floating-point number: | Sign (1 bit) | Exponent (8 bits) | Mantissa (23 bits) | |:------------:|:-----------------:|:----------------------------:| | x | xxxx xxxx | xxxx xxxx xxxx xxxx xxxx xxx | **1. Sign** Sign bit determines whether a floating-point number is positive or negative. * If the sign bit is `0`, the floating-point number is positive. * If the sign bit is `1`, the floating-point number is negative. **2. Exponent** The function of exponent is to control the range of floating-point number. It stores exponent bits in a biased format, which simpifies the comparison of floating-point numbers and allows exponent to be expressed as both positive and negative value. * For a 16-bit half-precision foating-point number, the bias is `15`. * For a 32-bit single-precision floating-point numberm the bias is `127`. **3. Mantissa** Mantissa decides the valid digits and the precision of a floating-point number. In IEEE 754 standard, the most significant bit (MSB) of mantissa is assumed to be 1, known as the implicit bit. Hence, the actual representation is `1.mantissa`. ### 1.3 fabsf The `fabsf` function computes the absolute value of a 32-bit floating-point number by leveraging the characteristics of IEEE 754 floating-point format. The function reads the bits of the floating-point `x` into an unsigned integer and uses the mask `0x7FFFFFFF` to clear the sign bit, thus obtaining the absolute value. Finally, it writes the modified bits back into the float and returns the absolute value. ```=C static inline float fabsf(float x) { uint32_t i = *(uint32_t *)&x; // Read the bits of the float into an integer i &= 0x7FFFFFFF; // Clear the sign bit to get the absolute value x = *(float *)&i; // Write the modified bits back into the float return x; } ``` The corresponding RISC-V assembly is showed below. ```=C # fabsf function in RISC-V assembly # Input: float x (passed via a0) # Output: absolute value of x (returned in a0) fabsf: li t1, 0x7FFFFFFF # t1 = 0x7FFFFFFF (mask) and a0, a0, t1 # x &= i ret ``` ### 1.4 my_clz The `my_clz` function calculates the number of leading zeros in a 32-bit unsigned integer. First, the function reads the 32-bit unsigned interger and initializes the variable `count` to 0. Then, a for loop starts from the most significant bit (MSB) and checks whether each bits is zero. If a non-zero bit is found, the for loop breaks, and the function returns the value of `count` , which represents the number of leading zeros. ```=C static inline int my_clz(uint32_t x) { int count = 0; //Initialize count to be 0 for (int i = 31; i >= 0; --i) { if (x & (1U << i)) //Check whether each bit is non-zero break; count++; } return count; } ``` The corresponding RISC-V assembly is showed below. ```=C # my_clz function in RISC-V assembly # Input: 32-bit unsigned integer x (passed via a0) # Output: count of leading zeros (returned in a0) my_clz: li t0, 0 # Initialize count = 0 (t0) li t1, 31 # Initialize i = 31 (t1) li s0, 1 # S0 = 1U my_clz_loop: sll t2, s0, t1 # t2 = (IU << i) and t2, a0, t2 # t2 = (x & (IU << i)) beqz t2, skip_end # if ((x & (IU << i)) != 0), go to skip_end j my_clz_end skip_end: addi t0, t0, 1 # count++ addi t1, t1, -1 # t1 = i-- bgez t1, my_clz_loop # if (i >= 0), go to for loop my_clz_end: mv a0, t0 # move result to a0 and return ret ``` ### 1.5 fp16_to_fp32 The `fp16_to_fp32` function converts a 16-bit half-precision floating-point number to a 32-bit floating-point single-precision number. The details of this function are as follows: **1. Extend half-precision to single-precision** To extend a 16-bit half-precision floating-point number to 32-bit single-precision floating-point number, shift it left by 16 bits, and put the mantissa and exponent into the upper 16 bits of an integer. **2. Extract sign, exponent, and mantissa** By using masks `0x80000000` and `0x7FFFFFFF` respectively, the sign bit is extracted into the variable `sign`, while the exponent and mantissa bits are stored in the variable `nonsign`. **3. Calcualte renormalized shift** The purpose of this operation is to determine if a half-precision floating-point number needs to be normalized. First, The number of leading zeros in exponent and mantissa is calculated by `my_clz` function. Then, `renorm_shift` adjusts the result according to whether the leading zeros exceed 5 bits, which decides how many bits the mantissa needs to be shifted for normalization. **4. Create inf_nan_mask and zero_mask** To identify `NaN` and `infinity` cases, the operation creates a mask to examine whether the exponent bits are `11111`. Then, for `zero` case, it creates an all one mask (`0xFFFFFFFF`). **5. Reformatting and Reorganization** Eventually, the sign bit, normalized mantissa, and adjusted exponent are combined to form a 32-bit single-precision floating-point number, which is returned as the result. The C code is as follows. ```=C static inline uint32_t fp16_to_fp32(uint16_t h) { // 1. Extend half-precision to dingle-precision const uint32_t w = (uint32_t) h << 16; // 2. Extract sign, exponent, and mantissa const uint32_t sign = w & UINT32_C(0x80000000); const uint32_t nonsign = w & UINT32_C(0x7FFFFFFF); // 3. Calcualte renormalized shift uint32_t renorm_shift = my_clz(nonsign); renorm_shift = renorm_shift > 5 ? renorm_shift - 5 : 0; // 4. Create inf_nan_mask and zero_mask const int32_t inf_nan_mask = ((int32_t)(nonsign + 0x04000000) >> 8) & INT32_C(0x7F800000); const int32_t zero_mask = (int32_t)(nonsign - 1) >> 31; // 5. Reformatting and Reorganization return sign | ((((nonsign << renorm_shift >> 3) + ((0x70 - renorm_shift) << 23)) | inf_nan_mask) & ~zero_mask); } ``` The corresponding RISC-V assembly is showed below. ```=C # fp16_to_fp32 function in RISC-V assembly # Input: uint16_t h (passed via a0) # Output: uint32_t h_32 (returned in a0) fp16_to_fp32: # w slli t0, a0, 16 # t0 = w = (uint32_t) h << 16 # sign li t1, 0x80000000 # t1 = 0x80000000 and t2, t0, t1 # t2 = sign = w & UINT32_C(0x80000000) # nonsign li t1, 0x7FFFFFFF # t1 = 0x7FFFFFFF and t3, t0, t1 # t3 = nonsign = w & UINT32_C(0x7FFFFFFF); # renorm_shift addi sp, sp, -4 # store ra to stack sw ra, 4(sp) mv a0, t3 jal ra, my_clz lw ra, 4(sp) # reload ra from stack addi sp, sp, 4 mv t4, a0 # t4 = renorm_shift = my_clz(nonsign) addi t4, t4, -5 # t4 = renorm_shift - 5 bgt t4, zero, continue # if (renorm_shift > 5), goto continue li t4 0 # if (renorm_shift <= 5), t4 = renorm_shift = 0 continue: # inf_nan_mask li t1, 0x04000000 # t1 = 0x04000000 add t5, t3, t1 # t5 = nonsign + 0x04000000 srai t5, t5, 8 # t5 = (nonsign + 0x04000000) >> 8 li t1, 0x7F800000 # t1 = 0x7F800000 and t5, t5, t1 # t5 = inf_nan_mask = ((int32_t)(nonsign + 0x04000000) >> 8) & INT32_C(0x7F800000) # zero_mask addi t6, t3, -1 # t6 = nonsign - 1 srai t6, t6, 31 # t6 = zero_mask = (int32_t)(nonsign - 1) >> 31 # return value sll t3, t3, t4 # t3 = nonsign << renorm_shift srai t3, t3, 3 # t3 = (nonsign << renorm_shift) >> 3 li t1, 0x70 # t1 = 0x70 sub t1, t1, t4 # t1 = 0x70 - renorm_shift slli t1, t1, 23 # t1 = (0x70 - renorm_shift) << 23 add t3, t3, t1 # t3 = ((nonsign << renorm_shift >> 3) + ((0x70 - renorm_shift) << 23)) or t3, t3, t5 # t3 = (((nonsign << renorm_shift >> 3) + ((0x70 - renorm_shift) << 23)) | inf_nan_mask) li t1, -1 xor t6, t6, t1 # t6 = ~zero_mask and t3, t3, t6 # t3 = ((((nonsign << renorm_shift >> 3) + ((0x70 - renorm_shift) << 23)) | inf_nan_mask) & ~zero_mask) or t2, t2, t3 # t2 = sign | ((((nonsign << renorm_shift >> 3) + ((0x70 - renorm_shift) << 23)) | inf_nan_mask) & ~zero_mask); #return mv a0, t2 ret ``` ## 2. Use case from Leetcode ### 2.1 Motivation In problem C, The `my_clz` function calculates how many leading zeros there are in the binary representation of a 32-bit unsigned integer. This function can be utilized in programs that count the maximum consecutive 1s. Counting the maximum consecutive 1s in a sequence consisting only of 0s and 1s can be applied in various fields. For example, in image compression, the pixels of a black-and-white image can be representeed as 0s and 1s. Calculating the maximum consecutive 1s in a specific area can be used to compress the data, thereby reducing storage space. Hence, finding the number of maximum consecutive 1s is an important issue. ### 2.2 Max Consecutive Ones Based on the motivation, [Leetcode 485. Max Consecutive Ones](https://leetcode.com/problems/max-consecutive-ones/description/) can be solved by using the `my_clz` function. The description of the Leetcode problem is as follows: >Description: >Given a binary array nums, return the maximum number of consecutive 1's in the array > >Example 1: >Input: nums = [1,1,0,1,1,1] Output: 3 Explanation: The first two digits or the last three digits are consecutive 1s. The maximum number of consecutive 1s is 3. > >Example 2: Input: nums = [1,0,1,1,0,1] Output: 2 > >Constraints: 1 <= nums.length <= 10^5 nums[i] is either 0 or 1. ## 3. Solution ### 3.1 C code The C code program can be divided into `my_clz` function and `findMaxConsecutiveOnes` function. The details of these functions are as follows: **1. my_clz function** This function counts the number of leading zeros in the binary representation of a 32-bit integer `x`. It iterates from the most significant bit (bit 31) down to 0, counting the number of zeros until it finds the first 1. **2. findMaxConsecutiveOnes function** This function calculates the maximum number of consecutive 1's in the array `nums[]`. It first groups every 32 elements in the array `nums[]` into a pack called `packed`. Then, if `packed` is `0xFFFFFFFF`, it adds `wordSize` to `currentCount`. Otherwise, it uses the `my_clz` function to skip over the leading zeros and count the consecutive 1s. Eventually, it returns the value of `maxCount`, which represents the maximum number of consecutive 1s. ```c= #include <stdio.h> #include <stdint.h> static inline int my_clz(uint32_t x) { int count = 0; for (int i = 31; i >= 0; --i) { if (x & (1U << i)) break; count++; } return count; } int findMaxConsecutiveOnes(int* nums, int numsSize) { int maxCount = 0; int currentCount = 0; int wordSize = 32; int set = (numsSize >> 5) << 5; if (numsSize == 0) return 0; for (int i = set; i >= 0; i-= wordSize) { uint32_t packed = 0; //Pack 32 bits into an unsigned interger for (int j = 0; j < wordSize && i + j < numsSize; j++) { packed |= ((uint32_t)nums[i + j] << j); } if (packed == 0xFFFFFFFF) { currentCount += wordSize; } else { int counter = 32; while(counter > 0){ int leadingZeros = my_clz(packed); packed <<= leadingZeros; counter -= leadingZeros; if(leadingZeros != 0){ if(currentCount > maxCount) maxCount = currentCount; currentCount = 0; } else{ currentCount++; packed <<= 1; counter--; } } } } if (currentCount > maxCount) maxCount = currentCount; return maxCount; } int main() { int nums1[] = {1, 1, 0, 1, 1, 1}; int numsSize1 = sizeof(nums1) / sizeof(nums1[0]); int result1 = findMaxConsecutiveOnes(nums1, numsSize1); printf("Maximum number of consecutive 1's: %d\n", result1); int nums2[] = {1, 0, 1, 1, 0, 1}; int numsSize2 = sizeof(nums2) / sizeof(nums2[0]); int result2 = findMaxConsecutiveOnes(nums2, numsSize2); printf("Maximum number of consecutive 1's: %d\n", result2); int nums3[] = {1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1}; int numsSize3 = sizeof(nums3) / sizeof(nums3[0]); int result3 = findMaxConsecutiveOnes(nums3, numsSize3); printf("Maximum number of consecutive 1's: %d\n", result3); return 0; } ``` ### 3.2 RISC-V assembly code The corresponding RISC-V assembly is as follows: ```=C .data num1: .word 1, 1, 0, 1, 1, 1 numSize1: .word 6 num2: .word 1, 0, 1, 1, 0, 1 numSize2: .word 6 num3: .word 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1 numSize3: .word 35 str1: .string "Maximum number of consecutive 1's :" str2: .string "\n" .text # main function in RISC-V assembly main: # num1 case la a0, num1 lw a1, numSize1 jal ra, findMaxConsecutiveOnes mv t0, a0 la a0, str1 li a7, 4 ecall mv a0, t0 li a7, 1 ecall la a0, str2 li a7, 4 ecall # num2 case la a0, num2 lw a1, numSize2 jal ra, findMaxConsecutiveOnes mv t0, a0 la a0, str1 li a7, 4 ecall mv a0, t0 li a7, 1 ecall la a0, str2 li a7, 4 ecall # num3 case la a0, num3 lw a1, numSize3 jal ra, findMaxConsecutiveOnes mv t6, a0 la a0, str1 li a7, 4 ecall mv a0, t6 li a7, 1 ecall la a0, str2 li a7, 4 ecall # Exit the program li a7, 10 # System call code for exiting ecall # my_clz function in RISC-V assembly # Input: 32-bit unsigned integer x (passed via a0) # Output: count of leading zeros (returned in a0) my_clz: li t2, 1 # t2 = 1U li t1, 31 # t1 = i = 31 li t0, 0 # t0 = count my_clz_loop: sll t3, t2, t1 # t3 = 1U << i addi t1, t1, -1 # t1 = i-- and t3, a0, t3 # t3 = x & (1U << i) bnez t3, my_clz_end # if ((x & (1U << i)) == 1), goto my_clz_end addi t0, t0, 1 # t0 = count++ bgez t1, my_clz_loop # if (i >= 0), goto my_clz_loop my_clz_end: mv a0, t0 ret # findMaxConsecutiveOnes function in RISC-V assembly # Input: pointer to array *num, interger numSize (passed via a0 and a1) # Output: max consecutive ones (returned in a0) findMaxConsecutiveOnes: li s0, 0 # s0 = maxCount = 0 srli s3, a1, 5 # s3 = set = (numSize >> 5) beqz a1, for_loop_end # if (numsSize == 0), goto for_loop_end li s1, 0 # s1 = currentCount = 0 slli s3, s3, 5 # s3 = set = (numSize >> 5) << 5 li s2, 32 # s2 = wordSize = 32 li s4, 0 # s4 = packed = 0 addi s5, s3, 0 # s5 = i = set li s6, 0 # s6 = j = 0 j for_loop for_loop: bgez s5, pack_bits # if (i >= 0), goto pack_bits sub t5, s1, s0 # t5 = currentcount - maxCount bgez t5, update_maxCount2 # if (currentCount >= maxCount), goto update_maxCount2 j for_loop_end # if-else structure 2 sub t6, s1, s0 # t6 = currentCount - maxCount bgez t6, update_maxCount2 # if (currentCount - maxCount >= 0), goto update_maxCount2 j for_loop_end for_loop_update: sub s5, s5, s2 # s1 = i -= wordSize j for_loop pack_bits: add t0, s5, s6 # t0 = i + j slli, t4, t0, 2 # t4 = (i+j) * 4 sub t1, s6, s2 # t1 = j - wordSize add t3, a0, t4 # t4 = nums[i+j] address sub t2, t0, a1 # t2 = (i+j) - numsSize bgez t1, check_packed # if (j - wordSize >= 0), goto check_packed lw t3, 0(t3) # t3 = num[i+j] value bgez t2, check_packed # if (((i+j) - numsSize) >= 0), goto check_packed sll t3, t3, s6 # t3 = num[i+j] << j addi s6, s6, 1 # s6 = j++ li s8, 0xFFFFFFFF # s8 = 0xFFFFFFFF or s4, s4, t3 # s4 = packed |= (nums[i + j] << j); j pack_bits check_packed: sub t4, s4, s8 # t4 = packed - 0xFFFFFFFF li s7, 32 # s7 = counter = 32 li s6, 0 # s6 = j = 0 beqz t4, update_currentCount # if (packed == 0xFFFFFFFF), goto update_currentCount bgez s7, while_loop # if (counter >= 0), goto, while_loop update_currentCount: add s1, s1, s2 # s1 = currentCount += wordSize sub s5, s5, s2 # s1 = i -= wordSize j for_loop_update while_loop: # check while_loop condition beqz s7, for_loop_update # if (counter == 0), goto for_loop_update # call my_clz addi sp, sp, -24 #store ra and a0 to stack before calling my_clz sw ra, 24(sp) sw a0, 20(sp) sw t0, 16(sp) sw t1, 12(sp) sw t2, 8(sp) sw t3, 4(sp) mv a0, s4 # a0 = packed jal ra, my_clz mv s9, a0 # s9 = leadingZeros = my_clz(packed) lw ra, 24(sp) lw a0, 20(sp) lw t0, 16(sp) lw t1, 12(sp) lw t2, 8(sp) lw t3, 4(sp) addi sp, sp, 24 sll s4, s4, s9 # s4 = packed <<= leadingZeros sub s7, s7, s9 # s7 = counter -= leadingZeros # if-else structure 1 beqz s9, update_currentCount2 # if (leadingZeros == 0), goto update_currentCount2 sub t5, s1, s0 # t5 = currentcount - maxCount bgez t5, update_maxCount1 # if (currentCount - maxCount >= 0), goto update_maxCount1 update_currentCount2: addi s7, s7, -1 # s7 = counter -= 1 addi s1, s1, 1 # s1 = currentCount++ slli s4, s4, 1 # s4 = packed <<= 1 j while_loop update_maxCount1: mv s0, s1 # s0 = maxCount = currentCount li s1, 0 # s1 = currentCount = 0 j while_loop update_maxCount2: mv s0, s1 # s0 = maxCount = currentCount j for_loop_end for_loop_end: mv a0, s0 # return maxCount ret ``` ## 4. Result ### 4.1 Test cases The test cases are shown below: | | numsSize | nums[] | |:-----------:|:--------:|:---------------------------------------------------------------------------------------------------------:| | test case 1 | 6 | {1, 1, 0, 1, 1, 1} | | test case 2 | 6 | {1, 0, 1, 1, 0, 1} | | test case 3 | 35 | {1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1} | ### 4.2 C progran output The output of C program is as follows: ```=C Maximum number of consecutive 1's :3 Maximum number of consecutive 1's :2 Maximum number of consecutive 1's :35 -------------------------------- Process exited after 0.0407 seconds with return value 0 ``` ### 4.3 RISC-V assembly output The outputs of RISC-V assembly output are below, including the results of maximum number of consecutive 1's and the execution information. **Results:** ```=C Maximum number of consecutive 1's :3 Maximum number of consecutive 1's :2 Maximum number of consecutive 1's :35 Program exited with code: 0 ``` **Execution information:** ```=C Cycles: 2434 Instrs. retired: 1902 CPI: 1.28 IPC: 0.781 Clock rate: 0 Hz ``` ![image](https://hackmd.io/_uploads/rkwlAWOkJg.png) ![image](https://hackmd.io/_uploads/rJuBkktyye.png) ## 5. Analysis I test the code using [Ripes](https://github.com/mortbopet/Ripes) simulator. ### 5.1 Pseodo instruction :::spoiler Pseodo instruction ``` 00000000 <main>: 0: 10000517 auipc x10 0x10000 4: 00050513 addi x10 x10 0 8: 10000597 auipc x11 0x10000 c: 0105a583 lw x11 16 x11 10: 0f0000ef jal x1 240 <findMaxConsecutiveOnes> 14: 00050293 addi x5 x10 0 18: 10000517 auipc x10 0x10000 1c: 0b050513 addi x10 x10 176 20: 00400893 addi x17 x0 4 24: 00000073 ecall 28: 00028513 addi x10 x5 0 2c: 00100893 addi x17 x0 1 30: 00000073 ecall 34: 10000517 auipc x10 0x10000 38: 0b850513 addi x10 x10 184 3c: 00400893 addi x17 x0 4 40: 00000073 ecall 44: 10000517 auipc x10 0x10000 48: fd850513 addi x10 x10 -40 4c: 10000597 auipc x11 0x10000 50: fe85a583 lw x11 -24 x11 54: 0ac000ef jal x1 172 <findMaxConsecutiveOnes> 58: 00050293 addi x5 x10 0 5c: 10000517 auipc x10 0x10000 60: 06c50513 addi x10 x10 108 64: 00400893 addi x17 x0 4 68: 00000073 ecall 6c: 00028513 addi x10 x5 0 70: 00100893 addi x17 x0 1 74: 00000073 ecall 78: 10000517 auipc x10 0x10000 7c: 07450513 addi x10 x10 116 80: 00400893 addi x17 x0 4 84: 00000073 ecall 88: 10000517 auipc x10 0x10000 8c: fb050513 addi x10 x10 -80 90: 10000597 auipc x11 0x10000 94: 0345a583 lw x11 52 x11 98: 068000ef jal x1 104 <findMaxConsecutiveOnes> 9c: 00050f93 addi x31 x10 0 a0: 10000517 auipc x10 0x10000 a4: 02850513 addi x10 x10 40 a8: 00400893 addi x17 x0 4 ac: 00000073 ecall b0: 000f8513 addi x10 x31 0 b4: 00100893 addi x17 x0 1 b8: 00000073 ecall bc: 10000517 auipc x10 0x10000 c0: 03050513 addi x10 x10 48 c4: 00400893 addi x17 x0 4 c8: 00000073 ecall cc: 00a00893 addi x17 x0 10 d0: 00000073 ecall 000000d4 <my_clz>: d4: 00100393 addi x7 x0 1 d8: 01f00313 addi x6 x0 31 dc: 00000293 addi x5 x0 0 000000e0 <my_clz_loop>: e0: 00639e33 sll x28 x7 x6 e4: fff30313 addi x6 x6 -1 e8: 01c57e33 and x28 x10 x28 ec: 000e1663 bne x28 x0 12 <my_clz_end> f0: 00128293 addi x5 x5 1 f4: fe0356e3 bge x6 x0 -20 <my_clz_loop> 000000f8 <my_clz_end>: f8: 00028513 addi x10 x5 0 fc: 00008067 jalr x0 x1 0 00000100 <findMaxConsecutiveOnes>: 100: 00000413 addi x8 x0 0 104: 0055d993 srli x19 x11 5 108: 10058c63 beq x11 x0 280 <for_loop_end> 10c: 00000493 addi x9 x0 0 110: 00599993 slli x19 x19 5 114: 02000913 addi x18 x0 32 118: 00000a13 addi x20 x0 0 11c: 00098a93 addi x21 x19 0 120: 00000b13 addi x22 x0 0 124: 0040006f jal x0 4 <for_loop> 00000128 <for_loop>: 128: 020ad263 bge x21 x0 36 <pack_bits> 12c: 40848f33 sub x30 x9 x8 130: 0e0f5463 bge x30 x0 232 <update_maxCount2> 134: 0ec0006f jal x0 236 <for_loop_end> 138: 40848fb3 sub x31 x9 x8 13c: 0c0fde63 bge x31 x0 220 <update_maxCount2> 140: 0e00006f jal x0 224 <for_loop_end> 00000144 <for_loop_update>: 144: 412a8ab3 sub x21 x21 x18 148: fe1ff06f jal x0 -32 <for_loop> 0000014c <pack_bits>: 14c: 016a82b3 add x5 x21 x22 150: 00229e93 slli x29 x5 2 154: 412b0333 sub x6 x22 x18 158: 01d50e33 add x28 x10 x29 15c: 40b283b3 sub x7 x5 x11 160: 02035063 bge x6 x0 32 <check_packed> 164: 000e2e03 lw x28 0 x28 168: 0003dc63 bge x7 x0 24 <check_packed> 16c: 016e1e33 sll x28 x28 x22 170: 001b0b13 addi x22 x22 1 174: fff00c13 addi x24 x0 -1 178: 01ca6a33 or x20 x20 x28 17c: fd1ff06f jal x0 -48 <pack_bits> 00000180 <check_packed>: 180: 418a0eb3 sub x29 x20 x24 184: 02000b93 addi x23 x0 32 188: 00000b13 addi x22 x0 0 18c: 000e8463 beq x29 x0 8 <update_currentCount> 190: 000bd863 bge x23 x0 16 <while_loop> 00000194 <update_currentCount>: 194: 012484b3 add x9 x9 x18 198: 412a8ab3 sub x21 x21 x18 19c: fa9ff06f jal x0 -88 <for_loop_update> 000001a0 <while_loop>: 1a0: fa0b82e3 beq x23 x0 -92 <for_loop_update> 1a4: fe810113 addi x2 x2 -24 1a8: 00112c23 sw x1 24 x2 1ac: 00a12a23 sw x10 20 x2 1b0: 00512823 sw x5 16 x2 1b4: 00612623 sw x6 12 x2 1b8: 00712423 sw x7 8 x2 1bc: 01c12223 sw x28 4 x2 1c0: 000a0513 addi x10 x20 0 1c4: f11ff0ef jal x1 -240 <my_clz> 1c8: 00050c93 addi x25 x10 0 1cc: 01812083 lw x1 24 x2 1d0: 01412503 lw x10 20 x2 1d4: 01012283 lw x5 16 x2 1d8: 00c12303 lw x6 12 x2 1dc: 00812383 lw x7 8 x2 1e0: 00412e03 lw x28 4 x2 1e4: 01810113 addi x2 x2 24 1e8: 019a1a33 sll x20 x20 x25 1ec: 419b8bb3 sub x23 x23 x25 1f0: 000c8663 beq x25 x0 12 <update_currentCount2> 1f4: 40848f33 sub x30 x9 x8 1f8: 000f5a63 bge x30 x0 20 <update_maxCount1> 000001fc <update_currentCount2>: 1fc: fffb8b93 addi x23 x23 -1 200: 00148493 addi x9 x9 1 204: 001a1a13 slli x20 x20 1 208: f99ff06f jal x0 -104 <while_loop> 0000020c <update_maxCount1>: 20c: 00048413 addi x8 x9 0 210: 00000493 addi x9 x0 0 214: f8dff06f jal x0 -116 <while_loop> 00000218 <update_maxCount2>: 218: 00048413 addi x8 x9 0 21c: 0040006f jal x0 4 <for_loop_end> 00000220 <for_loop_end>: 220: 00040513 addi x10 x8 0 224: 00008067 jalr x0 x1 0 ``` ::: ### 5.2 5-stage pipelined processor [Ripes](https://github.com/mortbopet/Ripes) provides several processers simulator, and I use 5-stage pipelined processor to test the code. ![image](https://hackmd.io/_uploads/Hkxd3Adyyl.png) Compared to a single-cycle processor, 5-stage processor is divided into five stages: Instruction Fetch (`IF`), Instruction Decode (`ID`), Execute (`EX`), Memory Access (`MEM`), and Write Back (`WB`). The details of each stage are as follows: **1. Instruction Fetch (IF)** * The instruction is fetched from the memory address pointed by the Program Counter (PC). * The Program Counter (PC) is updated by being added an immediate value of 4 using an adder, allowing the PC to point to the memory address of the next instruction. **2. Instruction Decode (ID)** * The instruction is decoded using a decoder and read the value from register sourse. * For I-format instructions, the immediate value undergoes sign extension before being inputted into the pipelined register. **3. Execute (EX)** * The ALU carries out one of the following instructions: * **R-format** : The ALU performs the operation based on the opcode and the values in registers. * **I-format** : The ALU performs the operation based on the opcode, the values in registers, and the sign-extensioned immediate value. * **S-format** : The ALU performs an addition operation to add the base address and offset, with the resulting value representing the memory address for storing or loading data. * **B-format** : The ALU checks conditions and determines whether the branch occurs. **4. Memory Access (MEM)** * For store instruction, the data will be stored in memory at the address computed in Execute (EX) stage. * For load instruction, the data will be loaded from memory according to the address computed in Execute (EX) stage and then input into pipelined register. **5. Write Back (WB)** * The data value from memory or the result of ALU operation is written back into register. ### 5.3 Example To clarify how the processor operates instructions, I will use the pseudoinstruction `addi x22, x22, 1` to illustrate the operation stage by stage. The memory address of pseudoinstruction `addi x22, x22, 1` is `0x170`. **1. Instruction Fetch (IF)** * First, the memory address pointed by program counter (PC) is `0x170`, which represents the next insturction the processor will execute in next cycle. * According to RISC-V instruction format, the I-format instruction in binay representation can be separated into `immediate`, `rs1`, `funct3`, `rd`, and `opcode`. | immediate [11:0] | rs1 [12:16] | funct3 [17:19] | rd [20:24] | opcode [25:31] | |:----------------:|:-----------:|:--------------:|:----------:|:--------------:| | 0000 0000 0001 | 10110 | 000 | 10110 | 0010011 | * Meanwhile, the program counter (PC) is added by immediate value 4 and obtain the memory address of the next instruction `0x174`. ![image](https://hackmd.io/_uploads/HJxBzBxYJyg.png) **2. Instruction Decode (ID)** According to the instruction fetched in the previous stage, the decoder decodes the instruction into `rs1`, `rd`, and `opcode`. The `immediate` is directly input into the sign-extension device. * `rs1` is `0x16` (x22). The register file reads the value `0x0` from the corresponding register and inputs it into pipelined register. * `opcode` is `0010011`. (ADDI) * `immediate` is `0x1`, which is extended to a 32-bit number for calculation in next stage. ![image](https://hackmd.io/_uploads/By8VSeF1Jx.png) **3. Execute (EX)** * In the EX stage, the input signals `op1` and `op2` are controlled by two-level multiplexers. * op1 : the value of `rs1` passes through two-level multiplexers as an input of the ALU. * op2 : the value of `immediate` passes through two-level multiplexers as an input of the ALU. * Hence, the inputs and output of ALU are as follows: * Input : `op1` = `0x0` , `op2` = `0x1` * Output : `ALU_result` = `0x1` * The result of the ALU operation, the destination register `0x16` (x22) , and the program counter are sent to the pipelined register for the next stage. ![image](https://hackmd.io/_uploads/ByarSgKyyl.png) **4. Memory Access (MEM)** * Because I-format instructions do not need to load or store value from memory, the control signal `wr_en` is set to 0. * Program counter, the result of ALU operation, and destination register `0x16` are sent to pipelined register directly. ![image](https://hackmd.io/_uploads/rJZwHlKkyg.png) **5. Write Back (WB)** * In the WB stage, the multiplexer controls which signal will pass through using the control signal. In this case, the result of ALU operation will be sent directly from pipelined register and written back into the destination register `0x16` (x22) through the multiplexer. ![image](https://hackmd.io/_uploads/SJU9Bxtykg.png) **6. Memory Update** The first image below shows the memory status when the pseudoinstruction `addi x22, x22, 1` is in MEM stage, while second image shows the same pseudoinstruction in the WB stage. According to two images, it is evident that the value of register `x22` is updated in the WB stage. ![image](https://hackmd.io/_uploads/HysIVVY1ke.png =50%x)![image](https://hackmd.io/_uploads/ryY_VVFkJe.png =50%x) ![image](https://hackmd.io/_uploads/HyzSrNty1l.png =50%x)