Assignment1: RISC-V Assembly and Instruction Pipeline

# Assignment1: RISC-V Assembly and Instruction Pipeline contributed by < [`michael1215`](https://github.com/lu1215/computer_architecture) > ## Problem C in [Quiz1](https://hackmd.io/@sysprog/arch2024-quiz1-sol) ### Problem analysis In this problem, three functions are used: `fabsf`, `my_clz`, and `fp16_to_fp32`. The goal is to convert a 16-bit floating-point number into a 32-bit floating-point number. #### fabsf In the IEEE standard, the most significant bit (MSB) of a 32-bit floating-point number is the sign bit. When the sign bit is 0, the number is positive. Therefore, by performing an AND operation between the input 32-bit floating-point number and 0x7FFFFFFF, the sign bit of the input is set to 0, while the rest of the number remains unchanged. original C program ``` static inline float fabsf(float x) { uint32_t i = *(uint32_t *)&x; // Read the bits of the float into an integer i &= 0x7FFFFFFF; // Clear the sign bit to get the absolute value x = *(float *)&i; // Write the modified bits back into the float return x; } ``` RISC-V ```s fabsf: # need to use t0 as a variable, put original value to stack addi sp, sp, -4 sw t0, 0(sp) li t0, 0x7FFFFFFF and s0, t0, s0 # i &= 0x7FFFFFFF lw t0, 0(sp) addi sp, sp, 4 ``` #### my_clz This program uses a for loop to check each bit starting from the most significant bit (MSB). If a bit is 0, the loop breaks, stopping further checks. If the bit is not 0, the program continues counting. Finally, it returns the number of leading zeros in the input value. Original C program ```c static inline int my_clz(uint32_t x) { int count = 0; for (int i = 31; i >= 0; --i) { if (x & (1U << i)) break; count++; } return count; } ``` RISC-V ```s my_clz_ini: # s0 is x, s1 is index of array, s2 is 1U << i, s3 is count in c code # result of this function will return in r0 addi sp, sp, -20 # adding space of 5 words in stack (s0-s3 and ra) sw ra, 0(sp) # store register to stack sw s0, 4(sp) sw s1, 8(sp) sw s2, 12(sp) sw s3, 16(sp) addi s1, s1, 31 addi s2, s2, 1 # initinalize s2 to 1 slli s2, s2, 31 # shift one to the leftmost bit myclz_loop: and t0, s0, s2 # doing 'and' operation bnez t0, myclz_end # if t0 is not 0, means meet 1, jump to myclz_end addi s3, s3, 1 # count+1 if s0 and s2 is 1 srli s2, s2, 1 # shift right 1 bit to meet 1U << i addi s1, s1, -1 # index = index - 1 bnez s1, myclz_loop # if s1 equal to 0, go to myclz_end, else go to myclz_loop again myclz_end: add s0, s3, x0 # store final answer to s0 register lw ra, 0(sp) # get return address lw s1, 8(sp) lw s2, 12(sp) lw s3, 16(sp) addi sp, sp, 20 jr ra ``` #### brenchless CLZ But in some cases, branchless CLZ will faster than function of my_clz [brenchless CLZ](https://hackmd.io/@sysprog/arch2023-quiz1-sol#Problem-A) First setting all bits to 1 after the first encountered 1-bit (from left to right).Afterward, it applies bitwise subtraction and masks to count the number of set bits (1s) in the integer. The number of leading zeros is then determined by subtracting the number of 1-bits from 32. Original C program ```c uint16_t count_leading_zeros(uint32_t x) { x |= (x >> 1); x |= (x >> 2); x |= (x >> 4); x |= (x >> 8); x |= (x >> 16); x -= ((x >> 1) & 0x55555555); x = ((x >> 2) & 0x33333333) + (x & 0x33333333); x = ((x >> 4) + x) & 0x0f0f0f0f; x += (x >> 8); x += (x >> 16); return (32 - (x & 0x3f)); } ``` RISC-V ```s bl_clz_ini: addi sp, sp, -8 # put value into stack sw ra, 0(sp) sw s0, 4(sp) bl_clz_calculating: srli t0, s0, 1 # x |= (x >> 1); or s0, t0, s0 srli t0, s0, 2 # x |= (x >> 2); or s0, t0, s0 srli t0, s0, 4 # x |= (x >> 4); or s0, t0, s0 srli t0, s0, 8 # x |= (x >> 8); or s0, t0, s0 srli t0, s0, 16 # x |= (x >> 16); or s0, t0, s0 srli t0, s0, 1 # x -= ((x >> 1) & 0x55555555); li t1, 0x55555555 and t0, t0, t1 sub s0, s0, t0 srli t0, s0, 2 # x = ((x >> 2) & 0x33333333) + (x & 0x33333333); li t1, 0x33333333 and t0, t0, t1 and s0, s0, t1 add s0, s0, t0 srli t0, s0, 4 # x = ((x >> 4) + x) & 0x0f0f0f0f; add s0, s0, t0 li t1, 0x0f0f0f0f and s0, s0, t1 srli t0, s0, 8 # x += (x >> 8); add s0, t0, s0 srli t0, s0, 16 # x += (x >> 16); add s0, t0, s0 # (32 - (x & 0x3f)); andi s0, s0, 0x3f li t1, 32 sub s0, t1, s0 lw ra, ra, 0(sp) # get value from stack addi sp, sp, 8 jr ra ``` #### comparison between my_clz and branchless clz Since the maximum input value in the selected LeetCode problem is 10,000, the loop is modified to start counting leading zeros from the 14th bit for comparison purposes. test data: **1. 10000(0x2710)** branchless clz ![image](https://hackmd.io/_uploads/ryagWfOyyg.png) my_clz ![image](https://hackmd.io/_uploads/SyvxrfOJ1l.png) **2. 8191(0x1FFF)** branchless clz ![image](https://hackmd.io/_uploads/ryZD4GOJJe.png) my_clz ![image](https://hackmd.io/_uploads/SkljVfOykx.png) **3. 0(0x0000)** branchless clz ![image](https://hackmd.io/_uploads/rkJrNMOJJg.png) my_clz ![image](https://hackmd.io/_uploads/BJiMNf_1Je.png) **4. 1(0x0001)** branchless clz ![image](https://hackmd.io/_uploads/Bk90SX_1ye.png) my_clz ![image](https://hackmd.io/_uploads/BJ7nBmOyye.png) **5. 64(=0x0040=0b1000000)** branchless clz ![image](https://hackmd.io/_uploads/SymLIQuJkg.png) my_clz ![image](https://hackmd.io/_uploads/r1Xt8mOyyg.png) #### conclusion the number of average insturction branchless clz: 37.5((38+37)/2)*10000/10000 = 37.5 my_clz: (2*98(0x0, 0x1) + 2*94 (0b1x) + 4*88(0b1xx) + 8*82(0b1xxx) + 16*76(0b1xxx)+ 32*70 + 64*64 + 128*58 + 256*52 + 512*46 + 1024*40 + 2048*35 + 4096*28 + 1809*22 )/10000 = 20.567 In this case, it can be found that if the data is evenly distributed in the range, the my_clz method is faster, but in the case of small input values, the branchless_clz method will be greatly improved, and the number used in the subsequent test data is smaller, so the branchless_clz is chosen to do the practice of the problem ### fp16_to_fp32 This function converts a 16-bit half-precision floating-point number to a 32-bit single-precision format by adjusting the mantissa and exponent, handling special cases like zero, NaN, infinity, and denormalized numbers. Original C program ```c static inline uint32_t fp16_to_fp32(uint16_t h) { /* * Extends the 16-bit half-precision floating-point number to 32 bits * by shifting it to the upper half of a 32-bit word: * +---+-----+------------+-------------------+ * | S |EEEEE|MM MMMM MMMM|0000 0000 0000 0000| * +---+-----+------------+-------------------+ * Bits 31 26-30 16-25 0-15 * * S - sign bit, E - exponent bits, M - mantissa bits, 0 - zero bits. */ const uint32_t w = (uint32_t) h << 16; /* * Isolates the sign bit from the input number, placing it in the most * significant bit of a 32-bit word: * * +---+----------------------------------+ * | S |0000000 00000000 00000000 00000000| * +---+----------------------------------+ * Bits 31 0-31 */ const uint32_t sign = w & UINT32_C(0x80000000); /* * Extracts the mantissa and exponent from the input number, placing * them in bits 0-30 of the 32-bit word: * * +---+-----+------------+-------------------+ * | 0 |EEEEE|MM MMMM MMMM|0000 0000 0000 0000| * +---+-----+------------+-------------------+ * Bits 30 27-31 17-26 0-16 */ const uint32_t nonsign = w & UINT32_C(0x7FFFFFFF); /* * The renorm_shift variable indicates how many bits the mantissa * needs to be shifted to normalize the half-precision number. * For normalized numbers, renorm_shift will be 0. For denormalized * numbers, renorm_shift will be greater than 0. Shifting a * denormalized number will move the mantissa into the exponent, * normalizing it. */ uint32_t renorm_shift = my_clz(nonsign); renorm_shift = renorm_shift > 5 ? renorm_shift - 5 : 0; /* * If the half-precision number has an exponent of 15, adding a * specific value will cause overflow into bit 31, which converts * the upper 9 bits into ones. Thus: * inf_nan_mask == * 0x7F800000 if the half-precision number is * NaN or infinity (exponent of 15) * 0x00000000 otherwise */ const int32_t inf_nan_mask = ((int32_t)(nonsign + 0x04000000) >> 8) & INT32_C(0x7F800000); /* * If nonsign equals 0, subtracting 1 will cause overflow, setting * bit 31 to 1. Otherwise, bit 31 will be 0. Shifting this result * propagates bit 31 across all bits in zero_mask. Thus: * zero_mask == * 0xFFFFFFFF if the half-precision number is * zero (+0.0h or -0.0h) * 0x00000000 otherwise */ const int32_t zero_mask = (int32_t)(nonsign - 1) >> 31; /* * 1. Shifts nonsign left by renorm_shift to normalize it (for denormal * inputs). * 2. Shifts nonsign right by 3, adjusting the exponent to fit in the * 8-bit exponent field and moving the mantissa into the correct * position within the 23-bit mantissa field of the single-precision * format. * 3. Adds 0x70 to the exponent to account for the difference in bias * between half-precision and single-precision. * 4. Subtracts renorm_shift from the exponent to account for any * renormalization that occurred. * 5. ORs with inf_nan_mask to set the exponent to 0xFF if the input * was NaN or infinity. * 6. ANDs with the inverted zero_mask to set the mantissa and exponent * to zero if the input was zero. * 7. Combines everything with the sign bit of the input number. */ return sign | ((((nonsign << renorm_shift >> 3) + ((0x70 - renorm_shift) << 23)) | inf_nan_mask) & ~zero_mask); } ``` RISC-V ```s fp16_to_fp32: addi sp, sp, -32 sw t0, 0(sp) sw t1, 4(sp) sw t2, 8(sp) sw t3, 12(sp) sw s0, 16(sp) sw ra, 20(sp) sw t4, 24(sp) sw t5, 28(sp) slli t0, s0, 16 # const uint32_t w = (uint32_t) h << 16; (w is t0) li t1, 0x80000000 and t2, t0, t1 # const uint32_t sign = w & UINT32_C(0x80000000); (sign is t2) addi t1, t1, -1 # get 0x7FFFFFFF and t3, t0, t1 # const uint32_t nonsign = w & UINT32_C(0x7FFFFFFF); (nonsign is t3) add s0, x0, t3 jal ra, my_clz_ini # my_clz(nonsign); addi s0, s0, -5 blt x0, s0, second_part # renorm_shift = renorm_shift > 5 ? renorm_shift - 5 : 0; addi s0, x0, 0 second_part: # const int32_t inf_nan_mask = ((int32_t)(nonsign + 0x04000000) >> 8)&INT32_C(0x7F800000); (t4 is inf_nan_mask) li t4, 0x04000000 add t4, t4, t3 li t1, 0x7F800000 srai t4, t4, 8 and t4, t4, t1 # const int32_t zero_mask = (int32_t)(nonsign - 1) >> 31; (t5 is zero_mask) addi t5, t3, -1 srai t5, t5, 31 # return sign | ((((nonsign << renorm_shift >> 3) +((0x70 - renorm_shift) << 23)) | inf_nan_mask) & ~zero_mask); sll t0, t3, s0 # t0 = nonsign << renorm_shift srai t0, t0, 3 # t0 = (nonsign << renorm_shift) >> 3 li t1, 0x70 # t1 = 0x70 sub t1, t1, s0 # t1 = 0x70 - renorm_shift slli t1, t1, 23 add t0, t0, t1 # t0 = (nonsign << renorm_shift >> 3) + ((0x70 - renorm_shift) << 23) or t0, t0, t4 # t0 = t0 | inf_nan_mask not t5, t5 # t5 = ~zero_mask and t0, t0, t5 # t0 = (t0 | inf_nan_mask) & ~zero_mask or t0, t0, t2 # t0 = sign | t0 addi s0, t0, 0 lw t0, 0(sp) lw t1, 4(sp) lw t2, 8(sp) lw t3, 12(sp) lw ra, 20(sp) lw t4, 24(sp) lw t5, 28(sp) addi sp, sp, 32 li a7, 10 ecall my_clz_ini: # s0 is x, s1 is index of array, s2 is 1U << i, s3 is count in c code # result of this function will return in r0 addi sp, sp, -20 ## adding space of 5 words in stack (s0-s3 and ra) sw ra, 0(sp) ## store register to stack sw s0, 4(sp) ## store register to stack sw s1, 8(sp) ## store register to stack sw s2, 12(sp) ## store register to stack sw s3, 16(sp) ## store register to stack addi s1, s1, 31 ## set index addi s2, s2, 1 ## initinalize s2 to 1 slli s2, s2, 31 ## shift one to the leftmost bit myclz_loop: and t0, s0, s2 ## doing 'and' operation bnez t0, myclz_end ## if t0 is not 0, means meet 1 addi s3, s3, 1 ## count+1 if s0 and s2 is 1 srli s2, s2, 1 ## shift right 1 bit to meet 1U << i addi s1, s1, -1 ## index = index - 1 bnez s1, myclz_loop ## if s1 equal to 0, go to myclz_end, else go to myclz_loop again myclz_end: add s0, s3, x0 ## store final answer to s0 register lw ra, 0(sp) ## get addi sp, sp, 20 jr ra ``` ## Sort Integers by The Number of 1 Bits([LeetCode 1356](https://leetcode.com/problems/sort-integers-by-the-number-of-1-bits/)) > Description: You are given an integer array arr. Sort the integers in the array in ascending order by the number of 1's in their binary representation and in case of two or more integers have the same number of 1's you have to sort them in ascending order. Return the array after sorting it. > > Constraints: 1 <= arr.length <= 500 0 <= arr[i] <= 10^4 > ## Solution ### Idea for problem solving first, Sorting every element in array in ascending order, and then i will get every element in array and calculate the number of one of each element in their binary representation, and store it to a new array, then i need to use stable sort to avoid disturbing the relative order after the first sorting. finally, i will return a array with sorted element. ### test input ``` test input1: [0,3,4] (each element has different number of one in binary form) test input2: [16,8,4,2,1] (each element has same number of one in binary form) test input3: [0,1,2,3,4,5,6,7,8] (mixed case test1 and test2) ``` ### C program #### First version ```c /** * Note: The returned array must be malloced, assume caller calls free(). */ int counting_one(int val){ int num = 0; int len = 32; for(int i = 0; i <= len; i++){ num += val&1; val>>=1; } return num; } void swap(int* v1, int* v2){ int tmp = *v1; *v1 = *v2; *v2 = tmp; } //// bubble sort //// void bubble_sort(int* arr, int arrSize){ for(int i = 0; i<arrSize; i++){ for(int j = 0; j<arrSize; j++){ if(arr[i] < arr[j]) swap(&arr[i], &arr[j]); } } } void double_bubble_sort(int* arr, int*arr2, int arrSize){ for(int i = 0; i<arrSize; i++){ for(int j = 0; j<arrSize-i-1; j++){ if(arr[j] > arr[j+1]) { swap(&arr[j], &arr[j+1]); swap(&arr2[j], &arr2[j+1]); } } } } //// bubble sort //// int* sortByBits(int* arr, int arrSize, int* returnSize) { *returnSize = arrSize; bubble_sort(arr, arrSize); int* one_num_arr = malloc(arrSize*sizeof(int)); for(int i = 0; i < arrSize; i++){ one_num_arr[i] = counting_one(arr[i]); } double_bubble_sort(one_num_arr, arr, arrSize); return arr; } ``` ### Second version ```c /** * Note: The returned array must be malloced, assume caller calls free(). */ // ref: https://hackmd.io/@sysprog/arch2023-quiz1-sol#Problem-A uint16_t count_leading_zeros(uint32_t x) { x |= (x >> 1); x |= (x >> 2); x |= (x >> 4); x |= (x >> 8); x |= (x >> 16); x -= ((x >> 1) & 0x55555555); x = ((x >> 2) & 0x33333333) + (x & 0x33333333); x = ((x >> 4) + x) & 0x0f0f0f0f; x += (x >> 8); x += (x >> 16); return (32 - (x & 0x3f)); } int counting_one(int val){ int num = 0; int len = 32-count_leading_zeros(val); for(int i = 0; i <= len; i++){ num += val&1; val>>=1; } return num; } void swap(int* v1, int* v2){ int tmp = *v1; *v1 = *v2; *v2 = tmp; } //// quick sort //// int partition(int* arr, int start, int end){ int pivot = arr[end]; int idx = start; for(int i = start; i < end; i++){ if(arr[i] <= pivot){ // swap(arr, idx, i); swap(&arr[idx], &arr[i]); idx++; } } // swap(arr, idx, end); swap(&arr[idx], &arr[end]); return idx; } void quick_sort(int* arr, int start, int end){ if(start < end){ int pivot = partition(arr, start, end); quick_sort(arr, start, pivot-1); quick_sort(arr, pivot+1, end); } } //// quick sort //// //// bubble sort //// void double_bubble_sort(int* arr, int*arr2, int arrSize){ for(int i = 0; i<arrSize; i++){ for(int j = 0; j<arrSize-i-1; j++){ if(arr[j] > arr[j+1]) { swap(&arr[j], &arr[j+1]); swap(&arr2[j], &arr2[j+1]); } } } } //// bubble sort //// int* sortByBits(int* arr, int arrSize, int* returnSize) { *returnSize = arrSize; quick_sort(arr, 0, arrSize-1); int* one_num_arr = malloc(arrSize*sizeof(int)); for(int i = 0; i < arrSize; i++){ one_num_arr[i] = counting_one(arr[i]); } double_bubble_sort(one_num_arr, arr, arrSize); return arr; } ``` In the initial version of the program, I used bubble sort twice. However, the stable property of the array only needed to be considered in the second sort. Therefore, I replaced the first sort with quick sort and modified the original counting one function. Since the input numbers are limited to the range of 0-10000, at most the rightmost 14 bits will be used. Thus, I can use the branchless CLZ function to reduce the number of loop iterations. ### Third version ```c /** * Note: The returned array must be malloced, assume caller calls free(). */ //// merge sort //// void double_merge_sort(int* arr,int* arr2, int left, int right){ if (left < right){ int mid = (right+left)/2; double_merge_sort(arr, arr2, left, mid); double_merge_sort(arr, arr2, mid+1, right); double_merge(arr, arr2, left, mid, right); } } void double_merge(int* arr,int* arr2, int left, int mid, int right){ int tmp_arr[right - left + 1]; int tmp_arr2[right - left + 1]; int tmp_idx = 0; int l_idx = left; int r_idx = mid+1; while(l_idx <= mid && r_idx <= right){ if( arr[l_idx] <= arr[r_idx]){ tmp_arr[tmp_idx] = arr[l_idx]; tmp_arr2[tmp_idx] = arr2[l_idx]; tmp_idx++; l_idx++; } else { tmp_arr[tmp_idx] = arr[r_idx]; tmp_arr2[tmp_idx] = arr2[r_idx]; tmp_idx++; r_idx++; } } while(l_idx <= mid){ tmp_arr[tmp_idx] = arr[l_idx]; tmp_arr2[tmp_idx] = arr2[l_idx]; tmp_idx++; l_idx++; } while(r_idx <= right){ tmp_arr[tmp_idx] = arr[r_idx]; tmp_arr2[tmp_idx] = arr2[r_idx]; tmp_idx++; r_idx++; } for(int i =left, j=0; i<=right; i++, j++){ arr[i] = tmp_arr[j]; arr2[i] = tmp_arr2[j]; } } //// merge sort //// // ref: https://hackmd.io/@sysprog/arch2023-quiz1-sol#Problem-A uint16_t count_leading_zeros(uint32_t x) { x |= (x >> 1); x |= (x >> 2); x |= (x >> 4); x |= (x >> 8); x |= (x >> 16); x -= ((x >> 1) & 0x55555555); x = ((x >> 2) & 0x33333333) + (x & 0x33333333); x = ((x >> 4) + x) & 0x0f0f0f0f; x += (x >> 8); x += (x >> 16); return (32 - (x & 0x3f)); } int counting_one(int val){ int num = 0; int len = 32-count_leading_zeros(val); for(int i = 0; i <= len; i++){ num += val&1; val>>=1; } return num; } void swap(int* v1, int* v2){ int tmp = *v1; *v1 = *v2; *v2 = tmp; } //// quick sort //// int partition(int* arr, int start, int end){ int pivot = arr[end]; int idx = start; for(int i = start; i < end; i++){ if(arr[i] <= pivot){ // swap(arr, idx, i); swap(&arr[idx], &arr[i]); idx++; } } // swap(arr, idx, end); swap(&arr[idx], &arr[end]); return idx; } void quick_sort(int* arr, int start, int end){ if(start < end){ int pivot = partition(arr, start, end); quick_sort(arr, start, pivot-1); quick_sort(arr, pivot+1, end); } } //// quick sort //// int* sortByBits(int* arr, int arrSize, int* returnSize) { *returnSize = arrSize; quick_sort(arr, 0, arrSize-1); int* one_num_arr = malloc(arrSize*sizeof(int)); for(int i = 0; i < arrSize; i++){ one_num_arr[i] = counting_one(arr[i]); } double_merge_sort(one_num_arr, arr, 0, arrSize-1); return arr; } ``` In the second version, the sorting of second part, which previously used bubble sort, was modified. Since bubble sort has an average time complexity of O(n^2) for stable sorting, it was replaced with merge sort, which has an average time complexity of O(nlogn) for stable sorting, thus improving the overall speed. ### Fourth version ```c /** * Note: The returned array must be malloced, assume caller calls free(). */ // ref: https://hackmd.io/@sysprog/arch2023-quiz1-sol#Problem-A uint16_t count_leading_zeros(uint32_t x) { x |= (x >> 1); x |= (x >> 2); x |= (x >> 4); x |= (x >> 8); x |= (x >> 16); x -= ((x >> 1) & 0x55555555); x = ((x >> 2) & 0x33333333) + (x & 0x33333333); x = ((x >> 4) + x) & 0x0f0f0f0f; x += (x >> 8); x += (x >> 16); return (32 - (x & 0x3f)); } int counting_one(int val){ int num = 0; int len = 32-count_leading_zeros(val); for(int i = 0; i <= len; i++){ num += val&1; val>>=1; } return num; } void swap(int* v1, int* v2){ int tmp = *v1; *v1 = *v2; *v2 = tmp; } //// advanced quick sort //// int comparing(const int a, int b){ int a_one = counting_one(a); int b_one = counting_one(b); return (a_one - b_one == 0) ? (a - b): (a_one - b_one); } int partition_adv(int* arr, int start, int end){ int pivot = arr[end]; int idx = start; for(int i = start; i < end; i++){ if(comparing(arr[i],pivot) < 0){ // swap(arr, idx, i); swap(&arr[idx], &arr[i]); idx++; } } // swap(arr, idx, end); swap(&arr[idx], &arr[end]); return idx; } void quick_sort_adv(int* arr, int start, int end){ if(start < end){ int pivot = partition_adv(arr, start, end); quick_sort_adv(arr, start, pivot-1); quick_sort_adv(arr, pivot+1, end); } } //// advanced quick sort //// int* sortByBits(int* arr, int arrSize, int* returnSize) { *returnSize = arrSize; quick_sort_adv(arr, 0, arrSize-1); return arr; } ``` In the third version of the program, we needed to use sort twice to achieve our goal. Referring to the usage of the qsort function, we combined the two comparison criteria into the comparing function, allowing us to complete the task with only one sorting process, making the program more concise. ### assembly ```s ## predifined test input .data test_data1: .word 0, 3, 4 test_data1_len: .word 3 ans_data1: .word 0, 4, 3 test_data2: .word 16, 8, 4, 2, 1 test_data2_len: .word 5 ans_data2: .word 1, 2, 4, 8, 16 test_data3: .word 0, 1, 2, 3, 4, 5, 6, 7, 8 test_data3_len: .word 9 ans_data3: .word 0, 1, 2, 4, 8, 3, 5, 6, 7 f_str: .string " False\n" t_str: .string " True\n" align_str: .string " " .text main: ## first test data la a2, test_data2 # get input test_data2 la a3, test_data2_len # get length of input test_data2 lw a3, 0(a3) addi sp, sp, -12 sw ra, 0(sp) # put return address into stack addi s0, a2, 0 # s0 = first address of test data 2 addi s1, x0, 0 # s1 = parameter start of quick_sort = 0 addi s2, a3, -1 # s2 = parameter end of quick_sort = length of input test_data2 - 1 jal ra, quick_sort # start executing quick sort addi t2, x0, 0 # set t2 as 0 (index of checking answer) addi t1, a3, 0 # la t3, ans_data2 # load first address of correct answer into t3 jal ra, print_loop # checking answer ## second test data la a2, test_data1 la a3, test_data1_len lw a3, 0(a3) addi sp, sp, -12 sw ra, 0(sp) addi s0, a2, 0 addi s1, x0, 0 addi s2, a3, -1 jal ra, quick_sort addi t2, x0, 0 addi t1, a3, 0 la t3, ans_data1 jal ra, print_loop ## third test data la a2, test_data3 la a3, test_data3_len lw a3, 0(a3) addi sp, sp, -12 sw ra, 0(sp) addi s0, a2, 0 addi s1, x0, 0 addi s2, a3, -1 jal ra, quick_sort addi t2, x0, 0 addi t1, a3, 0 la t3, ans_data3 jal ra, print_loop li a7, 10 # exiting program ecall quick_sort: # 3 parameters input in s0, s1, s2 # s0: arr pointer, s1: start, s2: end # variable # s3: pivot blt s2, s1, quick_sort_ret # if start < end, return # calculating pivot addi sp, sp, -8 # storing some variable to stack and jumping to partition function sw ra, 0(sp) sw s0, 4(sp) jal ra, partition # pivot return in s0 addi s3, s0, 0 # s3 is pivot lw ra, 0(sp) lw s0, 4(sp) addi sp, sp, 8 addi sp, sp, -16 # pushing data(ret addr, arr pointer, start, end) into stack sw ra, 0(sp) sw s0, 4(sp) sw s1, 8(sp) sw s2, 12(sp) addi s2, s3, -1 # set s2 to be pivot-1 jal ra, quick_sort # quick_sort_adv(arr, start, pivot-1); lw s2, 12(sp) # get value of end addi s1, s3, 1 # pivot + 1 jal ra, quick_sort # quick_sort_adv(arr, pivot+1, end); lw ra, 0(sp) lw s0, 4(sp) lw s1, 8(sp) lw s2, 12(sp) addi sp, sp, 16 quick_sort_ret: jr ra partition: # 3 parameters input in s0, s1, s2 # s0: arr pointer, s1: start, s2: end # return in s0: value of pivot addi sp, sp, -32 sw s0, 0(sp) sw s1, 4(sp) sw s2, 8(sp) sw ra, 12(sp) sw t0, 16(sp) sw t1, 20(sp) sw t2, 24(sp) sw t3, 28(sp) addi t1, s1, 0 # t1 saves idx variable (int idx = start;) slli t2, s1, 2 # t2 is relative addr of arr[i]( arr[i] = s0 + t2) slli s2, s2, 2 # end * 4 (making it can be compared to t2) add t0, s0, s2 # get addr of arr[end], t1 = relative addr of arr[end] lw t0, 0(t0) # t0 = pivot = arr[end]; partition_loop: bge t2, s2, partition_end # i >= end , loop break # addr of arr[i] lw s0, 0(sp) # s0 = addr of head element add t3, s0, t2 # get arr[i] lw s0, 0(t3) # s0 = arr[i] # addi s0, s1, 0 addi s1, t0, 0 # s1 = pivot jal ra, comparing # comparing pivot and arr[i] bltz s0, ready_to_swap # if return of comparing < 0, swaping addi t2, t2, 4 j partition_loop ready_to_swap: lw s0, 0(sp) # s0 = addr of head element slli s1, t1, 2 # s1 = idx*4 addi s2, t2, 0 # s1 = i*4 add s2, s0, s2 # s2 = s0 + i*4 add s0, s0, s1 # s0 = s0 + idx*4 addi s1, s2, 0 # s1 = s2 = s0 + i*4 jal ra, swap # swaping element stored in s0 and s1 addi t1, t1, 1 # idx++ addi t2, t2, 4 # addr + 4 lw s2, 8(sp) slli s2, s2, 2 # end * 4 (making it can be compared to t2) j partition_loop partition_end: # swap(&arr[idx], &arr[end]); lw s0, 0(sp) slli s1, t1, 2 # idx*4 addi s2, t2, 0 # i*4 add s2, s0, s2 # s2 = s0 + i*4 add s0, s0, s1 # s0 = s0 + idx*4 addi s1, s2, 0 # s1 = s2 = s0 + i*4 jal ra, swap addi s0, t1, 0 # return idx lw s1, 4(sp) lw s2, 8(sp) lw ra, 12(sp) lw t0, 16(sp) lw t1, 20(sp) lw t2, 24(sp) lw t3, 28(sp) addi sp, sp, 32 jr ra comparing: # input s0, s1, return in s0 addi sp, sp, -20 sw ra, 0(sp) sw t0, 4(sp) sw t1, 8(sp) sw t2, 12(sp) sw t3, 16(sp) addi t0, s0, 0 # storing a addi t1, s1, 0 # storing b jal ra, counting_one # counting_one(a) addi t2, s0, 0 # t2 = counting_one(a) addi s0, s1, 0 # s0 = b jal ra, counting_one # counting_one(b) sub t3, t2, s0 # t3 = counting_one(a) - counting_one(b) beqz t3, ret_comp_same_one # if counting_one(a) == counting_one(b), comparing which number is bigger addi s0, t3, 0 # return counting_one(a) - counting_one(b) j comparing_end ret_comp_same_one: sub s0, t0, t1 # directly return a-b comparing_end: lw ra, 0(sp) # restoring data from stack lw t0, 4(sp) lw t1, 8(sp) lw t2, 12(sp) lw t3, 16(sp) addi sp, sp, 20 jr ra swap: # 2 input parameter addr of a and addr of b (s0, s1) # no return addi sp, sp, -8 sw t0, 0(sp) sw t1, 4(sp) lw t0, 0(s0) # get value of a lw t1, 0(s1) # get value of b sw t0, 0(s1) # store a in s1 sw t1, 0(s0) # store b in s0 lw t0, 0(sp) lw t1, 4(sp) addi sp, sp, 8 jr ra counting_one: # input in s0, return in s0 addi sp, sp, -20 sw ra, 0(sp) sw t0, 4(sp) sw s0, 8(sp) sw t1, 12(sp) sw t2, 16(sp) addi t0, x0, 0 # int num = 0; jal ra, bl_clz_ini # counting number of one (return in s0) li t1, 32 sub t1, t1, s0 # t1 is len (int len = 32-count_leading_zeros(val);) lw s0, 8(sp) # get original input val or s0 is count_leading_zeros(val) counting_one_loop: bltz t1, counting_one_end # i < 0, loop breaking andi t2, s0, 1 # t2 = val&1; add t0, t0, t2 # num += t2; srli s0, s0, 1 # val>>=1; addi t1, t1, -1 # i-- j counting_one_loop counting_one_end: addi s0, t0, 0 # put return value in s0 lw ra, 0(sp) lw t0, 4(sp) lw t1, 12(sp) lw t2, 16(sp) addi sp, sp, 20 jr ra # brenchless CLZ part bl_clz_ini: # used variable s0, t0, t1 addi sp, sp, -16 sw ra, 0(sp) sw s0, 4(sp) sw t0, 8(sp) sw t1, 12(sp) bl_clz_calculating: srli t0, s0, 1 or s0, t0, s0 srli t0, s0, 2 or s0, t0, s0 srli t0, s0, 4 or s0, t0, s0 srli t0, s0, 8 or s0, t0, s0 srli t0, s0, 16 or s0, t0, s0 srli t0, s0, 1 li t1, 0x55555555 and t0, t0, t1 sub s0, s0, t0 srli t0, s0, 2 li t1, 0x33333333 and t0, t0, t1 and s0, s0, t1 add s0, s0, t0 srli t0, s0, 4 add s0, s0, t0 li t1, 0x0f0f0f0f and s0, s0, t1 srli t0, s0, 8 add s0, t0, s0 srli t0, s0, 16 add s0, t0, s0 andi s0, s0, 0x3f li t1, 32 sub s0, t1, s0 lw t0, 8(sp) lw t1, 12(sp) addi sp, sp, 16 jr ra print_loop: bge t2, t1, end_loop # checking whether index >= number of elements in array lw a0, 0(a2) # print element in array li a7, 1 ecall li a0, 32 # print space li a7, 11 ecall lw a0, 0(t3) # print correct ans li a7, 1 ecall lw t4, 0(a2) # check answer (t4 for tmp comparing variable) bne t4, a0, fail # if element != correct element, jump to fail part la a0, t_str # if element == correct element, console log " True\n" li a7, 4 ecall addi a2, a2, 4 # addr = addr + 4 (addr of calculating result) addi t3, t3, 4 # addr = addr + 4 (addr of answer) addi t2, t2, 1 # index = index + 1 j print_loop # keeping print elements end_loop: la a0, t_str # console log " True\n" li a7, 4 ecall jr ra fail: la a0, f_str # load address of ' False\n' into a0 li a7, 4 # print string ecall li a7, 10 # exit program ecall ``` auto test all test data: ![image](https://hackmd.io/_uploads/HJ2MiEYJkx.png) ## Reference [quick sort](https://alrightchiu.github.io/SecondRound/comparison-sort-quick-sortkuai-su-pai-xu-fa.html) [another quick sort](https://medium.com/@amber.fragments/%E6%BC%94%E7%AE%97%E6%B3%95-%E5%AD%B8%E7%BF%92%E7%AD%86%E8%A8%98-12-%E5%BF%AB%E9%80%9F%E6%8E%92%E5%BA%8F%E6%B3%95-quick-sort-841420575b24) [sorting algorithm](https://medium.com/%E6%8A%80%E8%A1%93%E7%AD%86%E8%A8%98/%E5%9F%BA%E7%A4%8E%E6%BC%94%E7%AE%97%E6%B3%95%E7%B3%BB%E5%88%97-%E4%BD%A0%E5%8F%AA%E6%9C%83%E7%94%A8-built-in-function-%E4%BE%86%E6%8E%92%E5%BA%8F%E5%97%8E-4a196d37f9f5) [branchless clz](https://hackmd.io/@sysprog/arch2023-quiz1-sol#Problem-A) [merge sort](https://alrightchiu.github.io/SecondRound/comparison-sort-merge-sorthe-bing-pai-xu-fa.html) [another merge sort](https://samchien.blogspot.com/2014/03/merge-sort.html)