Assignment 2 : RISC-V Toolchain

--- tags: RISCV, jserv,Computer Architecture 2022 --- ###### tags: `111-1` # Assignment 2 : RISC-V Toolchain > avoid heavy memory access duplicated code > coner cases ## Question selection * This question is done by [吳宇晨](https://hackmd.io/@bcQ6fpRFSRW8UlMXdeUoiQ/ByXg5nLZs) (latter denoted as Wu)in Assignment 1. * Motivation * I saw the screenshot of the execution info. and I see the CPI is quite low so I was curious about how he manage to achieve this performance. ## Question description > * You are given a 0-indexed integer array nums and a target element target. > * A target index is an index i such that nums[i] == target. > * Return a list of the target indices of nums after sorting nums in non-decreasing order. If there are no target indices, return an empty list. The returned list must be sorted in increasing order * constraits * 1 <= `nums.length` <= 100 * 1 <= `nums[i]`, `target` <= 100 ## I try to re-implement the Wu's idea in C >* First I sorted the input with Bubble sort, after that, I got a sorted vector. >* Then I compare the target number and the number in the vector, if they matched then I append the index of the number into the vector ‘out’. > * The bool ‘flag’ is detecting whether I find the number, after I find the target first time, I set the flag to 1. > * If the flag is equal to 1 and I find a number in vector doesn’t match the target, then I know that there is no need to search down anymore. > from [吳宇晨's AS1](https://hackmd.io/@bcQ6fpRFSRW8UlMXdeUoiQ/ByXg5nLZs) ### My first implementation in C : $O(n^2)$ * Bubble sort ```c= void swap(int* a, int* b){ int temp = *b; *b = *a; *a = temp; } // implement bubble sort in nondescending order void bubbleSort(int* arr, int arrSize) { int i, j; for (i = 0; i < arrSize - 1; i++) // Last i elements are already in place for (j = 0; j < arrSize - i - 1; j++) if (arr[j] > arr[j + 1]) swap(&arr[j], &arr[j + 1]); } ``` * Print array ```c= void printArr(int* arr, int arrSize){ for(int i = 0; i < arrSize; i++) printf("%d ", arr[i]); printf("\n"); } ``` * Main logic ```c= int* targetIndices(int* nums, int numsSize, int target, int* returnSize){ // sort given array bubbleSort(nums, numsSize); int* temp_ret = (int*) malloc(sizeof(int) * numsSize); int ret_index = 0; // traverse array to find target for(int i = 0; i < numsSize;i++){ if(nums[i] == target){ temp_ret[ret_index] = i; ret_index ++; } } *returnSize = ret_index; printf("%d",*returnSize); int * ret; ret = realloc(temp_ret, (ret_index) * sizeof(int)); return ret; } ``` ### My another implementation of C * I find out there is no need to really sort the given array. * just count how many values are smaller(`n_sma`) than target, and how many targets(`n_eq`) are inside the array. * return them with proper shift amount of the index which is the `n_sum` would be sufficient. * Thus the time complexity can be reduced to $O(n)$ ```c= int* targetIndices(int* nums, int numsSize, int target, int* returnSize){ int n_sma = 0, n_eq = 0; for(int i = 0; i < numsSize; i++) { if (nums[i] < target) n_sma++; if (nums[i] == target) n_eq++; } *returnSize = n_eq; int* ret = (int*) malloc(sizeof(int) * n_eq); for(int i = 0;i < n_eq; i++) { ret[i]= n_sma + i; } return ret; } ``` ### testcase :::danger Extend test case to make the metric distinguishable for finding bottleneck ::: ```c int test1[] = {55, 44, 33, 22, 11}; int test2[] = {55, 55, 55, 55, 55, 55, 55, 55, 55, 55, /* ignored for simpification */ 55, 55, 55, 55, 55, 55, 55, 55, 55, 55, }; // test_size = 100 ``` Result under above two testcase |Optimization level|re-implement of Wu's idea| My implementation| |------|------|------| |-O0|155824|61273| |-O1|72972|57756| |-O2|72972|57758| |-O3|72972|57758| |-Ofast|72972|57758| ## Assembly code > `.data` : is a read-write section containing global or static variables > `.text` : is a read-only section containing executable code > `.rodata` : is a read-only section containing const variables > `.bss` is a read-write section containing uninitialized data > `.LANCHOR0` : is how GCC groups things so it can reference multiple static locations from one reference point. > * Constructing a 32-bit address in a register takes multiple instructions, or a PC-relative load of a pointer from a nearby literal pool. The compiler wants to avoid having the address of each individual static (or global) variable in literal pools near code; that would bloat things. > `.LANCHOR0`, `.LANCHOR1`, etc. are the names gcc uses for such pointers ### -Ofast assembly code from gcc ```clike targetIndices: ble a1,zero,.L2 addi sp,sp,-16 slli a1,a1,2 # shift target by 2, target mutiply by 4 sw s0,8(sp) sw s1,4(sp) sw ra,12(sp) # save return address add a1,a0,a1 # a0 : given array address, compute the end address of array (a1) li s1,0 # s1(n_eq): clean to zero li s0,0 # s0(n_sma) : clean to zero j .L5 .L20: # count n_sma addi a0,a0,4 # next element's address addi s0,s0,1 # n_sma ++ beq a1,a0,.L19 # check if the array ends .L5: lw a4,0(a0) # load test[i] (current element) sub a5,a2,a4 # a5 = current element - target seqz a5,a5 # if a5 = zero means (a4 is target), and set a5 = 1 bgt a2,a4,.L20 # target greater than current element, than branch addi a0,a0,4 # next element's address add s1,s1,a5 # n_eq ++ (since a5 is set to 1 if equal) bne a1,a0,.L5 # check if the array ends .L19: sw s1,0(a3) # a3 : slli a0,s1,2 call malloc beq s1,zero,.L1 mv a5,s0 # move n_sma to a5 mv a4,a0 # copy return address add s1,s1,s0 # compute ending index .L8: sw a5,0(a4) # a4 : address of return array addi a5,a5,1 # index ++ addi a4,a4,4 # next address to put index bne a5,s1,.L8 .L1: lw ra,12(sp) # load back return address lw s0,8(sp) lw s1,4(sp) addi sp,sp,16 # add back sp jr ra # return to caller .L2: sw zero,0(a3) li a0,0 tail malloc .size targetIndices, .-targetIndices .section .rodata.str1.4,"aMS",@progbits,1 .align 2 # align to power of 2 (align with 4 bytes) ``` ### hand_written Assembly code version 1 ```clike .data arr1: .word 55,44,33,22,11 arrSize1: .word 5 arr2: .byte 55,55,55,55,55,, # ........ ,55,55,55,55,55,55 // extended testcase size: 100 arrSize2: .byte 100 target: .word 55 str: .string "Indices of target: " space: .string " " ``` ```clike .text main: # initailization la t0 target # a0 : target lw a0, 0(t0) la t1,arrSize lw a1, 0(t1) # a1 : arrSize la a2, arr1 # a2 : base address of arr addi s0, s0, 0 # s0 : n_sma addi s1, s1, 0 # s0 : n_equ jal ra loop jal ra PrintIndices loop: lw t2, 0(a2) # arr[i] beq t2, a0, n_eqpp blt t2, a0, n_smapp cont: addi a2, a2, 4 # i+1 addi a1, a1, -1 #numSize-- bne a1, x0, loop jalr x0, ra, 0 n_eqpp: addi s1, s1, 1 j cont n_smapp: addi s0, s0, 1 j cont PrintIndices: la a0, str li a7,4 ecall add t0, s0, x0 # return value pcont: add a0, t0, x0 li a7, 1 # print integer ecall la a0, comma li a7,4 ecall addi t0, t0, 1 addi s1, s1, -1 bne s1, x0, pcont Exit: ``` ## Assembly optimization 【this part is still revising】 goal : * create less assembly code ## Reference * [gnu:gcc - Optimize-Options](https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html) * [你所不知道的 C 語言：編譯器和最佳化原理篇](https://hackmd.io/@sysprog/c-compiler-optimization?type=view#%E4%BD%A0%E6%89%80%E4%B8%8D%E7%9F%A5%E9%81%93%E7%9A%84-C-%E8%AA%9E%E8%A8%80%EF%BC%9A%E7%B7%A8%E8%AD%AF%E5%99%A8%E5%92%8C%E6%9C%80%E4%BD%B3%E5%8C%96%E5%8E%9F%E7%90%86%E7%AF%87)

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.