---
tags: RISCV, jserv,Computer Architecture 2022
---
###### tags: `111-1`
# Assignment 2 : RISC-V Toolchain
> avoid heavy memory access duplicated code
> coner cases
## Question selection
* This question is done by [吳宇晨](https://hackmd.io/@bcQ6fpRFSRW8UlMXdeUoiQ/ByXg5nLZs) (latter denoted as Wu)in Assignment 1.
* Motivation
* I saw the screenshot of the execution info. and I see the CPI is quite low so I was curious about how he manage to achieve this performance.
## Question description
> * You are given a 0-indexed integer array nums and a target element target.
> * A target index is an index i such that nums[i] == target.
> * Return a list of the target indices of nums after sorting nums in non-decreasing order. If there are no target indices, return an empty list. The returned list must be sorted in increasing order
* constraits
* 1 <= `nums.length` <= 100
* 1 <= `nums[i]`, `target` <= 100
## I try to re-implement the Wu's idea in C
>* First I sorted the input with Bubble sort, after that, I got a sorted vector.
>* Then I compare the target number and the number in the vector, if they matched then I append the index of the number into the vector ‘out’.
> * The bool ‘flag’ is detecting whether I find the number, after I find the target first time, I set the flag to 1.
> * If the flag is equal to 1 and I find a number in vector doesn’t match the target, then I know that there is no need to search down anymore.
> from [吳宇晨's AS1](https://hackmd.io/@bcQ6fpRFSRW8UlMXdeUoiQ/ByXg5nLZs)
### My first implementation in C : $O(n^2)$
* Bubble sort
```c=
void swap(int* a, int* b){
int temp = *b;
*b = *a;
*a = temp;
}
// implement bubble sort in nondescending order
void bubbleSort(int* arr, int arrSize)
{
int i, j;
for (i = 0; i < arrSize - 1; i++)
// Last i elements are already in place
for (j = 0; j < arrSize - i - 1; j++)
if (arr[j] > arr[j + 1])
swap(&arr[j], &arr[j + 1]);
}
```
* Print array
```c=
void printArr(int* arr, int arrSize){
for(int i = 0; i < arrSize; i++)
printf("%d ", arr[i]);
printf("\n");
}
```
* Main logic
```c=
int* targetIndices(int* nums, int numsSize, int target, int* returnSize){
// sort given array
bubbleSort(nums, numsSize);
int* temp_ret = (int*) malloc(sizeof(int) * numsSize);
int ret_index = 0;
// traverse array to find target
for(int i = 0; i < numsSize;i++){
if(nums[i] == target){
temp_ret[ret_index] = i;
ret_index ++;
}
}
*returnSize = ret_index;
printf("%d",*returnSize);
int * ret;
ret = realloc(temp_ret, (ret_index) * sizeof(int));
return ret;
}
```
### My another implementation of C
* I find out there is no need to really sort the given array.
* just count how many values are smaller(`n_sma`) than target, and how many targets(`n_eq`) are inside the array.
* return them with proper shift amount of the index which is the `n_sum` would be sufficient.
* Thus the time complexity can be reduced to $O(n)$
```c=
int* targetIndices(int* nums, int numsSize, int target, int* returnSize){
int n_sma = 0, n_eq = 0;
for(int i = 0; i < numsSize; i++) {
if (nums[i] < target) n_sma++;
if (nums[i] == target) n_eq++;
}
*returnSize = n_eq;
int* ret = (int*) malloc(sizeof(int) * n_eq);
for(int i = 0;i < n_eq; i++) {
ret[i]= n_sma + i;
}
return ret;
}
```
### testcase
:::danger
Extend test case to make the metric distinguishable for finding bottleneck
:::
```c
int test1[] = {55, 44, 33, 22, 11};
int test2[] = {55, 55, 55, 55, 55, 55, 55, 55, 55, 55,
/* ignored for simpification */
55, 55, 55, 55, 55, 55, 55, 55, 55, 55,
}; // test_size = 100
```
Result under above two testcase
|Optimization level|re-implement of Wu's idea| My implementation|
|------|------|------|
|-O0|155824|61273|
|-O1|72972|57756|
|-O2|72972|57758|
|-O3|72972|57758|
|-Ofast|72972|57758|
## Assembly code
> `.data` : is a read-write section containing global or static variables
> `.text` : is a read-only section containing executable code
> `.rodata` : is a read-only section containing const variables
> `.bss` is a read-write section containing uninitialized data
> `.LANCHOR0` : is how GCC groups things so it can reference multiple static locations from one reference point.
> * Constructing a 32-bit address in a register takes multiple instructions, or a PC-relative load of a pointer from a nearby literal pool. The compiler wants to avoid having the address of each individual static (or global) variable in literal pools near code; that would bloat things.
> `.LANCHOR0`, `.LANCHOR1`, etc. are the names gcc uses for such pointers
### -Ofast assembly code from gcc
```clike
targetIndices:
ble a1,zero,.L2
addi sp,sp,-16
slli a1,a1,2 # shift target by 2, target mutiply by 4
sw s0,8(sp)
sw s1,4(sp)
sw ra,12(sp) # save return address
add a1,a0,a1 # a0 : given array address, compute the end address of array (a1)
li s1,0 # s1(n_eq): clean to zero
li s0,0 # s0(n_sma) : clean to zero
j .L5
.L20: # count n_sma
addi a0,a0,4 # next element's address
addi s0,s0,1 # n_sma ++
beq a1,a0,.L19 # check if the array ends
.L5:
lw a4,0(a0) # load test[i] (current element)
sub a5,a2,a4 # a5 = current element - target
seqz a5,a5 # if a5 = zero means (a4 is target), and set a5 = 1
bgt a2,a4,.L20 # target greater than current element, than branch
addi a0,a0,4 # next element's address
add s1,s1,a5 # n_eq ++ (since a5 is set to 1 if equal)
bne a1,a0,.L5 # check if the array ends
.L19:
sw s1,0(a3) # a3 :
slli a0,s1,2
call malloc
beq s1,zero,.L1
mv a5,s0 # move n_sma to a5
mv a4,a0 # copy return address
add s1,s1,s0 # compute ending index
.L8:
sw a5,0(a4) # a4 : address of return array
addi a5,a5,1 # index ++
addi a4,a4,4 # next address to put index
bne a5,s1,.L8
.L1:
lw ra,12(sp) # load back return address
lw s0,8(sp)
lw s1,4(sp)
addi sp,sp,16 # add back sp
jr ra # return to caller
.L2:
sw zero,0(a3)
li a0,0
tail malloc
.size targetIndices, .-targetIndices
.section .rodata.str1.4,"aMS",@progbits,1
.align 2 # align to power of 2 (align with 4 bytes)
```
### hand_written Assembly code version 1
```clike
.data
arr1: .word 55,44,33,22,11
arrSize1: .word 5
arr2: .byte 55,55,55,55,55,,
# ........
,55,55,55,55,55,55 // extended testcase size: 100
arrSize2: .byte 100
target: .word 55
str: .string "Indices of target: "
space: .string " "
```
```clike
.text
main:
# initailization
la t0 target # a0 : target
lw a0, 0(t0)
la t1,arrSize
lw a1, 0(t1) # a1 : arrSize
la a2, arr1 # a2 : base address of arr
addi s0, s0, 0 # s0 : n_sma
addi s1, s1, 0 # s0 : n_equ
jal ra loop
jal ra PrintIndices
loop:
lw t2, 0(a2) # arr[i]
beq t2, a0, n_eqpp
blt t2, a0, n_smapp
cont:
addi a2, a2, 4 # i+1
addi a1, a1, -1 #numSize--
bne a1, x0, loop
jalr x0, ra, 0
n_eqpp:
addi s1, s1, 1
j cont
n_smapp:
addi s0, s0, 1
j cont
PrintIndices:
la a0, str
li a7,4
ecall
add t0, s0, x0 # return value
pcont:
add a0, t0, x0
li a7, 1 # print integer
ecall
la a0, comma
li a7,4
ecall
addi t0, t0, 1
addi s1, s1, -1
bne s1, x0, pcont
Exit:
```
## Assembly optimization 【this part is still revising】
goal :
* create less assembly code
## Reference
* [gnu:gcc - Optimize-Options](https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html)
* [你所不知道的 C 語言:編譯器和最佳化原理篇](https://hackmd.io/@sysprog/c-compiler-optimization?type=view#%E4%BD%A0%E6%89%80%E4%B8%8D%E7%9F%A5%E9%81%93%E7%9A%84-C-%E8%AA%9E%E8%A8%80%EF%BC%9A%E7%B7%A8%E8%AD%AF%E5%99%A8%E5%92%8C%E6%9C%80%E4%BD%B3%E5%8C%96%E5%8E%9F%E7%90%86%E7%AF%87)