# Assignment2: GNU Toolchain
contributed by < [`RayChen`](https://github.com/padaray) >
## Selected Question
I chose the `Find Leftmost 0-byte using CLZ` as my assignment from [陳川曜](https://hackmd.io/@cychen/computer_architecture_hw1).
The motivation is because I'm interested in his project, especially since my HW1 also involved the application of CLZ. I didn't realize there could be this kind of application, and I'd like to gain a deeper understanding of his code. I want to explore if there are ways to enhance its efficiency and contribute to his project.
## Analysis Origin Code
### Checking cycle counts from C code
#### 1. C code
```c
#include <inttypes.h>
#include <stdint.h>
#include <stdio.h>
// count cycle
typedef uint64_t ticks;
static inline ticks getticks(void)
{ uint64_t result;
uint32_t l, h, h2;
asm volatile(
"rdcycleh %0\n"
"rdcycle %1\n"
"rdcycleh %2\n"
"sub %0, %0, %2\n"
"seqz %0, %0\n"
"sub %0, zero, %0\n"
"and %1, %1, %0\n"
: "=r"(h), "=r"(l), "=r"(h2));
result = (((uint64_t) h) << 32) | ((uint64_t) l);
return result;
}
uint16_t count_leading_zeros(uint64_t x)
{
x |= (x >> 1);
x |= (x >> 2);
x |= (x >> 4);
x |= (x >> 8);
x |= (x >> 16);
x |= (x >> 32);
/* count ones (population count) */
x -= ((x >> 1) & 0x5555555555555555);
x = ((x >> 2) & 0x3333333333333333) + (x & 0x3333333333333333);
x = ((x >> 4) + x) & 0x0f0f0f0f0f0f0f0f;
x += (x >> 8);
x += (x >> 16);
x += (x >> 32);
return (64 - (x & 0x7f));
}
uint16_t zbytel(uint64_t x) {
uint64_t y;
uint16_t n;
y = (x & 0x7F7F7F7F7F7F7F7F)+ 0x7F7F7F7F7F7F7F7F; // convert each 0-byte to 0x80
y = ~(y | x |0x7F7F7F7F7F7F7F7F); //and each nonzero byte to 0x00
n = count_leading_zeros(y) >> 3 ; // use number of leading zeros
return n; // n = 0 ... 8 , 8 if x has no 0-byte.
}
int main(int argc, char *argv[ ] ){
ticks t0 = getticks();
uint64_t a = 0x1122334455007700; //In this example,
uint16_t zbla = zbytel(a); //the value is 5.
uint64_t b = 0x1122334455667788; //Another example,
uint16_t zblb = zbytel(b); //the value is 8
ticks t1 = getticks();
printf("elapsed cycle: %d\n" , t1 - t0); //cycle number
printf("%llx\n",a);
printf("%d\n",zbla);
printf("%llx\n",b);
printf("%d\n",zblb);
return 0;
}
```
#### 2. Makefile for C code
Using the gcc command converting .c file -> .S file -> .o file -> .elf file.
```c
.PHONY: clean
include ../../../mk/toolchain.mk
CFLAGS = -march=rv32i -mabi=ilp32
ASFLAGS = -march=rv32i -mabi=ilp32
LDFLAGS = --oformat=elf32-littleriscv
%.S: %.c
$(CROSS_COMPILE)gcc $(CFLAGS) -o $@ -S $<
%.o: %.S
$(CROSS_COMPILE)as $(ASFLAGS) -o $@ $<
all: hugohw.elf
hugohw.S: hugohw.c
$(CROSS_COMPILE)gcc $(CFLAGS) -o $@ -S $<
hugohw.elf: hugohw.o
$(CROSS_COMPILE)gcc -o $@ $<
clean:
$(RM) hugohw.elf hugohw.o hugohw.S
```
#### 3. Output of cycle
The red underlined part represents the printed cycle count.
<s>

</s>
:::warning
:warning: Don't put the screenshots which contain plain text only. Instead, utilize HackMD syntax to annotate the text.
:notes: jserv
:::
#### 4. Problem encountered during the process
<s>

</s>
:::warning
:warning: Don't put the screenshots which contain plain text only. Instead, utilize HackMD syntax to annotate the text.
:notes: jserv
:::
Solution: change the uppercase 'S' to lowercase in the makefile.

### Using optimization to optimize cycle count
#### 1. Add -On command(n represent 0~3 or fast)

#### 2. Print out cycle count
| Optimization level | Cycles count |
| ----------------- | ------------ |
| -O0 | 741 |
| -O1 | 204 |
| -O2 | 204 |
| -O3 | 0 |
| -Ofast | 0 |
Analysis: I think the cycle count is 0 because the optimizer considers the part where I calculate the cycle count as non-essential code, so it optimizes it out.
### Checking cycle counts from Assembly code
#### 1. Using get_cycles function
```c
.text
.globl get_cycles
.align 2
get_cycles:
csrr a1, cycleh
csrr a0, cycle
csrr a2, cycleh
bne a1, a2, get_cycles
ret
.size get_cycles,.-get_cycles
```
#### 2. Makefile for Assembly code
Using the assembler command converting .S file -> .o file
Using the linker command converting .o file -> .elf file.
```c
.PHONY: clean
include ../../../mk/toolchain.mk
CFLAGS = -march=rv32i_zicsr_zifencei -mabi=ilp32
ASFLAGS = -march=rv32i_zicsr_zifencei -mabi=ilp32
LDFLAGS = --oformat=elf32-littleriscv
%.o: %.S
$(CROSS_COMPILE)as $(ASFLAGS) -o $@ $<
all: hugohw.elf
hugohw.elf: hugohw.o
$(CROSS_COMPILE)ld -o $@ -T hugohw.ld $(LDFLAGS) $<
clean:
$(RM) hugohw.elf hugohw.o hugohw.S
```
#### 3. rv32emu print out decimal
Although the code mentioned above can print the cycle count, but the rv32emu doesn't support the printf function. Therefore, we need to first store the cycle count's individual digits and convert them into ASCII codes to display it.
The code is display bellow:
```c
# a0: integer of cycle
# a1: number of bytes in buffer
print_ascii:
mv t0, a0 # load integer
li t1, 0 # t1 = quotient
li t2, 0 # t2 = reminder
li t3, 10 # t3 = divisor
mv t4, a1 # t4 = count round
check_less_then_ten:
bge t0, t3, divide
mv t2, t0
mv t0, t1 # t0 = quotient
j to_ascii
divide:
sub t0, t0, t3
addi t1, t1, 1
j check_less_then_ten
to_ascii:
addi t2, t2, 48 # reminder to ascii
la t5, buffer # t5 = buffer addr
addi t4, t4, -1
add t5, t5, t4
sb t2, 0(t5)
# counter = 0 exit
beqz t4, convert_loop_done
li t1, 0 # refresh quotient
j check_less_then_ten
convert_loop_done:
ret
```
#### 4. Output of cycle

#### 5. Problem encountered during the process
  **Problem 1:**
   The error said that I redefined **`_start`** argument
   Reason : No linking file specified for the linker.

  **Solution 1:**
   Add -T hugohw.ld in command.

</br>
  **Problem 2:**
   When executing the Makefile, the "csrr" instruction cannot be executed.
   Reason : Incorrect compiler architecture

  **Solution 2:**

</br>
### Rewrite and Optimize Assembly code