# Assignment2: GNU Toolchain > contribute by [tinhanho](https://github.com/tinhanho) ## Motivation I choose Assginment1 "Find the position MSB by CLZ" done by [SUE3k](https://github.com/SUE3K/computer_architecture_hw1). The motivation why I choose this topic is that I want to select the topic within the algorithm counting leading zeros. I regard that I had done couting leading zeros in assignment 1. Therefore, I want to see how other student implement this algorithm and how they use this algorithm to help the work which they want to do. When I browse assignment 1, I find out there is an assignment existing a small mistake in C code, so I slightly revise it into right one. It becomes the motivation for me as well. ## Rewrite [Find the position of MSB by CLZ](https://hackmd.io/8aelofIcRh6TtJi86jWYow?view) ### Correct the C Program First of all, there is an obvious mistake. When the algorithm "couting leading zeros" comes to 32bits, we need to beware the hexadecimal number in the algorithm. That is, the original code following is false. ```c13 // Following code needs to be modified. x -= ((x >> 1) & 0x5555555555555555); x = ((x >> 2) & 0x3333333333333333) + (x & 0x3333333333333333); x = ((x >> 4) + x) & 0x0f0f0f0f0f0f0f0f; ``` Therefore, we modify it into correct algorithm. ```c14 x -= ((x >> 1) & 0x55555555); x = ((x >> 2) & 0x33333333) + (x & 0x33333333); x = ((x >> 4) + x) & 0x0f0f0f0f; ``` The test_data declartion and printf function parts need to be modified as well. ```c uint32_t test_data[] = {0x00000011, 0x00001101, 0x00010011}; .. .. .. printf("Input: 0x%08x\n", test_data[i]); ``` After modifying the code, we run the code and get the output. <s> ![](https://hackmd.io/_uploads/H1AjygHza.png) </s> :::warning Don't put the screenshots which contain plain text only. You shall provide the informative descriptions. Improve your English writing. :notes: jserv ::: This output is what we expect for 32-bits input. ## rv32emu ### Setup PATH ``` Configure $PATH $ cd $HOME/riscv-none-elf-gcc $ echo "export PATH=`pwd`/bin:$PATH" > setenv Once step (1) and (2) are complete, you can simply update $PATH environment variable via: $ cd $HOME $ source riscv-none-elf-gcc/setenv ``` ### Performance Measurement #### 2 Ways to Measure the Performance 1.[ticks.c](https://github.com/sysprog21/rv32emu/blob/master/tests/ticks.c#L6) In order to analyze the elapsed cycle in the program, we take [ticks.c](https://github.com/sysprog21/rv32emu/blob/master/tests/ticks.c#L6) as reference. Add some code from the beginning and the end of the code. ```c /****************************************************************** Modified C code in order to get the cycles Use ticks.c (https://github.com/sysprog21/rv32emu/blob/master/tests/ticks.c#L6) ******************************************************************/ #include <stdio.h> #include <stdint.h> #include <inttypes.h> typedef uint64_t ticks; static inline ticks getticks(void) { uint64_t result; uint32_t l, h, h2; asm volatile( "rdcycleh %0\n" "rdcycle %1\n" "rdcycleh %2\n" "sub %0, %0, %2\n" "seqz %0, %0\n" "sub %0, zero, %0\n" "and %1, %1, %0\n" : "=r"(h), "=r"(l), "=r"(h2)); result = (((uint64_t) h) << 32) | ((uint64_t) l); return result; } uint16_t count_leading_zeros(uint64_t x){ x |= (x >> 1); x |= (x >> 2); x |= (x >> 4); x |= (x >> 8); x |= (x >> 16); x |= (x >> 32); x -= ((x >> 1) & 0x55555555); x = ((x >> 2) & 0x33333333) + (x & 0x33333333); x = ((x >> 4) + x) & 0x0f0f0f0f; x += (x >> 8); x += (x >> 16); x += (x >> 32); return (32 - (x & 0x7f)); } int main(){ ticks t0 = getticks(); uint32_t test_data[] = {0x00000011, 0x00001101, 0x00010011}; for (int i = 0; i < sizeof(test_data) / sizeof(test_data[0]); i++){ uint32_t clz = count_leading_zeros(test_data[i]); if (clz < 32){ uint32_t msb = (uint32_t)(31 - clz); printf("Test Data %d:\n", i + 1); printf("Input: 0x%08x\n", test_data[i]); printf("Leading Zeros: %u\n", clz); printf("MSB: %u\n", msb); } else{ printf("Test Data %d: Invalid input, cannot calculate MSB.\n", i + 1); } printf("\n"); } ticks t1 = getticks(); printf("elapsed cycle: %" PRIu64 "\n", t1 - t0); return 0; } ``` Run following instructions, ```c $ riscv-none-elf-gcc -S find_msb.c -Ofast -o0 find_msb_o0.s $ riscv-none-elf-gcc find_msb_o0.s -o0 find_msb_o0 build/rv32emu find_msb_o0 //run it ``` <s> ![](https://hackmd.io/_uploads/rkQrlfBfT.png) </s> :::warning Don't put the screenshots which contain plain text only. :notes: jserv ::: 2.[getcycles.s](https://github.com/sysprog21/rv32emu/blob/master/tests/perfcounter/getcycles.S) Another way is to use extern function and link the 2 object code generated by find_msb.s and getcycles.s. Here is the code, ```c #include <stdio.h> #include <stdint.h> #include <inttypes.h> extern uint64_t get_cycles(); uint16_t count_leading_zeros(uint64_t x){ x |= (x >> 1); x |= (x >> 2); x |= (x >> 4); x |= (x >> 8); x |= (x >> 16); x |= (x >> 32); x -= ((x >> 1) & 0x55555555); x = ((x >> 2) & 0x33333333) + (x & 0x33333333); x = ((x >> 4) + x) & 0x0f0f0f0f; x += (x >> 8); x += (x >> 16); x += (x >> 32); return (32 - (x & 0x7f)); } int main(){ uint64_t oldcount = get_cycles(); uint32_t test_data[] = {0x00000011, 0x00001101, 0x00010011}; for (int i = 0; i < sizeof(test_data) / sizeof(test_data[0]); i++){ uint32_t clz = count_leading_zeros(test_data[i]); if (clz < 32){ uint32_t msb = (uint32_t)(31 - clz); printf("Test Data %d:\n", i + 1); printf("Input: 0x%08lx\n", test_data[i]); printf("Leading Zeros: %lu\n", clz); printf("MSB: %lu\n", msb); } else{ printf("Test Data %d: Invalid input, cannot calculate MSB.\n", i + 1); } printf("\n"); } uint64_t cyclecount = get_cycles() - oldcount; printf("cycle count: %u\n", (unsigned int) cyclecount); return 0; } ``` Use the instructions below, ```c $riscv-none-elf-gcc -march=rv32i_zicsr_zifencei -mabi=ilp32 -O0 -Wall -S -o find_msb.s find_msb.c //make c code into assembly $riscv-none-elf-gcc -march=rv32i_zicsr_zifencei -mabi=ilp32 -O0 -Wall -c -o getcycles.o getcycles.s $riscv-none-elf-gcc -march=rv32i_zicsr_zifencei -mabi=ilp32 -O0 -Wall -c -o find_msb.o find_msb.s //make assembly into object $riscv-none-elf-gcc -o find_msb_o0 find_msb.o getcycles.o //link 2 object code $build/rv32emu find_msb_o0 //run it ``` ![](https://hackmd.io/_uploads/HJ96WGIM6.png) The second way will measure more cycle. Not sure why by now and still thinking. However, what we can ensure is that the total cycles is about 18K. Furthermore, using second way is more efficient to compare handwritten code to original one, so the following part is all taking the second way. ## Different Optimization Level Measurement | Optimization Level | Cycles Number | | -- | -- | |-O0|17915| |-O1|17184| |-O2|17180| |-O3|17096| |-Ofast|17096| |-Os|17183| |handwritten with -O1|**17087**| We choose the Level 1 optimization level to do the handwritten work and use several tricks. ### 1. Loop unrolling ```mips count_leading_zeros: ... ... .L4: li a0,10 call putchar addi s1,s1,1 addi s2,s2,4 beq s1,s6,.L8 # leave loop .L5: lw s3,0(s2) # load test_data mv a0,s3 li a1,0 call count_leading_zeros ... ... j .L4 ``` In the original code, we can see that L5 load test_data and call count_leading_zeros function. Therefore, it is easy to find out that L4 is the Loop counter and indicate test_data array index. ```mips .L5: # loop 1 li s1,1 lw s3,0(s2) mv a0,s3 li a1,0 call count_leading_zeros mv s0,a0 bgtu a0,s4,.L3 mv a1,s1 addi a0,s10,%lo(.LC0) call printf mv a1,s3 addi a0,s9,%lo(.LC1) call printf mv a1,s0 addi a0,s8,%lo(.LC2) call printf sub a1,s4,s0 addi a0,s7,%lo(.LC3) call printf li a0,10 call putchar addi s1,s1,1 # loop 2 lw s3,4(s2) mv a0,s3 li a1,0 call count_leading_zeros mv s0,a0 bgtu a0,s4,.L3 mv a1,s1 addi a0,s10,%lo(.LC0) call printf mv a1,s3 addi a0,s9,%lo(.LC1) call printf mv a1,s0 addi a0,s8,%lo(.LC2) call printf sub a1,s4,s0 addi a0,s7,%lo(.LC3) call printf li a0,10 call putchar addi s1,s1,1 # loop 3 lw s3,8(s2) mv a0,s3 li a1,0 call count_leading_zeros mv s0,a0 bgtu a0,s4,.L3 mv a1,s1 addi a0,s10,%lo(.LC0) call printf mv a1,s3 addi a0,s9,%lo(.LC1) call printf mv a1,s0 addi a0,s8,%lo(.LC2) call printf sub a1,s4,s0 addi a0,s7,%lo(.LC3) call printf li a0,10 call putchar addi s1,s1,1 L8. ... ... ``` Therefore, we clear the L4, unrolling the loop and put the L8 behind. ### 2. Speed up the function count_leading_zeros In the original code the function count_leading_zeros is not efficient. ```mips count_leading_zeros: slli a4,a1,31 srli a5,a0,1 or a5,a4,a5 srli a4,a1,1 or a0,a5,a0 or a1,a4,a1 slli a4,a1,30 srli a5,a0,2 or a5,a4,a5 srli a2,a1,2 or a0,a5,a0 or a2,a2,a1 slli a4,a2,28 srli a5,a0,4 or a5,a4,a5 srli a3,a2,4 or a4,a5,a0 or a3,a3,a2 slli a2,a3,24 srli a5,a4,8 or a5,a2,a5 srli a2,a3,8 or a5,a5,a4 or a2,a2,a3 slli a3,a2,16 srli a4,a5,16 or a4,a3,a4 srli a3,a2,16 or a4,a4,a5 or a3,a3,a2 or a4,a3,a4 srli a5,a4,1 li a2,1431654400 addi a2,a2,1365 and a5,a5,a2 sub a5,a4,a5 sgtu a4,a5,a4 sub a3,a3,a4 slli a3,a3,30 srli a4,a5,2 or a4,a3,a4 li a3,858992640 addi a3,a3,819 and a4,a4,a3 and a5,a5,a3 add a5,a4,a5 sltu a4,a5,a4 slli a4,a4,28 srli a3,a5,4 or a3,a4,a3 add a3,a3,a5 li a5,252645376 addi a5,a5,-241 and a3,a3,a5 srli a4,a3,8 add a4,a4,a3 srli a5,a4,16 add a5,a5,a4 andi a5,a5,127 li a0,32 sub a0,a0,a5 slli a0,a0,16 srli a0,a0,16 ret ``` It takes too many lines so I decide to use the function I write in [Assignment 1](https://github.com/tinhanho/Computer-Architecture-Assignment/blob/main/Assignment%201/OptimizingStoragewithCountLeadingZero.s). We can observe that a0 is the input and the output register at the same time. As a result, what we need is that simply modify the s0 register to a0 and it can be used now. ```mips count_leading_zeros: srli t0, a0, 1 # t0 = x>>1 or a0, a0, t0 # x |= (x >> 1); srli t0, a0, 2 # x>>2 or a0, a0, t0 srli t0, a0, 4 # x>>4 or a0, a0, t0 srli t0, a0, 8 # x>>8 or a0, a0, t0 srli t0, a0, 16 # x>>16 or a0, a0, t0 li t5, 0x55555555 srli t0, a0, 1 # x>>1 and t0, t0, t5 # (x >> 1) & 0x55555555 sub a0, a0, t0 li t5, 0x33333333 srli t0, a0, 2 # x>>2 and t0, t0, t5 # (x >> 2) & 0x33333333 and t1, a0, t5 # x & 0x33333333 add a0, t0, t1 li t5, 0x0f0f0f0f srli t0, a0, 4 # x>>4 add t0, a0, t0 and a0, t0, t5 srli t0, a0, 8 # x>>8 add a0, a0, t0 srli t0, a0, 16 # x>>16 add a0, a0, t0 li t5, 0x0000007f and a0, a0, t5 sub a0, x0, a0 addi a0, a0, 32 ret ``` It works and reduces some cycles. After two steps, we can measure the handwritten code cycles and it performs better. ![](https://hackmd.io/_uploads/Hk2l9Fwf6.png) ## Full code See [Assignment 2](https://github.com/tinhanho/Computer-Architecture-Assignment/tree/main/Assignment%202). [find_msb_ticks.c](https://github.com/tinhanho/Computer-Architecture-Assignment/blob/main/Assignment%202/find_msb_ticks.c) is the code that uses ticks.c to measure cycles. [find_msb.c](https://github.com/tinhanho/Computer-Architecture-Assignment/blob/main/Assignment%202/find_msb.c) is the code that uses getcycles.s and we mainly use to measure the performance. [find_msb.s](https://github.com/tinhanho/Computer-Architecture-Assignment/blob/main/Assignment%202/find_msb.s) is generated by rv32emu -O1. [find_msb_opt.s](https://github.com/tinhanho/Computer-Architecture-Assignment/blob/main/Assignment%202/find_msb_opt.s) is the handwritten code that reduces some cycles. [getcycles.s](https://github.com/tinhanho/Computer-Architecture-Assignment/blob/main/Assignment%202/getcycles.s) is the assembly code to generate object code and link together.