--- tags: HW --- # Computer Architecture HW02 contributed by <[`Paintakotako`](https://github.com/Paintako)> ## Rewrite [Generate bitmask by CLZ](https://github.com/yuchen0620/Computer-Architecture/tree/main/HW1) The reason I chose to [generate a bitmask using CLZ](https://github.com/yuchen0620/Computer-Architecture/tree/main/HW1) from [周育晨](https://hackmd.io/@8G9q08Y6Tnq9OJMgzCV1TA/By_CTTYgT) is because bitmasks are one of the most cryptic elements that appear in Linux source code and other projects. They are not intuitive. Therefore, I believe that studying how CLZ is used to create bitmasks can provide me with a deeper understanding of code related to bitmasks. ## Rewrite code The following is the origin C code ```c #include <stdio.h> #include <stdint.h> uint16_t count_leading_zeros(uint32_t x) { x |= (x >> 1); x |= (x >> 2); x |= (x >> 4); x |= (x >> 8); x |= (x >> 16); /* count ones (population count) */ x -= ((x >> 1) & 0x55555555 ); x = ((x >> 2) & 0x33333333) + (x & 0x33333333 ); x = ((x >> 4) + x) & 0x0f0f0f0f; x += (x >> 8); x += (x >> 16); return (32 - (x & 0x3f)); } uint32_t generate_bitmask(uint32_t x) { uint16_t leading_zeros = count_leading_zeros(x); if (leading_zeros == 0) return 0xffffffff; return (1 << (32 - leading_zeros)) - 1 ; } int main() { uint32_t test_data[] = {0, 4, 0x80000000}; for (int i = 0; i < 3; i++){ printf("The reslut of %x is %x\n", test_data[i],generate_bitmask(test_data[i])); } } ``` Performance: ![](https://hackmd.io/_uploads/HJCx5u5f6.png) * Execution info * Cycles: 306 * Instrs. retired: 230 * CPI: 1.33 * IPC: 0.752 * Clock rate: 0 Hz The following part ```c return (1 << (32 - leading_zeros)) - 1); ``` Can be rewritten as: ```c return 0xffffffff >> leading_zeros; ``` The following segment is the modified assembly code portion ```diff - # return (1 << (32 - leading_zeros)) - 1; - li t0, 32 - sub t0, t0, a0 - li t1, 1 - sll a0, t1, t0 - addi a0, a0, -1 + # return 0xffffffff >> leading_zeros; + li t0, 0xffffffff + mv t1, a0 # save result of leading zeros into t1 + srl a0, t0, t1 ``` This can save a few clock cycles, performance after rewriting: * Execution info * Cycles: 300 * Instrs. retired: 224 * CPI: 1.34 * IPC: 0.747 * Clock rate: 0 Hz ## rv32emu Terminology: * [ELF](https://en.wikipedia.org/wiki/Executable_and_Linkable_Format) * [Object files](https://en.wikipedia.org/wiki/Object_file) ### Compiling using RISC-V gcc ```bash $ riscv-none-elf-gcc -march=rv32i -mabi=ilp32 -s -o bitmask.s ./bitmask.c $ ./rv32emu bitmask.s ``` ### Using GNU Toolchain [objdump](https://man7.org/linux/man-pages/man1/objdump.1.html) ```bash $ riscv-none-elf-objdump -d bitmask.s ``` ```bash bitmask.s: file format elf32-littleriscv Disassembly of section .text: 00010094 <.text>: 10094: ff010113 add sp,sp,-16 10098: 00000593 li a1,0 1009c: 00812423 sw s0,8(sp) 100a0: 00112623 sw ra,12(sp) 100a4: 00050413 mv s0,a0 100a8: 761000ef jal 0x11008 100ac: f4c1a783 lw a5,-180(gp) 100b0: 00078463 beqz a5,0x100b8 100b4: 000780e7 jalr a5 100b8: 00040513 mv a0,s0 ... ``` [readelf](https://man7.org/linux/man-pages/man1/readelf.1.html) ```bash $ riscv-none-elf-readelf -h ./bitmask.s ``` ```bash ELF Header: Magic: 7f 45 4c 46 01 01 01 00 00 00 00 00 00 00 00 00 Class: ELF32 Data: 2's complement, little endian Version: 1 (current) OS/ABI: UNIX - System V ABI Version: 0 Type: EXEC (Executable file) Machine: RISC-V Version: 0x1 Entry point address: 0x100d8 Start of program headers: 52 (bytes into file) Start of section headers: 80344 (bytes into file) Flags: 0x0 Size of this header: 52 (bytes) Size of program headers: 32 (bytes) Number of program headers: 3 Size of section headers: 40 (bytes) Number of section headers: 13 Section header string table index: 12 ``` ### Running given ELF file with rv32emu ```bash $ ./rv32emu bitmask.s The reslut of 4 is 7 The reslut of 80000000 is ffffffff inferior exit code 0 ``` ## Contrast the handwritten and compiler-optimized assembly listings Before we start analyzing our code, I've decided to remove any redundant test data so that the disassembled code won't have any extra parts. ```c #include <stdio.h> #include <stdint.h> uint16_t count_leading_zeros(uint32_t x) { x |= (x >> 1); x |= (x >> 2); x |= (x >> 4); x |= (x >> 8); x |= (x >> 16); /* count ones (population count) */ x -= ((x >> 1) & 0x55555555 ); x = ((x >> 2) & 0x33333333) + (x & 0x33333333 ); x = ((x >> 4) + x) & 0x0f0f0f0f; x += (x >> 8); x += (x >> 16); return (32 - (x & 0x3f)); } uint32_t generate_bitmask(uint32_t x) { uint16_t leading_zeros = count_leading_zeros(x); if (leading_zeros == 0) return 0xffffffff; return (1 << (32 - leading_zeros)) - 1 ; } int main() { printf("%x\n", generate_bitmask(4); } ``` From [gcc man page](https://linux.die.net/man/1/gcc), there's 8 different valid -O options you can give to gcc * -O (Same as -O1) * -O0 (do no optimization, the default if no optimization level is specified) * -O1 (optimize minimally) * -O2 (optimize more) * -O3 (optimize even more) * -Ofast (optimize very aggressively to the point of breaking standard compliance) * -Os (Optimize for size) * -Og (Optimize debugging experience) ### Check differenct optimization strategy result * Check all 8 optimization stats by `riscv-none-elf-size` * It's challenging to deduce from this result what optimizations the compiler has implemented for different strategies. | | -O0 | -O1 | -O2 | -O3 | -Os | -Ofast | | ---- | ----- | ----- | ----- | ----- | ----- | ------ | | text | 75776 | 75464 | 75564 | 75564 | 75448 | 75564 | | data | 2320 | 2320 | 2320 | 2320 | 2320 | 2320 | | bss | 1528 | 1528 | 1528 | 1528 | 1528 | 1528 | :::warning You shall use RDCYCLE/RDCYCLEH instruction for the statistics of your program’s execution. :notes: jserv ::: * From the table above, it can be observed that the '-Os' option indeed optimized the file size. ### RDCYCLE/RDCYCLEH instruction - Reading the [RISC-V Assembly Programmer’s Manual](https://github.com/riscv-non-isa/riscv-asm-manual/blob/master/riscv-asm.md#pseudoinstructions-for-accessing-control-and-status-registers), you can access the clock cycles using "rdcycle[h] rd" or you can customize a register, incrementing it by 1 for each executed instruction. However, the trade-off is that every instruction will require addition or subtraction operations on that register, which will impact the total number of clock cycles. - So, through the use of Pseudoinstructions, it's possible to access CSR (Control and Status Register) and reduce the burden on the programmer, such as the need to add "addi" instructions after each command to calculate clock cycles. We can define a function that uses the RISC-V provided Pseudoinstruction "rdcycle[h] rd" to read the clock cycle count. We just need to include this function before and after the execution and calculate the difference to obtain the number of clock cycles executed. Because RISC-V provides assembly code (Pseudoinstructions), if you want to write assembly code in a C program, you can refer to the [GCC manual](https://gcc.gnu.org/onlinedocs/gcc/Extended-Asm.html#Remarks-1) and use inline assembly. ```c uint64_t read_cycles() { unsigned long cycles; asm volatile ("rdcycle %0" : "=r" (cycles)); return cycles; } ... int main() { uint32_t test_data[] = {4}; int before_cycle = read_cycles(); for (int i = 0; i < 1; i++){ printf("The reslut of %x is %x\n", test_data[i],generate_bitmask(test_data[i])); } int after_cycle = read_cycles(); printf("elapsed cycle: %d\n", after_cycle - before_cycle); } ``` We can now add clock cycle measurements to compare the level of optimization in the results of different optimization strategies. | | -O0 | -O1 | -O2 | -O3 | -Os | -Ofast | | ---- | ----- | ----- | ----- | ----- | ----- | ------ | | text | 75776 | 75464 | 75564 | 75564 | 75448 | 75564 | | data | 2320 | 2320 | 2320 | 2320 | 2320 | 2320 | | bss | 1528 | 1528 | 1528 | 1528 | 1528 | 1528 | | clock cycles| 2464 | 2375 | 2335 | 2335 | 2335 | 2335 | By observing the table above, it was unexpectedly found that the -Os option not only optimizes for space but also achieves the best number of cycles.