# Assignment2: RISC-V Toolchain > Selected subject: **Palindrome detection with counting leading zeros**. Contributed by [Ray Huang (黃柏叡)](https://github.com/coding-ray). Acknowledgment (unordered): * [jserv (黃敬群)](https://github.com/jserv): [rv32emu](https://github.com/sysprog21/rv32emu) project. * [padaray (陳浩文)](https://hackmd.io/@DpuBXDwLSDCIE6cNJSjUlg): [Report](https://hackmd.io/@DpuBXDwLSDCIE6cNJSjUlg/HyQ2i4nep) and [code](https://github.com/padaray/computer_architecture_hw1) for the topic "Implement palindrome detection and using CLZ." * [P76095018](https://hackmd.io/@wIVnCcUaTouAktrkMVLEMA): [Sample report](https://hackmd.io/@wIVnCcUaTouAktrkMVLEMA/SJEP_amvK) for this homework. * [kdnvt](https://github.com/kdnvt): [Report](https://hackmd.io/@kdnvt/2022-arch-hw1) related to program analysis and future optimizations. ## 1. Introduction ### 1.1. Palindrome number A palindrome number is a number that reads the same backward for forward. For example, $1356531$ and $9889$. With the similar idea, padaray and I defines a palindrome bit sequence (abbreviated as palindrome) as follows. Given a bit sequence without leading zero, saying $B$. If the reversed bit sequence of $B$, saying $B^*$, is the same as $B$, we call it a palindrome. For example, given $X = 00001011_2 = 1011_2$ (the subscript $2$ means binary representation,) the reversed version, $X^* = 1101_2$. Since $X \neq X^*$, we conclude that $X$ is not a palindrome. In contrast, number $00110011_2$ is a palindrome, for $110011_2 = 110011_2$. ### 1.2. Motivation The following pseudo-code can solve the problem in the intuitive way. 1. Given a number $X$ in binary. 1. Count the leading zeros of $X$ in binary by "bit-wise AND `&` from the most significant bit to the lease significant bit until some bit is set (1), and the number of bit-wise AND minus one is the number of leading zeros." A special case is that if $X = 0$, then the count is $32$ in 32-bit run-time environment. 1. For the rest of bits (from the leftmost set bit to the rightmost bit), reverse it by bit extraction for each bit, and storing the bit in the opposite direction in a new variable $X^*$. 1. If $X = X^*$, then $X$ is palindrome. Otherwise, it isn't. The 2nd step takes $O(n)$ to run, where $n$ is the number of bits in a given run-time environment. However, jserv (2023) optimized the 2nd step with [population count](https://www.chessprogramming.org/Population_Count) in the [1st quiz](https://hackmd.io/@sysprog/arch2023-quiz1-sol) of computer architecture course. So, it takes $O(\log n)$ after optimization. In a 64-bit environment, the 3rd step manipulates on 64 bits. But looking closely for the definition of palindrome given fixed number of maximal bits (64 in this case), padaray (2023) optimize this step by only reversing one half of the given number. That is, he optimized this step to have $n/2 = 32$ bits to manipulate. However, in this way, the 3rd step still takes $O(n)$ to run. As a result, I searched for reversing bits on the Internet, and found that [Douglas W. Jones](https://homepage.cs.uiowa.edu/~dwjones/compiler/notes/38.shtml) (2018), [sth and Ry-](https://stackoverflow.com/a/2602885) (2010, 2020) accomplish the reversing in $O(\log n). As a result, I picked this topic for optimization in this homework. ## 2. Set up environment Reference: [Lab 2 (rv32emu)](https://hackmd.io/@sysprog/SJAR5XMmi) provided by System Software Programming (2023). ### 2.1. Prepare GNU toolchain for RISC-V 1. Operating system (OS): Ubuntu 22.04.3 server installed in a VirtualBox (7.0.10) VM. :::warning I've tried to compile the sample programs of [rv32emu](https://github.com/sysprog21/rv32emu) in Debian 12.1.0, but all of the three tests result in segmentation fault. I will try to support [rv32emu](https://github.com/sysprog21/rv32emu) in Debian in the future. But I'm forced to stick on Ubuntu now. :( ::: 1. Check the latest version of on [its GitHub](https://github.com/xpack-dev-tools/riscv-none-elf-gcc-xpack/releases/latest) 1. Copy the link to the tarball of architecture Linux and x64, and then download it with `wget` command. Do the same thing for the SHA256 checksum. ```bash wget https://github.com/xpack-dev-tools/riscv-none-elf-gcc-xpack/releases/download/v13.2.0-2/xpack-riscv-none-elf-gcc-13.2.0-2-linux-x64.tar.gz wget https://github.com/xpack-dev-tools/riscv-none-elf-gcc-xpack/releases/download/v13.2.0-2/xpack-riscv-none-elf-gcc-13.2.0-2-linux-x64.tar.gz.sha ``` 1. Check if the checksum of the downloaded tarball matches the downloaded checksum. ```bash cat xpack-riscv-none-elf-gcc-13.2.0-2-linux-x64.tar.gz.sha # output: 52545afb900200fbf65fe05f7ce7090b8a42c64091f4f5d43cae6bb68ea2434a xpack-riscv-none-elf-gcc-13.2.0-2-linux-x64.tar.gz sha256sum xpack-riscv-none-elf-gcc-13.2.0-2-linux-x64.tar.gz # output: 52545afb900200fbf65fe05f7ce7090b8a42c64091f4f5d43cae6bb68ea2434a xpack-riscv-none-elf-gcc-13.2.0-2-linux-x64.tar.gz ``` 1. If the checksum matches, extract the tarball. ``` tar zxf xpack-riscv-none-elf-gcc-13.2.0-2-linux-x64.tar.gz ``` 1. Create a version memo. ``` echo 13.2.0-2 > xpack-riscv-none-elf-gcc-13.2.0-2/version.txt ``` 1. Create an app directory under user's home, and move the toolchain to there. ``` mkdir -p ~/app mv xpack-riscv-none-elf-gcc-13.2.0-2 ~/app/riscv-none-elf-gcc ``` 1. Configure the `PATH` environment variable. ``` echo "export PATH=\"$HOME/app/riscv-none-elf-gcc/bin:\$PATH\"" >> ~/.bashrc source ~/.bashrc ``` 1. Check if the toolchain version. This should work if you set `PATH` properly. ```bash riscv-none-elf-gcc -v # expected output: # ... (omitted) # gcc version 13.2.0 (xPack GNU RISC-V Embedded GCC x86_64) ``` ### 2.2. Get and build [rv32emu](https://github.com/sysprog21/rv32emu) 1. Install the dependencies. ``` sudo apt update && sudo apt install -y libsdl2-dev libsdl2-mixer-dev make ``` 1. Get and build [rv32emu](https://github.com/sysprog21/rv32emu) from source. ```bash mkdir ~/ca git clone https://github.com/sysprog21/rv32emu ~/ca/rv32emu cd ~/ca/rv32emu make ``` 1. Validate [rv32emu](https://github.com/sysprog21/rv32emu). ```bash make check # expected output: # Running hello.elf ... [OK] # Running puzzle.elf ... [OK] # Running pi.elf ... [OK] ``` 1. Run `hello.elf` by rv32emu. ```bash build/rv32emu build/hello.elf # expected output: # Hello World! # Hello World! # Hello World! # Hello World! # Hello World! # inferior exit code 0 ``` ### 2.3. Demo of functionalities of GNU toolchain 1. [objdump](http://man7.org/linux/man-pages/man1/objdump.1.html) (`-d`: Display the assembler mnemonics for the machine instructions.) ```bash riscv-none-elf-objdump -d build/hello.elf ``` Expected output: ``` build/hello.elf: file format elf32-littleriscv Disassembly of section .text: 00000000 <_start>: 0: 00000293 li t0,0 4: 00500313 li t1,5 8: 0040006f j c <_start+0xc> c: 00000013 nop 00000010 <loop>: 10: 02628063 beq t0,t1,30 <end> 14: 04000893 li a7,64 18: 00100513 li a0,1 1c: 03c00593 li a1,60 20: 00d00613 li a2,13 24: 00000073 ecall 28: 00128293 add t0,t0,1 2c: fe5ff06f j 10 <loop> 00000030 <end>: 30: 05d00893 li a7,93 34: 00000513 li a0,0 38: 00000073 ecall ``` 1. [readelf](http://man7.org/linux/man-pages/man1/readelf.1.html) (`-h`: Display the ELF file header) ```bash riscv-none-elf-readelf -h build/hello.elf ``` Expected output: ``` ELF Header: Magic: 7f 45 4c 46 01 01 01 00 00 00 00 00 00 00 00 00 Class: ELF32 Data: 2's complement, little endian Version: 1 (current) OS/ABI: UNIX - System V ABI Version: 0 Type: EXEC (Executable file) Machine: RISC-V Version: 0x1 Entry point address: 0x0 Start of program headers: 52 (bytes into file) Start of section headers: 4532 (bytes into file) Flags: 0x0 Size of this header: 52 (bytes) Size of program headers: 32 (bytes) Number of program headers: 2 Size of section headers: 40 (bytes) Number of section headers: 7 Section header string table index: 6 ``` 1. [size](http://man7.org/linux/man-pages/man1/size.1.html) (List the section sizes—and the total size—for each of the object or archive files objfile in its argument list.) ``` riscv-none-elf-size build/hello.elf ``` Expected output: ``` text data bss dec hex filename 73 0 0 73 49 build/hello.elf ``` 1. Check what preprocessor defines does RISC-V extension `zicsr` add. The result is `__riscv_zicsr`. ```sh echo | riscv-none-elf-gcc -march=rv32i -E -dM - > tmp-oritinal.txt echo | riscv-none-elf-gcc -march=rv32i_zicsr -E -dM - > tmp-csr.txt diff tmp-oritinal.txt tmp-csr.txt # result: # 232a233 # > #define __riscv_zicsr 2000000 ``` Since if I want to get the results of "how many instruction is retrieved" and "how many Control and Status Register (CSR) cycles are now," I have to enable the extension `_zicsr`. And the final compiling target is required to be without `_zicsr`. To share the code in the two cases above, I have to detect if `__riscv_zicsr` is defined in C with the following illustration. (Note: `.ifdef` doesn't work in assembly.) ```c #ifdef __riscv_zicsr extern uint64_t get_instret(); #endif // __riscv_zicsr ``` 1. Do the same trick as the above for the extension `m`. The result comes with multiple defines, and I will consider only `__riscv_m` in my code. ```bash echo | riscv-none-elf-gcc -march=rv32i -E -dM - > tmp-oritinal.txt echo | riscv-none-elf-gcc -march=rv32im -E -dM - > tmp-muldiv.txt diff tmp-oritinal.txt tmp-muldiv.txt # result: # 126a127 # > #define __riscv_mul 1 # 152a154 # > #define __riscv_muldiv 1 # 212a215 # > #define __riscv_m 2000000 # 303a307 # > #define __riscv_div 1 ``` If I compile function calls of `printf` (in standard library `<stdio.h>`) to assembly, I will loose the ability to link the object file in which implements `printf`. As a result, I utilize the ecall to `write` (with similar definition of the [write](https://man7.org/linux/man-pages/man2/write.2.html) command in Linux.) To convert integers to characters arrays, I need the support of modulo and division from the `m` extension for convenience, although modulo and division can be implemented by integer operations. ## 3. Analysis and Improvement You can get the complete code from the repository [coding-ray/2023-ca-hw-2](https://github.com/coding-ray/2023-ca-hw-2/) on GitHub. ### 3.1. V1: baseline The code of this version is in the directory [archived/v1-base](https://github.com/coding-ray/2023-ca-hw-2/tree/main/archived/v1-base). This version is the baseline version, without any optimization from me. In this version, I refactored the code from [padaray's work](https://github.com/padaray/computer_architecture_hw1). The primary principle is that after my refactoring, the algorithm should be the same as the original work, including the instructions. In addition, I did the following refactoring: 1. Convert code in C++ to C. (padaray didn't use any C++ features in his code, though.) ([c908804](https://github.com/coding-ray/2023-ca-hw-2/commit/c908804)) 1. Eliminate errors and warnings from `gcc` with flags `-Wall -Wextra`. ([c908804](https://github.com/coding-ray/2023-ca-hw-2/commit/c908804)) 1. Format code with VS Code [C/C++ extension](https://marketplace.visualstudio.com/items?itemName=ms-vscode.cpptools). ([10ef438](https://github.com/coding-ray/2023-ca-hw-2/commit/10ef438)) 1. Rename variables in snake case. ([d53f444](https://github.com/coding-ray/2023-ca-hw-2/commit/d53f444)) 1. Rename function `palindrome_detected` with `is_palindrome`. The latter one is in the format "verb_noun." ([d53f444](https://github.com/coding-ray/2023-ca-hw-2/commit/d53f444)) 1. Simplify the calling of function `is_palindrome`. Specifically, move the function call of `count_leading_zeros` to the first line of `is_palindrome`. ([d53f444](https://github.com/coding-ray/2023-ca-hw-2/commit/d53f444)) 1. Clean redundant brackets (), braces {} and one-line /* */ comments. ([d53f444](https://github.com/coding-ray/2023-ca-hw-2/commit/d53f444)) 1. Compress code with `x <op>= y` (from `x = x <op> y`). ([d53f444](https://github.com/coding-ray/2023-ca-hw-2/commit/d53f444)) 1. Move source code to `src/`. ([4d17e98](https://github.com/coding-ray/2023-ca-hw-2/commit/4d17e98)) 1. Customize C and assembly for rv32emu. ([e9c2284](https://github.com/coding-ray/2023-ca-hw-2/commit/e9c2284)) 1. Count CSR cycles and instructions retrieved. ([156ce2d](https://github.com/coding-ray/2023-ca-hw-2/commit/156ce2d)) ### 3.2. V2: O(log n) algorithm of reversing 64 bits The code of this version is in the directory [archived/v2-optimized-reversing](https://github.com/coding-ray/2023-ca-hw-2/tree/main/archived/v2-optimized-reversing). The code to reverse 64 bits is as follows before optimization. If the number of bits is $n$, this algorithm has time complicity $\Theta(n)$. ```c= uint64_t tmp_a = a; uint64_t reverse_a = 0; for (int i = 0; i < 64; i++) { reverse_a <<= 1; reverse_a |= tmp_a & 1; tmp_a >>= 1; } reverse_a >>= n_left_shift; ``` In this version, I apply the algorithm of reversing bits in place from [Douglas W. Jones](https://homepage.cs.uiowa.edu/~dwjones/compiler/notes/38.shtml) (2018), [sth and Ry-](https://stackoverflow.com/a/2602885) (2010, 2020). To reverse 16 bits in place, [Douglas W. Jones](https://homepage.cs.uiowa.edu/~dwjones/compiler/notes/38.shtml) (2018) wrote the following code in C. ```c= unsigned int reverse( unsigned int x ){ x = ((x >> 1) & 0x55555555) + ((x & 0x55555555) << 1); x = ((x >> 2) & 0x33333333) + ((x & 0x33333333) << 2); x = ((x >> 4) & 0x0F0F0F0F) + ((x & 0x0F0F0F0F) << 4); x = ((x >> 8) & 0x00FF00FF) + ((x & 0x00FF00FF) << 8); x = ((x >> 16) ) + ((x ) << 16); return x; } ``` The basic idea is as follows. 1. In line 2: 1. The first right shifting `>>` and AND `&` operations pick up bits with interleaving of one bit, from the most-significant bit of `x`. And then shift the picked bits right by one bit. That is, saying that `x` has bits 31, 30, 29, $\ldots$, 1, 0, from the most significant bit, the operations before the adding `+` operator takes bits 31, 29, 27, $\ldots$, 3, 1 of `x`, and put them to positions of bits 30, 28, 26, $\ldots$, 2, 0. 1. The operations after the adding `+` operator pick up the rest of bits, and shift them left by one bit. That is, the bits 30, 28, $\ldots$, 2, 0 of `x` are moved to the positions of bits 31, 29, $\ldots$, 3, 1. 1. The adding `+` operation is actually OR `|` operation for the two operands. So after this operation, the bits in the pair of two bits are swapped. Bits `abcd` are `badc` afterwards. 1. In line 3, two adjacent bits are grouped together, and two adjacent groups are swapped. So, bits `badc` become `dcba` later. 1. Similarly, in line 3, four adjacent bits are grouped together. 1. In line 6, it is trivial to group (mask) 16 bits together, for with and without masking `0x0000FFFF`, the result is the same. With the same idea, I wrote the following code. ```c= // file header #define INTERLEAVE_BY_01 0x5555555555555555 #define INTERLEAVE_BY_02 0x3333333333333333 #define INTERLEAVE_BY_04 0x0F0F0F0F0F0F0F0F #define INTERLEAVE_BY_08 0x00FF00FF00FF00FF #define INTERLEAVE_BY_16 0x0000FFFF0000FFFF // function body with `r` defined previously r = ((r >> 1) & INTERLEAVE_BY_01) + ((r & INTERLEAVE_BY_01) << 1); r = ((r >> 2) & INTERLEAVE_BY_02) + ((r & INTERLEAVE_BY_02) << 2); r = ((r >> 4) & INTERLEAVE_BY_04) + ((r & INTERLEAVE_BY_04) << 4); r = ((r >> 8) & INTERLEAVE_BY_08) + ((r & INTERLEAVE_BY_08) << 8); r = ((r >> 16) & INTERLEAVE_BY_16) + ((r & INTERLEAVE_BY_16) << 16); r = (r >> 32) + (r << 32); // finish reversing ``` With this optimization, the algorithm improves time complicity form $\Theta(n)$ to $\Theta(\log n)$, where $n$ is the number of bits. With different optimization levels of `gcc`, the comparison between versions 1 and 2 is as follows. Performance evaluation (number of CSR cycles) of the four test cases. These four tests are the same across all versions in the "archived" directory. These four tests are inputting 0, 1, 0x00000C0000000003 and 0x0F000000000000F0 to the `is_palindrome` function, respectively. And across the versions, all the results are verified before any evaluation. In the following table, for the cells that contain two numbers, the first number is the result from v1, and the latter is from v2. Since optimization level of `-Ofast` and `-O3` yield the same result for all tests in all versions, the former one is not included in the tables in this report. | Opt. | # of inst. ret. | Case 1 | Case 2 | Case 3 | Case 4 | | ----- | --------------- | -------- | -------- | ------ | -------- | | `-O0` | 720/727 | 3674/756 | 3653/735 | 0/0 | 3769/851 | | `-O1` | 734/734 | 923/235 | 0/0 | 0/0 | 0/0 | | `-O2` | 734/734 | 920/235 | 0/0 | 0/0 | 0/0 | | `-O3` | 730/734 | 894/223 | 0/0 | 0/0 | 0/0 | | `-Os` | 784/734 | 131/260 | 0/129 | 0/0 | 0/0 | Program size evaluation is as follows. The numbers are all the number of bytes of the entire program in decimal. | Opt. | V1 size | V2 size | V2/V1 Ratio | | ----- | ------- | ------- | ----------- | | `-O0` | 10428 | 11384 | 109% | | `-O1` | 8796 | 9068 | 103% | | `-O2` | 8792 | 9064 | 103% | | `-O3` | 9124 | 9396 | 103% | | `-Os` | 8776 | 9032 | 103% | ### 3.3. V3: Replace function printf with write The code of this version is in the directory [archived/v3-write](https://github.com/coding-ray/2023-ca-hw-2/tree/main/archived/v3-write). In this version, I replace function calls of `printf` with `write`, for the sake of linking error for the object file that contains `printf` when I want to link my object files. In detail, I didn't successfully link `/usr/lib/x86_64-linux-gnu/libc.a` or `/usr/lib/x86_64-linux-gnu/libc.so`. (I am glad of it.) Moreover, in this version, I made the following features: 1. Get rid of all C standard library. 1. C code is compile to assembly files instead of compiling to object files directly. To make the `main` function work with rv32emu, replace `jr ra` or `ret` in the end of the main function with the following assembly ``` li a7, 93 # "exit" syscall add a0, x0, 0 # Use 0 return code ecall # invoke syscall to terminate the program ``` 1. To properly call `write` function in rv32emu, in assembly files, replace `call write` with `li a7, 64` and `ecall`. 1. Keep built assembly files, and add them to index. 1. Programmers can start from the saved assembly files, and compile them to object files. 1. In the performance test, in assembly, replace `call __udivdi3` with `divu a0,a0,a2`, and `call __umoddi3` with `remu a0,a0,a2`. The results of this version show that given almost the similar printed messages to the stdout, the size of the ELF in this version is much smaller than that in v2. In the following table, for the cells that have two numbers, the first one the the result of v2, and the latter is of v3. With the optimization level `-Os`, the compiler emits errors for the undefined reference to `memcpy`, `__ashldi3`, `__lshrdi3`. However, I didn't even call these functions in C, so I just get rid of the result for this optimization level. | Opt. | # of inst. ret. | Case 1 | Case 2 | Case 3 | Case 4 | | ----- | --------------- | -------- | -------- | ------ | -------- | | `-O0` | 727/34 | 756/756 | 735/735 | 0/0 | 851/752 | | `-O1` | 734/38 | 235/235 | 0/0 | 0/0 | 0/0 | | `-O2` | 734/37 | 235/235 | 0/0 | 0/0 | 0/0 | | `-O3` | 734/42 | 223/223 | 0/0 | 0/0 | 0/0 | Program size evaluation is as follows. | Opt. | V2 size | V3 size | V3/V2 ratio | | ----- | ------- | ------- | ----------- | | `-O0` | 11384 | 3324 | 29% | | `-O1` | 9068 | 1012 | 11% | | `-O2` | 9064 | 1008 | 11% | | `-O3` | 9396 | 1340 | 14% | ## References 1. System Software Programming([@sysprog](https://hackmd.io/@sysprog)). [Assignment 2 requirements](https://hackmd.io/@sysprog/2023-arch-homework2). [Online]. [Accessed 2023-10-24]. 1. [philant](https://stackoverflow.com/users/18804/philant) & [proski](https://stackoverflow.com/users/2962407/proski). (2010-02-08, 2022-01-27). [GCC dump preprocessor defines](https://stackoverflow.com/a/2224357). [Online]. [Accessed 2023-10-29]. 1. [Akshay](https://stackoverflow.com/users/8198351/akshay). (2019-09-18). [Use of ifdef in gas assembly language](https://stackoverflow.com/q/46272497). [Online]. [Accessed 2023-10-29].