Assignment2: RISC-V Toolchain

Selected subject: Palindrome detection with counting leading zeros.

Acknowledgment (unordered):

jserv (黃敬群): rv32emu project.
padaray (陳浩文): Report and code for the topic "Implement palindrome detection and using CLZ."
P76095018: Sample report for this homework.
kdnvt: Report related to program analysis and future optimizations.

1. Introduction

1.1. Palindrome number

A palindrome number is a number that reads the same backward for forward. For example,

1356531

and

9889

With the similar idea, padaray and I defines a palindrome bit sequence (abbreviated as palindrome) as follows. Given a bit sequence without leading zero, saying

B

. If the reversed bit sequence of

B

, saying

B^{*}

, is the same as

B

, we call it a palindrome.

For example, given

X = 00001011_{2} = 1011_{2}

(the subscript

2

means binary representation,) the reversed version,

X^{*} = 1101_{2}

. Since

X \neq X^{*}

, we conclude that

X

is not a palindrome.

In contrast, number

00110011_{2}

is a palindrome, for

110011_{2} = 110011_{2}

1.2. Motivation

The following pseudo-code can solve the problem in the intuitive way.

Given a number
$X$ in binary.
Count the leading zeros of
$X$ in binary by "bit-wise AND & from the most significant bit to the lease significant bit until some bit is set (1), and the number of bit-wise AND minus one is the number of leading zeros." A special case is that if
$X = 0$ , then the count is
$32$ in 32-bit run-time environment.
For the rest of bits (from the leftmost set bit to the rightmost bit), reverse it by bit extraction for each bit, and storing the bit in the opposite direction in a new variable
$X^{*}$ .
If
$X = X^{*}$ , then
$X$ is palindrome. Otherwise, it isn't.

The 2nd step takes

O (n)

to run, where

n

is the number of bits in a given run-time environment. However, jserv (2023) optimized the 2nd step with population count in the 1st quiz of computer architecture course. So, it takes

O (\log n)

after optimization.

In a 64-bit environment, the 3rd step manipulates on 64 bits. But looking closely for the definition of palindrome given fixed number of maximal bits (64 in this case), padaray (2023) optimize this step by only reversing one half of the given number. That is, he optimized this step to have

n / 2 = 32

bits to manipulate.

However, in this way, the 3rd step still takes

O (n)

to run. As a result, I searched for reversing bits on the Internet, and found that Douglas W. Jones (2018), sth and Ry- (2010, 2020) accomplish the reversing in $O(\log n). As a result, I picked this topic for optimization in this homework.

2. Set up environment

Reference: Lab 2 (rv32emu) provided by System Software Programming (2023).

2.1. Prepare GNU toolchain for RISC-V

Operating system (OS): Ubuntu 22.04.3 server installed in a VirtualBox (7.0.10) VM.

I've tried to compile the sample programs of rv32emu in Debian 12.1.0, but all of the three tests result in segmentation fault.
I will try to support rv32emu in Debian in the future. But I'm forced to stick on Ubuntu now. :(
Check the latest version of on its GitHub

Copy the link to the tarball of architecture Linux and x64, and then download it with wget command. Do the same thing for the SHA256 checksum.

wget https://github.com/xpack-dev-tools/riscv-none-elf-gcc-xpack/releases/download/v13.2.0-2/xpack-riscv-none-elf-gcc-13.2.0-2-linux-x64.tar.gz
wget https://github.com/xpack-dev-tools/riscv-none-elf-gcc-xpack/releases/download/v13.2.0-2/xpack-riscv-none-elf-gcc-13.2.0-2-linux-x64.tar.gz.sha

Check if the checksum of the downloaded tarball matches the downloaded checksum.

cat xpack-riscv-none-elf-gcc-13.2.0-2-linux-x64.tar.gz.sha
# output: 52545afb900200fbf65fe05f7ce7090b8a42c64091f4f5d43cae6bb68ea2434a  xpack-riscv-none-elf-gcc-13.2.0-2-linux-x64.tar.gz
sha256sum xpack-riscv-none-elf-gcc-13.2.0-2-linux-x64.tar.gz
# output: 52545afb900200fbf65fe05f7ce7090b8a42c64091f4f5d43cae6bb68ea2434a  xpack-riscv-none-elf-gcc-13.2.0-2-linux-x64.tar.gz

If the checksum matches, extract the tarball.

tar zxf xpack-riscv-none-elf-gcc-13.2.0-2-linux-x64.tar.gz

Create a version memo.

echo 13.2.0-2 > xpack-riscv-none-elf-gcc-13.2.0-2/version.txt

Create an app directory under user's home, and move the toolchain to there.

mkdir -p ~/app
mv xpack-riscv-none-elf-gcc-13.2.0-2 ~/app/riscv-none-elf-gcc

Configure the PATH environment variable.

echo "export PATH=\"$HOME/app/riscv-none-elf-gcc/bin:\$PATH\"" >> ~/.bashrc
source ~/.bashrc

Check if the toolchain version. This should work if you set PATH properly.

riscv-none-elf-gcc -v
# expected output:
# ... (omitted)
# gcc version 13.2.0 (xPack GNU RISC-V Embedded GCC x86_64)

2.2. Get and build rv32emu

Install the dependencies.

sudo apt update && sudo apt install -y libsdl2-dev libsdl2-mixer-dev make

Get and build rv32emu from source.

mkdir ~/ca
git clone https://github.com/sysprog21/rv32emu ~/ca/rv32emu
cd ~/ca/rv32emu
make

Validate rv32emu.

make check
# expected output:
# Running hello.elf ... [OK]
# Running puzzle.elf ... [OK]
# Running pi.elf ... [OK]

Run hello.elf by rv32emu.

build/rv32emu build/hello.elf
# expected output:
# Hello World!
# Hello World!
# Hello World!
# Hello World!
# Hello World!
# inferior exit code 0

2.3. Demo of functionalities of GNU toolchain

objdump (-d: Display the assembler mnemonics for the machine instructions.)

riscv-none-elf-objdump -d build/hello.elf

Expected output:


build/hello.elf:     file format elf32-littleriscv


Disassembly of section .text:

00000000 <_start>:
   0:   00000293                li      t0,0
   4:   00500313                li      t1,5
   8:   0040006f                j       c <_start+0xc>
   c:   00000013                nop

00000010 <loop>:
  10:   02628063                beq     t0,t1,30 <end>
  14:   04000893                li      a7,64
  18:   00100513                li      a0,1
  1c:   03c00593                li      a1,60
  20:   00d00613                li      a2,13
  24:   00000073                ecall
  28:   00128293                add     t0,t0,1
  2c:   fe5ff06f                j       10 <loop>

00000030 <end>:
  30:   05d00893                li      a7,93
  34:   00000513                li      a0,0
  38:   00000073                ecall

readelf (-h: Display the ELF file header)

riscv-none-elf-readelf -h build/hello.elf

Expected output:

ELF Header:
  Magic:   7f 45 4c 46 01 01 01 00 00 00 00 00 00 00 00 00 
  Class:                             ELF32
  Data:                              2's complement, little endian
  Version:                           1 (current)
  OS/ABI:                            UNIX - System V
  ABI Version:                       0
  Type:                              EXEC (Executable file)
  Machine:                           RISC-V
  Version:                           0x1
  Entry point address:               0x0
  Start of program headers:          52 (bytes into file)
  Start of section headers:          4532 (bytes into file)
  Flags:                             0x0
  Size of this header:               52 (bytes)
  Size of program headers:           32 (bytes)
  Number of program headers:         2
  Size of section headers:           40 (bytes)
  Number of section headers:         7
  Section header string table index: 6

size (List the section sizes—and the total size—for each of the object or archive files objfile in its argument list.)

riscv-none-elf-size build/hello.elf

Expected output:

   text    data     bss     dec     hex filename
     73       0       0      73      49 build/hello.elf

Check what preprocessor defines does RISC-V extension zicsr add. The result is __riscv_zicsr.
```
echo | riscv-none-elf-gcc -march=rv32i -E -dM - > tmp-oritinal.txt
echo | riscv-none-elf-gcc -march=rv32i_zicsr -E -dM - > tmp-csr.txt
diff tmp-oritinal.txt tmp-csr.txt
# result:
#   232a233
#   > #define __riscv_zicsr 2000000
```
Since if I want to get the results of "how many instruction is retrieved" and "how many Control and Status Register (CSR) cycles are now," I have to enable the extension _zicsr. And the final compiling target is required to be without _zicsr. To share the code in the two cases above, I have to detect if __riscv_zicsr is defined in C with the following illustration. (Note: .ifdef doesn't work in assembly.)
```
#ifdef __riscv_zicsr
extern uint64_t get_instret();
#endif // __riscv_zicsr
```
Do the same trick as the above for the extension m. The result comes with multiple defines, and I will consider only __riscv_m in my code.
```
echo | riscv-none-elf-gcc -march=rv32i -E -dM - > tmp-oritinal.txt
echo | riscv-none-elf-gcc -march=rv32im -E -dM - > tmp-muldiv.txt
diff tmp-oritinal.txt tmp-muldiv.txt
# result:
#   126a127
#   > #define __riscv_mul 1
#   152a154
#   > #define __riscv_muldiv 1
#   212a215
#   > #define __riscv_m 2000000
#   303a307
#   > #define __riscv_div 1
```
If I compile function calls of printf (in standard library <stdio.h>) to assembly, I will loose the ability to link the object file in which implements printf. As a result, I utilize the ecall to write (with similar definition of the write command in Linux.) To convert integers to characters arrays, I need the support of modulo and division from the m extension for convenience, although modulo and division can be implemented by integer operations.

3. Analysis and Improvement

You can get the complete code from the repository coding-ray/2023-ca-hw-2 on GitHub.

3.1. V1: baseline

The code of this version is in the directory archived/v1-base.

This version is the baseline version, without any optimization from me. In this version, I refactored the code from padaray's work. The primary principle is that after my refactoring, the algorithm should be the same as the original work, including the instructions.

In addition, I did the following refactoring:

Convert code in C++ to C. (padaray didn't use any C++ features in his code, though.) (c908804)
Eliminate errors and warnings from gcc with flags -Wall -Wextra. (c908804)
Format code with VS Code C/C++ extension. (10ef438)
Rename variables in snake case. (d53f444)
Rename function palindrome_detected with is_palindrome. The latter one is in the format "verb_noun." (d53f444)
Simplify the calling of function is_palindrome. Specifically, move the function call of count_leading_zeros to the first line of is_palindrome. (d53f444)
Clean redundant brackets (), braces {} and one-line /* */ comments. (d53f444)
Compress code with x <op>= y (from x = x <op> y). (d53f444)
Move source code to src/. (4d17e98)
Customize C and assembly for rv32emu. (e9c2284)
Count CSR cycles and instructions retrieved. (156ce2d)

3.2. V2: O(log n) algorithm of reversing 64 bits

The code of this version is in the directory archived/v2-optimized-reversing.

The code to reverse 64 bits is as follows before optimization. If the number of bits is

n

, this algorithm has time complicity

Θ (n)








uint64_t tmp_a = a;
uint64_t reverse_a = 0;
for (int i = 0; i < 64; i++) {
  reverse_a <<= 1;
  reverse_a |= tmp_a & 1;
  tmp_a >>= 1;
}
reverse_a >>= n_left_shift;

In this version, I apply the algorithm of reversing bits in place from Douglas W. Jones (2018), sth and Ry- (2010, 2020).

To reverse 16 bits in place, Douglas W. Jones (2018) wrote the following code in C.








unsigned int reverse( unsigned int x ){
  x = ((x >>  1) & 0x55555555) + ((x & 0x55555555) <<  1);
  x = ((x >>  2) & 0x33333333) + ((x & 0x33333333) <<  2);
  x = ((x >>  4) & 0x0F0F0F0F) + ((x & 0x0F0F0F0F) <<  4);
  x = ((x >>  8) & 0x00FF00FF) + ((x & 0x00FF00FF) <<  8);
  x = ((x >> 16)             ) + ((x             ) << 16);
  return x;
}

The basic idea is as follows.

In line 2:
1. The first right shifting >> and AND & operations pick up bits with interleaving of one bit, from the most-significant bit of x. And then shift the picked bits right by one bit. That is, saying that x has bits 31, 30, 29,
  $\dots$ , 1, 0, from the most significant bit, the operations before the adding + operator takes bits 31, 29, 27,
  $\dots$ , 3, 1 of x, and put them to positions of bits 30, 28, 26,
  $\dots$ , 2, 0.
2. The operations after the adding + operator pick up the rest of bits, and shift them left by one bit. That is, the bits 30, 28,
  $\dots$ , 2, 0 of x are moved to the positions of bits 31, 29,
  $\dots$ , 3, 1.
3. The adding + operation is actually OR | operation for the two operands. So after this operation, the bits in the pair of two bits are swapped. Bits abcd are badc afterwards.
In line 3, two adjacent bits are grouped together, and two adjacent groups are swapped. So, bits badc become dcba later.
Similarly, in line 3, four adjacent bits are grouped together.
In line 6, it is trivial to group (mask) 16 bits together, for with and without masking 0x0000FFFF, the result is the same.

With the same idea, I wrote the following code.















// file header
#define INTERLEAVE_BY_01 0x5555555555555555
#define INTERLEAVE_BY_02 0x3333333333333333
#define INTERLEAVE_BY_04 0x0F0F0F0F0F0F0F0F
#define INTERLEAVE_BY_08 0x00FF00FF00FF00FF
#define INTERLEAVE_BY_16 0x0000FFFF0000FFFF

// function body with `r` defined previously
r = ((r >> 1) & INTERLEAVE_BY_01) + ((r & INTERLEAVE_BY_01) << 1);
r = ((r >> 2) & INTERLEAVE_BY_02) + ((r & INTERLEAVE_BY_02) << 2);
r = ((r >> 4) & INTERLEAVE_BY_04) + ((r & INTERLEAVE_BY_04) << 4);
r = ((r >> 8) & INTERLEAVE_BY_08) + ((r & INTERLEAVE_BY_08) << 8);
r = ((r >> 16) & INTERLEAVE_BY_16) + ((r & INTERLEAVE_BY_16) << 16);
r = (r >> 32) + (r << 32);
// finish reversing

With this optimization, the algorithm improves time complicity form

Θ (n)

Θ (\log n)

, where

n

is the number of bits.

With different optimization levels of gcc, the comparison between versions 1 and 2 is as follows.

Performance evaluation (number of CSR cycles) of the four test cases. These four tests are the same across all versions in the "archived" directory. These four tests are inputting 0, 1, 0x00000C0000000003 and 0x0F000000000000F0 to the is_palindrome function, respectively. And across the versions, all the results are verified before any evaluation.

In the following table, for the cells that contain two numbers, the first number is the result from v1, and the latter is from v2. Since optimization level of -Ofast and -O3 yield the same result for all tests in all versions, the former one is not included in the tables in this report.

Opt.	# of inst. ret.	Case 1	Case 2	Case 3	Case 4
`-O0`	720/727	3674/756	3653/735	0/0	3769/851
`-O1`	734/734	923/235	0/0	0/0	0/0
`-O2`	734/734	920/235	0/0	0/0	0/0
`-O3`	730/734	894/223	0/0	0/0	0/0
`-Os`	784/734	131/260	0/129	0/0	0/0

Program size evaluation is as follows. The numbers are all the number of bytes of the entire program in decimal.

Opt.	V1 size	V2 size	V2/V1 Ratio
`-O0`	10428	11384	109%
`-O1`	8796	9068	103%
`-O2`	8792	9064	103%
`-O3`	9124	9396	103%
`-Os`	8776	9032	103%

3.3. V3: Replace function printf with write

The code of this version is in the directory archived/v3-write.

In this version, I replace function calls of printf with write, for the sake of linking error for the object file that contains printf when I want to link my object files. In detail, I didn't successfully link /usr/lib/x86_64-linux-gnu/libc.a or /usr/lib/x86_64-linux-gnu/libc.so. (I am glad of it.)

Moreover, in this version, I made the following features:

Get rid of all C standard library.
C code is compile to assembly files instead of compiling to object files directly. To make the main function work with rv32emu, replace jr ra or ret in the end of the main function with the following assembly
```
li a7, 93           # "exit" syscall
add a0, x0, 0       # Use 0 return code
ecall               # invoke syscall to terminate the program
```
To properly call write function in rv32emu, in assembly files, replace call write with li a7, 64 and ecall.
Keep built assembly files, and add them to index.
Programmers can start from the saved assembly files, and compile them to object files.
In the performance test, in assembly, replace call __udivdi3 with divu a0,a0,a2, and call __umoddi3 with remu a0,a0,a2.

The results of this version show that given almost the similar printed messages to the stdout, the size of the ELF in this version is much smaller than that in v2.

In the following table, for the cells that have two numbers, the first one the the result of v2, and the latter is of v3.

With the optimization level -Os, the compiler emits errors for the undefined reference to memcpy, __ashldi3, __lshrdi3. However, I didn't even call these functions in C, so I just get rid of the result for this optimization level.

Opt.	# of inst. ret.	Case 1	Case 2	Case 3	Case 4
`-O0`	727/34	756/756	735/735	0/0	851/752
`-O1`	734/38	235/235	0/0	0/0	0/0
`-O2`	734/37	235/235	0/0	0/0	0/0
`-O3`	734/42	223/223	0/0	0/0	0/0

Program size evaluation is as follows.

Opt.	V2 size	V3 size	V3/V2 ratio
`-O0`	11384	3324	29%
`-O1`	9068	1012	11%
`-O2`	9064	1008	11%
`-O3`	9396	1340	14%

References

System Software Programming(@sysprog). Assignment 2 requirements. [Online]. [Accessed 2023-10-24].
philant & proski. (2010-02-08, 2022-01-27). GCC dump preprocessor defines. [Online]. [Accessed 2023-10-29].
Akshay. (2019-09-18). Use of ifdef in gas assembly language. [Online]. [Accessed 2023-10-29].

Assignment2: RISC-V Toolchain

1. Introduction

1.1. Palindrome number

1.2. Motivation

2. Set up environment

2.1. Prepare GNU toolchain for RISC-V

2.2. Get and build rv32emu

2.3. Demo of functionalities of GNU toolchain

3. Analysis and Improvement

3.1. V1: baseline

3.2. V2: O(log n) algorithm of reversing 64 bits

3.3. V3: Replace function printf with write

References

Read more

HPDS Group Meeting Presentation Schedule

Computer Architecture 2023: Term Project

112-1CA Test Memory Leak of rv32emu

112-1 CA Assignment 3: Single-Cycle RISC-V CPU