Selected subject: Palindrome detection with counting leading zeros.
Contributed by Ray Huang (黃柏叡).
Acknowledgment (unordered):
A palindrome number is a number that reads the same backward for forward. For example, and .
With the similar idea, padaray and I defines a palindrome bit sequence (abbreviated as palindrome) as follows. Given a bit sequence without leading zero, saying . If the reversed bit sequence of , saying , is the same as , we call it a palindrome.
For example, given (the subscript means binary representation,) the reversed version, . Since , we conclude that is not a palindrome.
In contrast, number is a palindrome, for .
The following pseudo-code can solve the problem in the intuitive way.
&
from the most significant bit to the lease significant bit until some bit is set (1), and the number of bit-wise AND minus one is the number of leading zeros." A special case is that if , then the count is in 32-bit run-time environment.The 2nd step takes to run, where is the number of bits in a given run-time environment. However, jserv (2023) optimized the 2nd step with population count in the 1st quiz of computer architecture course. So, it takes after optimization.
In a 64-bit environment, the 3rd step manipulates on 64 bits. But looking closely for the definition of palindrome given fixed number of maximal bits (64 in this case), padaray (2023) optimize this step by only reversing one half of the given number. That is, he optimized this step to have bits to manipulate.
However, in this way, the 3rd step still takes to run. As a result, I searched for reversing bits on the Internet, and found that Douglas W. Jones (2018), sth and Ry- (2010, 2020) accomplish the reversing in $O(\log n). As a result, I picked this topic for optimization in this homework.
Reference: Lab 2 (rv32emu) provided by System Software Programming (2023).
wget
command. Do the same thing for the SHA256 checksum.
PATH
environment variable.
PATH
properly.
hello.elf
by rv32emu.
-d
: Display the assembler mnemonics for the machine instructions.)
Expected output:
-h
: Display the ELF file header)
Expected output:
zicsr
add. The result is __riscv_zicsr
.
Since if I want to get the results of "how many instruction is retrieved" and "how many Control and Status Register (CSR) cycles are now," I have to enable the extension _zicsr
. And the final compiling target is required to be without _zicsr
. To share the code in the two cases above, I have to detect if __riscv_zicsr
is defined in C with the following illustration. (Note: .ifdef
doesn't work in assembly.)
m
. The result comes with multiple defines, and I will consider only __riscv_m
in my code.
If I compile function calls of printf
(in standard library <stdio.h>
) to assembly, I will loose the ability to link the object file in which implements printf
. As a result, I utilize the ecall to write
(with similar definition of the write command in Linux.) To convert integers to characters arrays, I need the support of modulo and division from the m
extension for convenience, although modulo and division can be implemented by integer operations.You can get the complete code from the repository coding-ray/2023-ca-hw-2 on GitHub.
The code of this version is in the directory archived/v1-base.
This version is the baseline version, without any optimization from me. In this version, I refactored the code from padaray's work. The primary principle is that after my refactoring, the algorithm should be the same as the original work, including the instructions.
In addition, I did the following refactoring:
gcc
with flags -Wall -Wextra
. (c908804)palindrome_detected
with is_palindrome
. The latter one is in the format "verb_noun." (d53f444)is_palindrome
. Specifically, move the function call of count_leading_zeros
to the first line of is_palindrome
. (d53f444)x <op>= y
(from x = x <op> y
). (d53f444)src/
. (4d17e98)The code of this version is in the directory archived/v2-optimized-reversing.
The code to reverse 64 bits is as follows before optimization. If the number of bits is , this algorithm has time complicity .
In this version, I apply the algorithm of reversing bits in place from Douglas W. Jones (2018), sth and Ry- (2010, 2020).
To reverse 16 bits in place, Douglas W. Jones (2018) wrote the following code in C.
The basic idea is as follows.
>>
and AND &
operations pick up bits with interleaving of one bit, from the most-significant bit of x
. And then shift the picked bits right by one bit. That is, saying that x
has bits 31, 30, 29, , 1, 0, from the most significant bit, the operations before the adding +
operator takes bits 31, 29, 27, , 3, 1 of x
, and put them to positions of bits 30, 28, 26, , 2, 0.+
operator pick up the rest of bits, and shift them left by one bit. That is, the bits 30, 28, , 2, 0 of x
are moved to the positions of bits 31, 29, , 3, 1.+
operation is actually OR |
operation for the two operands. So after this operation, the bits in the pair of two bits are swapped. Bits abcd
are badc
afterwards.badc
become dcba
later.0x0000FFFF
, the result is the same.With the same idea, I wrote the following code.
With this optimization, the algorithm improves time complicity form to , where is the number of bits.
With different optimization levels of gcc
, the comparison between versions 1 and 2 is as follows.
Performance evaluation (number of CSR cycles) of the four test cases. These four tests are the same across all versions in the "archived" directory. These four tests are inputting 0, 1, 0x00000C0000000003 and 0x0F000000000000F0 to the is_palindrome
function, respectively. And across the versions, all the results are verified before any evaluation.
In the following table, for the cells that contain two numbers, the first number is the result from v1, and the latter is from v2. Since optimization level of -Ofast
and -O3
yield the same result for all tests in all versions, the former one is not included in the tables in this report.
Opt. | # of inst. ret. | Case 1 | Case 2 | Case 3 | Case 4 |
---|---|---|---|---|---|
-O0 |
720/727 | 3674/756 | 3653/735 | 0/0 | 3769/851 |
-O1 |
734/734 | 923/235 | 0/0 | 0/0 | 0/0 |
-O2 |
734/734 | 920/235 | 0/0 | 0/0 | 0/0 |
-O3 |
730/734 | 894/223 | 0/0 | 0/0 | 0/0 |
-Os |
784/734 | 131/260 | 0/129 | 0/0 | 0/0 |
Program size evaluation is as follows. The numbers are all the number of bytes of the entire program in decimal.
Opt. | V1 size | V2 size | V2/V1 Ratio |
---|---|---|---|
-O0 |
10428 | 11384 | 109% |
-O1 |
8796 | 9068 | 103% |
-O2 |
8792 | 9064 | 103% |
-O3 |
9124 | 9396 | 103% |
-Os |
8776 | 9032 | 103% |
The code of this version is in the directory archived/v3-write.
In this version, I replace function calls of printf
with write
, for the sake of linking error for the object file that contains printf
when I want to link my object files. In detail, I didn't successfully link /usr/lib/x86_64-linux-gnu/libc.a
or /usr/lib/x86_64-linux-gnu/libc.so
. (I am glad of it.)
Moreover, in this version, I made the following features:
main
function work with rv32emu, replace jr ra
or ret
in the end of the main function with the following assembly
write
function in rv32emu, in assembly files, replace call write
with li a7, 64
and ecall
.call __udivdi3
with divu a0,a0,a2
, and call __umoddi3
with remu a0,a0,a2
.The results of this version show that given almost the similar printed messages to the stdout, the size of the ELF in this version is much smaller than that in v2.
In the following table, for the cells that have two numbers, the first one the the result of v2, and the latter is of v3.
With the optimization level -Os
, the compiler emits errors for the undefined reference to memcpy
, __ashldi3
, __lshrdi3
. However, I didn't even call these functions in C, so I just get rid of the result for this optimization level.
Opt. | # of inst. ret. | Case 1 | Case 2 | Case 3 | Case 4 |
---|---|---|---|---|---|
-O0 |
727/34 | 756/756 | 735/735 | 0/0 | 851/752 |
-O1 |
734/38 | 235/235 | 0/0 | 0/0 | 0/0 |
-O2 |
734/37 | 235/235 | 0/0 | 0/0 | 0/0 |
-O3 |
734/42 | 223/223 | 0/0 | 0/0 | 0/0 |
Program size evaluation is as follows.
Opt. | V2 size | V3 size | V3/V2 ratio |
---|---|---|---|
-O0 |
11384 | 3324 | 29% |
-O1 |
9068 | 1012 | 11% |
-O2 |
9064 | 1008 | 11% |
-O3 |
9396 | 1340 | 14% |