contributed by kk908676
I refer to the Find First String of 1-bits of a Given Length by CLZ from 林柏全
The following is the source code:
#include <stdint.h>
#include <stdio.h>
uint16_t count_leading_zeros(uint64_t x)
{
x |= (x >> 1);
x |= (x >> 2);
x |= (x >> 4);
x |= (x >> 8);
x |= (x >> 16);
x |= (x >> 32);
x -= ((x >> 1) & 0x5555555555555555);
x = ((x >> 2) & 0x3333333333333333) + (x & 0x3333333333333333);
x = ((x >> 4) + x) & 0x0f0f0f0f0f0f0f0f;
x += (x >> 8);
x += (x >> 16);
x += (x >> 32);
return (64 - (x & 0x7f));
}
int find_string(uint64_t x, int n){
int clz; // leading zeros of x
int pos = 0; // position of fist '1' bit from significant bit
while(x != 0){
clz = count_leading_zeros(x);
x = x << clz;
pos = pos + clz;
clz = count_leading_zeros(~x);
if (clz >= n)
return pos;
x = x << clz;
pos = pos + clz;
}
return -1;
}
int main() {
uint64_t test_data[] = {0x0f00000000000000,
0x0000000000000000,
0x0123456789abcdef};
for (int i = 0; i < 3; i++) {
uint64_t x = test_data[i];
int n = 4;
int result = find_string(x, n);
printf("Test Case %d: Input: 0x%016lx, Result: %d\n", i+1, x, result);
}
return 0;
}
The importance of "Find First String of 1-bits of a Given Length by CLZ" is primarily manifested in its efficiency and relevance in bitwise operations, particularly in programming contexts where performance is emphasized. This includes optimizing algorithms, performance improvement, and applications in bit-level security and cryptography.
First, we need to write a Makefile that suits our needs.
.PHONY: clean
include /home/an/rv32emu/mk/toolchain.mk
ASFLAGS = -march=rv32i_zicsr -mabi=ilp32
SOURCE = main.o initial.o getcycles.o
VAR = -O0
CC = riscv-none-elf-gcc
%.s: %.c
$(CC) $(ASFLAGS) $(VAR) -s -o $@ $<
%.o: %.s
$(CC) $(ASFLAGS) -c -o $@ $<
all: main.elf
main.elf: $(SOURCE)
$(CC) -o $@ $(SOURCE)
clean:
$(RM) main.elf main.o initial.o getcycles.o
Next, we execute the following instructions
$ ~/rv32emu/build/rv32emu main.elf
Display the main.elf header
ELF Header:
Magic: 7f 45 4c 46 01 01 01 00 00 00 00 00 00 00 00 00
Class: ELF32
Data: 2's complement, little endian
Version: 1 (current)
OS/ABI: UNIX - System V
ABI Version: 0
Type: EXEC (Executable file)
Machine: RISC-V
Version: 0x1
Entry point address: 0x100c2
Start of program headers: 52 (bytes into file)
Start of section headers: 69400 (bytes into file)
Flags: 0x1, RVC, soft-float ABI
Size of this header: 52 (bytes)
Size of program headers: 32 (bytes)
Number of program headers: 3
Size of section headers: 40 (bytes)
Number of section headers: 15
Section header string table index: 14
Program size
text data bss dec hex filename
52590 1876 1528 55994 daba main.elf
cycle count
Display the main.elf header
ELF Header:
Magic: 7f 45 4c 46 01 01 01 00 00 00 00 00 00 00 00 00
Class: ELF32
Data: 2's complement, little endian
Version: 1 (current)
OS/ABI: UNIX - System V
ABI Version: 0
Type: EXEC (Executable file)
Machine: RISC-V
Version: 0x1
Entry point address: 0x100c2
Start of program headers: 52 (bytes into file)
Start of section headers: 69408 (bytes into file)
Flags: 0x1, RVC, soft-float ABI
Size of this header: 52 (bytes)
Size of program headers: 32 (bytes)
Number of program headers: 3
Size of section headers: 40 (bytes)
Number of section headers: 15
Section header string table index: 14
Program size
text data bss dec hex filename
51812 1884 1528 55224 d7b8 main.elf
cycle count
main.elf header
ELF Header:
Magic: 7f 45 4c 46 01 01 01 00 00 00 00 00 00 00 00 00
Class: ELF32
Data: 2's complement, little endian
Version: 1 (current)
OS/ABI: UNIX - System V
ABI Version: 0
Type: EXEC (Executable file)
Machine: RISC-V
Version: 0x1
Entry point address: 0x1016e
Start of program headers: 52 (bytes into file)
Start of section headers: 69408 (bytes into file)
Flags: 0x1, RVC, soft-float ABI
Size of this header: 52 (bytes)
Size of program headers: 32 (bytes)
Number of program headers: 3
Size of section headers: 40 (bytes)
Number of section headers: 15
Section header string table index: 14
Program size
text data bss dec hex filename
51834 1884 1528 55246 d7ce main.elf
cycle count
main.elf header
ELF Header:
Magic: 7f 45 4c 46 01 01 01 00 00 00 00 00 00 00 00 00
Class: ELF32
Data: 2's complement, little endian
Version: 1 (current)
OS/ABI: UNIX - System V
ABI Version: 0
Type: EXEC (Executable file)
Machine: RISC-V
Version: 0x1
Entry point address: 0x1016e
Start of program headers: 52 (bytes into file)
Start of section headers: 69408 (bytes into file)
Flags: 0x1, RVC, soft-float ABI
Size of this header: 52 (bytes)
Size of program headers: 32 (bytes)
Number of program headers: 3
Size of section headers: 40 (bytes)
Number of section headers: 15
Section header string table index: 14
Program size
text data bss dec hex filename
52380 1884 1528 55792 d9f0 main.elf
First,I need to rewrite the code to fit the format of rv32emu.Because integer output is not supported, I attempted to modify the system call to output the cycle count.
In this code, I have added…
.set SYSPRINTFCYCLE, 32
And added it to syscall.c
static void syscall_printfcycle(riscv_t *rv)
{
uint32_t buffer = rv_get_reg(rv, rv_reg_a3);
fprintf(stdout, "Cycle count: %d\n", buffer);
rv_set_reg(rv, rv_reg_a3, 0);
}
an@an-Standard-PC-i440FX-PIIX-1996:~/Desktop/Hw2_asm$ ~/rv32emu/build/rv32emu main.elf
Cycle count: 348
inferior exit code 0
Later, I attempted to make changes to the overflow handling to reduce the cycle count. Eventually, I found that the overflow check here was unnecessary for this program and proceeded to remove it.
I made reductions to these aspects:
blt t1, t3, 16 # if underflow then jump
beq zero, zero, 12
addi t0, t0, -1 # underflow at lower bits
sub t0, t0, t2 #[t0, t1] => x -= ((x>>1) & 0x5555555555555555)
or t4, t1, zero
xor t4, s4, t4 # nor t5, t1, zero (t4 = ~(s4 | zero))
bgeu t4, t3, 8 # if no overflow then jump
addi t0, t0, 1 # if overflow upper bits plus 1
or t4, t1, zero
xor t4, s4, t4 # nor t5, t1, zero (t4 = ~(s4 | zero))
bgeu t4, t3, 8 # if no overflow then jump
addi t0, t0, 1 # if overflow upper bits plus 1
or t4, t1, zero
xor t4, s4, t4
bgeu t4, t3, 8
addi t0, t0, 1
or t4, t1, zero
xor t4, s4, t4
bgeu t4, t3, 8
addi t0, t0, 1
or t4, t1, zero
xor t4, s4, t4
bgeu t4, t3, 8
addi t0, t0, 1
This is the cycle count obtained after the reductions
$ ~/rv32emu/build/rv32emu main.elf
Cycle count: 314
inferior exit code 0
Next, I want to try reducing the text size.
# x |= (x>>1);
srli t3, t1, 1 # shift lower bits of test data right with 1 bit
slli t2, t0, 31 # shift upper bits of test data left with 31 bits
or t3, t2, t3 # combine to get new lower bits of test data
srli t2, t0, 1 # shift upper bound of test data right with 1 bit
or t0, t0, t2 # [0~31]x | [0~31](x >> 1)
or t1, t1, t3 # [32~63]x | [32~63](x >> 1)
# x |= (x>>2);
srli t3, t1, 2
slli t2, t0, 30
or t3, t2, t3
srli t2, t0, 2
or t0, t0, t2
or t1, t1, t3
# x |= (x>>4);
srli t3, t1, 4
slli t2, t0, 28
or t3, t2, t3
srli t2, t0, 4
or t0, t0, t2
or t1, t1, t3
# x |= (x>>8);
srli t3, t1, 8
slli t2, t0, 24
or t3, t2, t3
srli t2, t0, 8
or t0, t0, t2
or t1, t1, t3
# x |= (x>>16);
srli t3, t1, 16
slli t2, t0, 16
or t3, t2, t3
srli t2, t0, 16
or t0, t0, t2
or t1, t1, t3
li a5, 1
li a6, 32
loop:
sub a7, a6, a5
srl t3, t1, a5 # shift lower bits of test data right with n bit
sll t2, t0, a7 # shift upper bits of test data left with 31-n bits
or t3, t2, t3 # combine to get new lower bits of test data
srl t2, t0, a5 # shift upper bound of test data right with n bit
or t0, t0, t2 # [0~31]x | [0~31](x >> n)
or t1, t1, t3 # [32~63]x | [32~63](x >> n)
slli a5, a5, 1
beq a5, a6, CE
j loop
Similarly, we execute it in the terminal to obtain the result.
text data bss dec hex filename
640 0 0 640 280 main.elf
target | cycle | size |
---|---|---|
O0 | 8918 | 52590 |
O1 | 6796 | 51812 |
O2 | 6788 | 51834 |
O3 | 6883 | 52380 |
assembly | 314 | 640 |
You shall explain why C implementation is significantly slower.