owned this note
owned this note
Published
Linked with GitHub
# Lab2: RISC-V RV32I[MA] emulator with ELF support : Count Leading Zero(CLZ)
###### tags: `Computer Architecture`
[TOC]
### RISC-V GNU Compiler toolchain
#### Introduction
This is the RISC-V C and C++ cross-compiler. It supports two build modes: a generic ELF/Newlib toolchain and a more sophisticated Linux-ELF/glibc toolchain.
### Environment Setup
#### Getting the [source](https://github.com/riscv/riscv-gnu-toolchain)
```
$ git clone --recursive https://github.com/riscv/riscv-gnu-toolchain
```
#### Prerequisites
Several standard packages are needed to build the toolchain.
```
$ sudo apt-get install autoconf automake autotools-dev curl python3 libmpc-dev libmpfr-dev libgmp-dev gawk build-essential bison flex texinfo gperf libtool patchutils bc zlib1g-dev libexpat-dev
```
#### Installation
To build the 32-bit RV32GC toolchain, use:
```
1. cd riscv-gnu-toolchain
2. export PATH=$PATH:/opt/riscv/bin
3. source ~/.bashrc
4. ./configure --prefix=/opt/riscv --with-arch=rv32i --with-abi=ilp32
5. sudo make
```
### Rewritten code in C
#### clz.c
```c=
unsigned int clz(unsigned int num)
{
int res = 0;
unsigned mask = 0x80000000;
for (int i = 0; i < 32; i++) {
if (num & (mask >> i)) break;
res++;
}
return res;
}
int _start(){
unsigned int countNum = 0x0000000f;
unsigned int ans = clz(countNum);
volatile char *tx = (volatile char*)0x40002000;
*tx = ans + '0';
return 0;
}
```
note that the part of print result is not correct but haven't been fixed, because it cannot correctly print out the value > 9, it still have to fix the cast. Currently it can only correctly print out the value less than 10
result (compile without optimization)
```
riscv-none-embed-gcc -march=rv32i -mabi=ilp32 -O0 -nostdlib -o clz clz.c
./emu-rv32i clz
```
```
>>> Execution time: 80585 ns
>>> Instruction count: 466 (IPS=5782713)
>>> Jumps: 34 (7.30%) - 3 forwards, 31 backwards
>>> Branching T=30 (51.72%) F=28 (48.28%)
```
#### Disassemble the ELF file
```
riscv-none-embed-objdump -h clz
```
```
clz: file format elf32-littleriscv
Sections:
Idx Name Size VMA LMA File off Algn
0 .text 000000d0 00010054 00010054 00000054 2**2
CONTENTS, ALLOC, LOAD, READONLY, CODE
1 .comment 00000033 00000000 00000000 00000124 2**0
CONTENTS, READONLY
```
#### Header of ELF file
```
ELF Header:
Magic: 7f 45 4c 46 01 01 01 00 00 00 00 00 00 00 00 00
Class: ELF32
Data: 2's complement, little endian
Version: 1 (current)
OS/ABI: UNIX - System V
ABI Version: 0
Type: EXEC (Executable file)
Machine: RISC-V
Version: 0x1
Entry point address: 0x10088
Start of program headers: 52 (bytes into file)
Start of section headers: 544 (bytes into file)
Flags: 0x0
Size of this header: 52 (bytes)
Size of program headers: 32 (bytes)
Number of program headers: 1
Size of section headers: 40 (bytes)
Number of section headers: 6
Section header string table index: 5
```
#### RISC-V assembly result
```
riscv-none-embed-objdump -d clz
```
```
clz: file format elf32-littleriscv
Disassembly of section .text:
00010054 <clz>:
10054: fd010113 addi sp,sp,-48
10058: 02812623 sw s0,44(sp)
1005c: 03010413 addi s0,sp,48
10060: fca42e23 sw a0,-36(s0)
10064: fe042623 sw zero,-20(s0)
10068: 800007b7 lui a5,0x80000
1006c: fef42223 sw a5,-28(s0)
10070: fe042423 sw zero,-24(s0)
10074: 0340006f j 100a8 <clz+0x54>
10078: fe842783 lw a5,-24(s0)
1007c: fe442703 lw a4,-28(s0)
10080: 00f75733 srl a4,a4,a5
10084: fdc42783 lw a5,-36(s0)
10088: 00f777b3 and a5,a4,a5
1008c: 02079663 bnez a5,100b8 <clz+0x64>
10090: fec42783 lw a5,-20(s0)
10094: 00178793 addi a5,a5,1 # 80000001 <__global_pointer$+0x7ffee6dd>
10098: fef42623 sw a5,-20(s0)
1009c: fe842783 lw a5,-24(s0)
100a0: 00178793 addi a5,a5,1
100a4: fef42423 sw a5,-24(s0)
100a8: fe842703 lw a4,-24(s0)
100ac: 01f00793 li a5,31
100b0: fce7d4e3 bge a5,a4,10078 <clz+0x24>
100b4: 0080006f j 100bc <clz+0x68>
100b8: 00000013 nop
100bc: fec42783 lw a5,-20(s0)
100c0: 00078513 mv a0,a5
100c4: 02c12403 lw s0,44(sp)
100c8: 03010113 addi sp,sp,48
100cc: 00008067 ret
000100d0 <_start>:
100d0: fe010113 addi sp,sp,-32
100d4: 00112e23 sw ra,28(sp)
100d8: 00812c23 sw s0,24(sp)
100dc: 02010413 addi s0,sp,32
100e0: 00f00793 li a5,15
100e4: fef42623 sw a5,-20(s0)
100e8: fec42503 lw a0,-20(s0)
100ec: f69ff0ef jal ra,10054 <clz>
100f0: fea42423 sw a0,-24(s0)
100f4: 400027b7 lui a5,0x40002
100f8: fef42223 sw a5,-28(s0)
100fc: fe842783 lw a5,-24(s0)
10100: 0ff7f713 andi a4,a5,255
10104: fe442783 lw a5,-28(s0)
10108: 00e78023 sb a4,0(a5) # 40002000 <__global_pointer$+0x3fff06dc>
1010c: 00000793 li a5,0
10110: 00078513 mv a0,a5
10114: 01c12083 lw ra,28(sp)
10118: 01812403 lw s0,24(sp)
1011c: 02010113 addi sp,sp,32
10120: 00008067 ret
```
#### Compare to assembly code credit to [鄭惟](https://hackmd.io/@WeiCheng14159/rkUifs2Hw)
```
.data
input: .word 0x0000000f
one: .word 0x80000000
str1: .string "clz value of "
str2: .string " is "
.text
main:
lw a0, input # Load input from static data
jal ra, clz # Jump-and-link to the 'clz' label
# Print the result to console
mv a1, a0
lw a0, input
jal ra, printResult
# Exit program
li a7, 10
ecall
clz:
# t0 = one
# t1 = cnt = 32
# t2 = res
# a0 = i
lw t0, one
li t1, 32
li t2, 0
_beg: bne t1, zero, cnt
_ret: mv a0, t2
ret
cnt: addi t1,t1,-1
and t3, a0, t0 # i & one
bne t3, zero, _ret
addi t2, t2, 1
srli t0, t0, 1
j _beg
# --- printResult ---
# a0: input
# a1: result
printResult:
mv t0, a0
mv t1, a1
la a0, str1
li a7, 4
ecall
mv a0, t0
li a7, 1
ecall
la a0, str2
li a7, 4
ecall
mv a0, t1
li a7, 1
ecall
ret
```
### Compare result with optimization
```
riscv-none-embed-gcc -march=rv32i -mabi=ilp32 -O# -nostdlib -o clz clz.c
# = 1 ~ 3
```
#### Optimize O1
```
>>> Execution time: 40708 ns
>>> Instruction count: 156 (IPS=3832170)
>>> Jumps: 31 (19.87%) - 2 forwards, 29 backwards
>>> Branching T=28 (50.00%) F=28 (50.00%)
```
#### Optimize O2
```
>>> Execution time: 6854 ns
>>> Instruction count: 149 (IPS=21739130)
>>> Jumps: 28 (18.79%) - 0 forwards, 28 backwards
>>> Branching T=27 (48.21%) F=29 (51.79%)
```
#### Optimize O3
```
>>> Execution time: 50465 ns
>>> Instruction count: 149 (IPS=2952541)
>>> Jumps: 28 (18.79%) - 0 forwards, 28 backwards
>>> Branching T=27 (48.21%) F=29 (51.79%)
```
#### remove the printing part (line 16,17 in C code), compile with O3
```
>>> Execution time: 1764 ns
>>> Instruction count: 3 (IPS=1700680)
>>> Jumps: 1 (33.33%) - 0 forwards, 1 backwards
>>> Branching T=0 (-nan%) F=0 (-nan%)
```
* We can observe that compile with O2 has the best result, although Instruction cont, Jumps and Branching is almost similar with O3, but execution time is obviously better than O3. The [GCC document](http://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html) has decribe the detaied in each kind of optimization.
* [GCC optimization](https://wiki.gentoo.org/wiki/GCC_optimization#-O) has mentioned that
>Compiling with -O3 is not a guaranteed way to improve performance, and in fact, in many cases, can slow down a system due to larger binaries and increased memory usage.
* Another point is that without printing result, the all of information include execution time, instruction count, Jumps amd Branching will drastically decrease.