# Assignment1: RISC-V Assembly and Instruction Pipeline
contributed by < [TingSheng](https://github.com/TingSheng10) >
## Quiz1 - Problem C
### Promblem Describe ###
We implement `fp16_to_fp32` function (with `clz`), which converts a 16-bit floating-point number in IEEE half-precision format (bit representation) to a 32-bit floating-point number in IEEE single-precision format (bit representation).
`clz` function counts the number of leading zero bits in the binary representation of an unsigned integer, starting from the most significant bit (MSB). It returns an integer representing how many zero bits precede the first 1 in the binary representation of the number. If the input is zero, the output is **"undefined"**.
### C code of `fp16_to_fp32` ###
```c
static inline int my_clz(uint32_t x) {
int count = 0;
for (int i = 31; i >= 0; --i) {
if (x & (1U << i))
break;
count++;
}
return count;
}
```
```c
static inline uint32_t fp16_to_fp32(uint16_t h) {
const uint32_t w = (uint32_t) h << 16;
const uint32_t sign = w & UINT32_C(0x80000000);
const uint32_t nonsign = w & UINT32_C(0x7FFFFFFF);
uint32_t renorm_shift = my_clz(nonsign);
renorm_shift = renorm_shift > 5 ? renorm_shift - 5 : 0;
const int32_t inf_nan_mask = ((int32_t)(nonsign + 0x04000000) >> 8) &
INT32_C(0x7F800000);
const int32_t zero_mask = (int32_t)(nonsign - 1) >> 31;
return sign | ((((nonsign << renorm_shift >> 3) +
((0x70 - renorm_shift) << 23)) | inf_nan_mask) & ~zero_mask);
}
```
### Test case with `fp16_to_fp32` ###
|number | number in fp16 (input) | output |
| -------- | ---------------------- | -------- |
|+0 | 0x0000 |0x00000000|
|-0 | 0x8000 |0x80000000|
|NaN | 0xffff |0xFFFFE000|
|+inf | 0x7c00 |0x7F800000|
|-inf | 0xf800 |0xFF800000|
|normalized number| 0x007f |0x36FE0000|
|denormalized number| 0x0300 |0x33800000|
### Assamble language of `fp16_to_fp32` ###
```c
.data
argument: .word 0x0000
endl: .string "\n"
.text
la t0, argument
lw t0, 0(t0) # store x in t0
slli t0, t0, 16
addi s1, zero, 0x80000000
and s1, s1, t0
addi s2, zero, 0x7FFFFFFF
and s2, s2, t0
addi t1, t1, 31 # initialized t1 to 31
addi t2, t2, 1 # t2 = 1
clz_loop:
sll t4, t2, t1 # t4 record 1U shift i
addi t1, t1, -1 # i = i - 1
and t4, t4, s1 # x & (1U << i)
bnez t4, out_clz # break
addi t5, t5, 1 # did not enter the if statement and incremented count
bge t1, zero, clz_loop # return to the clz_loop
out_clz:
addi s3 ,s3 , 5
addi t5, t5, -5
bge t5, s3, out
add t5, x0, x0 # t5 = renorm_shift
out:
addi t3, zero, 0x4000000
add t3, s2, t3
srli t3, t3, 8
addi s3, zero, 0x7F800000
and t3, t3, s3 # s3 = inf_nan_mask
addi s4, s2, -1
srli s4, s4, 31 # s4 = zero mask
sll s2, s2, t5
srli s2, s2, 3
sub t5, zero, t5
addi t5, t5, 0x70
slli t5, t5, 23 # 0x70 - renorm << 23
or t5, t5, s3
xori s4, s4, -1
and t5, t5, s4
add s2, t5, s2
or s1, s1, s2
addi a0, zero, s1
addi a7, zero, 34
ecall
```
### Assamble language of clz(with loop) ###
x is stored in `t0`
clz(x) is store in `t4`
```c
main:
addi t1, zero, 31 # i in loop
addi t2, zero, 1 # 1U
addi t3, zero, 0 # clz(x)
clz_loop:
sll t3, t2, t1 # set mask t3
addi t1, t1, -1 # i = i-1
and t3, t0, t3 # using mask
bnez t3, Done
addi t4, t4, 1
bge t1, zero, clz_loop
```
### Assamble language of clz(without loop) ###
By leveraging the fact that every natural number can be represented in binary, we can implement the solution efficiently.
x is stored in `t0`
clz(x) is stored in `t1`
```c
clz:
addi t1, zero, 32
srli t2, t0, 16
beqz t2, shift_8
addi t1, t1, -16
addi t0, zero, t2
shift_8:
srli t2, t0, 8
beqz t2, shift_4
addi t1, t1, -8
addi t0, zero, t2
shift_4:
srli t2, t0, 4
beqz t2, shift_2
addi t1, t1, -4
addi t0, zero, t2
shift_2:
srli t2, t0, 2
beqz t2, shift_1
addi t1, t1, -2
addi t0, zero, t2
shift_1:
srli t2, t0, 1
beqz t2, finish
addi t1, t1, -1
```
## Exponential-Golomb coding ##
### Introduction ###
>[!Note]
> **The example and code in this article are for Nature number**
Golomb code is based on assumption that larger value of an integer will have less probability of occurrences. It has the advantage that they can be simply encoded and decoded algorithmically without the need for look-up tables. They also possess the property that, for any geometric distribution, it is possible to find an encoding that is an optimal prefix code. Heence, adaptive methods have thus been developed and exploited in standards such as the lossless form of JPEG – JPEG-LS.
The characteristic of this method of representing integers over many other methods is that it can be **quite efficient at representing small numbers** without imposing a limit on the maximum number that can be represented.
### Approach to Golomb coding ###
To encode any nonnegative integer x using the exp-Golomb code:
:::info
1. Write down x+1 in binary
2. Count the bits written, subtract one, and write that number of starting zero bits preceding the previous bit string.
:::
For example,
* we encode 3
* x+1 = 3+1 = 4
* 4 in binary is '100'
* '100' has 3 bits, and 3-1 = 2
* Hence add 2 zeros before '100', which is '00100'
### Why does this coding method enable compression? ###
Consider the following sequence:
```0, 1, 4, 3```
After encoding, the results are as follows:
```1, 010, 00101, 000100```
When we concatenate these bit strings, we obtain:
```101000101000100```
Assuming our bit buffer operates on 8-bit units, we start with 32 bits before encoding. After encoding, the first 8-bit buffer becomes:
```10100010```
The remaining bits, 1000100, are placed in the next buffer:
(filled with 1 additional zero to complete 8 bits)
```10001000```
In the end, the total number of bits used is only 16 bits, in contrast to the original 32 bits.
Next, we will explore how to accelerate the decoding of Golomb codes using the clz.
### C code of `Golomb coding` ###
:::danger
Provide more discussions about the rationale.
:::
This program prints the results after transformation.(include the test number)
```c
#include <stdio.h>
#include <stdint.h>
static inline int my_clz(uint32_t x) {
int count = 0;
for (int i = 31; i >= 0; --i) {
if (x & (1U << i))
break;
count++;
}
return count;
}
void exp_golomb_encode(int value) {
uint32_t code_num = value + 1;
int leading_zeros = my_clz(code_num);
int bit_length = 32 - leading_zeros;
for (int i = 0; i < bit_length - 1; ++i) {
printf("0");
}
for (int i = bit_length - 1; i >= 0; --i) {
printf("%d", (code_num >> i) & 1);
}
printf("\n");
}
int main() {
int values[] = {0, 1, 2, 3, 4, 5, 7, 15};
int n = sizeof(values) / sizeof(values[0]);
printf("Exponential-Golomb(0) Encodings (Using CLZ):\n");
for (int i = 0; i < n; ++i) {
printf("Value: %d -> Encoding: ", values[i]);
exp_golomb_encode(values[i]);
}
return 0;
}
```
### Assemble language of `Golomb coding` (with loop) ###
```c
.data
argument1: .word 1
argument2: .word 5
argument3: .word 8
text1: .string "The Golomb code of "
text2: .string " is "
endl: .string "\n"
.text
main:
la s0, argument1 # store data address to s0
li s3, 3 # amount of test data
golomb_1:
lw s1, 0(s0) # store test data to s1
addi s2, s1, 1
addi t1, zero, 31 # i for clz
addi t2, zero, 1 # 1U for clz
addi t4, zero, 0 # counter for clz
clz_loop:
sll t3, t2, t1 # mask
addi t1, t1, -1 # i -= 1
and t3, s2, t3 # using mask
bnez t3, golomb_2
addi t4, t4, 1 # counter += 1
bge t1, zero, clz_loop
golomb_2:
sub t4, zero, t4
addi t4, t4, 31 # (32 - clz(x)) - 1
mv t5, t4
print:
la a0, text1 # print "The Golomb code of "
li a7, 4
ecall
mv a0, s1 # print test data
li a7, 1
ecall
la a0, text2 # print " is "
li a7, 4
ecall
li a7, 1 # for print_prefix
mv a0, zero # for print "0"
print_prefix:
ecall
addi t4, t4, -1
bgt t4, zero, print_prefix
print_suffix:
srl t3, s2, t5 # mask
andi a0, t3, 1 # using mask
ecall
addi t5, t5, -1
bge t5, zero, print_suffix
la a0, endl
li a7, 4
ecall
addi s0, s0, 4 # select next test data
addi s3, s3, -1 # num of remainding test data
bne s3, zero, golomb_1
li a7, 10 # termination
ecall
```
### Assemble language of `Golomb coding` (without loop) ###
>[!Tip]
This version includes only the main program. Please add the `.data` and `print:` sections for testing as needed.
```c
main:
la s0, argument1 # store data address to s0
li s3, 3 # amount of test data
golomb_1:
lw s1, 0(s0) # store test data to s1
addi s2, s1, 1
mv t0, s2 # t0 = test data for clz
addi t4, zero, 31 # counter for clz
clz:
srli t2, t0, 16
beq t2, zero, shift_8
addi t4, t4, -16
mv t0, t2
shift_8:
srli t2, t0, 8
beqz t2, shift_4
addi t4, t4, -8
mv t0, t2
shift_4:
srli t2, t0, 4
beqz t2, shift_2
addi t4, t4, -4
mv t0, t2
shift_2:
srli t2, t0, 2
beqz t2, shift_1
addi t4, t4, -2
mv t0, t2
shift_1:
srli t2, t0, 1
beqz t2, golomb_2
addi t4, t4, -1
golomb_2:
sub t4, zero, t4
addi t4, t4, 31 # (32 - clz(x)) - 1
mv t5, t4
```
## Result ##
Both versions of the Golomb coding (in RISC-V ) have the same test data 1, 5, and 8, and the have the same output:
```
The Golomb code of 1 is 010
The Golomb code of 5 is 00110
The Golomb code of 8 is 0001001
Program exited with code: 0
```
The following data is their performance in [Ripes](https://github.com/mortbopet/Ripes), using 5-stage pipelined processor.
| Execution info | With loop | Without loop |
| -------------- | --------- | ------------ |
| Cycles | 945 | 286 |
| Instrs. retired| 686 | 185 |
| CPI | 1.38 | 1.55 |
| IPC | 0.726 | 0.647 |
| Clock Rate | 0 | 0 |
The version without loop has significantly fewer total cycles than the version with loop.
Both the instruction count and total cycles indicate a clear efficiency advantage, however, this comes at the cost of reduced code readability.
## Reference ##
Exponential-Golomb coding - https://en.wikipedia.org/wiki/Exponential-Golomb_coding
Golomb Code -https://www.sciencedirect.com/topics/engineering/golomb-code
Ripes - https://github.com/mortbopet/Ripes