# Assignment1: RISC-V Assembly and Instruction Pipeline
contributed by <[`st10740`](https://github.com/st10740/Computer-Architecture-HW/tree/main/HW1)>
## Linear convolution using bfloat16
### Idea
#### bfloat16
bfloat16 format was developed by Google Brain for artificial intelligence purpose to accelarate computing and save storage, which uses only half number of bits of FP32 to represent a floating point number. The smaller number of bits speedup arithmetic operations such as addition, multiplication and so on.
Below is the comparision of FP32 and BF16 in visual.
- FP32
| sign (1 bit) | exponent (8bit) | mantissa (23 bit) |
| ---- | --------- | -------------------------------- |
| 31 | 30 ----------- 23 | 22 --------------------- 0 |
- BF16
| sign (1 bit) | exponent (8bit) | mantissa (7 bit) |
| ---- | --------- | --------------------- |
| 15 | 14 ----------- 7 | 6 ----------- 0 |
After conversion from FP32 to BF16, the exponent bits of BF16 are preserved so that the range of numbers BF16 can represent is as wide as FP32. The number of mantissa bits is reduced from 23 to 7 after conversion, which leads to less accuracy but also saves storage.
#### Linear convolution
Linear convolution is a fundamental concept in Digital Signal Processing (DSP) with a wide range of applications. It plays a pivotal role in analyzing, processing, and transforming signals in various domains, such as audio, image, and communication systems. Given two functions f(n) and h(n), the linear convolution of two is expressed by following:$$f(n) * h(n) = \sum_{k=-\infty}^{\infty} f(k) * h(n-k)$$
We can notive that there are plenty of addition and multiplication operations of floating point numbers in this formula. In order to speedup the computing, I want to try using bfloat16 to compute it.
## Implementation
### C program
`fp32_to_bf16` is forked from [Quiz1 - Problem B](https://hackmd.io/@sysprog/arch2023-quiz1-sol)
#### Test cases
| signal | impulse | Case |
| -------- | -------- | -------- |
| [1.2, 1.3, 1.4] | [2.1, 3.2, 2.1] | normal |
| [1.1, 1.2, 1.3] | [0.0, 0.0, 0.0] | contains zero |
| [1.4, 1.5, 1.6] | [1.7, 1.8, 1.9] | may carry on after multiplicating |
#### Code
```c!=
#include <stdio.h>
#include <stdlib.h>
float fp32_to_bf16(float x) {
float y = x;
int *p = (int *) &y;
unsigned int exp = *p & 0x7F800000;
unsigned int man = *p & 0x007FFFFF;
if (exp == 0 && man == 0) /* zero */
return x;
if (exp == 0x7F800000) /* infinity or NaN */
return x;
/* Normalized number */
/* round to nearest */
float r = x;
int *pr = (int *) &r;
*pr &= 0xFF800000; /* r has the same exp as x */
r /= 0x100;
y = x + r;
*p &= 0xFFFF0000;
return y;
}
void fp32_to_bf16_in_arr(float* arr, int len) {
while(len) {
*arr = fp32_to_bf16(*arr);
arr += 1;
len--;
}
}
void printResult(float* signal, int len) {
for(int i=0; i<len; i++) {
printf("%f ", *(signal + i));
}
printf("\n");
}
float* convolution_with_bf16(float* signal, float* impulse, int sLen, int impluseLen) {
int l = sLen + impluseLen -1;
float* output = (float*) malloc(sizeof(float)*l);
/* convert signal from FP32 format to BF16 format */
fp32_to_bf16_in_arr(signal, sLen);
fp32_to_bf16_in_arr(impulse, impluseLen);
/* do linear convolution */
for(int n=0; n<l; n++) {
output[n] = 0.0;
for(int k=0; k<sLen; k++) {
if((n-k) >= 0 && (n-k) <= impluseLen) {
output[n] += signal[k] * impulse[n-k];
}
}
}
return output;
}
int main() {
float signal1[3] = {1.2, 1.3, 1.4};
float impulse1[3] = {2.1, 3.2, 2.1};
float* result1 = convolution_with_bf16(signal1, impulse1, 3, 3);
printResult(result1, 3+3-1);
return 0;
}
```
### Assembly code
Below is the implementation of `fp32_to_bf16_2(float x)` assembly code with test cases to ensure the program works as expect.
#### Test Cases
| HEX | binary | Case |
| -------- | -------- | -------- |
| 0x0 | 00000000000000000000000000000000 | exponent=0x0 & mantissa=0x0|
| 0x7F800000 | 01111111100000000000000000000000 | exponent=0xFF |
| 0x700AB000 | 01110000000010101011000000000000 | normalized positive number |
| 0xF00AB000 | 11110000000010101011000000000000 | normalized negative number |
| 0x707F8FFF | 01110000011111111000111111111111 | normalized number with overflow after rounding to nearest |
#### Code
```asm!=
.data
test1: .word 0x0 # exponent=0x0 & mantissa=0x0
test2: .word 0x7F800000 # exp=0x7F800000
test3: .word 0x700AB000 # normalized positive number
test4: .word 0xF00AB000 # normalized negative number
test5: .word 0x707F8FFF # normalized number with overflow after rounding to nearest
exp_mask: .word 0x7F800000
man_mask: .word 0x7FFFFF
round: .word 0x8000
bfloat16_man_plus_one_mask: .word 0x7F8000
bfloat16_mask: .word 0xFFFF0000
msg_before_str: .string "before convert to bfloat16: "
msg_after_str: .string "after convert to bfloat16: "
end_str: .string "\n"
.text
main:
lw a0, test1
jal do_fp32_to_bf16_and_print
lw a0, test2
jal do_fp32_to_bf16_and_print
lw a0, test3
jal do_fp32_to_bf16_and_print
lw a0, test4
jal do_fp32_to_bf16_and_print
lw a0, test5
jal do_fp32_to_bf16_and_print
li a7, 10 # exit
ecall
do_fp32_to_bf16_and_print:
addi sp, sp, -8
sw ra, 0(sp)
sw a0, 4(sp)
la a0, msg_before_str
li a7, 4
ecall # print "before convert to bfloat16: "
lw a0, 4(sp)
li a7, 2
ecall # print FP32 format
la a0, end_str
li a7, 4
ecall # print "\n"
la a0, msg_after_str
li a7, 4
ecall # print "after convert to bfloat16: "
lw a0, 4(sp)
jal fp32_to_bf16
li a7, 2
ecall # print BF16 format
la a0, end_str
li a7, 4
ecall # print "\n"
lw ra, 0(sp)
lw a0, 4(sp)
addi sp, sp, 8
jr ra
fp32_to_bf16:
addi sp, sp, -4
sw ra, 0(sp)
add t0, a0, zero # store a0 to t0
lw t1, exp_mask
and t2, t0, t1 # t2: result of applying exp mask to t0
beq t1, t2, return_fp32_to_bf16 # t0 is infinity or NaN, just return it
beq t2, zero, exp_is_zero
normalized_number:
lw t1, bfloat16_man_plus_one_mask
lw t2, round
and t3, t0, t1 # t3: applying bfloat16_man_plus_one_mask to t0
beq t1, t3, is_overflow # doing round to nearest will cause overflow
add t0, t0, t2 # do round to nearest
lw t1, bfloat16_mask
and a0, t0, t1 # a0: applying bfloat16_mask to t0
return_fp32_to_bf16:
lw ra, 0(sp)
addi sp, sp, 4
jr ra
exp_is_zero:
lw t1, man_mask
and t1, t1, t0 # t1: applying man_mask to t0
bne t1, zero, normalized_number # if t1!=0, t0 is normalized number
j return_fp32_to_bf16 # else, t0 is zero, just return it
is_overflow:
lw t1, bfloat16_mask
and a0, t0, t1 # do round down
j return_fp32_to_bf16
```
The most tricky part is to convert below lines into assembly code :
```c!
*pr &= 0xFF800000; /* r has the same exp as x */
r /= 0x100;
y = x + r;
```
It needs division and addition of floating point numbers. However, it's only allowed to use RV32I instructions in this homework. To overcome it, I take a deep dive into the purpose of these three lines and find out that the only thing they do is to plus 1 to 8 binary places for rounding to nearest.
suppose $x = (1.xxx)_2 * 2 ^ N$ and $r = (1.xxx)_2 * 2^N$
after `*pr &= 0xFF800000`, $r=(1.0)_2 * 2^N$
after `r /= 0x100`, $r=(1.0)_2 * 2^{N-8}$
so $y = x + r = (1.xxx)_2 * 2^N + (1.0)_2 * 2^{N-8} = (1.xxx)_2 * 2^N + (0.00000001)_2 * 2^N$
$= (1.xxx + 0.00000001)_2 * 2^N$
Therefore, I replace the division and addition operations with below strategy.
(t0 stores the floating point number)
```asm!
.data
round: .word 0x8000
.text
lw t2, round
add t0, t0, t2
```
---
Actually, I haven't finished converting the other parts of c code to assembly code yet because I'm stuck in implementing bfloat16 addition and mulplication. I will try to finish these parts in the future as soon as possible.
## Analysis
I use [Ripes](https://github.com/mortbopet/Ripes) simulator to analyze.
### Five stage RISC pipeline
1. **Instruction fetch (IF)**
- Instruction is fetched from the memory
2. **Instruction decode and register fetch (ID)**
- Decode the instruction
3. **Execute (EX)**
- Perform the operation specified by the instruction
4. **Memory access (MEM)**
- Access data from memory or store data into memory
5. **Register write back (WB)**
- Store the result in the destination location

I take `add t0, t0, t2` at 82 line for example to explain each stage.
#### Instruction fetch (IF)
- The instruction is fetched into IFID which located at `0x007282b3` in instruction memory.
- PC is updated to PC+4 to fetch next instruction with the value from `0x000000f4` to `0x000000f8`

#### Instruction decode and register fetch (ID)
- The instruction is decoded for next stages.
- The decoded opcode indicates that it's `ADD` instrunction.
- The upper `0x05` indicates that the number of register 1 is `0x05`, which means it's `x5` or `t0`. The `0x07` indicates that the number of register 2 is `0x07`, which means it's `x7` or `t2`. The lower `0x05` indicates that the number of destination register is `0x05`, which means it's `x5` or `t0`.
- We can notice that register 1 and destination register share the same register.

#### Execute (EX)
- `ADD` is excuted through ALU in this stage.
- The value `0x700ab000` stored in register 1 `x5` and the value `0x00001f40` stored in register 2 `x7` is added through ALU, and output of ALU is the result `0x700acf40`.

#### Memory access (MEM)
- Since this instruction is R-type, it doesn't do anything in this stage.
- Data memory includes two inputs, which are the address of data to be loaded or stored and the value of data to be writen into. There is one output in data memory , which is the value of data loaded from the given address.
- Although R-type instruction do nothing in this stage, we can see from the illustration, the result of ALU and the value stored in register 2 are buffered into data memory.

#### Register write back (WB)
- The result `0x700acf40` calculated by ALU is written back to destination register `x5`.

### Memory update
As mentioned above, the source register and destination register of `add x5, x5, x7` share the same register. Before `add x5, x5, x7` completed, the value stored in `x5` is `0x700ab000`. After it completed, the value stored in `x5` is updated to `0x700acf40`.
#### `add x5, x5, x7` in the WB stage

#### `add x5, x5, x7` out of the WB stage

## Assembly optimization
### Original execution info

### Duplicate code
- **Observation** : There are plenty of codes in `main` doing similar things and only the test data are different.
- **Implentation**
- Put all test data into one array and use loop to iterate them.
**Before**
```asm!=
.data
test1: .word 0x0
test2: .word 0x7F800000
test3: .word 0x700AB000
test4: .word 0xF00AB000
test5: .word 0x707F8FFF
.text
main:
lw a0, test1
jal do_fp32_to_bf16_and_print
lw a0, test2
jal do_fp32_to_bf16_and_print
lw a0, test3
jal do_fp32_to_bf16_and_print
lw a0, test4
jal do_fp32_to_bf16_and_print
lw a0, test5
jal do_fp32_to_bf16_and_print
li a7, 10 # exit
ecall
```
**After**
```asm!=
.data
test: .word 0x00000000 0x7F800000 0x700AB000 0xF00AB000 0x707F8FFF
.text
main:
la s0, test
li s1, 0 # i
li s2, 5 # n
test_loop:
beq s1, s2, exit # if i==5 break
lw a0, 0(s0) # a0: test[i]
jal do_fp32_to_bf16_and_print
addi s0, s0, 4 # s0: address of test[i+1]
addi s1, s1, 1 # s1: i++
j test_loop
exit:
li a7, 10 # exit
ecall
```
We can see the execution info below. Although the the number of cycle increase after modifying, the code become more elegant.

## Reference
- https://en.wikipedia.org/wiki/Bfloat16_floating-point_format
- https://en.wikipedia.org/wiki/Convolution
- https://www.geeksforgeeks.org/linear-convolution-using-c-and-matlab/