# Implement Variable Byte Compression By Counting Zeros
###### Computer Architecture
## Variable Byte
Variable Byte is a algorithm of data compression. The algorithm main objective is to store data in a smaller space. For example, 0x00000088 can be stored in 1 byte, but the data will be stored in 4 bytes. The rule of Variable Byte is been described as follow:
* The Most Significant Bit (MSB) of a byte represents if there is any byte concatenate with this byte.
* The rest 7 bit of a byte is to store the target data.
| target data | binary | variable byte encode |
| --------- |:-------------------------------------:|:-------------------------------------:|
| 1 | 0b00000000 00000000 00000000 00000001 | 0b00000001 |
| 1023 | 0b00000000 00000000 00000011 11111111 | 0b1000111 01111111 |
| 16384 | 0b00000000 00000000 01000000 00000000 | 0b10000001 10000000 00000000 |
| 268435455 | 0b00001111 11111111 11111111 11111111 | 0b11111111 11111111 11111111 01111111 |
## C code
### Counting Leading Zeros (CLZ)
```c
#include <stdint.h>
uint16_t count_leading_zeros(uint64_t x)
{
x |= (x >> 1);
x |= (x >> 2);
x |= (x >> 4);
x |= (x >> 8);
x |= (x >> 16);
x |= (x >> 32);
/* count ones (population count) */
x -= ((x >> 1) & A01 /* Fill this! */ );
x = ((x >> 2) & 0x3333333333333333) + (x & A02 /* Fill this! */);
x = ((x >> 4) + x) & 0x0f0f0f0f0f0f0f0f;
x += (x >> 8);
x += (x >> 16);
return (64 - (x & 0x7f));
}
```
### VariableByte
#### Version 1
To implement the variable byte, an `uint8_t` pointer and the length of array are loaded as the function parameter which is to store the value every 7 bits.
```c
void encodeVariableByte(uint64_t value, uint8_t *encodebytes, int len) {
for (int i = 0; i < len; i++)
encodebytes[i] = i == 0 ? ((value >> (i * 7)) & 0x7F) & 0x7f : ((value >> (i * 7)) & 0x7F) | 0x80;
}
```
The `len` argument is to show how many `uint_8` should the value use to store.
```c
uint64_t testdata[3] = { 128, 1023, 0xffffffffffffff};
int len1 = (63 - count_leading_zeros(testdata[0])) / 7 + 1;
uint8_t encodedData1[len1];
int len2 = (63 - count_leading_zeros(testdata[1])) / 7 + 1;
uint8_t encodedData2[len2];
int len3 = (63 - count_leading_zeros(testdata[2])) / 7 + 1;
uint8_t encodedData3[len3];
encodeVariableByte(testdata[0], encodedData1, len1);
encodeVariableByte(testdata[1], encodedData2, len2);
encodeVariableByte(testdata[2], encodedData3, len3);
```
#### Version 2
What is the problem of version 1 ? Obviously, the `uint8_t` data type doesn't fit the register size in RISC-V, so the `uint8_t` array must store in memory. Therefore, the instruction of version 1 code will be too long to implement.
```c
void encodeVariableByte(uint64_t value, uint32_t *encodebytes, int len) {
uint32_t bitmask1 = 0x7f;
uint32_t bitmask2 = 0x80;
for (int i = 0; i < len; i++) {
if (i % 4 == 0) {
bitmask1 = 0x7f;
bitmask2 = 0x80;
}
encodebytes[i/4] |= i/4 == 0 ? (value & bitmask1) << i : (((value >> 32) & bitmask1) << (i % 4));
encodebytes[i/4] = i == 0 ? encodebytes[i/4] : (encodebytes[i/4] | bitmask2);
bitmask1 <<= 7;
bitmask2 <<= 8;
}
}
```
So, after modifying the code, what has been improved ? The most important thing is that the array type is been modified to `uint32_t`, and the data type fits the register size of the RISC-V. However, the side effect is that prompts to write more code.

```c
uint32_t bitmask1 = 0x7f;
uint32_t bitmask2 = 0x80;
...
bitmask1 <<= 7;
bitmask2 <<= 8;
```
The code includes two bitmasks, the first bitmask `bitmask1` is to get the data from origin value every 7 bits, and the second bitmask `bitmask2` is to generate the most significant bit of every byte like the figure above. Finally, the loop left shift the bitmask to get the higher bit of the value.
As the test data is given as below, the result shows the code is functioning well.
```c
uint64_t testdata[3] = { 128, 0xfffffff, 0xfffffffffffffff};
int len1 = (63 - count_leading_zeros(testdata[0])) / 7 + 1;
printf("%d\n", len1);
uint32_t encodedData1[1] = {0};
int len2 = (63 - count_leading_zeros(testdata[1])) / 7 + 1;
uint32_t encodedData2[1] = {0};
int len3 = (63 - count_leading_zeros(testdata[2])) / 7 + 1;
uint32_t encodedData3[2] = {0, 0};
encodeVariableByte(testdata[0], encodedData1, len1);
encodeVariableByte(testdata[1], encodedData2, len2);
encodeVariableByte(testdata[2], encodedData3, len3);
```
The test result:
```
Encoded Bytes: 8100
Encoded Bytes: ffffff7f
Encoded Bytes: ffffff7f ffffffff
```
#### Version 3
In version 2, a basic variable byte algorithm has been provided, but there is an error in it. Take a look at the sheet below, it shows that there are only 28 bits can store in a 32 bit space because of the MSB in every byte.
| target data | binary | variable byte encode |
| --------- |:-------------------------------------:|:-------------------------------------:|
| 268435455 | 0b00001111 11111111 11111111 11111111 | 0b11111111 11111111 11111111 01111111 |
So the code is modified to left shift 28 bit value in higher bit.
```diff
void encodeVariableByte(uint64_t value, uint32_t *encodebytes, int len) {
uint32_t bitmask1 = 0x7f;
uint32_t bitmask2 = 0x80;
for (int i = 0; i < len; i++) {
if (i % 4 == 0) {
bitmask1 = 0x7f;
bitmask2 = 0x80;
}
- encodebytes[i/4] |= i/4 == 0 ? (value & bitmask1) << i : (((value >> 32) & bitmask1) << (i % 4));
+ encodebytes[i/4] |= i/4 == 0 ? (value & bitmask1) << i : (((value >> 28) & bitmask1) << (i % 4));
encodebytes[i/4] = i == 0 ? encodebytes[i/4] : (encodebytes[i/4] | bitmask2);
bitmask1 <<= 7;
bitmask2 <<= 8;
}
}
```
Therefore the test case is modified to `0xffffffffffffff`, because `0xfffffffffffffff` can not be store in two registers.
```diff
-uint64_t testdata[3] = { 128, 0xfffffff, 0xfffffffffffffff};
+uint64_t testdata[3] = { 128, 0xfffffff, 0xffffffffffffff};
```
The test result:
```
Encoded Bytes: 8100
Encoded Bytes: ffffff7f
Encoded Bytes: ffffff7f ffffffff
```
## Assembly code
### Counting Leading Zeros
First work is to implement the function `Counting_Leading_Zeros`,and the tricky part of it is the argument `uint64_t`. In RISC-V, 64 bit should be stored in two register. So, the higher bit is stored in `a0`, and the lower bit is stored in `a1` in the CLZ function.
```
CLZ:
# a0 presents higher bit, a1 presents lower bit
# a0 store in s2, a1 store s3
addi s2, a0, 0
addi s3, a1, 0
```
When the `>>` operator is used, the lower bit register should do the right shift first and get the least significant bit from the higher bit. Then the higher bit right shift the bit at last.
```
# x |= x >> 1;
srli a1, s3, 1
slli s4, s2, 31
or a1, a1, s4
srli a0, s2, 1 # a0,a1 = x >> 1, s2,s3 = x
or s3, a1, s3
or s2, a0, s2 # x |= (x >> 1)
```
`x |= x >> 1` is the sample of the how to right shift the 64 bit in two 32 register, and the rest of function can be done as the same way.
**full Assembly code of CLZ**
```
CLZ:
# a0 presents higher bit, a1 presents lower bit
# a0 store in s2, a1 store s3
addi s2, a0, 0
addi s3, a1, 0
# x |= x >> 1;
srli a1, s3, 1
slli s4, s2, 31
or a1, a1, s4
srli a0, s2, 1 # a0,a1 = x >> 1, s2,s3 = x
or s3, a1, s3
or s2, a0, s2 # x |= (x >> 1)
# x |= x >> 2;
srli a1, s3, 2
slli s4, s2, 30
or a1, a1, s4
srli a0, s2, 2 # a0,a1 = x >> 2, s2,s3 = x
or s3, a1, s3
or s2, a0, s2 # x |= (x >> 2)
# x |= x >> 4;
srli a1, s3, 4
slli s4, s2, 26
or a1, a1, s4
srli a0, s2, 4 # a0,a1 = x >> 4, s2,s3 = x
or s3, a1, s3
or s2, a0, s2 # x |= (x >> 4)
# x |= x >> 8;
srli a1, s3, 8
slli s4, s2, 24
or a1, a1, s4
srli a0, s2, 8 # a0,a1 = x >> 8, s2,s3 = x
or s3, a1, s3
or s2, a0, s2 # x |= (x >> 8)
# x |= x >> 16;
srli a1, s3, 16
slli s4, s2, 16
or a1, a1, s4
srli a0, s2, 16 # a0,a1 = x >> 16, s2,s3 = x
or s3, a1, s3
or s2, a0, s2 # x |= (x >> 16)
# x |= x >> 32;
li a1, 0
or a1, a1, s2
li a0, 0 # a0,a1 = x >> 16, s2,s3 = x
or s3, a1, s3
or s2, a0, s2 # x |= (x >> 32)
# load 0x5555555555555555
la t0, and3
lw t1, 0(t0)
lw t2, 4(t0)
srli a1, s3, 1
slli s4, s2, 31
or a1, a1, s4
srli a0, s2, 1 # a0,a1 = x >> 1, s2,s3 = x
and a1, a1, t2
and a0, a0, t1
sub s3, s3, a1
sub s2, s2, a0
# load 0x3333333333333333
la t0, and1
lw t1, 0(t0)
lw t2, 4(t0)
srli a1, s3, 2
slli s4, s2, 30
or a1, a1, s4
srli a0, s2, 2 # a0,a1 = x >> 2, s2,s3 = x
and a1, a1, t2 # lower bit & 0x33333333
and a0, a0, t1 # higher bit & 0x33333333
and s3, s3, t2 # lower bit & 0x33333333
and s2, s2, t1 # higher bit & 0x33333333
add s3, s3, a1 # lower bit add
add s2, s2, a0 # higher bit add
# load 0x0f0f0f0f0f0f0f0f
la t0, and2
lw t1, 0(t0)
lw t2, 4(t0)
srli a1, s3, 4
slli s4, s2, 30
or a1, a1, s4
srli a0, s2, 4 # a0,a1 = x >> 4, s2,s3 = x
add s3, s3, a1 # lower bit add
add s2, s2, a0 # higher bit add
and s3, s3, t2 # lower bit & 0x0f0f0f0f
and s2, s2, t1 # higher bit & 0x0f0f0f0f
srli a1, s3, 8
slli s4, s2, 24
or a1, a1, s4
srli a0, s2, 8 # a0,a1 = x >> 8, s2,s3 = x
add s3, s3, a1
add s2, s2, a0
srli a1, s3, 16
slli s4, s2, 16
or a1, a1, s4
srli a0, s2, 16 # a0,a1 = x >> 16, s2,s3 = x
add s3, s3, a1
add s2, s2, a0
li a1, 0
or a1, a1, s2
li a0, 0 # a0,a1 = x >> 32, s2,s3 = x
add s3, s3, a1
li s2, 0
andi s3, s3, 0x7f
li a0, 64
sub a0, a0, s3
ret
```
### Variable Bytes
Second part of the assembly code is variable byte. It will be seperated by different label. The label `encodeVariableByte` presents when the encoding variable bytes is started. In the label, some registers are initialized to start as the bitmask or the counter of the loop.
```
encodeVariableByte:
li t0, 0x7f # bitmask1
li t1, 0x80 # bitmask2
li t2, 0 # i = 0
li t3, 4
li t5, 0
j first_register
```
The `first_register` is to transform the value to variable byte. The branch `not_firstbyte` is to present that the first byte in variable byte and the rest of bytes aren't the same because of the most significant bit of each byte.
```
first_register:
div a5, t2, t3
bne a5, t5, reset_bitmask
and t4, a1, t0 # value & bitmask1
sll t4, t4, t2 # (value & bitmask1) << i
or a3, a3, t4
bne t2, t5, not_firstbyte
slli t0, t0, 7
slli t1, t1, 8
addi t2, t2, 1
j first_register
not_firstbyte:
or a3, a3, t1
slli t0, t0, 7
slli t1, t1, 8
addi t2, t2, 1
beq t2, a0, exit
j first_register
```
After the `first_register` is finished, the bitmask is reset. Then, the `second_register` is activated to store the upper 32 byte in the second register.
```
exit:
jr ra
second_register:
and t4, a2, t0
sll t4, t4, t6
or a4, a4, t4
or a4, a4, t1
slli t0, t0, 7
slli t1, t1, 8
li t5, 7
beq t2, t5, exit
addi t2, t2, 1
beq t2, a0, exit
j second_register
reset_bitmask:
li t0, 0x7f # bitmask1
li t1, 0x80 # bitmask2
slli a2, a2, 4
srli s1, a1, 28
or a2, a2, s1
j second_register
```
### Test case
The 64 bit test case is also stored in two registers, so in order to load the word the program has to get access into the memory to load the data from the memory.
```
.data
testdata:
.word 0, 128
.word 0, 0xfffffff
.word 0xffffff, 0xffffffff
```
The 3,4 and 5 line are to load the testdata from the memory to two registers to do the rest operation.
```=
.text
main:
la t6, testdata
lw a0, 0(t6)
lw a1, 4(t6)
jal ra, CLZ
li t1, 63
sub a0, t1, a0
li t1, 7
div a0, a0, t1
addi a0, a0, 1 #a0 = len1
lw a1, 4(t6) #a1,a2 = value
lw a2, 0(t6)
la t5, encodedData1 # uint32_t encodedData1[1] = {0}
lw a3, 0(t5)
jal ra, encodeVariableByte
mv t6, a3
jal ra, printHEX
la t6, testdata
lw a0, 8(t6)
lw a1, 12(t6)
jal ra, CLZ
li t1, 63
sub a0, t1, a0
li t1, 7
div a0, a0, t1
addi a0, a0, 1 #a0 = len2
lw a1, 12(t6) #a1,a2 = value
lw a2, 8(t6)
la t5, encodedData2 # uint32_t encodedData2[1] = {0}
lw a3, 0(t5)
jal ra, encodeVariableByte
mv t6, a3
jal ra, printHEX
la t6, testdata
lw a0, 16(t6)
lw a1, 20(t6)
jal ra, CLZ
li t1, 63
sub a0, t1, a0
li t1, 7
div a0, a0, t1
addi a0, a0, 1 #a0 = len3
lw a1, 20(t6) #a1,a2 = value
lw a2, 16(t6)
la t5, encodedData3 # uint32_t encodedData3[2] = {0,0}
lw a3, 0(t5)
lw a4, 4(t5)
jal ra, encodeVariableByte
mv t6, a3
jal ra, printHEX
mv t6, a4
jal ra, printHEX
li a7, 10
ecall
```
`printHEX` uses the `ecall` to print the number in hexadecimal and print the value stored in register `t6`.
```=
printHEX:
la a0, EncodedBytes
li a7, 4
ecall
mv a0, t6 # return in register a0
li a7, 34
ecall
la a0, \n
li a7, 4
ecall
ret
```

## Analysis
The program is simulated on the [Ripes](https://github.com/mortbopet/Ripes). Ripes provides different kinds of processor for users to choose such as single cycle processor, 5-stage processor etc.
### 5-stage pipelined processor
The program is run on the 5-stage processor which includes IF,ID,EX,MEM and WB stages.
* Instruction fetch (IF)
* Instruction decode and register fetch (ID)
* Execute (EX)
* Memorty access (MEM)
* Register write back (WB)
All the assembly code will go through the stages like the example below.


#### IF

* PC = PC + 4
* Program Counter (PC) gives next instruction address to instruction memory.
* The instruction `addi` is in the memory address `0x14`

* Instruction `addi` can be translated to `03f00313` from the figure above.
#### ID

* Instruction `0x1a8000ef` is decoded into three parts:
* opcode `JAL`
* R1 idx & R2 idx
* Imm. `0x000001a8`
* Wr idx `0x01`
* Read the value from R1 idx and R2 idx which are both 0
#### EX

* ALU adds two operand together which one is r1_out and another is imm_out.
* PC and next PC value just send through the stage without any operation.
#### MEM

* Read the word from Data memory, and the result is 0.
#### WB

### Improvement
The assembly code below is the origin code of CLZ. The code shows that register `a1` use OR operation to get the value from `s2` and then adds with `s3`, but the code can be simplified to `s3` adds `s2`.
The reason is because that if the number digits which right shift `>>` is more 32, it doesn't matter what is the value in the higher bit since it will become 0.
```
...
li a1, 0
or a1, a1, s2
li a0, 0 # a0,a1 = x >> 32, s2,s3 = x
add s3, s3, a1
li s2, 0
andi s3, s3, 0x7f
li a0, 64
sub a0, a0, s3 # return in register t0
jr ra
```
The clock cycles of program:

So the new version assembly code is showed below.
```diff
...
- li a1, 0
- or a1, a1, s2
- li a0, 0 # a0,a1 = x >> 32, s2,s3 = x
- add s3, s3, a1
+ add s3, s3, s2
- li s2, 0
andi s3, s3, 0x7f
li a0, 64
sub a0, a0, s3 # return in register t0
jr ra
```
The clock cycles of improved program:
