# Implement Variable Byte Compression By Counting Zeros ###### Computer Architecture ## Variable Byte Variable Byte is a algorithm of data compression. The algorithm main objective is to store data in a smaller space. For example, 0x00000088 can be stored in 1 byte, but the data will be stored in 4 bytes. The rule of Variable Byte is been described as follow: * The Most Significant Bit (MSB) of a byte represents if there is any byte concatenate with this byte. * The rest 7 bit of a byte is to store the target data. | target data | binary | variable byte encode | | --------- |:-------------------------------------:|:-------------------------------------:| | 1 | 0b00000000 00000000 00000000 00000001 | 0b00000001 | | 1023 | 0b00000000 00000000 00000011 11111111 | 0b1000111 01111111 | | 16384 | 0b00000000 00000000 01000000 00000000 | 0b10000001 10000000 00000000 | | 268435455 | 0b00001111 11111111 11111111 11111111 | 0b11111111 11111111 11111111 01111111 | ## C code ### Counting Leading Zeros (CLZ) ```c #include <stdint.h> uint16_t count_leading_zeros(uint64_t x) { x |= (x >> 1); x |= (x >> 2); x |= (x >> 4); x |= (x >> 8); x |= (x >> 16); x |= (x >> 32); /* count ones (population count) */ x -= ((x >> 1) & A01 /* Fill this! */ ); x = ((x >> 2) & 0x3333333333333333) + (x & A02 /* Fill this! */); x = ((x >> 4) + x) & 0x0f0f0f0f0f0f0f0f; x += (x >> 8); x += (x >> 16); return (64 - (x & 0x7f)); } ``` ### VariableByte #### Version 1 To implement the variable byte, an `uint8_t` pointer and the length of array are loaded as the function parameter which is to store the value every 7 bits. ```c void encodeVariableByte(uint64_t value, uint8_t *encodebytes, int len) { for (int i = 0; i < len; i++) encodebytes[i] = i == 0 ? ((value >> (i * 7)) & 0x7F) & 0x7f : ((value >> (i * 7)) & 0x7F) | 0x80; } ``` The `len` argument is to show how many `uint_8` should the value use to store. ```c uint64_t testdata[3] = { 128, 1023, 0xffffffffffffff}; int len1 = (63 - count_leading_zeros(testdata[0])) / 7 + 1; uint8_t encodedData1[len1]; int len2 = (63 - count_leading_zeros(testdata[1])) / 7 + 1; uint8_t encodedData2[len2]; int len3 = (63 - count_leading_zeros(testdata[2])) / 7 + 1; uint8_t encodedData3[len3]; encodeVariableByte(testdata[0], encodedData1, len1); encodeVariableByte(testdata[1], encodedData2, len2); encodeVariableByte(testdata[2], encodedData3, len3); ``` #### Version 2 What is the problem of version 1 ? Obviously, the `uint8_t` data type doesn't fit the register size in RISC-V, so the `uint8_t` array must store in memory. Therefore, the instruction of version 1 code will be too long to implement. ```c void encodeVariableByte(uint64_t value, uint32_t *encodebytes, int len) { uint32_t bitmask1 = 0x7f; uint32_t bitmask2 = 0x80; for (int i = 0; i < len; i++) { if (i % 4 == 0) { bitmask1 = 0x7f; bitmask2 = 0x80; } encodebytes[i/4] |= i/4 == 0 ? (value & bitmask1) << i : (((value >> 32) & bitmask1) << (i % 4)); encodebytes[i/4] = i == 0 ? encodebytes[i/4] : (encodebytes[i/4] | bitmask2); bitmask1 <<= 7; bitmask2 <<= 8; } } ``` So, after modifying the code, what has been improved ? The most important thing is that the array type is been modified to `uint32_t`, and the data type fits the register size of the RISC-V. However, the side effect is that prompts to write more code. ![](https://hackmd.io/_uploads/Hy1RrEl-p.png) ```c uint32_t bitmask1 = 0x7f; uint32_t bitmask2 = 0x80; ... bitmask1 <<= 7; bitmask2 <<= 8; ``` The code includes two bitmasks, the first bitmask `bitmask1` is to get the data from origin value every 7 bits, and the second bitmask `bitmask2` is to generate the most significant bit of every byte like the figure above. Finally, the loop left shift the bitmask to get the higher bit of the value. As the test data is given as below, the result shows the code is functioning well. ```c uint64_t testdata[3] = { 128, 0xfffffff, 0xfffffffffffffff}; int len1 = (63 - count_leading_zeros(testdata[0])) / 7 + 1; printf("%d\n", len1); uint32_t encodedData1[1] = {0}; int len2 = (63 - count_leading_zeros(testdata[1])) / 7 + 1; uint32_t encodedData2[1] = {0}; int len3 = (63 - count_leading_zeros(testdata[2])) / 7 + 1; uint32_t encodedData3[2] = {0, 0}; encodeVariableByte(testdata[0], encodedData1, len1); encodeVariableByte(testdata[1], encodedData2, len2); encodeVariableByte(testdata[2], encodedData3, len3); ``` The test result: ``` Encoded Bytes: 8100 Encoded Bytes: ffffff7f Encoded Bytes: ffffff7f ffffffff ``` #### Version 3 In version 2, a basic variable byte algorithm has been provided, but there is an error in it. Take a look at the sheet below, it shows that there are only 28 bits can store in a 32 bit space because of the MSB in every byte. | target data | binary | variable byte encode | | --------- |:-------------------------------------:|:-------------------------------------:| | 268435455 | 0b00001111 11111111 11111111 11111111 | 0b11111111 11111111 11111111 01111111 | So the code is modified to left shift 28 bit value in higher bit. ```diff void encodeVariableByte(uint64_t value, uint32_t *encodebytes, int len) { uint32_t bitmask1 = 0x7f; uint32_t bitmask2 = 0x80; for (int i = 0; i < len; i++) { if (i % 4 == 0) { bitmask1 = 0x7f; bitmask2 = 0x80; } - encodebytes[i/4] |= i/4 == 0 ? (value & bitmask1) << i : (((value >> 32) & bitmask1) << (i % 4)); + encodebytes[i/4] |= i/4 == 0 ? (value & bitmask1) << i : (((value >> 28) & bitmask1) << (i % 4)); encodebytes[i/4] = i == 0 ? encodebytes[i/4] : (encodebytes[i/4] | bitmask2); bitmask1 <<= 7; bitmask2 <<= 8; } } ``` Therefore the test case is modified to `0xffffffffffffff`, because `0xfffffffffffffff` can not be store in two registers. ```diff -uint64_t testdata[3] = { 128, 0xfffffff, 0xfffffffffffffff}; +uint64_t testdata[3] = { 128, 0xfffffff, 0xffffffffffffff}; ``` The test result: ``` Encoded Bytes: 8100 Encoded Bytes: ffffff7f Encoded Bytes: ffffff7f ffffffff ``` ## Assembly code ### Counting Leading Zeros First work is to implement the function `Counting_Leading_Zeros`,and the tricky part of it is the argument `uint64_t`. In RISC-V, 64 bit should be stored in two register. So, the higher bit is stored in `a0`, and the lower bit is stored in `a1` in the CLZ function. ``` CLZ: # a0 presents higher bit, a1 presents lower bit # a0 store in s2, a1 store s3 addi s2, a0, 0 addi s3, a1, 0 ``` When the `>>` operator is used, the lower bit register should do the right shift first and get the least significant bit from the higher bit. Then the higher bit right shift the bit at last. ``` # x |= x >> 1; srli a1, s3, 1 slli s4, s2, 31 or a1, a1, s4 srli a0, s2, 1 # a0,a1 = x >> 1, s2,s3 = x or s3, a1, s3 or s2, a0, s2 # x |= (x >> 1) ``` `x |= x >> 1` is the sample of the how to right shift the 64 bit in two 32 register, and the rest of function can be done as the same way. **full Assembly code of CLZ** ``` CLZ: # a0 presents higher bit, a1 presents lower bit # a0 store in s2, a1 store s3 addi s2, a0, 0 addi s3, a1, 0 # x |= x >> 1; srli a1, s3, 1 slli s4, s2, 31 or a1, a1, s4 srli a0, s2, 1 # a0,a1 = x >> 1, s2,s3 = x or s3, a1, s3 or s2, a0, s2 # x |= (x >> 1) # x |= x >> 2; srli a1, s3, 2 slli s4, s2, 30 or a1, a1, s4 srli a0, s2, 2 # a0,a1 = x >> 2, s2,s3 = x or s3, a1, s3 or s2, a0, s2 # x |= (x >> 2) # x |= x >> 4; srli a1, s3, 4 slli s4, s2, 26 or a1, a1, s4 srli a0, s2, 4 # a0,a1 = x >> 4, s2,s3 = x or s3, a1, s3 or s2, a0, s2 # x |= (x >> 4) # x |= x >> 8; srli a1, s3, 8 slli s4, s2, 24 or a1, a1, s4 srli a0, s2, 8 # a0,a1 = x >> 8, s2,s3 = x or s3, a1, s3 or s2, a0, s2 # x |= (x >> 8) # x |= x >> 16; srli a1, s3, 16 slli s4, s2, 16 or a1, a1, s4 srli a0, s2, 16 # a0,a1 = x >> 16, s2,s3 = x or s3, a1, s3 or s2, a0, s2 # x |= (x >> 16) # x |= x >> 32; li a1, 0 or a1, a1, s2 li a0, 0 # a0,a1 = x >> 16, s2,s3 = x or s3, a1, s3 or s2, a0, s2 # x |= (x >> 32) # load 0x5555555555555555 la t0, and3 lw t1, 0(t0) lw t2, 4(t0) srli a1, s3, 1 slli s4, s2, 31 or a1, a1, s4 srli a0, s2, 1 # a0,a1 = x >> 1, s2,s3 = x and a1, a1, t2 and a0, a0, t1 sub s3, s3, a1 sub s2, s2, a0 # load 0x3333333333333333 la t0, and1 lw t1, 0(t0) lw t2, 4(t0) srli a1, s3, 2 slli s4, s2, 30 or a1, a1, s4 srli a0, s2, 2 # a0,a1 = x >> 2, s2,s3 = x and a1, a1, t2 # lower bit & 0x33333333 and a0, a0, t1 # higher bit & 0x33333333 and s3, s3, t2 # lower bit & 0x33333333 and s2, s2, t1 # higher bit & 0x33333333 add s3, s3, a1 # lower bit add add s2, s2, a0 # higher bit add # load 0x0f0f0f0f0f0f0f0f la t0, and2 lw t1, 0(t0) lw t2, 4(t0) srli a1, s3, 4 slli s4, s2, 30 or a1, a1, s4 srli a0, s2, 4 # a0,a1 = x >> 4, s2,s3 = x add s3, s3, a1 # lower bit add add s2, s2, a0 # higher bit add and s3, s3, t2 # lower bit & 0x0f0f0f0f and s2, s2, t1 # higher bit & 0x0f0f0f0f srli a1, s3, 8 slli s4, s2, 24 or a1, a1, s4 srli a0, s2, 8 # a0,a1 = x >> 8, s2,s3 = x add s3, s3, a1 add s2, s2, a0 srli a1, s3, 16 slli s4, s2, 16 or a1, a1, s4 srli a0, s2, 16 # a0,a1 = x >> 16, s2,s3 = x add s3, s3, a1 add s2, s2, a0 li a1, 0 or a1, a1, s2 li a0, 0 # a0,a1 = x >> 32, s2,s3 = x add s3, s3, a1 li s2, 0 andi s3, s3, 0x7f li a0, 64 sub a0, a0, s3 ret ``` ### Variable Bytes Second part of the assembly code is variable byte. It will be seperated by different label. The label `encodeVariableByte` presents when the encoding variable bytes is started. In the label, some registers are initialized to start as the bitmask or the counter of the loop. ``` encodeVariableByte: li t0, 0x7f # bitmask1 li t1, 0x80 # bitmask2 li t2, 0 # i = 0 li t3, 4 li t5, 0 j first_register ``` The `first_register` is to transform the value to variable byte. The branch `not_firstbyte` is to present that the first byte in variable byte and the rest of bytes aren't the same because of the most significant bit of each byte. ``` first_register: div a5, t2, t3 bne a5, t5, reset_bitmask and t4, a1, t0 # value & bitmask1 sll t4, t4, t2 # (value & bitmask1) << i or a3, a3, t4 bne t2, t5, not_firstbyte slli t0, t0, 7 slli t1, t1, 8 addi t2, t2, 1 j first_register not_firstbyte: or a3, a3, t1 slli t0, t0, 7 slli t1, t1, 8 addi t2, t2, 1 beq t2, a0, exit j first_register ``` After the `first_register` is finished, the bitmask is reset. Then, the `second_register` is activated to store the upper 32 byte in the second register. ``` exit: jr ra second_register: and t4, a2, t0 sll t4, t4, t6 or a4, a4, t4 or a4, a4, t1 slli t0, t0, 7 slli t1, t1, 8 li t5, 7 beq t2, t5, exit addi t2, t2, 1 beq t2, a0, exit j second_register reset_bitmask: li t0, 0x7f # bitmask1 li t1, 0x80 # bitmask2 slli a2, a2, 4 srli s1, a1, 28 or a2, a2, s1 j second_register ``` ### Test case The 64 bit test case is also stored in two registers, so in order to load the word the program has to get access into the memory to load the data from the memory. ``` .data testdata: .word 0, 128 .word 0, 0xfffffff .word 0xffffff, 0xffffffff ``` The 3,4 and 5 line are to load the testdata from the memory to two registers to do the rest operation. ```= .text main: la t6, testdata lw a0, 0(t6) lw a1, 4(t6) jal ra, CLZ li t1, 63 sub a0, t1, a0 li t1, 7 div a0, a0, t1 addi a0, a0, 1 #a0 = len1 lw a1, 4(t6) #a1,a2 = value lw a2, 0(t6) la t5, encodedData1 # uint32_t encodedData1[1] = {0} lw a3, 0(t5) jal ra, encodeVariableByte mv t6, a3 jal ra, printHEX la t6, testdata lw a0, 8(t6) lw a1, 12(t6) jal ra, CLZ li t1, 63 sub a0, t1, a0 li t1, 7 div a0, a0, t1 addi a0, a0, 1 #a0 = len2 lw a1, 12(t6) #a1,a2 = value lw a2, 8(t6) la t5, encodedData2 # uint32_t encodedData2[1] = {0} lw a3, 0(t5) jal ra, encodeVariableByte mv t6, a3 jal ra, printHEX la t6, testdata lw a0, 16(t6) lw a1, 20(t6) jal ra, CLZ li t1, 63 sub a0, t1, a0 li t1, 7 div a0, a0, t1 addi a0, a0, 1 #a0 = len3 lw a1, 20(t6) #a1,a2 = value lw a2, 16(t6) la t5, encodedData3 # uint32_t encodedData3[2] = {0,0} lw a3, 0(t5) lw a4, 4(t5) jal ra, encodeVariableByte mv t6, a3 jal ra, printHEX mv t6, a4 jal ra, printHEX li a7, 10 ecall ``` `printHEX` uses the `ecall` to print the number in hexadecimal and print the value stored in register `t6`. ```= printHEX: la a0, EncodedBytes li a7, 4 ecall mv a0, t6 # return in register a0 li a7, 34 ecall la a0, \n li a7, 4 ecall ret ``` ![](https://hackmd.io/_uploads/ryU1PcWbT.png) ## Analysis The program is simulated on the [Ripes](https://github.com/mortbopet/Ripes). Ripes provides different kinds of processor for users to choose such as single cycle processor, 5-stage processor etc. ### 5-stage pipelined processor The program is run on the 5-stage processor which includes IF,ID,EX,MEM and WB stages. * Instruction fetch (IF) * Instruction decode and register fetch (ID) * Execute (EX) * Memorty access (MEM) * Register write back (WB) All the assembly code will go through the stages like the example below. ![](https://hackmd.io/_uploads/BJpLvoGb6.png) ![](https://hackmd.io/_uploads/H1qjcjMbT.png) #### IF ![](https://hackmd.io/_uploads/B1XYooGbT.png) * PC = PC + 4 * Program Counter (PC) gives next instruction address to instruction memory. * The instruction `addi` is in the memory address `0x14` ![](https://hackmd.io/_uploads/Sk2NEJ7-6.png) * Instruction `addi` can be translated to `03f00313` from the figure above. #### ID ![](https://hackmd.io/_uploads/HyJkwy7-a.png) * Instruction `0x1a8000ef` is decoded into three parts: * opcode `JAL` * R1 idx & R2 idx * Imm. `0x000001a8` * Wr idx `0x01` * Read the value from R1 idx and R2 idx which are both 0 #### EX ![](https://hackmd.io/_uploads/rJ9pglQbp.png) * ALU adds two operand together which one is r1_out and another is imm_out. * PC and next PC value just send through the stage without any operation. #### MEM ![](https://hackmd.io/_uploads/HJMmToGZa.png) * Read the word from Data memory, and the result is 0. #### WB ![](https://hackmd.io/_uploads/SJ1S6ofZa.png) ### Improvement The assembly code below is the origin code of CLZ. The code shows that register `a1` use OR operation to get the value from `s2` and then adds with `s3`, but the code can be simplified to `s3` adds `s2`. The reason is because that if the number digits which right shift `>>` is more 32, it doesn't matter what is the value in the higher bit since it will become 0. ``` ... li a1, 0 or a1, a1, s2 li a0, 0 # a0,a1 = x >> 32, s2,s3 = x add s3, s3, a1 li s2, 0 andi s3, s3, 0x7f li a0, 64 sub a0, a0, s3 # return in register t0 jr ra ``` The clock cycles of program: ![](https://hackmd.io/_uploads/Sy3a_e7Za.png) So the new version assembly code is showed below. ```diff ... - li a1, 0 - or a1, a1, s2 - li a0, 0 # a0,a1 = x >> 32, s2,s3 = x - add s3, s3, a1 + add s3, s3, s2 - li s2, 0 andi s3, s3, 0x7f li a0, 64 sub a0, a0, s3 # return in register t0 jr ra ``` The clock cycles of improved program: ![](https://hackmd.io/_uploads/H182OlQWa.png)