Assignment1: - HackMD

# Assingment1: Approximating a bfloat number using binary search ###### contributed by < [Timothy Liu](https://github.com/timothyliu0912) > ###### tags: `RISC-V` `Binary Search` `bfloat16` `float32` ## 1. Backgrond ### 1.1 Binary Search introduction Binay seach is a efficient search algorithm that locate the position of a target value within a sorted array (or interval). Binary search process involves comparing the target value within the middle element of the array. If they do not match, it elimates the half if the element of the array where the target cannot where the target cannot possibly be found and continues the search in the remaining half. This process repeats, with each iteration narrowing down the search range by comparing the middle element to the target value until the desired target value is located. ### 1.2 Binary Search's Procedure Binary search is operated on a sorted array (or interval). The process of binary search starts by comparing an element located at the midpoint of the array to the target value. If the target value matches the element, the algorithm returns the position of that element in the array. If the target value is less than the element, the search narrows down to the lower half of the array. On the other hand, if the target value is greater than the element, the search continues on the upper half of the array. This iterative approach effectively eliminates the half in which the target value cannot be located during each step of the search, ultimately leading to the discovery of the target value's position in the array. Given an interval between $L$ and $R$ (i.e., $\left[L, R \right]$) and the target value $T$, the following subroutine uses binary search to find the index of $T$ in $\left[ L, R \right]$ 1. If $L > R$, the search terminates as unsuccessful. 3. Set $m$ (the positioon if the middle element) to $\frac{L+R}{2}$. 4. If $m < T$, set $R$ to $m$ and got to step 2. 5. If $m > T$, set $L$ to $m-1$ and got to step 2. 6. If $m = T$, the search is done; return m. ### 1.3 Bfloat16 floating-point format The `bfloat16` is a custom 16-bit floating point format for machine learning that is composed of * Sign bit: 1 bit * Exponent width: 8 bits * Significand precision: 8 bits (7 explicitly stored, with an implicit leading bit), as opposed to 24 bits in a classical single-precision floating-point format #### Composed of bfloat16, float32 and float16 |floating point format|Sign bit | Exponent bits | Mantissa bits | |:--------:|:--------:|:--------:|:--------:| |`bfloat16` |1-bit | 8-bits | 7-bits | |`float32` |1-bit | 8-bits | 23-bits | |`float16` |1-bit | 5-bits | 10-bits | Compared with `float32`, `bfloat16` has the same exponent size as `float32`, which allows for fast conversion to and from `float32`. The exponent bits are preserved in conversion to the `bfloat16` , but handles denormals are flushed to zero. (wiki, google:The bfloat16 numerical format ) ## 2. Implemtation The source code of C Program and Assembly code can be found [here]() ### 2.1 C Program ```c= #include <stdio.h> float fp32_to_bf16(float x) { float y = x; int *p = (int *) &y; unsigned int exp = *p & 0x7F800000; unsigned int man = *p & 0x007FFFFF; if (exp == 0 && man == 0) return x; if (exp == 0x7F800000) return x; float r = x; int *pr = (int *) &r; *pr &= 0xFF800000; r /= 0x100; y = x + r; *p &= 0xFFFF0000; return y; } float binary_search(float low, float high, float target) { low = fp32_to_bf16(low); high = fp32_to_bf16(high); target = fp32_to_bf16(target); while (low <= high) { float mid = low + (high - low) / 2; if (mid == target) { return mid; } if (mid < target) { low = mid; } else { high = mid; } } return -1; } int main(){ float test_case1_x= 0.1; float test_case1_ub = 10.0; float test_case1_lb = 0.001; float test_case2_x= 100.0; float test_case2_ub = 256.0; float test_case2_lb = 10.0; float test_case3_x= 0.563; float test_case3_ub = 1.0; float test_case3_lb = 0.01; printf("%f\n",binary_search(test_case1_lb, test_case1_ub, test_case1_x)); printf("%f\n",binary_search(test_case2_lb, test_case2_ub, test_case2_x)); printf("%f\n",binary_search(test_case3_lb, test_case3_ub, test_case3_x)); return 0; } ``` ### 2.2 Assembly Code The complete source code can be found [here](). The following code only shows the part of assembly code functions. ```s= float32tobfloat16: #convert float32 to bfloat16 li s0, 0x7F800000 li s1, 0x007FFFFF li s2, 0x00800000 and a0, t1, s0 #a0 input exp and a1, t1, s1 #a1 input man # or a1, a1, s2 beq a0, x0, zerocheck beq a0, s0 print_float jal s10 float32_div jal s10 float32_add li s0, 0xFFFF0000 and t1, t2, s0 ret zerocheck: beq a1, x0, print_float float32_div: li s0, 0xFF800000 li s1, 0x04000000 #256 # and s2, t1, s0 #pr=s2 input exp sub a2, a0, s1 #a2 = a0 / 256 jr s10 float32_add: li s0, 0x7F800000 li s1, 0x007FFFFF li s2, 0x00800000 and a3, a2, s0 #r exp and a4, a2, s1 #r man or a4, a4, s2 sub a5, a0, a3 #a5 = shift srli a5, a5, 23 srl a4, a4, a5 # shift r man add a6, a4, a1 #a6 res_man or t2, a0, a6 jr s10 float_add: li s0, 0x7F800000 li s1, 0x007FFFFF li s2, 0x00800000 and a0, t4, s0 #a0 input exp and a1, t4, s1 #a1 input man srli a0, a0, 23 # shift exp or a1, a1, s2 and a3, t5, s0 #r exp and a4, t5, s1 #r man srli a3, a3, 23 or a4, a4, s2 sub a5, a0, a3 #a5 = shift srl a4, a4, a5 # shift r man add a6, a4, a1 #a6 res_man li s7, 0x01000000 and a7, a6, s7 bnez a7, overflow #check man overflow #combine exp and man and a6, a6,s1 andi a0, a0, 0xff slli a0, a0, 23 or t2, a0, a6 jr s10 overflow: srli a6, a6 1 addi a0, a0, 1 and a6, a6,s1 andi a0, a0, 0xff slli a0, a0, 23 or t2, a0, a6 jr s10 binary_search: jal s10 float_add #add ub and lb # div 2 li s0, 0xFF800000 li s1, 0x007FFFFF li s2 0x01000000 #2 and a0, t2, s0 #a0 t2 exp and a1, t2, s1 #a0 t2 man sub a2, a0, s1 #a2 = div 2 or a2, a2, a1 li s0, 0xFFFF0000 and a2, a2, s0 add a0, t4, x0 li a7, 2 ecall la a0, spacestr li a7, 4 ecall add a0, a2, x0 li a7, 2 ecall la a0, spacestr li a7, 4 ecall add a0, t5, x0 li a7, 2 ecall la, a0, linestr li a7, 4 ecall beq a2, t3 print_result blt t3, a2, mid_to_ub blt a2, t3, mid_to_lb mid_to_lb: mv t5, a2 j binary_search mid_to_ub: mv t4, a2 j binary_search ``` ### 2.3 Result #### test data 1 ``` 0x40A00000 0x41200000 0x40400000 ``` #### test data 2 ``` 0x42C80000 0x43800000 0x3F800000 ``` #### test data 3 ``` 0x3DCCCCCD 0x3F800000 0x3C23D70A ``` #### C program output ```txt 0.100098 100.000000 0.562500 ``` #### Assembly output ```txt= The target, upper bound and lower bound numbers are : 0.100098 10 0.000999451 ----------------- 10 5 0.000999451 5 2.5 0.000999451 2.5 1.25 0.000999451 1.25 0.625 0.000999451 0.625 0.3125 0.000999451 0.3125 0.15625 0.000999451 0.15625 0.0786133 0.000999451 0.15625 0.117188 0.0786133 0.117188 0.0976563 0.0786133 0.117188 0.107422 0.0976563 0.107422 0.102539 0.0976563 0.102539 0.100098 0.0976563 ----------------- The binary search result number is 0.100098 The target, upper bound and lower bound numbers are : 100 256 10 ----------------- 256 133 10 133 71.5 10 133 102 71.5 102 86.5 71.5 102 94 86.5 102 98 94 102 100 98 ----------------- The binary search result number is 100 The target, upper bound and lower bound numbers are : 0.5625 1 0.0100098 ----------------- 1 0.503906 0.0100098 1 0.75 0.503906 0.75 0.625 0.503906 0.625 0.5625 0.503906 ----------------- The binary search result number is 0.5625 ``` ## 3. Analysis In the assignment, I use [Ripes](https://github.com/mortbopet/Ripes) to test and analysis the assembly code. :::info The Ripes is a visual computer architecture simulator and assembly code editor built for the RISC-V instruction set architecture. ::: ### 3.1 Disassembled Executable Code Disassembled executable code here only shows a part of the code. The entire code can be found [here](). ```s = 000000a4 <float32tobfloat16>: a4: 7f800437 lui x8 0x7f800 a8: 008004b7 lui x9 0x800 ac: fff48493 addi x9 x9 -1 b0: 00800937 lui x18 0x800 b4: 00837533 and x10 x6 x8 b8: 009375b3 and x11 x6 x9 bc: 00050e63 beq x10 x0 28 <zerocheck> c0: 16850a63 beq x10 x8 372 <print_float> c4: 01800d6f jal x26 24 <float32_div> c8: 02400d6f jal x26 36 <float32_add> cc: ffff0437 lui x8 0xffff0 d0: 0083f333 and x6 x7 x8 d4: 00008067 jalr x0 x1 0 000000d8 <zerocheck>: d8: 14058e63 beq x11 x0 348 <print_float> 000000dc <float32_div>: dc: ff800437 lui x8 0xff800 e0: 040004b7 lui x9 0x4000 e4: 40950633 sub x12 x10 x9 e8: 000d0067 jalr x0 x26 0 000000ec <float32_add>: ec: 7f800437 lui x8 0x7f800 f0: 008004b7 lui x9 0x800 f4: fff48493 addi x9 x9 -1 f8: 00800937 lui x18 0x800 fc: 008676b3 and x13 x12 x8 100: 00967733 and x14 x12 x9 104: 01276733 or x14 x14 x18 108: 40d507b3 sub x15 x10 x13 10c: 0177d793 srli x15 x15 23 110: 00f75733 srl x14 x14 x15 114: 00b70833 add x16 x14 x11 118: 010563b3 or x7 x10 x16 11c: 000d0067 jalr x0 x26 0 00000120 <float_add>: 120: 7f800437 lui x8 0x7f800 124: 008004b7 lui x9 0x800 128: fff48493 addi x9 x9 -1 12c: 00800937 lui x18 0x800 130: 008ef533 and x10 x29 x8 134: 009ef5b3 and x11 x29 x9 138: 01755513 srli x10 x10 23 13c: 0125e5b3 or x11 x11 x18 140: 008f76b3 and x13 x30 x8 144: 009f7733 and x14 x30 x9 148: 0176d693 srli x13 x13 23 14c: 01276733 or x14 x14 x18 150: 40d507b3 sub x15 x10 x13 154: 00f75733 srl x14 x14 x15 158: 00b70833 add x16 x14 x11 15c: 01000bb7 lui x23 0x1000 160: 017878b3 and x17 x16 x23 164: 00089c63 bne x17 x0 24 <overflow> 168: 00987833 and x16 x16 x9 16c: 0ff57513 andi x10 x10 255 170: 01751513 slli x10 x10 23 174: 010563b3 or x7 x10 x16 178: 000d0067 jalr x0 x26 0 0000017c <overflow>: 17c: 00185813 srli x16 x16 1 180: 00150513 addi x10 x10 1 184: 00987833 and x16 x16 x9 188: 0ff57513 andi x10 x10 255 18c: 01751513 slli x10 x10 23 190: 010563b3 or x7 x10 x16 194: 000d0067 jalr x0 x26 0 00000198 <binary_search>: 198: f89ffd6f jal x26 -120 <float_add> 19c: ff800437 lui x8 0xff800 1a0: 008004b7 lui x9 0x800 1a4: fff48493 addi x9 x9 -1 1a8: 01000937 lui x18 0x1000 1ac: 0083f533 and x10 x7 x8 1b0: 0093f5b3 and x11 x7 x9 1b4: 40950633 sub x12 x10 x9 1b8: 00b66633 or x12 x12 x11 1bc: ffff0437 lui x8 0xffff0 1c0: 00867633 and x12 x12 x8 1c4: 000e8533 add x10 x29 x0 1c8: 00200893 addi x17 x0 2 1cc: 00000073 ecall 1d0: 10000517 auipc x10 0x10000 1d4: eb050513 addi x10 x10 -336 1d8: 00400893 addi x17 x0 4 1dc: 00000073 ecall 1e0: 00060533 add x10 x12 x0 1e4: 00200893 addi x17 x0 2 1e8: 00000073 ecall 1ec: 10000517 auipc x10 0x10000 1f0: e9450513 addi x10 x10 -364 1f4: 00400893 addi x17 x0 4 1f8: 00000073 ecall 1fc: 000f0533 add x10 x30 x0 200: 00200893 addi x17 x0 2 204: 00000073 ecall 208: 10000517 auipc x10 0x10000 20c: e1c50513 addi x10 x10 -484 210: 00400893 addi x17 x0 4 214: 00000073 ecall 218: 0bc60263 beq x12 x28 164 <print_result> 21c: 00ce4863 blt x28 x12 16 <mid_to_ub> 220: 01c64263 blt x12 x28 4 <mid_to_lb> 00000224 <mid_to_lb>: 224: 00060f13 addi x30 x12 0 228: f71ff06f jal x0 -144 <binary_search> 0000022c <mid_to_ub>: 22c: 00060e93 addi x29 x12 0 230: f69ff06f jal x0 -152 <binary_search> ``` ### 3.2 Five-stage pipelined processor In this assignment, 5-stage pipelined processor is chosen to execute the program. The block diagram is shown below: ![5-stage_processor](https://hackmd.io/_uploads/S1Z9PtAea.png) 1. **Instruction Fetch (IF)** * instruction is being fetched from the memory. 2. **Instruction Decode (ID)** * decode the instruction and fetch the source operands 3. **Execute (EX)** * the computer performs the operation specified by the instruction 4. **Memory (MEM)** * If there is any data that needs to be accessed, it is done in the memory stage 5. **Write (WB)** * If we need to store the result in the destination location, it is done during the writeback stage. ### 3.3 Instruction Analysis Take the instruction ` beq x10, x8 372<print_float>`, for example, and analyze how the instruction is operated in the processor at different stages. The sub instruction is The following image is the format of B-type instruction format. ![](https://hackmd.io/_uploads/Sk3Q5aAep.png) #### IF Stage ![](https://hackmd.io/_uploads/ByPNaaCeT.png) At this stage, the instruction address of` beq x10, x8 372<print_float> ` is `0x000000c0`, and the PC will translate the address to 32-bit instruction. The opcode of `beq` instruction is `1100011` and can be formatted as the machine code below. | imm[12] |imm[10:5] | rs2 | rs1 | funct3 | imm[4:1] | imm[11]| opcode| |:-------:|:--------:|:---:|:---:|:------:|:-------:|:-------:|:-------:| | 0 | 001011 | 01000 | 01010 | 000 | 1010 | 0 |1100011| #### ID Stage ![](https://hackmd.io/_uploads/BJ41x1kZa.png) The instruction will be decoded in to several parts: * `opcode` : `0x05` `beq` * `R1 idx` : `0x0a` `x10` * `R2 idx` : `0x08` `x8` * `Wr reg idx` : `0x14` `x14` `R1 idx` and `R2 idx` will be sent into the registers and get the corresponding value `0x7f800000`. With the opcode `beq` and instruction `0x16850a63`, The `imm` field output `0x00000174` also is `372` in decimal. #### EX Stage ![](https://hackmd.io/_uploads/ByNR4yJba.png) * `Op1` of `ALU` is set to `PC` : `0x000000c0` * `Op2` of `ALU` is set to the `imm` field : `0x0000174` * In `beq` instruction, inputs of the branch taken will be the same as `x10 == x8`; otherwise, inputs of the branch taken will be different. * The example of `x10 != x8` ![](https://hackmd.io/_uploads/H1rRsJ1Wa.png) the next `PC` will be `PC+4` `0x00000c8` ![](https://hackmd.io/_uploads/HkyZkxJZT.png) * The example of `x10 == x8` ![](https://hackmd.io/_uploads/B1C0o11bp.png) the next `PC` will be the specified address `0x0000234` ![](https://hackmd.io/_uploads/HJYByxy-T.png) #### MEM Stage ![](https://hackmd.io/_uploads/SJ_Ahk1b6.png) There is nothing will be written to memory so `WrEn` is set to 0. #### WB Stage ![](https://hackmd.io/_uploads/Bkiimg1Wp.png) The `beq` instruction does not need to write back to the register so `WrEn` is set to 0.