# Assignment1: RISC-V Assembly and Instruction Pipeline contributed by < [vicLin8712](https://github.com/vicLin8712/ca2025-quizzes#) > ## Problem B Refer to [Quiz1 of Computer Architecture (2025 Fall) Problem `B`](https://hackmd.io/@sysprog/arch2025-quiz1-sol) ### Assembly Implementation #### First approach clz To calculate leading zero of given unsigned integer `x`, I left shift `x` until the first non-zero bit position at the MSB. The implementation code as below showed: ```c main: lw a0, test clz: li t0, 0 # counts of leading zero li t1, 0x80000000 # MSB set to 1 clz_loop: and t2, t1, a0 # Check MSB bnez t2, clz_end # branch addi t0,t0,1 slli a0, a0,1 j clz_loop clz_end: mv a0, t0 li a7,1 ecall ``` The time complexity obvious is $O(n)$ which determined by the position of the first non-zero bit. The corresponding performance tests as below showed with best, worst, and general case: ![image](https://hackmd.io/_uploads/HyaoPRynel.png ) Fig. Best case when `x=0x80000000` ![image](https://hackmd.io/_uploads/SkWxYCJ2xg.png) Fig. Worst case when `x=0x0000001` ![image](https://hackmd.io/_uploads/ryX4FAknxe.png) Fig. General case when `x=0x00010000` #### Second approach clz Refer to the algorithm how problem `B` approaches its `clz`, the implementation assembly showed below: ```c main: lw a0, test clz: li t0, 32 # n li t1, 16 # c clz_loop: srl t2, a0, t1 # y = x >> c beqz t2, skip_sub # if (y==0) sub t0, t0, t1 # n -= c mv a0, t2 # x = y skip_sub: srli t1, t1, 1 # c >>= 1 bnez t1, clz_loop end_clz: sub a0, t0, a0 # return n - x li a7, 1 ecall ``` The above assembly executed in $O(log_2n)$ time complexity, only depended on the bit length of given data. ![image](https://hackmd.io/_uploads/HklJw-Hneg.png) Fig. `x=0x80000000` ![image](https://hackmd.io/_uploads/Sk2Zv-Bnxl.png) Fig. `x=0x00000001` As above execution information showed, total cycles are fixed nomatter where the first non-zero bit at. #### branchless CLZ #### uf8_decode I compiled `uf8_decode` with `RISC-V (32-bits) gcc -o2` on [Compiler Explorer](https://godbolt.org/). The corresponding assembly code and performance is: ```c uf8_decode: srli a3,a0,4 li a4,15 li a5,32768 sub a4,a4,a3 addi a5,a5,-1 sra a5,a5,a4 andi a0,a0,15 sll a0,a0,a3 slli a5,a5,4 add a0,a5,a0 ret ``` ![image](https://hackmd.io/_uploads/Hkra2GBnle.png) In my version, I tried to modified origin mapping function as: $$ \begin{aligned} D(b) &= m \cdot 2^e + (2^e - 1) \cdot 16 \\&= (m+16)\cdot2^e - 16 \end{aligned} $$ And the following assembly code and performance are: ```c uf8_decode: srli t0, a0, 4 # e = b >> 4 andi t1, a0, 0x0F # m = b & 0xF addi t1, t1, 16 # t1 = m + 16 sll t2, t1, t0 # t2 = (m+16) << e = (m+16)*2^e addi a0, t2, -16 # a0 = (m+16)*2^e - 16 ret ``` ![image](https://hackmd.io/_uploads/SJO5CGBhee.png) #### uf8_encode Consider the encoding function as problem `B` showed: $$ \begin{aligned} E(v) = \begin{cases} v, & \text{if } v < 16 \\ 16e + \lfloor(v - \text{offset}(e))/2^e\rfloor, & \text{otherwise} \end{cases} \end{aligned} $$ When given a 32-bit integer $v$, we wish to get an 8-bit integer $x$ to represent its origin value. The structure of 8-bit utf8 code is: <div align="right"> <img src=https://hackmd.io/_uploads/rJI4oH8hle.png width="400"> </div> There is only 4 bit available for exponent, which means that the maximum exponent value is $2^4-1=15$. The value $v$ will located at a range of corresponding exponent value. Based on the `utf_decode` function, we can construct corresponding range table for different exponent as problem `B` showed: ![image](https://hackmd.io/_uploads/S1SR0HU2el.png) For example, if exponenet is $1$, the minimum and maximum value are: $$ \begin{aligned} &D(00010000) = 0 + (2^1-1)\cdot 16 = 16, m=0000 \, e=0001\\ &D(00011111) = 15\cdot 2^1 + (2^1-1)\cdot 16 = 46,m=1111 \, e=0001 \end{aligned} $$ The first step is to decided the exponent value where $v$ located at. Assume that we have correct exponent value $e$ now, we are going to figure out mantissa value. Refer to decode function, we can obtain what mantissa value should be: $$ \begin{aligned} &D(b) = m \cdot 2^e + (2^e - 1) \cdot 16\\ &\Rightarrow \frac {D(b)-(2^e-1)\cdot16}{2^e} = m \end{aligned} $$ The below pictures illustrate how a utf-8 data decoded and restore. The test program should verify wether uf8 remaining the same after decoded and encoded. :::info utf8 -> int32 -> utf8: No infomation loss int32 -> utf8 -> int32: Information loss due to data compaction ::: ![未命名绘图.drawio (14)](https://hackmd.io/_uploads/H12UNN9nel.png) #### approach 1 The fisrt task is to figure out the exponent of given value. We can assume that the exponenet can be calculated by `clz` method because the first non-zero value decide the minimum value of given 32-bit integer. ```asm # a0: Passed in value, the value to be encoded. uf8_encode: # a1: exponent # a2: offset # Retrun if a0 < 16 li t0, 16 blt a0, t0, r # clz, caller save process addi sp, sp, -8 sw a0, 0(sp) sw ra, 4(sp) jal clz mv a1, a0 # Store leading zero into a1 lw a0, 0(sp) lw ra, 4(sp) addi sp, sp, 8 # Initial exponent gauss li t0, 27 # exp max = msb-4 = 31-a1-4 = 27 -4 li t1, 15 # Max exp limit sub a1, t0, a1 blt a1, t1, skip_max_exp mv a1, t1 skip_max_exp: # Caller save process for offset calculation addi sp, sp, -12 sw a0, 0(sp) sw a1, 4(sp) sw ra,8(sp) mv a0, a1 jal cal_offset # Put exp a1 into a0, call cal offset mv a2, a0 # Store offset value based on gaussed exponent into a2 lw a0, 0(sp) lw a1, 4(sp) lw ra, 8(sp) addi sp,sp,12 reset_over: # Check offset over over value blt a2, a0, skip_overflow addi a2, a2, -16 srli a2, a2,1 addi a1,a1,-1 j reset_over skip_overflow: # Check offset less than next offset value li t1, 15 bge a1,t1, end_uf8_encode slli t0,a2,1 addi t0,t0,16 blt a0, t0, end_uf8_encode mv a2,t0 addi a1,a1,1 j skip_overflow end_uf8_encode: sub t0, a0, a2 # Mantissa = value(a0)-a2(overflow) srl t0, t0, a1 # Mantissa >> exponent(a1) slli a1, a1, 4 # Exp(a1) << 4 or a0, a1,t0 # Store exp << 4 | Mantissa ret # Calculate offset for given a0 cal_offset: # a0: exponenet value for offset calculating li t0, 0 # exp index li t1, 0 # offset value offset_loop: bge t0, a0, end_cal_offset slli t1,t1,1 addi t1, t1,16 addi t0,t0,1 j offset_loop end_cal_offset: mv a0, t1 jr ra ``` #### approach 2 For a given value `x` to be encoded into utf-8, I tried another approach, searching pre-build table to figure out exponenet directly. The code size comparing to **approach 1** can be redueced almost third of it: ```asm table: .word 16,48,112,240,496,1008,2032,4048,8176 ,16368,32752,65520,131056,262128,524272 # a0: The value to be encoded uf8_encode: # a1: exponent # a2: offset la a2, table # Load offset table li a1, 0 # Initial exponent value li t1, 15 # Maximum exp value lw t2, 0(a2) # Load data of offset table next_offset: lw t2, 0(a2) # Load data of offset table blt a0, t2, end_uf8_encode # If value less than next offset # Go to next offset value addi a1, a1, 1 # Exponent++ addi a2, a2, 4 # Next offset address mv t3, t2 # Store previous data bge a1, t1, end_uf8_encode j next_offset end_uf8_encode: mv a2, t3 # Store offset value into a2 sub t0, a0, a2 # Mantissa = value(a0)-a2(overflow) srl t0, t0, a1 # Mantissa >> exponent(a1) slli a1, a1, 4 # Exp(a1) << 4 or a0, a1,t0 # Store exp << 4 | Mantissa ret ``` ### Validation Now we have two different uf8 encode and decode approach. First one is using gcc compiler -02 optimization and the other is using pre-build table without `clz`. The below two results all executed on the `Rips` simulator. With case from `uf8 = 0x00` to `uf8 = 0xFF`. #### gcc I used [Compiler Explorer](https://godbolt.org/) to compile source code and run on the Ripes. Of course the number of whole cycles includs string output and relative operations. ![image](https://hackmd.io/_uploads/rJkaTXWpgl.png) #### table > commit [2f2dbf9](https://github.com/vicLin8712/ca2025-quizzes/commit/2f2dbf906f35261c3becc57ce526cc0da910db7f) The cycles almost reduced $28$ % using table approach! Although the fixed size memory required for this table. ![image](https://hackmd.io/_uploads/ryIgR7-Txe.png) > commit [4ec3276](https://github.com/vicLin8712/ca2025-quizzes/commit/4ec32764d419968e5d4a42f993bb5cdef8a0e517) support [rv32emu](https://github.com/sysprog21/rv32emu) system calls #### Analysis ## Problem C In this problem, four operations are implemented for `bf16` data type, `add`, `minus`, `mul`, and `div`. The bit field of `bf16` is: ``` ┌─────────┬──────────────┬──────────────┐ │Sign (1) │ Exponent (8) │ Mantissa (7) │ └─────────┴──────────────┴──────────────┘ 15 14 7 6 0 S: Sign bit (0 = positive, 1 = negative) E: Exponent bits (8 bits, bias = 127) M: Mantissa/fraction bits (7 bits) ``` The relation between real value $v$ and `bf16` representation can be calculated as: \begin{aligned} v=(-1)^S\times 2^{E-127} \times \bigg (1+\frac M{128} \bigg ) \end{aligned} where \begin{aligned} &S:sign\\ &E:Exponent\\ &M:Mantissa \end{aligned} Some special case defined as follow: * Zero: $E=0$ and $M=0$ * Infinity: $E=255$ and $M=0$ * NaN: $E=255$ and $M \ne 0$ * Denormals: Not supported (flush to zero) ### bf16_add There are several edge conditions that must be considered first. Let us denote the parameters `a` and `b` as the two values to be added. We need to follow the rules of the addition operation described in **Section 6 of the IEEE 754-2019 Standard**. | Operand a | Operand b | Result | Explanation | | --------- |:---------:| ------ | --------------------------------------------------------------------------------- | | +∞ | finite | +∞ | Adding any finite number to positive infinity still results in positive infinity. | | -∞ | finite | -∞ | Adding any finite number to negative infinity still results in negative infinity. | | finite | +∞ | +∞ | Symmetric case; finite plus positive infinity is positive infinity. | | finite | -∞ | -∞ | Symmetric case; finite plus negative infinity is negative infinity. | | +∞ | +∞ | +∞ | Same sign infinities remain the same (positive infinity). | | -∞ | -∞ | -∞ | Same sign infinities remain the same (negative infinity). | | +∞ | -∞ | NaN | Opposite sign infinities cause invalid operation → Not a Number (NaN). | | -∞ | +∞ | NaN | Opposite sign infinities cause invalid operation → Not a Number (NaN). | Based on the table, we first check whether a or b is `NaN`; if so, return `NaN`. Otherwise, check whether either operand is `infinity`. If both are `infinity`, return `NaN` when their signs differ, otherwise return that `infinity`. If exactly one is infinity, return that infinity (preserving its sign). If neither is `NaN` or `infinity`, proceed with the normal finite addition. We can use flag to handle such problem ```c /* value a: bit 0 1 2, mantissa > 0, exp full, sign bit * value b: bit 3 4 5, mantissa > 0, exp full, sign bit */ uint8_t flag; ```