arch2025-homework1

# Assignment1: RISC-V Assembly and Instruction Pipeline contributed by < [RayYang421](https://github.com/RayYang421/ca2025-quizzes) > ## Probelm B in Quiz 1 ### uf8 ``` |7 6 5 4|3 2 1 0| +-------+-------+ | Exp | Mant | ``` ### CLZ **Problem Description** The task is to implement a function clz(x) that counts the number of consecutive leading zeros in a 32-bit unsigned integer — that is, the number of zeros from the most significant bit (bit 31) down to the first 1. Examples: x = 0x00000001 → leading zeros = 31 x = 0x80000000 → leading zeros = 0 x = 0x00F00000 → leading zeros = 12 x = 0x00000000 → leading zeros = 32 This function is commonly used in bit manipulation, floating-point normalization, and hardware optimizations. Many instruction sets, such as ARM and RISC-V, include a hardware instruction (CLZ) to perform this operation directly. **binary search approach** 1. Initialize n = 32 (the maximum possible number of leading zeros) and c = 16. 1. Repeatedly check whether x >> c equals zero: - If the result is zero, the upper c bits of x are all zeros, so we do not modify n or x. - If the result is nonzero, it means the highest 1 bit lies within that upper range, so we update: n -= c and x = x >> c 1. Halve c each iteration (c >>= 1) and continue until c becomes zero. 1. Finally, return n - x, which yields the count of leading zeros. **Branchless method** Because normalization of subnormal IEEE-754 operands requires a count-leading-zeros operation, we must emulate CLZ in software on a 32-bit RISC-V platform without the Zbb extension. For side-channel resistance, the implementation must avoid data-dependent branches, as branch mispredictions would introduce timing variation. Since integer multiply instructions (MUL, MULHU) execute in fixed latency on this platform, they can safely be used in a constant-time, branchless CLZ emulation. ref: https://stackoverflow.com/questions/78589600/branchless-count-leading-zeros-on-32-bit-risc-v-without-zbb-extension **Branchless C code** ```cpp= int myclz(uint32_t x) { int r = 0, c; c = (x < 0x00010000) << 4; r += c; x <<= c; // off 16 c = (x < 0x01000000) << 3; r += c; x <<= c; // off 8 c = (x < 0x10000000) << 2; r += c; x <<= c; // off 4 c = (x < 0x40000000) << 1; r += c; x <<= c; // off 2 c = x < 0x80000000; r += c; x <<= c; // off 1 r += x == 0; return r; } ``` **Comparison** | branch | branchless | |:---------------------------------------------------:|:---------------------------------------------------:| | ![image](https://hackmd.io/_uploads/SyizZQcpeg.png) | ![image](https://hackmd.io/_uploads/S1Mjgm5Txx.png) | * Cycles: 194 → 146 (↓ ~25%) * Instructions retired: 130 → 126 (almost the same) * CPI: 1.49 → 1.16 (↓, indicating fewer pipeline stalls) * IPC: 0.67 → 0.863 (↑, more instructions completed per cycle) Both have $O(1)$ time complexity. The performance gap comes from branch misprediction: the binary-search version relies on conditional jumps, and with random inputs those mispredictions flush the pipeline, driving up CPI and total cycles. ### uf8 decode * 4 bits exponent ⇒ exponential scaling. * 4 bits mantissa ⇒ fractional resolution within each exponent. * The offset ensures monotonic growth across exponent levels. ### uf8 encode **Performance** ![image](https://hackmd.io/_uploads/Ski-XDFaxl.png) ## Probelm C in Quiz 1 ### bf16 Format | Sign (1) | Exponent (8) | Mantissa (7) | |-----------|---------------|---------------| | S | EEEEEEEE | MMMMMMM | --- ### bf16 zero inf nan | **Exponent (E)** | **Mantissa (M)** | **Sign (S)** | **Result** | **Description** | | ---------------- | ---------------- | ------------ | --------------------- | -------------------------------- | | `E = 0` | `M = 0` | 0 | `+0` | Positive zero | | `E = 0` | `M = 0` | 1 | `-0` | Negative zero | | `E = 0` | `M ≠ 0` | Any | **Subnormal number** | Denormalized (very small) number | | `1 ≤ E ≤ 254` | Any | Any | **Normalized number** | Regular representable value | | `E = 255` | `M = 0` | 0 | `+Inf` | Positive infinity | | `E = 255` | `M = 0` | 1 | `-Inf` | Negative infinity | | `E = 255` | `M ≠ 0` | Any | **NaN** | Not a Number | ### bf16 zero When the exponent is zero and the mantissa is also zero, the value represents zero. * If the sign bit equals 1, it is negative zero (-0) * if the sign bit equals 0, it is positive zero (+0) ### bf16 infinite When the exponent field is all ones (E = 255) and the mantissa field is zero (M = 0), the value represents infinity. * If the sign bit equals 0, the value is positive infinity (+Inf) * if the sign bit equals 1, the value is negative infinity (-Inf) ### bf16 not a number When the exponent field is all ones (E = 255) and the mantissa field is nonzero (M ≠ 0), the value represents NaN (Not a Number). --- ### float 32 ``` | 31 | 30 ........ 23 | 22 .................... 0 | | S | Exponent | Fraction | ``` normalized: $V=(−1)^S×(1.F)×2^{E−127}$ --- ### bf16_mul **Booth Algorithm** ref: [Booth Algorithm](https://www.geeksforgeeks.org/computer-organization-architecture/computer-organization-booths-algorithm/) * Initialization:Initialize the Accumulator (AC) register to 0. * Initialize the Accumulator (AC) register to 0. * Load the Multiplicand into the BR register. * Load the Multiplier into the QR register. * Initialize the auxiliary bit ($Q_{n+1}$) to 0. * Set the Sequence Counter (SC) to the number of bits in the multiplier (N). * Inspection Loop (N times): * Inspect the pair of bits $(Q_n, Q_{n+1})$. * Perform the corresponding operation (Add, Subtract, or No Op) as per the table above. * Perform an Arithmetic Right Shift on the concatenated register $AC \leftarrow QR \leftarrow Q_{n+1}$ (the sign bit of AC is preserved). * Decrement the Sequence Counter (SC). * Result: When SC reaches 0, the final product is found in the concatenated AC and QR registers. **Booth Algorithm in BF16 Mantissa Multiplication** In the BF16 floating-point multiplication unit, the sign bit is computed using an **XOR operation**, while the exponent is obtained through integer **addition** followed by bias adjustment. The mantissa, however, requires an 8-bit × 8-bit multiplication to generate a 16-bit intermediate product. To reduce hardware cost and improve computational efficiency, the design **employs the Booth algorithm for partial-product generation**. The resulting product is then normalized and rounded before being re-encoded into the BF16 format. --- ### bf16_div ref: [LeetCode : 29. Divide Two Integers](https://leetcode.com/problems/divide-two-integers/description/) **Problem Descript** Given two integers dividend and divisor, divide two integers without using multiplication, division, and mod operator. The integer division should truncate toward zero, which means losing its fractional part. For example, 8.345 would be truncated to 8, and -2.7335 would be truncated to -2. Return the quotient after dividing dividend by divisor. Note: Assume we are dealing with an environment that could only store integers within the 32-bit signed integer range: [$−2^{31}$, $2^{31}$]. For this problem, if the quotient is strictly greater than $2^{31}$ - 1, then return 231 - 1, and if the quotient is strictly less than -$2^{31}$, then return -$2^{31}$. **Restoring Division Algorithm** The Restoring Division Algorithm is a method for dividing two unsigned integers in binary form, producing a quotient and remainder through iterative shifting and subtraction. * It uses registers for the quotient (Q), remainder (A), and divisor (M). * If subtraction gives a negative result, the remainder is restored to its previous value, and the quotient bit is set to zero, hence the term "restoring." **comparison** | w/o Restoring Division Algorithm | with Restoring Division Algorithm | |:----------------------------------------------------:|:----------------------------------------------------:| | ![div v1](https://hackmd.io/_uploads/Hkrm-eKpll.png) | ![div v2](https://hackmd.io/_uploads/ryOEZxtaxx.png) | The restoring division algorithm version completes the task in fewer cycles (732 vs. 841), meaning it is more efficient in total execution time. However, its CPI and IPC show that each instruction is slightly less efficient individually. Overall, the performance improvement mainly comes from reduced total operations rather than faster per-instruction execution. ### bf16_sqrt **Newton Method** Let N be any number then the square root of N can be given by the formula: $$x^{n+1}=(x^n + a / x^n)/2$$ **Steps** 1. Assign X to the N itself. 1. Now, start a loop and keep calculating the root which will surely move towards the correct square root of N. 1. Check for the difference between the assumed X and calculated root, if not yet inside tolerance then update root and continue. 1. If the calculated root comes inside the tolerance allowed then break out of the loop. 1. Print the root. **C code** ```c= double findSqrt(double N, double guess, double tolerance) { double next_guess = (guess + N/guess) / 2; if (abs(guess - next_guess) <= tolerance) { return next_guess; } else { return findSqrt(N, next_guess, tolerance); } } ``` ## Analysis We evaluated our implementation using the [Ripes](https://ripes.me/) simulator on a **five-stage pipelined CPU architecture**. ![image](https://hackmd.io/_uploads/BJNn3Ltagl.png) The processor consists of five stages: 1. Instruction fetch (IF) 1. Instruction decode and register fetch (ID) 1. Execute (EX) 1. Memory access (MEM) 1. Register write back (WB) ### Instruction fetch (IF) stage ![image](https://hackmd.io/_uploads/BJEfR8YTeg.png) * The processor fetches the instruction from memory (using the Program Counter, PC). * The PC is incremented to point to the next instruction. * Output: the fetched instruction and updated PC. ### Instruction Decode and Register Fetch (ID) ![image](https://hackmd.io/_uploads/H1drgPFTxg.png) * The instruction is decoded to determine its type (e.g., arithmetic, memory access, branch). * The processor reads the required registers specified by the instruction. * For branch instructions, the target address may also be calculated here. ### Execute (EX) ![image](https://hackmd.io/_uploads/S1IUlDK6ll.png) * The ALU performs the operation (e.g., addition, subtraction, address calculation). * For branch instructions, the condition is evaluated here. * The effective address for memory access instructions (like lw, sw) is computed. ### Memory Access (MEM) ![image](https://hackmd.io/_uploads/ryXDxvKTxg.png) * For load instructions, data is read from memory. * For store instructions, data is written to memory. * For ALU-type instructions, this stage may simply pass the result through. ### Write Back (WB) ![image](https://hackmd.io/_uploads/S1zuxwYTeg.png) * The result is written back to the destination register (either from the ALU or memory). * Completes the execution of the instruction. ## Reference 1. https://leetcode.com/problems/divide-two-integers/description/ 1. https://www.geeksforgeeks.org/computer-organization-architecture/restoring-division-algorithm-unsigned-integer/ 2. https://stackoverflow.com/questions/78589600/branchless-count-leading-zeros-on-32-bit-risc-v-without-zbb-extension 3. https://www.geeksforgeeks.org/dsa/find-root-of-a-number-using-newtons-method/