Assignment 1: RISC-V Assembly and Instruction Pipeline

# Assignment 1: RISC-V Assembly and Instruction Pipeline contributed by <`ryanycs`> [TOC] ## Convert int32 to bfloat16 Bfloat16 is a custom 16-bit floating point format for machine learning that’s comprised of 1 sign bit, 8 exponent bits, and 7 mantissa bits. As shown in below figure: ```text ┌ sign (1) │ │ ┌ exponent (8) │ │ │ │ ┌ mantissa (7) │ │ │ │┌──┴───┐┌─┴───┐ 0b0000000000000000 bfloat16 ``` In the quiz1, we have already completed the function that converts fp32 into bf16, and basic arithmetic operations of bfloat16. The bfloat16 reduces the computation complexity compared to the float32. But in some applications, there is a need to convert an integer into a floating-point number. Therefore, I decided to write a function that can convert int32 into bfloat16. The idea of converting int32 into bf16 is that we first convert the binary representation of int32 into scientific notation, then we extract the sign, exponent, and mantissa from the scientific notation. Since the mantissa of bf16 is only 7 bits, we need to discard the lower bits of the mantissa of int32 if the mantissa exceeds 7 bits. There also needs to be a rounding of the mantissa to the nearest even. we take the number `-1023` as an example to demonstrate the conversion process. The binary representation of int32 `-1023` is: ```text 1111 1111 1111 1111 1111 1100 0000 0001 ``` Since the number is negative, the sign bit is `1`. we can use bit manipulation `val >> 31 & 1` to extract the sign bit. ```text sign = val >> 31 & 1 ``` Next, we need to change the number to positive if it is negative: ```c if (val < 0) val = -val; ``` The binary representation of int32 `1023` is: ```text 0000 0000 0000 0000 0000 0011 1111 1111 ``` To convert the binary representation into scientific notation, we need to find the position of the highest 1's bit, which has the termlogy called "**Most Significant Bit Set**", or "**[Find Last bit Set](https://en.wikipedia.org/wiki/Find_first_set)**" (`fls`). This can be done by using the `clz` (count leading zeros) function. The position of the highest set bit is `31 - clz(val)` (0-indexed). ```text ┌─── fls = 31 - 22 = 9 0000 0000 0000 0000 0000 0011 1111 1111 └─────────────────────────┘ clz = 22 ``` After finding the position of the last bit set, we can convert the number into scientific notation: ```text 1.1 1111 1111 * 2^9 ``` The exponent in the scientific notation is `9`, which is same as the position of the last bit set. Therefore, the exponent in bf16 is `9 + 127 = 136`. Wait a minute, if we look the final result of `-1023` in bf16 (`0xC480`), we can see that the exponent is `137`, not `136`. This is because the mantissa in bf16 is only 7 bits. If the mantissa exceeds 7 bits, we need to **discard the lower bits** of the mantissa and **round it to the nearest even**. To complete this goal, there is a trick we can use. We can add **half of the discarded mantissa** to the original number. If the discarded mantissa of the original number exceeds halfway, the result will be rounded to the nearest even value. (This idea is inspired by 2024-quiz1 problem B) ```text 1.1 1111 1111 (* 2^9) + 0.0 0000 0001 ───────────────── 10.0 0000 0001 ``` Add half of the discarded mantissa can be accomplished by adding original number with `(1 << (fls - 8)) - 1` (which is `0.0 0000 0001` in the above example). After finishing the rounding, there is a chance that carry occurs, which means the position of the last set bit may change. Therefore, we need to recalculate the position of the last set bit. ```text ┌─── fls = 31 - 21 = 10 0000 0000 0000 0000 0000 0100 0000 0001 └────────────────────────┘ clz = 21 ``` So far, the exponent can be determined by `fls + BF16_EXP_BIAS`, which is `10 + 127 = 137`. Next, we need to determine the mantissa. The implicit leading `1` in the scientific notation needs to be removed, and only the next 7 bits are preserved. This can be done by shifting: ```c mantissa = (val << (32 - fls)) >> 25; ``` ```text << shift left: 32 - fls = 22 ┌─────────────────────────┐ 0000 0000 0000 0000 0000 0100 0000 0001 >> shift right: 25 ┌─────────────────────────────┐ 0000 0000 0100 0000 0000 0000 0000 0000 mantissa ┌──────┐ 0000 0000 0000 0000 0000 0000 0000 0000 ``` Finally, we can concatenate the sign, exponent, and mantissa to get the final result: ```text sign = 1 exponent = 137 = 1000 1001 mantissa = 0 = 000 0000 result = 1100 0100 1000 0000 = 0xC480 │└───┬────┘└──┬───┘ │ │ │ │ │ └ mantissa (7) │ │ │ └ exponent (8) │ └ sign (1) ``` If we convert the result back to fp32 value, we can see the value of bf16 representation `0xC480` is `-1024`. There is an error between the original value and bf16 value. This is because the rounding method discards the mantissa that bf16 can not carry. ### C code You can find the [source](https://github.com/ryanycs/ca2025-quizzes/blob/main/int32_to_bf16.c) code here, including the test function. Feel free to fork and modify it. ```c bf16_t int32_to_bf16(int32_t val) { uint16_t sign = val >> 31 & 1; uint16_t exponent; uint32_t mantissa; if (val == 0) return BF16_ZERO(); if (val < 0) val = -val; uint16_t fls = 31 - clz(val); if (fls > 7) { // Half of discarded mantissa uint32_t round_bit = (1 << (fls - 8)) - 1; // Round to nearest even val += round_bit; // Recalculate if overflow (carry) after rounding fls = 31 - clz(val); } exponent = fls + BF16_EXP_BIAS; // Remove implicit 1 and shift to 7 bits mantissa = (val << (32 - fls)) >> 25; return (bf16_t){.bits = sign << 15 | (exponent & 0xff) << 7 | (mantissa & 0x7f)}; } ``` ### RISC-V Assembly You can find the [source](https://github.com/ryanycs/ca2025-quizzes/blob/main/int32_to_bf16.s) code here, including automate testing. Feel free to fork and modify it. ```asm #------------------------------------------------------------------------------- # int32_to_bf16 # # Arguments: # a0: int32 value # # Returns: # a0: bf16 value # # Registers Usage: # s0: val (int32) # s1: sign # s2: fls # s3: round_bit # s4: exponent # s5: mantissa #------------------------------------------------------------------------------- int32_to_bf16: # Callee save addi sp, sp, -28 sw ra, 24(sp) sw s0, 20(sp) sw s1, 16(sp) sw s2, 12(sp) sw s3, 8(sp) sw s4, 4(sp) sw s5, 0(sp) mv s0, a0 # s0 = val # sign srli s1, a0, 31 # sign = val >> 31 andi s1, s1, 1 # sign &= 1 li t0, BF16_ZERO beq s0, t0, return_zero # if (val == 0) return 0 bgez s0, 1f sub s0, x0, s0 # val = -val 1: # highest set bit mv a0, s0 # a0 = val jal ra, clz li t0, 31 sub s2, t0, a0 # fls = 31 - clz(val) # Round to nearest even li t0, 7 ble s2, t0, 1f addi t0, s2, -8 # t0 = fls - 8 li s3, 1 sll s3, s3, t0 # round_bit = 1 << (fls - 8) addi s3, s3, -1 # round_bit = (1 << (fls - 8)) - 1 add s0, s0, s3 # val += round_bit mv a0, s0 # a0 = val jal ra, clz # clz(val) li t0, 31 sub s2, t0, a0 # fls = 31 - clz(val) 1: # no rounding needed # exponent addi s4, s2, BF16_EXP_BIAS # exponent = fls + BF16_EXP_BIAS # mantissa li t0, 32 sub t0, t0, s2 # t0 = 32 - fls sll s5, s0, t0 # mantissa = val << (32 - fls) srli s5, s5, 25 # mantissa >>= 25 slli a0, s1, 15 # a0 = sign << 15 andi t0, s4, 0xFF # t0 = exponent & 0xFF slli t0, t0, 7 # t0 = (exponent & 0xFF) << 7 or a0, a0, t0 # a0 |= (exponent & 0xFF) << 7 andi t0, s5, 0x7F # t0 = mantissa & 0x7F or a0, a0, t0 # a0 |= mantissa & 0x7F j on_return return_zero: li a0, BF16_ZERO on_return: # Callee restore lw s5, 0(sp) lw s4, 4(sp) lw s3, 8(sp) lw s2, 12(sp) lw s1, 16(sp) lw s0, 20(sp) lw ra, 24(sp) addi sp, sp, 28 ret #------------------------------------------------------------------------------- # clz # Count leading zeros # # Arguments: # a0: x # # Returns: # a0: number of leading zeros #------------------------------------------------------------------------------- clz: li t0, 32 # n li t1, 16 # c 1: # do while srl t3, a0, t1 # y = x >> c beq x0, t3, 2f # if (!y) go to 2 sub t0, t0, t1 # n = n - c mv a0, t3 # x = y 2: # join srai t1, t1, 1 # c >>= 1 bne x0, t1, 1b # while (c) sub a0, t0, a0 # return value: n - x ret ``` ## Implementation ### clz ```asm= #------------------------------------------------------------------------------- # clz # Count leading zeros # # Arguments: # a0: x # # Returns: # a0: number of leading zeros #------------------------------------------------------------------------------- clz: li t0, 32 # n li t1, 16 # c 1: # do while srl t3, a0, t1 # y = x >> c beq x0, t3, 2f # if (!y) go to 2 sub t0, t0, t1 # n = n - c mv a0, t3 # x = y 2: # join srai t1, t1, 1 # c >>= 1 bne x0, t1, 1b # while (c) sub a0, t0, a0 # return value: n - x ret ``` ### uf8 You can find the [source](https://github.com/ryanycs/ca2025-quizzes/blob/main/q1-uf8.s) code here, including automate testing. Feel free to fork and modify it. #### uf8_decode ```asm= #------------------------------------------------------------------------------- # uf8_decode # Decode uf8 to uint32_t # # Arguments: # a0: fl # # Returns: # a0: value # # Registers Usage: # t0: mantissa # t1: exponent # t3: offset #------------------------------------------------------------------------------- uf8_decode: addi sp, sp, -4 sw ra, 0(sp) # store return addr andi t0, a0, 0x0f # mantissa = fl & 0x0f srli t1, a0, 4 # exponent = fl >> 4 li t2, 15 sub t2, t2, t1 # 15 - exponent li t3, 0x7fff srl t3, t3, t2 # offset = 0x7fff >> (15 - exponent) slli t3, t3, 4 # offset <<= 4 sll t4, t0, t1 # mantissa << exponent add a0, t4, t3 # return mantissa + offset lw ra, 0(sp) # restore return addr addi sp, sp, 4 ret ``` #### uf8_encode ```asm= #------------------------------------------------------------------------------- # uf8_encode # Encode uint32_t to uf8 # # Arguments: # a0: value # # Returns: # a0: fl # # Registers Usage: # s0: value # s1: lz # s2: msb # s3: exponent # s4: overflow # s5: e # s6: next_overflow # s7: mantissa #------------------------------------------------------------------------------- uf8_encode: addi sp, sp, -36 sw ra, 32(sp) # store return addr sw s0, 28(sp) sw s1, 24(sp) sw s2, 20(sp) sw s3, 16(sp) sw s4, 12(sp) sw s5, 8(sp) sw s6, 4(sp) sw s7, 0(sp) mv s0, a0 # value li t0, 16 blt s0, t0, 8f # if (value < 16) return value jal ra, clz # clz(value) mv s1, a0 # lz = clz(value) li s2, 31 sub s2, s2, t1 # msb = 31 - lz li s3, 0 # exponent li s4, 0 # overflow # if (msb >= 5) li t0, 5 blt s2, t0, 5f # if (msb < 5) goto 5 addi, s3, s2, -4 # exponent = msb - 4 # if (exponent > 15) li t0, 15 ble s3, t0, 2f # if (exponent <= 15) goto 2 li s3, 15 # exponent = 15 2: # for (e = 0; e < exponent; e++) li s5, 0 # e = 0 3: bge s5, s3, 4f # if (e >= exponent) goto 4 slli s4, s4, 1 # overflow <<= 1 addi s4, s4, 16 # overflow += 16 addi s5, s5, 1 # e++ j 3b # repeat # end for 4: # while (exponent > 0 && value >= overflow) slt t0, x0, s3 # t0 = if (exponent > 0) slt t1, s0, s4 # t1 = if (value < overflow) and t0, t0, t1 # t0 = if (exponent > 0 && value < overflow) beq x0, t0, 5f # if (!t0) goto 5 addi s4, s4, -16 # overflow -= 16 srli s4, s4, 1 # overflow >>= 1 addi s3, s3, -1 # exponent-- j 4b # repeat # end while 5: # while (exponent < 15) li t0, 15 6: bge s3, t0, 7f # if (exponent >= 15) goto 7 slli s6, s4, 1 # next_overflow = overflow << 1 addi s6, s6, 16 # next_overflow += 16 blt s0, s6 , 7f # if (value < next_overflow) break to 7 mv s4, s6 # overflow = next_overflow addi s3, s3, 1 # exponent++ j 6b # repeat # end while 7: sub s7, s0, s4 # mantissa = value - overflow srl s7, s7, s3 # mantissa >>= exponent slli a0, s3, 4 # a0 = exponent << 4 or a0, a0, s7 # a0 |= mantissa 8: # on return lw s7, 0(sp) lw s6, 4(sp) lw s5, 8(sp) lw s4, 12(sp) lw s3, 16(sp) lw s2, 20(sp) lw s1, 24(sp) lw s0, 28(sp) lw ra, 32(sp) # restore return addr addi sp, sp, 36 ret ``` ### bfloat16 Arithmetic Since the source code is too long, I skipped pasting the code snapshot here. You can find the [source](https://github.com/ryanycs/ca2025-quizzes/blob/main/q1-bfloat16.s) code here, including automate testing. Feel free to fork and modify it. For testing automately, I design a subroutine which called `testfixture`, it can verify the functionality of the given function with the provided input and golden data. The `testfixture` subroutine assembly is as follow: ```asm #------------------------------------------------------------------------------- # testfixture # Test the given function with the provided input and golden data # # Arguments: # a0: address of the function to test # a1: address of input data # a2: address of golden data # a3: number of arguments of the test function (1 or 2) # a4: number of test data # # Returns: # a0: 0 if all tests passed, 1 if any test failed # # Register Usage: # s0: i # s1: func_addr # s2: input_data_addr # s3: golden_data_addr # s4: num_args # s5: num_test_data # s6: func(a0) or func(a0, a1) # s7: golden_data[i] # #------------------------------------------------------------------------------- testfixture: # Callee save addi sp, sp, -36 sw ra, 32(sp) sw s0, 28(sp) sw s1, 24(sp) sw s2, 20(sp) sw s3, 16(sp) sw s4, 12(sp) sw s5, 8(sp) sw s6, 4(sp) sw s7, 0(sp) li s0, 0 # i = 0 mv s1, a0 # func_addr mv s2, a1 # input_data_addr mv s3, a2 # golden_data_addr mv s4, a3 # num_args mv s5, a4 # num_test_data bge s0, s5, 5f # if (i >= num_test_data) go to pass # Determine the number of arguments to load li t0, 1 sub t1, s4, t0 # t1 = num args - 1 beqz t1, 2f # if (num args - 1 == 0) go to one_arg 1: # two_args lw a0, 0(s2) # a0 = input_data[i*2] lw a1, 4(s2) # a1 = input_data[i*2 + 1] jalr ra, s1, 0 # func(a0, a1) mv s6, a0 # s6 = result lw s7, 0(s3) # s7 = golden_data[i] # Print for debugging # mv a0, s6 # mv a1, s7 # jal ra, print bne s6, s7, 4f # compare s6, s7 addi s0, s0, 1 # i++ addi s2, s2, 8 # input_data += 8 addi s3, s3, 4 # golden_data += 4 blt s0, s5, 1b # if (i < num_test_data) go to two_args j 3f 2: # one_arg lw a0, 0(s2) # a0 = input_data[i] jalr ra, s1, 0 # test_function(a0) mv s6, a0 # s6 = result lw s7, 0(s3) # s7 = golden_data[i] # Print for debugging # mv a0, s6 # mv a1, s7 # jal ra, print bne s6, s7, 4f # compare s6, s7 addi s0, s0, 1 # i++ addi s2, s2, 4 # input_data += 4 addi s3, s3, 4 # golden_data += 4 blt s0, s5, 2b # if (i < num_test_data) go to one_arg 3: # pass li a0, 0 # return 0 j 5f 4: # fail li a0, 1 # return 1 5: # on return # Callee restore lw s7, 0(sp) lw s6, 4(sp) lw s5, 8(sp) lw s4, 12(sp) lw s3, 16(sp) lw s2, 20(sp) lw s1, 24(sp) lw s0, 28(sp) lw ra, 32(sp) addi sp, sp, 36 ret ``` This subroutine corresponds to the following C function prototype: ```c int testfixture( bf16_t (*func)(bf16_t, ...), bf16_t *input, bf16_t *output, int num_args, int num_test_data ); ``` This function receives the callback function address, which may be `bf16_add`, `bf16_sub`, and so on. Then it feeds the input data into the callback function and compares the output that the callback function returned. If all outputs of the callback function are the same as the golden output, then the `testfixture` function will return 0; otherwise, it will return 1. Since there is a function like `bf16_sqrt` that only takes 1 argument, the `testfixture` subroutine receives an extra argument `num_args` to tell the number of arguments the callback function has. :::info If you want to show the details of comparison, you can uncomment the debug section (`<C-f>` to search the keyword **debug** for finding them quickly) ::: ## Analysis ### 5-stage Pipeline Processor Let me use the following **shift-and-add multiplication** function as an example to demonstrate how the 5-stage pipeline processor works: ```asm .text #------------------------------------------------------------------------------- # mul # Multiplies two word by shift-and-add algorithm. # # Arguments: # a0 = multiplicand # a1 = multiplier # # Returns: # a0 = a0 * a1 #------------------------------------------------------------------------------- main: li a0, 1023 li a1, 255 jal ra, mul li a7, 1 ecall li a7, 10 ecall mul: addi sp, sp, -4 sw ra, 0(sp) li t0, 0 1: andi t1, a1, 1 # Check LSB of multiplier beq t1, x0, 8 # If LSB is 0, skip add(next instruction) add t0, t0, a0 slli a0, a0, 1 # multiplicand <<= 1 srli a1, a1, 1 # multiplier >>= 1 bnez a1, 1b # Repeat if multiplier != 0 mv a0, t0 lw ra, 0(sp) addi sp, sp, 4 ret ``` The above code will be translated in the following machine code: ```text 00000000 <main>: 0: 3ff00513 addi x10 x0 1023 4: 0ff00593 addi x11 x0 255 8: 014000ef jal x1 20 <mul> c: 00100893 addi x17 x0 1 10: 00000073 ecall 14: 00a00893 addi x17 x0 10 18: 00000073 ecall 0000001c <mul>: 1c: ffc10113 addi x2 x2 -4 20: 00112023 sw x1 0 x2 24: 00000293 addi x5 x0 0 28: 0015f313 andi x6 x11 1 2c: 00030463 beq x6 x0 8 30: 00a282b3 add x5 x5 x10 34: 00151513 slli x10 x10 1 38: 0015d593 srli x11 x11 1 3c: fe0596e3 bne x11 x0 -20 40: 00028513 addi x10 x5 0 44: 00012083 lw x1 0 x2 48: 00410113 addi x2 x2 4 4c: 00008067 jalr x0 x1 0 ``` We focus on the `beq` instruction at address `0x2c`: :::info Conditional Branches All branch instructions use the **B-type** instruction format. ```text ┌──────────────┬──────────┬──────────┬──────┬───────────┬──────────────┐ │31 25│24 20│19 15│14 12│11 7│6 0│ ├──────────────┼──────────┼──────────┼──────┼───────────┼──────────────┤ │ imm[12|10:5] │ rs2 │ rs1 │funct3│imm[4:1|11]│ opcode │ └──────────────┴──────────┴──────────┴──────┴───────────┴──────────────┘ 7 5 5 3 5 7 ``` The 12-bit B-immediate encodes signed offsets in multiples of 2 bytes. ::: #### IF ![image](https://hackmd.io/_uploads/SkSwAQFpel.png) - At the Instruction Fetch (IF) stage, the `PC` has the value `0x0000002c`. Processor fetches the instruction from memory at address `0x0000002c`. This value is also sent to the `IF/ID` pipeline register. - The instruction fetched is `0x00030463`, which is the machine code for the `beq x6, x0, 8` instruction. This instruction is brought to the `IF/ID` pipeline register. - PC is incremented by 4 to point to the next instruction, which is `0x00000030`. This value is sent to the `IF/ID` pipeline register. #### ID ![image](https://hackmd.io/_uploads/rycnAQKTgx.png) - At the Instruction Decode (ID) stage, the `Decode` unit will decode the instruction: (refer to [RV32/64G Instruction Set Listings](https://docs.riscv.org/reference/isa/unpriv/rv-32-64g.html)) ```text 0x00030463 0000 0000 0000 0011 0000 0100 0110 0011 └──────┘└────┘ └────┘└─┘ └────┘└──────┘ imm rs2 rs1 f3 imm opcode imm = 0000 0000 1000 = 8 rs1 = 00110 = x6 rs2 = 00000 = x0 funct3 = 000 = BEQ opcode = 11 000 11 = BRANCH ``` - The `Reg 1` read value from `x6` register, which contains the value `0x00000000`. The `Reg 2` read value from `x0` register, which contains the value `0x00000000`. These values are brought to the `ID/EX` pipeline register. - The `Imm.` unit will take the opcode field and the instruction to determine how the immediate value is formed. In this case, the immediate value is `0x00000008`. This value is brought to the `ID/EX` pipeline register. - The `PC` and `next PC` is go through from the `IF/ID` pipeline register to the `ID/EX` pipeline register. #### EX The Execute (EX) stage is the most important stage in branch instruction. ![image](https://hackmd.io/_uploads/rJLrJEtplg.png) - Since the previous instruction `andi x6, x11, 1` is writing to register `x6`, and the current instruction `beq x6, x0, 8` is reading register `x6`, there is a **data hazard**. The value of register `x6` is not yet written back to the register file when the `beq` instruction is in the ID stage. Therefore, the first level multiplexer chooses the value from the `MEM/WB` pipeline register, which is the previous `ALU` result `0x00000000`. This value is sent to 2 ways: - to the second level multiplexer - to the `Branch` unit - These is no data hazard for register `x0`, so the value `0x00000000` is by passed directly from the `ID/EX` pipeline register. This value is sent to 2 ways: - to the second level multiplexer - to the `Branch` unit - The second level multiplexer will choose the value `PC` and `imm`, bring the two values to the `ALU` unit. `ALU` unit will add the two values together to get the target address `0x00000034`. The result is sent to 2 ways: - to the `EX/MEM` pipeline register - to the multiplexer input before the `PC` register - The `Branch` unit will take two register (`x6`, `x0`) values, check if they are equal. Since both values are `0x00000000`, the branch is **taken**. the taken signal `res` will be sent back to the multiplexer before the `PC` register, which makes the `PC` to be the result of the `ALU` unit. ![image](https://hackmd.io/_uploads/Hy8OJVKage.png) After this stage, the `PC` will be updated to `0x00000034`, which skips the next instruction and goes to the `slli x10 x10 1` instruction, and the instruction `add x5 x5 x10` which in ID stage is flushed. #### MEM ![image](https://hackmd.io/_uploads/ryzs1Vtpgl.png) - At the Memory Access (MEM) stage, the result of `ALU` is sent to 3 ways: - to the `MEM/WB` pipeline register - to the `Data Memory` - to the EX stage multiplexer - The `PC` is go through the `EX/MEM` pipeline register to the `MEM/WB` pipeline register. - `Reg 2` is sent to Data in, but there is no write enable signal, so the data memory is not written. The data read from data memory is not used in this instruction. - `Wr idx` is also go through from the `EX/MEM` pipeline register to the `MEM/WB` pipeline register. #### WB ![image](https://hackmd.io/_uploads/BJtp1Etaxl.png) - At the the Write Back (WB) stage, the multiplexer chooses the value of `ALU` result, which is `0x00000034`. This value is sent to 2 ways: - to the EX stage multiplexer - to the `Registers` file Wr data - `Wr idx` is sent to the `Registers` file `Wr idx`, but there is no write enable signal, so the register file is not written. ## Acknowledgement This assignment was completed with assistance from [Github Copilot](https://github.com/features/copilot) auto completions for code/comments writing (Agent mode does not use). The [ChatGPT](https://chat.openai.com/) is used for getting idea of informative use case. ## Reference - [RISC-V Unprivileged Architecture](https://riscv.org/specifications/ratified/) - [Examples of Code Patterns and Structure in RISC-V](https://cmput229.github.io/229-labs-RISCV/RISC-V-Examples_Public/example.html) - [BFloat16: The secret to high performance on Cloud TPUs](https://cloud.google.com/blog/products/ai-machine-learning/bfloat16-the-secret-to-high-performance-on-cloud-tpus) - [arch2024-quiz1-sol](https://hackmd.io/@sysprog/arch2024-quiz1-sol) - [arch2025-quiz1-sol](https://hackmd.io/@sysprog/arch2025-quiz1-sol)