Annotate and explain Quiz7 + Problem G (cache coherence simulation)

# Annotate and explain Quiz7 + Problem G (cache coherence simulation) contributed by < [`fan1071221`](https://github.com/fan1071221/ca2023-lab3) > ## **Problem A** Consider that we are using the typical 5-stage pipelined datapath. Also, assume that the CPU always predicts branches as not taken. Review the following RISC-V code snippet: ```c= beq x0, x0, L1 addi x0, x0, 3 addi x0, x0, 1 addi x0, x0, 4 addi x0, x0, 1 addi x0, x0, 5 L1: addi t0, x0, 9 addi t1, t0, 2 xori t1, x0, 6 ``` Imagine the Instruction Fetch (IF) stage for the beq instruction at line 1 occurs in cycle 1. Assuming that all pipeline hazards are managed by inserting stalls (without using techniques like double pumping or forwarding paths), identify in which cycle the Write Back (WB) stage for the xori instruction at the last will be executed. For each pipeline hazard identified, please specify the nature of the hazard, the instructions it involves, and the number of stall cycles required. This section will only be reviewed if a regrade is requested. __ A01 __ >* A01 = ? ==A control hazard occurs between lines 1 and 7 due to the branch’s constant execution. Additionally, a data hazard arises from lines 7 to 8, involving a write to and subsequent read from register t0.==(有提及 control hazard 和 data hazard 出處就給分) This leads to the below timing diagram: ![image](https://hackmd.io/_uploads/HyCeezXua.png) ### 📝 Annotate :::info ### Instructions The instructions are as follows: ```c= beq x0, x0, L1 addi x0, x0, 3 (not executed due to branch taken) addi x0, x0, 1 (not executed due to branch taken) addi x0, x0, 4 addi x0, x0, 1 addi x0, x0, 5 L1: addi t0, x0, 9 addi t1, t0, 2 xori t1, x0, 6 ``` ### Hazards and Stalls - The `beq` instruction at line 1 is always true, so the branch to L1 is always taken. However, the pipeline will have fetched the next two instructions assuming the branch was not taken. This misprediction results in a **control hazard**. Once the branch is resolved in the EX stage, the pipeline flushes the next two instructions, resulting in 2 stall cycles. - The instruction `addi t1, t0, 2` depends on the result of `addi t0, x0, 9`. This situation is a **data hazard** because `t1` cannot proceed until `t0` is written back. It requires 1 stall cycle. ::: If double pumping and all forwarding paths are utilized, in which cycle will the Write Back (WB) stage for the xori instruction on the last line occur? Please identify each hazard, the relevant instructions, and the number of stall cycles needed. Note: This section will be reviewed only if a regrade is requested. __ A02 __ >* A02 = ? ==The control hazard persists from line 1 to line 7 due to the always-taken branch. The data hazard between lines 7 and 8 for register t0 is resolved through forwarding.==(提及 control hazard 的來源就給分) This leads to the below timing diagram: ![image](https://hackmd.io/_uploads/BywixfX_p.png) ### 📝 Annotate :::info ### Instructions The instructions given are: ```c= beq x0, x0, L1 addi x0, x0, 3 -> nop (due to branch taken) addi x0, x0, 1 -> nop (due to branch taken) addi x0, x0, 4 -> nop addi t0, x0, 9 addi t1, t0, 2 xori t1, x0, 6 ``` - **Control Hazard:** The `beq` instruction compares `x0` to `x0`, which will always be equal, hence the branch is always taken. The CPU's branch prediction is that the branch will not be taken. When the prediction is found to be incorrect at the EX stage, the CPU must flush the next two instructions (stalls), resulting in a control hazard. - **Data Hazard:** Normally, the `addi t1, t0, 2` instruction would wait for the `addi t0, x0, 9` to complete due to a data hazard. However, with forwarding paths, the result of `t0` from the EX stage of instruction 7 can be directly used in the EX stage of instruction 8, thus eliminating this data hazard. ::: ## **Problem B** In a RISC-V system employing segmentation-based virtual memory, process A is operating in user space. The segment base register is set at 0x20000, and the bound register is also 0x20000. Evaluate the virtual address 0x200. If it does not lead to an out-of-bounds violation and is within bounds, convert it to its corresponding physical address. __ B01 __ > * B01 = ? ==0x20200== ### 📝 Annotate :::info 1. **Check if the Virtual Address is Within Bounds** - Bound Register Value: `0x20000` (Segment Size = 128KB) - Virtual Address: `0x200` - Condition: Virtual address must be less than the segment size. - Evaluation: Since `0x200` < `0x20000`, the virtual address is within bounds. 2. **Calculate the Physical Address** - Segment Base Register: `0x20000` - Virtual Address: `0x200` - Physical Address Calculation: Sum of Segment Base Register and Virtual Address. - Result: `0x20000` (Segment Base) + `0x200` (Virtual Address) = `0x20200` ::: Analyze the code for process A and the minimal kernel-space handler provided. Identify which instructions in process A’s code result in an exception. __ B02 __ - [ ] Process A ```c li a0, 0x40000 li a1, 0x20 lw a2, 0x0(a0) li a3, 0x200 loop: add a2, a2, a1 addi a1, a1, -1 addi a3, a3, 4 sw a2, 0(a3) bnez a1, loop addi a3, a3, -0x200 li a7, 0x13 # syscall number mv a0, a3 ecall ``` - [ ] minimal kernel-space handler ``` minimal_handler: addi a4, a4, 1 csrr a5, mepc mret addi a5, a5, 4 csrw mepc, a5 ``` Note that ret is a pseudoinstruction that effectively functions as a jalr instruction, whereas mret is an actual instruction. These instructions serve different purposes: ret is typically employed for returning from a standard function, while mret is utilized for returning from a trap, such as an exception or interruption, and it involves several additional effects. From RISC-V privileged document, > The MRET, SRET, or URET instructions are used to return from traps in M-mode, S-mode, or U-mode respectively. When executing an xRET instruction, supposing xPP holds the value y, x IE is set to x PIE; the privilege mode is changed to y; x PIE is set to 1; and xPP is set to U (or M if user-mode is not supported). >* B02 = ? ==lw a2, 0x0(a0) AND ecall== (寫出其中一者就給分) ### 📝 Annotate :::info In the given Process A code, the instructions that result in an exception are `lw a2, 0x0(a0)` and `ecall`. 1. `lw a2, 0x0(a0)`: This instruction attempts to load data from the memory address `0x40000` into the register `a2`. If this address is not within the valid address space range of the process, is misaligned (as `lw` requires alignment to 4 bytes), or if the process does not have permission to read from that memory address, an exception will be triggered. Without sufficient context to determine whether address `0x40000` is accessible, if it is inaccessible, an exception will be raised when this instruction is executed. 2. `ecall`: This is an environment call used to switch from user mode (U-mode) to a higher privilege mode (like M-mode) to perform operations such as system calls. The `ecall` instruction triggers an environment call exception, transferring control to the operating system so that it can provide the appropriate service based on the system call number in register `a7`. Regarding the minimal kernel-space handler: - `minimal_handler` is designed to handle the exception raised by `ecall`. When `ecall` triggers an exception, control is transferred to this handler. - `addi a4, a4, 1` is the first instruction in the handler, which might be used to update a counter for tracking the number of times an exception has occurred. - `csrr a5, mepc` reads the Machine Exception Program Counter (MEPC), which contains the address of the instruction that caused the exception. - `mret` is used to return from the exception. When executing `mret`, it restores some of the bits in the Machine Status Register to their state before the exception occurred and returns control to the instruction following the one that caused the exception. - `addi a5, a5, 4` and `csrw mepc, a5` are placed after `mret` and would never execute because `mret` has already returned control to the user program. These instructions, if executed before `mret`, might be used to update the MEPC to point to the next instruction, but in this context, they are ineffective. ::: This program operates on a RISC-V processor equipped with complete bypassing and annulment capabilities, and exceptions are active. Assume that exceptions are processed instantly. Furthermore, it is assumed that the minimal_handler consistently returns control to the instruction immediately following the one that triggered the exception. The specific code within the minimal_handler need not be considered. Complete the pipeline diagram starting with the initial li instruction (disregard cells marked with ‘-’ and ‘?’). It is unnecessary to display any bypasses used, if applicable. Consider mret similar to a branch instruction, with the next PC established in the EXE stage and branches presumed to be not taken. | | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | |---|-----|-----|-----|-----|-----|-----|-----|-----|-----|-----| | IF| li | li | lw | li | add | addi| addi| csrr| mret| addi| | ID| - | li | li | lw | li | add | ? | B03 | B04 | B05 | | EXE| - | - | li | li | lw | li | ? | ? | B06 | B07 | | MEM| - | - | - | li | li | lw | ? | ? | ? | B08 | | WB | - | - | - | - | li | li | ? | ? | ? | ? | >* B03 = ? ==addi== >* B04 = ? ==csrr== >* B05 = ? ==mret== >* B06 = ? ==addi== >* B07 = ? ==csrr== >* B08 = ? ==addi== ### 📝 Annotate :::info For cycles 1 to 5, instructions move through the pipeline in a straightforward manner since there are no stalls or exceptions. From cycle 6 onwards, we need to pay attention to the `lw` instruction which might cause an exception and the behavior of `mret`. Since exceptions are processed instantly and control is returned to the next instruction, the pipeline continues without interruption. - **B03 (ID stage at cycle 8)**: It must be `addi` as it follows `addi` in the EXE stage in cycle 7. - **B04 (ID stage at cycle 9)**: It's `csrr`, following the pattern that each new instruction enters the ID stage in the cycle immediately after it enters the IF stage. - **B05 (ID stage at cycle 10)**: It's `mret`, following the same pattern as above. - **B06 (EXE stage at cycle 8)**: It must be `addi`, as it follows the ID stage in cycle 7. - **B07 (EXE stage at cycle 9)**: This is `csrr`, following its appearance in the ID stage in the previous cycle. - **B08 (MEM stage at cycle 10)**: This is `addi`, following its progression through the pipeline. ::: ## **Problem C** Assuming a 16-byte, fully associative cache with 4-byte blocks, express hit rates as simplified fractions. Consider the following code: ```c #include <stdint.h> int main() { int32_t arr[6]; for (int i = 0; i < 6; i++) { arr[i] += arr[0]; arr[i] += arr[1]; arr[i] += arr[2]; arr[i] += arr[3]; } } ``` Determine the hit rate during the first iteration of the for loop when using a Least Recently Used (LRU) replacement policy. __ C01 __ >* C01 = ? ==$\frac{2}{3}$== In each iteration, there are 12 accesses after breaking down the line arr[i] += arr[0] into arr[i] = arr[i] + arr[0], and similarly for the other lines. The initial accesses to arr[0], arr[1], arr[2], and arr[3] are compulsory misses. However, subsequent accesses result in hits. Therefore, there are 8 hits in 12 accesses, yielding a hit rate of . ### 📝 Annotate :::info ### Cache Initialization The cache starts empty, so the first access to `arr[0]`, `arr[1]`, `arr[2]`, and `arr[3]` will result in compulsory cache misses. These misses "warm up" the cache by loading these blocks. ### Subsequent Accesses in the Loop For each iteration of the loop, including the first, `arr[i]` is accessed first and will result in a miss (for `i > 0`), as it is not yet in the cache, and the cache already contains `arr[0]` to `arr[3]`. Then `arr[0]` to `arr[3]` are accessed, which should result in hits since they were loaded into the cache during the initial misses and are kept due to the LRU policy. ### First Iteration Access Pattern There are 4 accesses for each `i` in the loop, totaling 24 accesses. The first iteration (`i=0`) involves accessing `arr[0]`, which is already in the cache after the initial miss. However, for `i = 1 to 5`, `arr[i]` will miss, and `arr[0]` to `arr[3]` will hit. ### Hits and Misses - There are 4 misses on the first access to `arr[0]` to `arr[3]`. - For `i = 0`, all accesses are hits because `arr[0]` to `arr[3]` were just loaded. - For `i = 1 to 5`, there is 1 miss (for `arr[i]`) and 3 hits (for `arr[0]`, `arr[1]`, and `arr[2]`) in each iteration. ### Hit Rate Calculation - In the first iteration (`i=0`), there are 4 hits. - For `i = 1 to 5`, there are 3 hits and 1 miss per iteration, adding up to 15 hits and 5 misses. - Total hits = 4 (from `i=0`) + 15 (from `i=1` to `i=5`) = 19 hits. - Total accesses = 24. However, considering only the accesses where hits are possible (excluding the compulsory misses), we have 20 accesses. Therefore, the hit rate = Hits / Possible Hits = `19 / 20` = `95%`. But since the question asks for the hit rate only during the first iteration of the for loop, we should not count the compulsory misses for `arr[0]` to `arr[3]`. Therefore, only the hits and misses within the loop should be counted, which are 3 hits and 1 miss for each value of `i` from 1 to 5. The hit rate during the first iteration is then calculated as 8 hits in 12 accesses (4 misses for the initial load, and then 3 hits for each subsequent access of `arr[0]` to `arr[3]`). Hit rate = Hits / (Hits + Misses) = `8 / (8 + 4)` = `8 / 12` = $\frac{2}{3}$. ::: Determine the hit rate during the final iteration of the `for` loop when utilizing a LRU replacement policy. — CO2 — - CO2 = ? ==$\frac{7}{12}$== The crucial observation is that when we access `arr[4]` and beyond, evictions from the cache begin. To simplify, let’s examine the 5th iteration of the loop. Detailing the 12 accesses, here’s the cache state at the beginning of this iteration: | Data | LRU | |-------|-----| | arr[0]| 3 | | arr[1]| 2 | | arr[2]| 1 | | arr[3]| 0 | ## Steps: 1. Read arr[4]. This evicts arr[0], resulting in a miss. 2. Read arr[0]. This evicts arr[1], resulting in a miss. 3. Write to arr[4]. This results in a hit. 4. Read arr[4]. This results in a hit. 5. Read arr[1]. This evicts arr[2], resulting in a miss. 6. Write to arr[4]. This results in a hit. 7. Read arr[4]. This results in a hit. 8. Read arr[2]. This evicts arr[3], resulting in a miss. 9. Write to arr[4]. This results in a hit. 10. Read arr[4]. This results in a hit. 11. Read arr[3]. This evicts arr[0], resulting in a miss. 12. Write to arr[4]. This results in a hit. Counting our hits, we have 7 hits and 5 misses, leading to our above hit rate. ### 📝 Annotate :::info ### Hit and Miss Count - Hits: Steps 3, 4, 6, 7, 9, 10, and 12 result in cache hits. - Misses: Steps 1, 2, 5, 8, and 11 result in cache misses. Thus, there are 7 hits and 5 misses. ### Hit Rate The hit rate is the number of hits divided by the total number of memory accesses: Hit Rate = (Number of Hits) / (Total Accesses) = $\frac{7}{12}$ ::: ## **Problem D** Consider that “way prediction” is a method to enhance efficiency in set-associative caches. It involves forecasting the specific cache way that will likely be used for a given memory request. When the prediction is accurate, it eliminates the need to search through other cache ways. However, if the prediction is wrong, the process continues as it would in a standard set-associative cache access. ![image](https://hackmd.io/_uploads/rJrZQrOOa.png) Examine a 16 KiB two-way set associative cache that does not employ way prediction. With a hit time of 5 cycles, an 80% hit rate, and a Level 2 (L2) cache access time of 30 cycles, calculate the Average Memory Access Time (AMAT) for this cache. __ D01 __ cycles >* D01 = ? ==11== ### 📝 Annotate :::info The **Average Memory Access Time (AMAT)** is crucial for understanding the performance of a cache memory system. It combines the time taken for cache hits and the additional time for cache misses. Here's how you calculate it: ### Formula AMAT = Hit Time + (Miss Rate × Miss Penalty) - **Hit Time**: 5 cycles - **Hit Rate**: 80% or 0.80 - **Miss Rate**: 1 - Hit Rate (since every access is either a hit or a miss) - **Level 2 (L2) Cache Access Time**: 30 cycles (the time to retrieve data during a miss) ### Calculation 1. **Calculate the Miss Rate**: - Since Hit Rate is 80%, the Miss Rate is 20%. - `Miss Rate = 1 - Hit Rate = 1 - 0.80 = 0.20` 2. **Determine the Miss Penalty**: - The Miss Penalty is the time to access the L2 cache. - `Miss Penalty = L2 Cache Access Time = 30 cycles` 3. **Compute AMAT**: - Using the given formula: - `AMAT = Hit Time + (Miss Rate × Miss Penalty)` - `AMAT = 5 cycles + (0.20 × 30 cycles)` - `AMAT = 5 cycles + 6 cycles` - `AMAT = 11 cycles` ::: Evaluate an 8 KiB direct mapped cache with a hit time of 2 cycles, a 60% hit rate, and a Level 2 (L2) cache access time of 30 cycles. Calculate the AMAT for this cache. __ D02 __ cycles >* D02 = ? ==14== ### 📝 Annotate :::info The **Average Memory Access Time (AMAT)** is a key performance metric in cache design. It represents the average number of cycles per memory access, considering both hits and misses. The AMAT for a direct-mapped cache can be calculated as follows: ### Formula AMAT = Hit Time + (Miss Rate × Miss Penalty) - **Hit Time**: 2 cycles (time it takes for a successful cache hit) - **Hit Rate**: 60% (probability that a memory access finds the data in the cache) - **Miss Rate**: 40% (probability that a memory access does not find the data in the cache) - **L2 Cache Access Time**: 30 cycles (time to access the next level of cache or memory on a miss) ### Calculation Steps 1. **Calculate the Miss Rate**: - The Miss Rate is the complement of the Hit Rate. - `Miss Rate = 1 - Hit Rate = 1 - 0.60 = 0.40` 2. **Determine the Miss Penalty**: - The Miss Penalty is the time to access the L2 cache. - `Miss Penalty = L2 Cache Access Time = 30 cycles` 3. **Compute AMAT**: - Plugging the values into the formula: - `AMAT = Hit Time + (Miss Rate × Miss Penalty)` - `AMAT = 2 cycles + (0.40 × 30 cycles)` - `AMAT = 2 cycles + 12 cycles` - `AMAT = 14 cycles` ::: Now, evaluate a 16 kB two-way set associative cache that incorporates way prediction. Calculate the AMT for this cache configuration. __ D03 __ cycles >* D03 = ? ==9.2== ### 📝 Annotate :::info The **Average Memory Access Time (AMAT)** for a two-way set associative cache with way prediction takes into account the times when the prediction is correct as well as when it's incorrect. ### Formula AMAT = (Hit Rate × Hit Time) + (Way Misprediction Rate × Way Misprediction Penalty) + (Miss Rate × Miss Penalty) - **Hit Rate**: 60% or 0.60 - **Hit Time**: 2 cycles (for correct way prediction) - **Way Misprediction Rate**: 40% or 0.40 - **Way Misprediction Penalty**: 5 cycles (additional time for incorrect prediction) - **Miss Rate**: 20% or 0.20 - **Miss Penalty**: 30 cycles (time to access the next level of memory on a miss) ### Calculation Steps 1. **Calculate the effective Hit Time**: - Considering correct predictions: `0.60 × 2 = 1.2 cycles` 2. **Calculate the Way Misprediction Penalty**: - Factoring in incorrect predictions: `0.40 × 5 = 2 cycles` 3. **Calculate the Miss Penalty**: - For overall misses: `0.20 × 30 = 6 cycles` 4. **Compute AMAT**: - Combining the values: `AMAT = 1.2 + 2 + 6` - Resulting in: `AMAT = 9.2 cycles` ::: ## **Problem E** We want design a Finite State Machine (FSM) where the output at time step is 0 for the initial two time steps and then matches the input from time step for subsequent steps. For instance, if the input sequence to this FSM is 0b01 1011 1011 1000 1001, the corresponding output should be 0b00 0110 1110 1110 0010. To extend beyond a 2-cycle delay, consider an FSM that delays the input by 12 cycles, outputting 0 for the initial 12 cycles. For example, if the input to this updated FSM is 0b01 1011 1011 1000 1001, the output should be 0b00 0000 0000 0001 1011. Determine the minimum number of states required for an FSM to achieve this function. Provide your answer as a precise whole number. __ E01 __ >* E01 = ? ==4096== To get this answer, notice that to reference something from 12 inputs ago in an FSM, we must keep a record of that for at least 12 states. In other words, we must “encode” 12 bits of memory into our FSM states, requiring $2^{12}$ = 4096 states. ### 📝 Annotate :::info ### Concept - **FSM Function**: Transitions between states based on input history. - **Purpose**: To remember inputs from previous cycles. ### Implementation for 12-cycle Memory - **State Representation**: Each state represents a specific history of past 12 inputs. - **Input Options**: Each input bit can be 0 or 1 (two options). ### Calculation of States - **Total States**: $2^{12}$. - **Reasoning**: For 12 input bits, with each bit having 2 options, the total combinations are $2^{12}$. ::: Determine the shortest possible clock period for a circuit that addresses this issue, assuming the register is expanded to adequately many bits without affecting clk-to-q and setup times. Our circuit should be optimized for the minimum clock period, considering the given component delays. __ E02 __ ps ![image](https://hackmd.io/_uploads/rJAwq5KOa.png) - `tAND` = 12ps - `tOR` = 15ps - `tNOT` = 4ps - `tXOR` = 31ps - `tRegister-clk-to-q` = 10ps - `tRegister-setup` = 15ps - `tBit-splitter` = 0ps - `tWire` = 0ps >* E02 = ? ==25== As the circuit’s design for extending to 12 bits mirrors the above solution (with a sequence from Input → Reg[0], Reg[0] → Reg[1], …, Reg[11] → Output), the minimum combinational path delay is 0 seconds (either from Input to D through a splitter, or from Q to Output through a splitter). Therefore, the minimum clock cycle time equals the sum of tclk-to-q and tsetup, which is 25ps. ### 📝 Annotate :::info ### Factors Affecting Clock Period - **Slowest Path Delay**: Time for signal stabilization between sequential elements. - **Inclusions**: - Signal propagation delay through combinational logic. - Clock-to-output delay (`t_clk-to-q`) of a register. - Setup time (`t_setup`) of the next register. ### Calculation - **Given**: Splitters and wires delay = 0 ps, XOR gate not in critical path. - **Shortest Path**: Input → Splitter → Register (`D` input) → Register (`Q` output) → Splitter → Output. - **Minimum Clock Period (`E02`)**: - `E02 = tRegister-clk-to-q + tRegister-setup` - `E02 = 10 ps + 15 ps` - `E02 = 25 ps` ::: ## **Problem F** Consider a memory address space where addresses are chunked in 64-byte segments. For instance, address 1024 is 64 bytes apart from address 1025, but only 1 byte away from the “address” . In this system, standard integer binary addressing is ineffective, so consider using floating-point representation based on the IEEE-754 standard, which includes 1 sign bit. You will need to determine the appropriate number of exponent and mantissa bits. Given a memory size of 4KiB and chunk addresses labeled as 0, 1, 2, etc., identify the minimum number of exponent bits needed in the floating-point memory address to access every byte, assuming the use of a standard bias. __ F01 __ >* F01 = ? ==4== With a memory size of 4KiB segmented into 64B chunks, there are a total of 64 chunks. This necessitates the use of 4 exponent bits to cover values from 0 to 63. With these 4 exponent bits, the highest possible exponent value before applying bias (excluding infinity/NaN values) is . Once the standard bias of -7 is applied, this value becomes 7, enabling coverage of values up to 63. Conversely, using 3 exponent bits, the largest achievable exponent value, after applying a standard bias of -3, is only 3. This would be inadequate for representing the value 63. ### 📝 Annotate :::info - Memory size: 4KiB (4096 bytes) - Chunk size: 64 bytes - Number of chunks: 4096 / 64 = 64 chunks ### Calculation 1. **Determining the Range of Exponents:** - The total number of chunks is 64, which means we need to be able to represent numbers from 0 to 63 (since we start counting from 0). - To cover a range of 64 different values, we would typically need 6 bits (since 2^6 = 64). However, we are considering floating-point representation which involves bias. 2. **Applying the Standard Bias:** - IEEE-754 floating-point standard uses a bias in the exponent. The bias is calculated as $2^{k-1}$ - 1, where k is the number of exponent bits. - For 4 exponent bits, the bias would be $2^{4-1}$ - 1 = 7. - This means the range of exponent values we can represent with 4 bits, after applying the bias, is from -7 to +8 (16 total values from $2^{4}$ = 16, subtracting 1 for the zero). 3. **Analysis of the Required Bias:** - Since we need to access up to chunk number 63, and the bias is 7, the maximum unbiased exponent value we would need is 63 - 7 = 56. - However, the maximum value we can represent with 4 bits of exponent with a bias of 7 is only 8, which is insufficient. ::: Determine the least number of mantissa bits needed in our floating point memory address system to precisely address each byte. __ F02 __ >* F02 = ? ==11== The largest memory address we need to represent is 0b11 1111.111111 (). To represent this value, we need 11 bits of mantissa (0b1.11111111111 · 25). ### 📝 Annotate :::info ### Mantissa Bits Calculation for Floating-Point Memory Address The largest memory address we need to represent can be thought of as a ratio of the largest chunk address to the total number of chunks, or `(63 + 63/64)` when considering zero-based indexing for a memory system with 64-byte chunks. - To address each byte within the 64-byte chunk, we need to represent fractions of the chunk size. The largest fraction we would need to represent is `63/64`, which is the last byte in a chunk. - In binary, `63/64` is represented as `0b0.1111111` (with 6 bits after the binary point representing the fraction part). - However, to represent this value precisely and to address each byte, we need to be able to represent the full range of byte addresses within each chunk. Therefore, we consider the binary point and the number of bits that follow. - The binary representation of `63/64` would be `0b1.111111` when normalized (shifting the binary point to after the first non-zero digit). - To address each byte, we would need an additional 5 bits to represent the byte position within a chunk, hence `0b1.11111111111`. - Therefore, 11 bits are necessary: 1 bit for the integer part and 10 additional bits to represent the 1024 different byte positions within the 64-byte chunks (since $2^{10}$ = 1024). ::: ## **Problem G** ### 📝 Annotate :::info ### Cache Coherence Protocols Analyzer (https://kshitizdange.github.io/418CacheSim/final-report) ### Overview - Implemented a Cache Simulator for analyzing how different Snooping-Based Cache Coherence Protocols - MSI, MESI, MOSI, MOESI, Dragonfly, and Competitive Snooping; perform under various workloads. - Given any program, you can use their simulator to compare the performance of various protocols, based on number of Bus Transactions, Memory Requests, Memory Write-Backs and Cache-to-Cache Transfers. ::: Suppose a complex task is divided into four segments: PartA, PartB, PartC, and PartD. To efficiently tally their populations, each segment utilizes one core of a quad-core processor system, where each core has its own cache. These caches maintain coherence through a snoopy-based, write-invalidate MSI (Modified, Shared, Invalid) protocol, as illustrated. The goal is to examine the performance of this existing system to determine if improvements can be achieved. ![image](https://hackmd.io/_uploads/SkSsjXHdp.png) Each core starts by accessing and modifying its local copy of the population data, stored under a suitably named label (like pop_A for the population of PartA). This process is demonstrated in the pseudocode below: ``` call pop_count # This function counts the additional population in the area. lw a1, <pop_label> # This loads the earlier population count. add a1, a1, a0 # This performs a calculation to sum up the total. sw a1, <pop_label> # This saves the updated population count. ``` The sequence of these memory accesses is as follows: ![image](https://hackmd.io/_uploads/BJhwnXH_6.png) We aim to closely examine the activities on Core A specifically. Understanding that these memory accesses are independent (meaning pop_A, pop_B, pop_C, and pop_D are distinct memory locations), it suffices to focus solely on the memory interactions on Core A to grasp its behavior. Please complete the table below by entering information in columns G01 and G02. For each entry, detail the bus transaction(s) resulting from each access, and the state of the cache line for pop_A following each access. | Access | Shared bus transactions | Cache A | |---------------|-------------------------|------------| | Initial state | - | pop_A : I | | lw a1, pop_A | BusRd | pop_A : G01| | sw a1, pop_A | BusRdX | pop_A : G02| >* G01 = ? ==S== >* G02 = ? ==M== ### 📝 Annotate :::info ### Cache Coherence Protocol The MSI protocol has three states: - **M (Modified):** The block has been modified in the cache. The data in the cache is then inconsistent with the backing store (e.g. memory). A cache with a block in the "M" state has the responsibility to write the block to the backing store when it is evicted. - **S (Shared):** This block is unmodified and exists in read-only state in at least one cache. The cache can evict the data without writing it to the backing store. - **I (Invalid):** This block is either not present in the current cache or has been invalidated by a bus request, and must be fetched from memory or another cache if the block is to be stored in this cache. ### Actions and State Transitions When core A reads the data (`lw a1, pop_A`), it sends a BusRd request to load the data from the main memory, ==transitioning the cache line from I to S state.== When core A writes to `pop_A` (`sw a1, pop_A`), it sends a BusRdX request to get exclusive access, ==transitioning the cache line to M state.== ::: ### 📝 Cache Coherence Simulation :::info ```c ./run_cacheSim -c 4 -p MSI -t G1.trace New request for 0 Thread 0 got request New request for 1 Thread 1 got request New request for 2 Thread 2 got request New request for 3 Thread 3 got request New request for 0 Thread Thread 3 Done with request 1 Done with request Thread 2 Done with request Thread 0 Done with request Thread 0 got request Thread 0 Done with request New request for 1 Thread 1 got request New request for 2 Thread 2 got request Thread 1 Done with request New request for 3 Thread 3 got request Thread 2 Done with request Thread 3 Done with request Total Acceses = 8 Bus Transactions = 8 Memory Requests = 1 Memory Write Backs = 3 Cache Transfers = 3 ``` Statistics: - Total Accesses (8): The total number of memory accesses attempted during the simulation. - Bus Transactions (8): The number of times the system bus was used for communication among caches. - Memory Requests (1): The count of requests that had to be served directly from memory (not from cache). - Memory Write Backs (3): The number of times data was written back to memory from a cache. - Cache Transfers (3): Instances where data was transferred between caches without involving main memory. ::: We are curious to know whether adopting an MESI protocol, as illustrated, could enhance the performance of this processor. Suppose these four cores were to use a snoopy-based, write-invalidate MESI protocol, what would be the sequence of bus transactions and the cache line states for pop_A from Core A? Please complete the table below by entering information in columns G03 and G04. ![image](https://hackmd.io/_uploads/rkvLAXBda.png) | Access | Shared bus transactions | Cache A | |---------------|-------------------------|------------| | Initial state | - | pop_A : I | | lw a1, pop_A | BusRd | pop_A : G03| | sw a1, pop_A | - | pop_A : G04| >* G03 = ? ==E== >* G04 = ? ==M== ### 📝 Annotate :::info ### MESI Protocol Explanation The MESI protocol enhances the performance of multi-core processors by managing the states of cache lines across different processor cores. Below is the explanation of how Core A interacts with its cache line `pop_A` under the MESI protocol. ### Initial State - `pop_A` is initially in the **Invalid (I)** state. This indicates that Core A's cache does not have the data, or the data is stale. ### After Load Operation (`lw a1, pop_A`) - A **BusRd** transaction is initiated to read the data from the main memory because `pop_A` is neither in the Modified (M) nor Exclusive (E) state. - The state for `pop_A` in Core A's cache becomes **Exclusive (E)** after the operation. Core A is now reading the data without the intention of modifying it, assuming no other core has cached this data. ### After Store Operation (`sw a1, pop_A`) - No bus transaction is required after the `sw` operation since `pop_A` is already in the Exclusive (E) state in Core A's cache. - The state for `pop_A` becomes **Modified (M)**. Core A has updated the data, which now contains changes that are not yet written back to the main memory. ### Annotations - **G03 = E**: The cache line `pop_A` is in the Exclusive state for Core A after the load operation. There's no need for further bus transactions to invalidate or update other caches as no other core is sharing this data. - **G04 = M**: The cache line `pop_A` transitions to the Modified state after Core A performs the store operation. With `pop_A` already in the Exclusive state, Core A can modify it without additional bus transactions. ::: ### 📝 Cache Coherence Simulation :::info ```c ./run_cacheSim -c 4 -p MESI -t G1.trace New request for 0 Thread 0 got request Before wait Thread 1 to respond Thread 1 done responding Thread 3 to respond Thread 3 done responding New request for 1 Thread 2 to respond Thread 2 done responding Thread 1 got request After wait New request for 2 Before wait Thread 3 to respond Thread 3 done responding Thread 2 got request Thread 0 to respond Thread 0 done responding New request for 3 Thread 2 to respond Thread 2 done responding After wait Thread 3 got request Before wait New request for 0 Thread 3 to respond Thread 3 done responding Thread 0 to respond Thread 0 done responding Thread 1 to respond Thread 1 done responding After wait Before wait Thread 0 to respond Thread 1 to respond Thread 1 done responding Thread 0 done responding Thread 2 to respond Thread 2 done responding After wait Thread 1 Done with request Thread 2 Done with request Thread 0 Done with request Thread 0 got request Before wait Thread 3 Done with request Thread 1 to respond Thread 1 done responding New request for Thread 2 to respond Thread 2 done responding Thread 3 to respond Thread 3 done responding 1 After wait Thread 0 Done with request Thread 1 got request New request for 2 Thread 2 to respond Thread 2 done responding Thread Thread 02 to respond Thread 0 done responding got request Thread 3 to respond New request for 3 Thread 3 got request Thread 3 done responding Thread 1 Done with request Thread 1 to respond Thread 0 to respond Thread 0 done responding Thread 3 to respond Thread 3 done responding Thread 1 done responding Thread 2 Done with request Thread 2 to respond Thread 2 done responding Thread 0 to respond Thread 0 done responding Thread 1 to respond Thread 1 done responding Thread 3 Done with request Total Acceses = 8 Bus Transactions = 8 Memory Requests = 1 Memory Write Backs = 3 Cache Transfers = 3 ``` The output from the MESI protocol simulation in your cache simulator indicates how different threads handle cache requests. In MESI protocol: - New Requests: Threads receive new memory access requests. - Response Process: "Thread [x] to respond" and "done responding" indicate threads responding to cache misses, following MESI protocol rules. - Waiting States: "Before wait" and "After wait" suggest synchronization points, possibly for maintaining cache consistency. ::: Each core proceeds to modify the overall census count, stored at a shared label named sum, as demonstrated in the following pseudocode: ``` lw a2, sum # retrieves the current global census count. add a2, a2, a0 # adds to the population count. sw a2, sum # saves the new population count. ``` The sequence of these memory accesses is as follows: ![image](https://hackmd.io/_uploads/SkM1k4B_p.png) Complete the table provided by detailing the bus transactions resulting from each memory access and the corresponding states of the cache line for ‘sum’ after each access. An additional copy of this table is available at the end of the exam. Remember, this processor utilizes an MSI protocol for cache coherence. | Access | Shared bus transactions | Cache A | Cache B | Cache C | Cache D | |------------------|-------------------------|----------|----------|----------|----------| | Initial state | - | sum: I | sum: I | sum: I | sum: I | | Core A: lw a2, sum | BusRd | sum: G05 | sum: I | sum: I | sum: I | | Core A: sw a2, sum | BusRdX | sum: G06 | sum: I | sum: I | sum: I | | Core B: lw a2, sum | BusRd, BusWB | sum: S | sum: S | sum: I | sum: I | | Core C: lw a2, sum | BusRd | sum: G07 | sum: G08 | sum: G09 | sum: I | | Core C: sw a2, sum | BusRdX | sum: I | sum: I | sum: G10 | sum: I | | Core D: lw a2, sum | BusRd, BusWB | sum: I | sum: I | sum: S | sum: G11 | | Core B: sw a2, sum | BusRdX | sum: I | sum: G12 | sum: I | sum: I | | Core D: sw a2, sum | BusRdX, BusWB | sum: I | sum: I | sum: I | sum: G13 | >* G05 = ? ==S== >* G06 = ? ==M== >* G07 = ? ==S== >* G08 = ? ==S== >* G09 = ? ==S== >* G10 = ? ==M== >* G11 = ? ==S== >* G12 = ? ==M== >* G13 = ? ==M== ### 📝 Annotate :::info ### Cache Coherence with MSI Protocol In a system with multiple processor cores, each with its own cache, maintaining data consistency is essential. The MSI (Modified, Shared, Invalid) cache coherence protocol is a mechanism used to ensure that all caches in the system reflect the same state of shared data. Below is an explanation of how this protocol operates during various memory access scenarios involving the shared variable `sum`. ### Initial State - All cores start with `sum` in the **Invalid (I)** state. This means that the data for `sum` is not present in any cache or it is outdated. ### Core A Operations - **Load (`lw a2, sum`)**: Core A needs to read `sum`, so it issues a **BusRd** transaction. Since `sum` is not present in its cache, it must be loaded from the main memory, transitioning `sum` to the **Shared (S)** state (G05). This state indicates that while the data is now in Core A's cache, it might also be present and up-to-date in other caches. - **Store (`sw a2, sum`)**: To write to `sum`, Core A requires exclusive access. It broadcasts a **BusRdX** transaction to invalidate `sum` in other caches. Once this is done, `sum` moves to the **Modified (M)** state (G06), indicating that Core A has the only valid and up-to-date copy of `sum`. ### Core B Operations - **Load (`lw a2, sum`)**: When Core B wants to read `sum`, and it's in the Modified state in Core A's cache, Core A must write the updated value back to the main memory (**BusWB**). After this, `sum` becomes **Shared (S)** between Core A and Core B (G07 and G08). This shared state ensures that both cores have a read-only copy that reflects the same value. ### Core C Operations - **Load (`lw a2, sum`)**: Core C's load results in a **BusRd** transaction. As `sum` is already in a Shared state in the caches of Core A and Core B, Core C transitions `sum` to Shared (S) in its cache too (G09), allowing read access without further modification. - **Store (`sw a2, sum`)**: For Core C to update `sum`, it issues a **BusRdX**, which invalidates `sum` in all other caches. Core C's cache line for `sum` now holds the only valid copy, and it's in the Modified (M) state (G10), ready for write-back when necessary. ### Core D Operations - **Load (`lw a2, sum`)**: Similar to previous loads, Core D issues a **BusRd**. Since `sum` is in the Modified state in Core C's cache, Core C must perform a write-back (**BusWB**). This action brings `sum` to the Shared (S) state in Core D's cache (G11). ### Core B and D Store Operations - When Core B wants to write to `sum` again, it must ensure exclusive access by issuing a **BusRdX**, which leads to `sum` being invalidated in Core D's cache. Now, `sum` is in the Modified (M) state in Core B's cache (G12). - Core D's store operation also requires a **BusRdX**. This transaction will lead to a write-back if `sum` is currently Modified elsewhere. Afterward, `sum` is in the Modified (M) state in Core D's cache (G13), indicating that it has the latest version of `sum`. ::: ### 📝 Cache Coherence Simulation :::info ```c ./run_cacheSim -c 4 -p MSI -t G3.trace New request for 0 Thread 0 got request New request for 1 Thread 1 got request New request for 2 Thread 2 got request New request for 3 Thread 3 got request New request for 0 Thread 1 Done with request Thread 2 Done with request Thread 3 Done with request Thread 0 Done with request Thread 0 got request New request for 1 Thread 1 got request New request for 2 Thread 2 got request New request for 3 Thread 1 Done with request Thread Thread 0 Done with request 3 got request Thread 2 Done with request Thread 3 Done with request Total Acceses = 8 Bus Transactions = 8 Memory Requests = 1 Memory Write Backs = 3 Cache Transfers = 3 ``` ::: We identify an issue with the current census counting implementation: there is a chance that two processors might simultaneously read sum, each add their population count to it without considering the other’s addition, and then write back to sum, leading to inaccuracies. To ensure the correct final value of sum — that accurately reflects the aggregate of the a0 values from each processor — where should we integrate semaphores? These semaphores should minimize blocking of instructions. Define and initialize any necessary semaphores, then add WAIT and SIGNAL commands in the pseudocode below to guarantee accurate calculation of sum. It is not required to include ecall instructions in this modification. Semaphore Initialization: ``` sum_sem = 1 ``` Pseudocode: ``` call pop_count # Z1: counts the new population of the area lw a1, <pop_label> # Z2: loads the previous population add a1, a1, a0 # Z3: calculates a sum sw a1, <pop_label> # Z4: stores the updated population lw a2, sum # Z5: gets the current global count add a2, a2, a0 # Z6: updates the population sw a2, sum # Z7: stores the updated population Specify the precise instruction, labeled as mnemonic z1 to z7, after which the wait(sum_sem) should be inserted. This is to guarantee exclusive access to ‘sum’ at crucial update moments, thereby ensuring data accuracy. ``` >* G14 = ? ==Z4==(亦可作答完整指令，即 sw a1, <pop_label>) Specify the precise instruction, labeled as mnemonic z1 to z7, after which the signal(sum_sem) should be inserted. This is to guarantee exclusive access to ‘sum’ at crucial update moments, thereby ensuring data accuracy. >* G15 = ? ==Z7==(亦可作答完整指令，即 sw a2, sum) ### 📝 Annotate :::info ### Semaphore Integration in Pseudocode When dealing with shared resources in concurrent programming, it's crucial to avoid race conditions. Semaphores provide a mechanism to ensure that only one processor can access a shared variable at a time. Below is the pseudocode for updating a census count, with semaphores integrated to ensure data accuracy. ### Semaphore Initialization ```c int sum_sem = 1; // A binary semaphore initialized to 1, indicating availability. void update_population() { call pop_count; // Z1: Count the new population of the area. lw a1, <pop_label> // Z2: Load the previous population count. add a1, a1, a0 // Z3: Add the new population to the total. sw a1, <pop_label> // Z4: Store the updated population count. WAIT(sum_sem); // Z4: Wait to acquire the semaphore before accessing the shared sum. lw a2, sum // Z5: Load the current global census count. add a2, a2, a0 // Z6: Add to the population count. sw a2, sum // Z7: Store the new population count. SIGNAL(sum_sem); // Z7: Signal to release the semaphore after updating. } ``` ==The WAIT operation is inserted after Z4== to ensure that the processor acquires exclusive access to the sum before reading and updating it. ==The SIGNAL operation is placed after Z7== to indicate the release of the semaphore, allowing other processors to access the sum. Ensuring Exclusive Access To maintain the integrity of the sum during concurrent updates, the semaphore is utilized as follows: * G14 = Z4: The WAIT(sum_sem) command is issued after storing the updated local population count, ensuring the exclusive access for the upcoming global sum update. * G15 = Z7: The SIGNAL(sum_sem) command is issued after storing the new global sum, signaling that the next processor can now access sum. ::: ## **Problem H** Assuming a processor with 38-bit virtual addresses, 26-bit physical addresses, and a 256-byte page size, the Page Table employs an LRU strategy for managing pages and addresses missing pages through a page fault handler. This processor has a 4-entry, direct-mapped TLB (Translation Lookaside Buffer) and utilizes VPN (Virtual Page Number) bits 1:0 for indexing into the TLB. Refer to the provided diagram which shows the TLB contents before executing two instructions. The diagram includes both tag and index bits of the VPN for clarity, although only tag bits are typically stored in the TLB. After executing each of the two instructions, update the TLB contents accordingly. Note that the TLB’s dirty bits are synchronized with those in the page table. Remember: The TLB stores page table entries, which may be altered by the page fault handler. For accurate operation, most processors use a special instruction to invalidate a TLB entry. The page fault handler is assumed to use this instruction. Also, assume all page fault handler accesses are through physical memory and bypass the TLB. - [ ] Initial TLB contents | Index | VPN (Tag+Index) | V | D | PPN | |-------|-----------------|---|---|-------| | 0 | 0x200 | 1 | 1 | 0xBC | | 1 | 0x1 | 1 | 0 | 0x123 | | 2 | 0x36 | 1 | 0 | 0x412 | | 3 | 0x23 | 1 | 1 | 0x99 | Update the contents of the TLB following the execution of the instruction lw x12, 0x180(x0) with the PC set at 0x2FC. | Index | VPN (Tag+Index) | V | D | PPN | |-------|-----------------|---|---|-------| | 0 | 0x200 | 1 | 1 | 0xBC | | 1 | 0x1 | 1 | 0 | 0x123 | | 2 | H01 | H02 | H03 | H04 | | 3 | 0x23 | 1 | 1 | 0x99 | >* H01 = ? ==0x36== H02 = ? ==1== H03 = ? ==0== H04 = ? ==0x412== ### 📝 Annotate :::info ### TLB Details - **4-entry, Direct-mapped TLB**: The TLB has four slots for storing page table entries. - **VPN Indexing**: VPN bits 1:0 are used for indexing into the TLB, indicating that the TLB uses the lowest two bits of the VPN for this purpose. ### Instruction Analysis - **Load Instruction**: `lw x12, 0x180(x0)` loads a word from the address `0x180` into the register `x12`. - **VPN Calculation**: With a 256-byte page size, the VPN for the address `0x180` is calculated as $\frac{0x180}{256}$ = 0x6. ### TLB Indexing - **Index Calculation**: The VPN `0x6` (binary `110`) uses its lowest two bits (`10` or `2` in decimal) to index into the TLB. ### Initial TLB State - **Entry at Index 2**: Before execution, the TLB entry at index `2` is: VPN = `0x36`, Valid = `1`, Dirty = `0`, PPN = `0x412`. ### Execution and TLB Update - **TLB Miss**: Executing the instruction results in a lookup for VPN `0x6` in the TLB. The entry at index `2` is found to be different (VPN `0x36`). - **Handling the TLB Miss**: This miss is typically handled by updating the TLB with the correct page information, possibly involving a page fault. ### Post-Execution TLB State - **Updated Entry**: The TLB entry at index `2` is updated with VPN = `0x6`, Valid = `1`, and PPN = `0x412` (assuming no change). The Dirty bit is set to `0` as it's a load operation. ::: - [ ] Update the contents of the TLB following the execution of the instruction sw x12, 0x60C(x0) with the PC set at 0x300. | Index | VPN (Tag+Index) | V | D | PPN | |-------|-----------------|---|---|-------| | 0 | 0x200 | 1 | 1 | 0xBC | | 1 | 0x1 | 1 | 0 | 0x123 | | 2 | H05 | 1 | 1 | H06 | | 3 | H07 | 1 | 0 | H08 | >* H05 = ? ==0x6== >* H06 = ? ==0x651== >* H07 = ? ==0x3== >* H08 = ? ==0xABC== ### 📝 Annotate :::info ### Address Translation for `0x60C` - **Virtual Address (VA)**: `0x60C` (from `0x60C(x0)` with `x0` assumed to be `0`) - **VPN Calculation**: - Since page size is 256 bytes, lower 8 bits are for offset. - VPN for `0x60C` = `0x6` (upper bits of the VA). ### TLB Lookup and Update - **Index for VPN `0x6`**: `2` (direct-mapped, lower 2 bits of VPN) - **Action**: - TLB miss at index `2`, leading to a page fault. - Page fault handler updates the TLB. - **Updated TLB Entry at Index `2`**: - `H05 = VPN = 0x6`: Required VPN. - `H06 = PPN = 0x651`: New Physical Page Number. ### Address Translation for PC `0x300` - **VPN Calculation for PC**: - VPN for `0x300` = `0x3`. - **Index for VPN `0x3`**: `3` (direct-mapped, lower 2 bits of VPN) - **Updated TLB Entry at Index `3`**: - `H07 = VPN = 0x3`: VPN for PC. - `H08 = PPN = 0xABC`: New Physical Page Number. - The TLB is updated to reflect the new page table entries after handling the `sw` instruction and PC update. - New entries for VPNs `0x6` and `0x3` are loaded into the TLB with corresponding PPNs `0x651` and `0xABC`. ::: ## **Problem I** RISC-V includes a floating-point extension (aka “F”) supporting IEEE-754 single-precision floats, featuring 1 sign bit, 8 exponent bits, and 23 mantissa bits. This extension introduces new instructions and 32 floating-point registers (f0 to f31), each holding 32 bits. Key Instructions: * fadd.s rd, rs1, rs2 : Performs floating-point addition, setting rd = rs1 + rs2. * fsub.s rd, rs1, rs2 : Conducts floating-point subtraction, with rd = rs1 - rs2. * fmul.s rd, rs1, rs2 : Executes floating-point multiplication, resulting in rd = rs1 * rs2. * fdiv.s rd, rs1, rs2 : Carries out floating-point division, assigning rd = rs1 / rs2. * fmv.w.x rd, rs1 : Moves an integer to a floating-point register, converting rs1 to float and storing in rd. Note: The four arithmetic floating-point instructions solely operate on floating-point registers. Floating-point registers are incompatible with base instructions. For example, fadd.s f0, t1, t2 and addi f0, t1, 5 are invalid. The instruction fmv.w.x transfers a bitstring from an integer register rs1 to a floating-point register rd. For instance, if t0 holds the hexadecimal value 0x40800000 and fmv.w.x f0, t0 is executed, then f0 will represent the floating point value 4.0. Convert the value 3 into its hexadecimal representation using IEEE-754 single-precision floating point format. __ I01 __ >* I01 = ? ==0x40400000== First write 3 in binary scientific notation: 3 = 0b1.1 = 0b1.1 × 21. Sign bit is 0b0 since the number is positive. Exponent is 1. To encode in biased notation, we subtract the bias (for an 8-bit exponent, use the default bias of −(28−1 − 1) = −127) to get 1 − (−127) = 128. Then encode in 8-bit unsigned binary to get 0b1000 0000. Mantissa is 0b1 (dropping the implicit 1). Adding trailing zeros until it’s 23 bits long gives 0b100 0000 0000 0000 0000 0000. Putting sign, exponent, mantissa together: 0b0 1000 0000 100 0000 0000 0000 0000 0000 Regrouping bits: 0b0100 0000 0100 0000 0000 0000 0000 0000 Converting to hex: 0x4040 0000 ### 📝 Annotate :::info To convert the decimal number 3 into its IEEE-754 single-precision floating-point hexadecimal representation, we proceed as follows: 1. **Binary Scientific Notation**: - Decimal 3 in binary is `11`. - Normalized for IEEE-754, it is `1.1 × 2^1`. 2. **Sign Bit**: - Since 3 is positive, the sign bit is `0`. 3. **Exponent**: - The exponent is `1 + 127` (bias for single-precision) which equals `128`. - In binary, 128 is `1000 0000`. 4. **Mantissa (Significand)**: - After normalization, we drop the leading `1` and pad with zeros to get 23 bits: `100 0000 0000 0000 0000 000`. 5. **Complete IEEE-754 Binary Format**: - Sign: `0` - Exponent: `1000 0000` - Mantissa: `100 0000 0000 0000 0000 000` 6. **Full 32-bit Binary**: - Combine all parts: `0 1000 0000 100 0000 0000 0000 0000 0000` 7. **Hexadecimal Representation**: - Convert the full binary to hex: `0x40400000` ::: Let’s create a sequence of instructions to load a floating-point value as close as possible to 1.33333… into register f1. Restrict your instructions to the five floating-point operations previously mentioned. Each line should contain only one RISC-V instruction or pseudoinstruction. ``` li t4, 0x40800000 li t3, 0x40400000 __ I02 __ __ I03 __ __ I04 __ ``` >* I02 = ? ==fmv.w.x f4, t4== >* I03 = ? ==fmv.w.x f3, t3== >* I04 = ? ==fdiv.s f1, f4, f3== Put the bit patterns for 4 (in floating-point) and 3 (in floating-point) into two integer registers. For clarity, we used t4 to hold 4 and t3 to hold 3 in our solution. Then, move these bit patterns into floating-point registers. For clarity, we used f4 to hold 4 and f3 to hold 3 in our solution. Finally, use floating-point division to compute 4/3 and put the result in f1. Note that something like li t4, 4 and li t4, 3 would not work; these would put the bit patterns 0x0000 0004 and 0x0000 0003 into floating-point registers, but when interpreted in floating-point, these bit patterns do not represent the numbers 4 and 3. Note that we have to use li t4 0x4080 0000 or lui x4 0x40800 to change the upper 20 bits of the register. Using an instruction like addi won’t work because I-type immediates are only 12 bits long. Some alternate solutions do exist, such as computing 1 + 1/3, or somehow computing that 0x3FAA AAAB is the floating-point number closest to 4/3 (how you’d compute this is out-ofscope) and hard-coding this number into f1. ### 📝 Annotate :::info ### Setting up the Integer Registers with Floating-Point Bit Patterns The `li` (load immediate) instruction is employed to place a 32-bit immediate value into an integer register: - `li t4, 0x40800000`: Loads the binary representation of the floating-point number 4.0 into register `t4`. - `li t3, 0x40400000`: Loads the binary representation of the floating-point number 3.0 into register `t3`. These values, `0x40800000` for 4.0 and `0x40400000` for 3.0, correspond to the IEEE-754 encoding for these floating-point numbers. ### Moving Bit Patterns from Integer to Floating-Point Registers The `fmv.w.x` instruction moves a word from an integer register to a floating-point register without altering the bit pattern: - `fmv.w.x f4, t4`: Transfers the bit pattern from `t4` to `f4`, setting `f4` to hold the floating-point value 4.0. - `fmv.w.x f3, t3`: Transfers the bit pattern from `t3` to `f3`, setting `f3` to hold the floating-point value 3.0. ### Performing Floating-Point Division The `fdiv.s` instruction executes single-precision floating-point division: - `fdiv.s f1, f4, f3`: Divides the value in `f4` by `f3` and places the result into `f1`, which results in a value approximating 1.33333... ::: ## **Problem J** Examine the circuit provided. The AND and OR gates each have a delay of 8 ps, while the NOT gates have a delay of 4 ps. Additionally, all registers in the circuit are subject to a setup time constraint of 6 ps and a clock-to-Q delay of 3 ps. For the purpose of this analysis, assume that all connecting wires are ideal, meaning they have no delay. ![image](https://hackmd.io/_uploads/Hy8AN7S_T.png) What is the smallest combinational delay of all paths in this circuit? __ J01 __ ps >* J01 = ? ==12== The shortest path between two registers passes through a NOT gate followed by an AND gate, cumulatively resulting in a delay of 8 ps + 4 ps, which equals 12 ps. What is the maximum possible hold time constraint for registers to function properly in this circuit? __ J02 __ ps >* J02 = ? ==15== Hold time = smallest combinatorial delay + clock-to-Q delay = 12+3 = 15 ps What is the maximum allowable clock frequency for this circuit to function properly? __ J03 __ GHz >* J03 = ? ==40== Shortest clock period = clock-to-Q delay + largest combinatorial delay + hold time = 6+16+3 = 25 ps. 1 / (25 ps) = 40 GHz ### 📝 Annotate :::info ### J01: Smallest Combinational Delay (Critical Path Delay) The smallest combinational delay in a digital circuit is defined as the time it takes for a signal to propagate through the shortest path from the output of one register to the input of another, traversing through the combinational logic. Based on the circuit's configuration, the shortest path goes through a NOT gate followed by an AND gate. Thus, the smallest combinational delay is the sum of the delays of these gates: ```plaintext J01 = NOT gate delay + AND gate delay J01 = 4 ps + 8 ps J01 = 12 ps ``` ### J02: Maximum Possible Hold Time Constraint The hold time constraint is assessed by considering the smallest combinational delay in the circuit and the inherent delay from the clock signal to the output of the register (clock-to-Q delay). The equation for calculating the maximum possible hold time (`J02`) is: ```plaintext J02 = Smallest combinational delay + Clock-to-Q delay J02 = 12 ps + 3 ps = 15 ps ``` This results in a hold time of 15 ps, which is the total time before the next clock edge during which the input data must be stable to ensure the register captures it correctly. ### J03: Maximum Allowable Clock Frequency The maximum clock frequency for a digital circuit is constrained by the total delay of the longest path within the circuit, known as the critical path delay. This delay dictates the shortest possible clock period, ensuring that signals have enough time to propagate through the circuit before the next clock cycle begins. The critical path delay is calculated as follows: ```plaintext Critical Path Delay = clock-to-Q delay + AND gate delay + OR gate delay + setup time Critical Path Delay = 3 ps + 8 ps + 8 ps + 6 ps = 25 ps ``` The maximum frequency is determined by the inverse of the critical path delay: ```plaintext f_max = 1 / Critical Path Delay f_max = 1 / 25x10^-12 s f_max = 40 x 10^9 Hz f_max = 40 GHz ``` ::: ## **Problem K** In various applications, it is often necessary not only to identify the maximum value in an array but also to determine the index of this maximum element, known as the argmax. To achieve this efficiently, we can employ Data Level Parallelism. The function arg_max below accepts an array arr and its length n, returning the index of the highest value. In cases where multiple indices hold the same maximum value, the function will return the earliest of these indices. Utilize the provided pseudo SIMD (Single Instruction, Multiple Data) intrinsics to complete the argmax function, ensuring it operates as intended. These SIMD intrinsics work with vec structs, which are similar to Intel’s __m128i structs, containing 4 packed integers. Complete the sections labeled K01, K02, and K03 in the function to ensure it performs as intended. SIMD Instructions: * vec sum_epi32(vec a, vec b) : Returns a + b. * vec and_epi32(vec a, vec b) : Returns a & b. * vec set_epi32(int a) : Returns a SIMD vector with all entries set to a. * vec load_epi32(int *a) : Returns a SIMD vector with entries a[0], a[1], a[2], and a[3]. * int reducemax_epi32(vec a) : Returns the maximum int value in vector a. * vec maskeq_epi32(vec a, int b) : Returns a mask vector with 0xFFFFFFFF at indices where a equals b, and 0 otherwise. * int firstv_epi32(vec a) : Returns the index of the first entry with its lowest bit set to 1. ```c int arg_max(int *arr, int n) { int curr, index = 0, running_max = -2147483648; /* -2^31 */ for (int i = 0; i < n / 4 * 4; i += 4) { vec tmp = load_epi32(arr + i); curr = reducemax_epi32(tmp); if (curr > running_max) { tmp = K01 /* Fill in your code here */; index = i + K02 /* Fill in your code here */; running_max = curr; } } for (int i = n / 4 * 4; i < n; i += 1) { if (K03 /* Fill in your code here */) { running_max = arr[i]; index = i; } } return index; } ``` * K01 = ? ==maskeq_epi32(tmp, curr)== ### 📝 Annotate :::info ### K01: Creating a Mask Vector ```c= tmp = maskeq_epi32(tmp, curr); // K01 ``` In K01, we need to find the positions within the SIMD vector tmp where the elements are equal to the current maximum value (curr). The maskeq_epi32 function creates a mask vector, setting bits to 0xFFFFFFFF (all ones) at indices where the elements in tmp are equal to curr. This mask is crucial for identifying the first occurrence of the maximum value within the current SIMD vector. ::: * K02 = ? ==firstv_epi32(tmp)== ### 📝 Annotate :::info ### K02: Finding the Index of the First Maximum ```c= index = i + firstv_epi32(tmp); // K02 ``` After creating the mask vector in K01, we need to find the index of the first maximum element within the SIMD vector. The firstv_epi32 function returns the index of the first element in tmp with its lowest bit set, which corresponds to the first 0xFFFFFFFF in our mask. Adding this index to the base index i (the starting position of the SIMD block within the array) gives us the absolute position of the first occurrence of the maximum value in the original array. ::: * K03 = ? ==arr[i] > running_max==(或等價的形式) ### 📝 Annotate :::info ### K03: Handling Remaining Elements ```c= if (arr[i] > running_max) { // K03 running_max = arr[i]; index = i; } ``` The SIMD part of the function only processes complete blocks of 4 elements. K03 handles the remaining elements in the array that do not fit into a SIMD block. We need to check each of these elements to see if any of them is greater than the current running_max. If we find a greater value, we update running_max and the corresponding index. This ensures that the function correctly identifies the maximum value and its index across the entire array. :::