# Annotate and explain Quiz6/7 with additional challenge Problems:Quiz7 ## Problem A Consider that we are using the typical 5-stage pipelined datapath. Also, assume that the CPU always predicts branches as not taken. Review the following RISC-V code snippet: ```cpp= beq x0, x0, L1 addi x0, x0, 3 addi x0, x0, 1 addi x0, x0, 4 addi x0, x0, 1 addi x0, x0, 5 L1: addi t0, x0, 9 addi t1, t0, 2 xori t1, x0, 6 ``` Imagine the Instruction Fetch (IF) stage for the beq instruction at line 1 occurs in cycle 1. Assuming that all pipeline hazards are managed by inserting stalls (without using techniques like double pumping or forwarding paths), identify in which cycle the Write Back (WB) stage for the xori instruction at the last will be executed. For each pipeline hazard identified, please specify the nature of the hazard, the instructions it involves, and the number of stall cycles required. This section will only be reviewed if a regrade is requested. __ A01 __ >- A01 = ? A control hazard occurs between lines 1 and 7 due to the branch’s constant execution. Additionally, a data hazard arises from lines 7 to 8, involving a write to and subsequent read from register t0. (有提及 control hazard 和 data hazard 出處就給分) This leads to the below timing diagram: ![image](https://hackmd.io/_uploads/B1PtY1bY6.png) If double pumping and all forwarding paths are utilized, in which cycle will the Write Back (WB) stage for the xori instruction on the last line occur? Please identify each hazard, the relevant instructions, and the number of stall cycles needed. Note: This section will be reviewed only if a regrade is requested. __ A02 __ >- A02 = ? The control hazard persists from line 1 to line 7 due to the always-taken branch. The data hazard between lines 7 and 8 for register t0 is resolved through forwarding. (提及 control hazard 的來源就給分) This leads to the below timing diagram: ![image](https://hackmd.io/_uploads/BkW_FyWKp.png) ### 📝 Annotate :::info A01: The beq x0, x0, L1 instruction is an unconditional branch because the x0 register is always 0, and comparing x0 to itself will always be equal After the beq instruction confirms the branch, the results of all addi instructions will be discarded because they were executed based on an incorrect prediction. A02: Forwarding: When an instruction requires the result of the previous instruction, data forwarding technology allows the data to be directly transferred back from subsequent stages of the pipeline to earlier stages, without having to wait for the data to be completely written back to the registers. Double Pumping: This is a technique that allows for two operations to be performed in one cycle, typically used to increase execution speed. Control hazards still exist because the result of the beq instruction can only be determined at the end of the execution stage, meaning that if the branch prediction fails, the instructions that have already been executed in the pipeline may need to be revoked. Since data forwarding is used, the data hazard between t0 and t1 can be resolved. ::: ## Problem B In a RISC-V system employing segmentation-based virtual memory, process A is operating in user space. The segment base register is set at 0x20000, and the bound register is also 0x20000. Evaluate the virtual address 0x200. If it does not lead to an out-of-bounds violation and is within bounds, convert it to its corresponding physical address. __ B01 __ >- B01 = ? 0x20200 Analyze the code for process A and the minimal kernel-space handler provided. Identify which instructions in process A’s code result in an exception. __ B02 __ Process A ```cpp li a0, 0x40000 li a1, 0x20 lw a2, 0x0(a0) li a3, 0x200 loop: add a2, a2, a1 addi a1, a1, -1 addi a3, a3, 4 sw a2, 0(a3) bnez a1, loop addi a3, a3, -0x200 li a7, 0x13 # syscall number mv a0, a3 ecall ``` minimal kernel-space handler ```cpp minimal_handler: addi a4, a4, 1 csrr a5, mepc mret addi a5, a5, 4 csrw mepc, a5 ``` Note that ret is a pseudoinstruction that effectively functions as a jalr instruction, whereas mret is an actual instruction. These instructions serve different purposes: ret is typically employed for returning from a standard function, while mret is utilized for returning from a trap, such as an exception or interruption, and it involves several additional effects. From RISC-V privileged document, >The MRET, SRET, or URET instructions are used to return from traps in M-mode, S-mode, or U-mode respectively. When executing an xRET instruction, supposing xPP holds the value y, x IE is set to x PIE; the privilege mode is changed to y; x PIE is set to 1; and xPP is set to U (or M if user-mode is not supported). >- B02 = ? lw a2, 0x0(a0) AND ecall (寫出其中一者就給分) This program operates on a RISC-V processor equipped with complete bypassing and annulment capabilities, and exceptions are active. Assume that exceptions are processed instantly. Furthermore, it is assumed that the minimal_handler consistently returns control to the instruction immediately following the one that triggered the exception. The specific code within the minimal_handler need not be considered. Complete the pipeline diagram starting with the initial li instruction (disregard cells marked with ‘-’ and ‘?’). It is unnecessary to display any bypasses used, if applicable. Consider mret similar to a branch instruction, with the next PC established in the EXE stage and branches presumed to be not taken. ![image](https://hackmd.io/_uploads/rJw8Y1ZY6.png) >- B03 = ? addi >- B04 = ? csrr >- B05 = ? mret >- B06 = ? addi >- B07 = ? csrr >- B08 = ? addi ### 📝 Annotate :::info B01: Because the virtual address 0x200 does not exceed the boundary register's range of 0x20000, it is valid. To convert the virtual address to a physical address, you need to add the virtual address to the segment base address, which is 0x20000+0x200=0x20200. B02: The instruction lw a2, 0x0(a0) attempts to load data from the address 0x40000, which is beyond the user space boundary set at 0x20000, thus it will cause an exception. Furthermore, the ecall instruction will trigger a system call. System calls typically run in user mode and will switch to kernel mode to handle the call, which will also cause an exception. B03: ![image](https://hackmd.io/_uploads/r1xJzC4WKp.png) ::: ## Problem C Assuming a 16-byte, fully associative cache with 4-byte blocks, express hit rates as simplified fractions. Consider the following code: ```cpp #include <stdint.h> int main() { int32_t arr[6]; for (int i = 0; i < 6; i++) { arr[i] += arr[0]; arr[i] += arr[1]; arr[i] += arr[2]; arr[i] += arr[3]; } } ``` Determine the hit rate during the first iteration of the for loop when using a Least Recently Used (LRU) replacement policy. __ C01 __ >- C01 = ? $\tfrac{2}{3}$ In each iteration, there are 12 accesses after breaking down the line arr[i] += arr[0] into arr[i] = arr[i] + arr[0], and similarly for the other lines. The initial accesses to arr[0], arr[1], arr[2], and arr[3] are compulsory misses. However, subsequent accesses result in hits. Therefore, there are 8 hits in 12 accesses, yielding a hit rate of . Determine the hit rate during the final iteration of the for loop when utilizing a LRU replacement policy. __ C02 __ >- C02 = ? $\tfrac{7}{12}$ The crucial observation is that when we access arr[4] and beyond, evictions from the cache begin. To simplify, let’s examine the 5th iteration of the loop. Detailing the 12 accesses, here’s the cache state at the beginning of this iteration: | Data | LRU | | - | - | | arr[0] | 3 | | arr[1] | 2 | | arr[2] | 1 | | arr[3] | 0 | >Steps: 1.Read arr[4]. This evicts arr[0], resulting in a miss. 2.Read arr[0]. This evicts arr[1], resulting in a miss. 3.Write to arr[4]. This results in a hit. 4.Read arr[4]. This results in a hit. 5.Read arr[1]. This evicts arr[2], resulting in a miss. 6.Write to arr[4]. This results in a hit. 7.Read arr[4]. This results in a hit. 8.Read arr[2]. This evicts arr[3], resulting in a miss. 9.Write to arr[4]. This results in a hit. 10.Read arr[4]. This results in a hit. 11.Read arr[3]. This evicts arr[0], resulting in a miss. 12.Write to arr[4]. This results in a hit. Counting our hits, we have 7 hits and 5 misses, leading to our above hitrate. ### 📝 Annotate :::info C01: The access to arr[0] results in a compulsory miss (as the cache is initially empty). The access to arr[1] is a compulsory miss (it is not in the cache). The access to arr[2] is a compulsory miss (it is not in the cache). The access to arr[3] is a compulsory miss (it is not in the cache). In total, there are 12 accesses, and the initial accesses to arr[0], arr[1], arr[2], and arr[3] result in 4 misses, so there are 8 hits. 8 / 12 = $\tfrac{2}{3}$ C02: 1.Read arr[4]. This evicts arr[0], resulting in a miss. 2.Read arr[0]. This evicts arr[1], resulting in a miss. 3.Write to arr[4]. This results in a hit. 4.Read arr[4]. This results in a hit. 5.Read arr[1]. This evicts arr[2], resulting in a miss. 6.Write to arr[4]. This results in a hit. 7.Read arr[4]. This results in a hit. 8.Read arr[2]. This evicts arr[3], resulting in a miss. 9.Write to arr[4]. This results in a hit. 10.Read arr[4]. This results in a hit. 11.Read arr[3]. This evicts arr[0], resulting in a miss. 12.Write to arr[4]. This results in a hit. we have 7 hits and 5 misses 7 / 12 = $\tfrac{7}{7}$ ::: ## Problem D Consider that “way prediction” is a method to enhance efficiency in set-associative caches. It involves forecasting the specific cache way that will likely be used for a given memory request. When the prediction is accurate, it eliminates the need to search through other cache ways. However, if the prediction is wrong, the process continues as it would in a standard set-associative cache access. ![image](https://hackmd.io/_uploads/SJn4KybFp.png) Examine a 16 KiB two-way set associative cache that does not employ way prediction. With a hit time of 5 cycles, an 80% hit rate, and a Level 2 (L2) cache access time of 30 cycles, calculate the Average Memory Access Time (AMAT) for this cache. __ D01 __ cycles >- D01 = ? 11 5 + 0.2 x 30 = 11 Evaluate an 8 KiB direct mapped cache with a hit time of 2 cycles, a 60% hit rate, and a Level 2 (L2) cache access time of 30 cycles. Calculate the AMAT for this cache. __ D02 __ cycles >- D02 = ? 14 2 + 0.4 x 30 = 14 Now, evaluate a 16 kB two-way set associative cache that incorporates way prediction. Calculate the AMT for this cache configuration. __ D03 __ cycles >- D03 = ? 9.2 (0.6 x 2 + 0.4 x 5) + 0.2 x 30 = 9.2 ### 📝 Annotate :::info D01: Hit Time is 5 cycles. The Hit Rate is 80%, so the Miss Rate is 1−0.8=0.2. Level 2 (L2) Cache Access Time, that is Miss Penalty, is 30 cycles. 5 + 0.2 x 30 = 11 D02: Hit Time is 2 cycles. The Hit Rate is 60%, so the Miss Rate is 1 −0.6 = 0.2. Level 2 (L2) Cache Access Time, that is Miss Penalty, is 30 cycles. 2 + 0.4 x 30 = 14 D03: For a 16 KiB two-way set-associative cache with way prediction, we need additional information for an accurate calculation. When the prediction is correct, which happens 60% of the time, we use a shorter Hit Time, assumed to be 2 cycles. When the prediction is incorrect, which occurs 40% of the time, we use a longer Hit Time, assumed to be 5 cycles. The Miss Rate remains at 20%, with a Miss Penalty of 30 cycles. The average hit time is calculated as the sum of the hit times for correct and incorrect predictions: (0.6 × 2) + (0.4 × 5) = 3.2 The Average Memory Access Time is the sum of the average hit time and the product of the Miss Rate and the Miss Penalty: 3.2 + (0.2 × 30) = 9.2 ::: ## Problem E We want design a Finite State Machine (FSM) where the output at time step N is 0 for the initial two time steps and then matches the input from time step N - 2 for subsequent steps. For instance, if the input sequence to this FSM is 0b01 1011 1011 1000 1001, the corresponding output should be 0b00 0110 1110 1110 0010. To extend beyond a 2-cycle delay, consider an FSM that delays the input by 12 cycles, outputting 0 for the initial 12 cycles. For example, if the input to this updated FSM is 0b01 1011 1011 1000 1001, the output should be 0b00 0000 0000 0001 1011. Determine the minimum number of states required for an FSM to achieve this function. Provide your answer as a precise whole number. __ E01 __ >- E01 = ? 4096 To get this answer, notice that to reference something from 12 inputs ago in an FSM, we must keep a record of that for at least 12 states. In other words, we must “encode” 12 bits of memory into our FSM states, requiring $2^{12}$ = 4096 states. Determine the shortest possible clock period for a circuit that addresses this issue, assuming the register is expanded to adequately many bits without affecting clk-to-q and setup times. Our circuit should be optimized for the minimum clock period, considering the given component delays. __ E02 __ ps ![image](https://hackmd.io/_uploads/HkabK1bta.png) - tAND = 12ps - tOR = 15ps - tNOT = 4ps - tXOR = 31ps - tRegister-clk-to-q = 10ps - tRegister-setup = 15ps - tBit-splitter = 0ps - tWire = 0ps >- E02 = ? 25 As the circuit’s design for extending to 12 bits mirrors the above solution (with a sequence from Input → Reg[0], Reg[0] → Reg[1], …, Reg[11] → Output), the minimum combinational path delay is 0 seconds (either from Input to D through a splitter, or from Q to Output through a splitter). Therefore, the minimum clock cycle time equals the sum of tclk-to-q and tsetup, which is 25ps. ### 📝 Annotate :::info E01: When designing an FSM that introduces a delay of 12 cycles, we need to maintain a history of the past 12 inputs. Since each input can be either 0 or 1, the total number of possible combinations for 12 inputs is $2^{12}$. This means that at least 4096 distinct states are required to store all possible input histories, ensuring that the output at any given moment correctly reflects the input from 12 cycles earlier. E02: Extend the delay of the input signal to 12 bits, with signals passed sequentially from one register to the next, labeled from Reg[0] to Reg[11]. In this setup: The signal is transferred from the input to the D (data) input of the first register (Reg[0]). Then, the signal goes from the Q (output) of the first register to the D input of the second register (Reg[1]), and so on, until it reaches Reg[11]. Finally, the signal moves from the Q output of Reg[11] to the output. The minimum combinational path delay is 0 seconds. tRegister-clk-to-q = 10ps tRegister-setup = 15ps 10 + 15 = 25 ps ::: ## Problem F Consider a memory address space where addresses are chunked in 64-byte segments. For instance, address 1024 is 64 bytes apart from address 1025, but only 1 byte away from the “address” $1024 + \tfrac{1}{64}$. In this system, standard integer binary addressing is ineffective, so consider using floating-point representation based on the IEEE-754 standard, which includes 1 sign bit. You will need to determine the appropriate number of exponent and mantissa bits. Given a memory size of 4KiB and chunk addresses labeled as 0, 1, 2, etc., identify the minimum number of exponent bits needed in the floating-point memory address to access every byte, assuming the use of a standard bias. __ F01 __ >- F01 = ? 4 With a memory size of 4KiB segmented into 64B chunks, there are a total of 64 chunks. This necessitates the use of 4 exponent bits to cover values from 0 to 63. With these 4 exponent bits, the highest possible exponent value before applying bias (excluding infinity/NaN values) is $2^4 - 2$. Once the standard bias of -7 is applied, this value becomes 7, enabling coverage of values up to 63. Conversely, using 3 exponent bits, the largest achievable exponent value, after applying a standard bias of -3, is only 3. This would be inadequate for representing the value 63. Determine the least number of mantissa bits needed in our floating point memory address system to precisely address each byte. __ F02 __ >- F02 = ? 11 The largest memory address we need to represent is 0b11 1111.111111 ($63 + \tfrac{63}{64}$). To represent this value, we need 11 bits of mantissa (0b1.11111111111 · 25). ### 📝 Annotate :::info F01: The IEEE-754 floating-point standard uses a bias in the exponent. The bias is calculated as $2^{(k-1)}$ - 1, where k is the number of exponent bits. Comparing with 3 Exponent Bits: If we use 3 exponent bits, the largest exponent value achievable after applying the standard bias (-3) is only 3. This is insufficient to represent the value 63. For 4 exponent bits, the bias would be $2^{(4-1)}$ - 1 = 7. This means that the range of exponent values we can represent with 4 bits, after applying the bias, is from -7 to +8 (a total of 16 values, since $2^4$ = 16, subtracting 1 for zero). F02: The largest memory address we need to represent can be considered as the ratio of the largest chunk address to the total number of chunks, the largest memory address we need to represent is $63 + \frac{63}{64}$, which in binary is represented as 0b11 1111.111111. However, to precisely represent this value and address each byte, we need to represent the full range of byte addresses within each chunk. Therefore, we consider the binary point and the number of bits that follow. To address each byte, we would need an additional 5 bits to represent the byte position within a chunk, hence 0b1.11111111111. Therefore, 11 bits are necessary. ::: ## Problem G Suppose a complex task is divided into four segments: PartA, PartB, PartC, and PartD. To efficiently tally their populations, each segment utilizes one core of a quad-core processor system, where each core has its own cache. These caches maintain coherence through a snoopy-based, write-invalidate MSI (Modified, Shared, Invalid) protocol, as illustrated. The goal is to examine the performance of this existing system to determine if improvements can be achieved. ![image](https://hackmd.io/_uploads/rJm8wJZKp.png) Each core starts by accessing and modifying its local copy of the population data, stored under a suitably named label (like pop_A for the population of PartA). This process is demonstrated in the pseudocode below: ``` call pop_count # This function counts the additional population in the area. lw a1, <pop_label> # This loads the earlier population count. add a1, a1, a0 # This performs a calculation to sum up the total. sw a1, <pop_label> # This saves the updated population count. ``` The sequence of these memory accesses is as follows: ![image](https://hackmd.io/_uploads/By5KD1WYp.png) We aim to closely examine the activities on Core A specifically. Understanding that these memory accesses are independent (meaning pop_A, pop_B, pop_C, and pop_D are distinct memory locations), it suffices to focus solely on the memory interactions on Core A to grasp its behavior. Please complete the table below by entering information in columns G01 and G02. For each entry, detail the bus transaction(s) resulting from each access, and the state of the cache line for pop_A following each access. ![image](https://hackmd.io/_uploads/rkzsvkbYa.png) >- G01 = ? S >- G02 = ? M We are curious to know whether adopting an MESI protocol, as illustrated, could enhance the performance of this processor. Suppose these four cores were to use a snoopy-based, write-invalidate MESI protocol, what would be the sequence of bus transactions and the cache line states for pop_A from Core A? Please complete the table below by entering information in columns G03 and G04. ![image](https://hackmd.io/_uploads/S1WZ_J-t6.png) >- G03 = ? E >- G04 = ? M Each core proceeds to modify the overall census count, stored at a shared label named sum, as demonstrated in the following pseudocode: ``` lw a2, sum # retrieves the current global census count. add a2, a2, a0 # adds to the population count. sw a2, sum # saves the new population count. ``` The sequence of these memory accesses is as follows: ![image](https://hackmd.io/_uploads/BJj4dy-YT.png) Complete the table provided by detailing the bus transactions resulting from each memory access and the corresponding states of the cache line for ‘sum’ after each access. An additional copy of this table is available at the end of the exam. Remember, this processor utilizes an MSI protocol for cache coherence. ![image](https://hackmd.io/_uploads/rkHvuy-Ya.png) >- G05 = ? S >- G06 = ? M >- G07 = ? S >- G08 = ? S >- G09 = ? S >- G10 = ? M >- G11 = ? S >- G12 = ? M >- G13 = ? M We identify an issue with the current census counting implementation: there is a chance that two processors might simultaneously read sum, each add their population count to it without considering the other’s addition, and then write back to sum, leading to inaccuracies. To ensure the correct final value of sum — that accurately reflects the aggregate of the a0 values from each processor — where should we integrate semaphores? These semaphores should minimize blocking of instructions. Define and initialize any necessary semaphores, then add WAIT and SIGNAL commands in the pseudocode below to guarantee accurate calculation of sum. It is not required to include ecall instructions in this modification. Semaphore Initialization: ``` sum_sem = 1 ``` Pseudocode: ``` call pop_count # Z1: counts the new population of the area lw a1, <pop_label> # Z2: loads the previous population add a1, a1, a0 # Z3: calculates a sum sw a1, <pop_label> # Z4: stores the updated population lw a2, sum # Z5: gets the current global count add a2, a2, a0 # Z6: updates the population sw a2, sum # Z7: stores the updated population ``` Specify the precise instruction, labeled as mnemonic z1 to z7, after which the wait(sum_sem) should be inserted. This is to guarantee exclusive access to ‘sum’ at crucial update moments, thereby ensuring data accuracy. >- G14 = ? Z4 (亦可作答完整指令,即 sw a1, <pop_label>) Specify the precise instruction, labeled as mnemonic z1 to z7, after which the signal(sum_sem) should be inserted. This is to guarantee exclusive access to ‘sum’ at crucial update moments, thereby ensuring data accuracy. >- G15 = ? Z7 (亦可作答完整指令,即 sw a2, sum) ### 📝 Annotate :::info G01: The initial load (lw) of pop_A would bring the cache line into the Shared (S) state because it is a read operation. G02: The subsequent store (sw) of pop_A would transition the cache line to the Modified (M) state, as it's a write operation. G03: The initial load (lw) of pop_A under the MESI protocol would bring the cache line into the Exclusive (E) state because it's the only core accessing that line at that time. G04: The subsequent store (sw) of pop_A would transition the cache line to the Modified (M) state. G05 : The cache line for sum enters the Shared (S) state due to a read operation. G06 : The cache line transitions to the Modified (M) state due to a write. G07 ~ G09 : Each core reads sum, and the cache line remains in the Shared (S) state. G10 : Core C writes to sum, moving the cache line to Modified (M). G11: Core D reads sum, bringing it back to the Shared (S) state. G12: Core B writes to sum, setting the cache line to Modified (M). G13:When Core D attempts to write sum, it must first perform a bus transaction to invalidate other cachesand then write to sum, putting the cache line back to Modified (M). G14 : The semaphore wait operation should be inserted after Z4, which is the instruction sw a1, <pop_label>. This is because the critical section starts immediately after this instruction when Core A begins the process of updating the global sum. We want to acquire the semaphore before reading the current value of sum to ensure that the addition to sum is atomic. G15 : The semaphore signal operation should be inserted after Z7, which is the instruction sw a2, sum. This is the end of the critical section, and releasing the semaphore at this point allows other cores to enter their critical sections to update sum. ::: ## Problem H Assuming a processor with 38-bit virtual addresses, 26-bit physical addresses, and a 256-byte page size, the Page Table employs an LRU strategy for managing pages and addresses missing pages through a page fault handler. This processor has a 4-entry, direct-mapped TLB (Translation Lookaside Buffer) and utilizes VPN (Virtual Page Number) bits 1:0 for indexing into the TLB. Refer to the provided diagram which shows the TLB contents before executing two instructions. The diagram includes both tag and index bits of the VPN for clarity, although only tag bits are typically stored in the TLB. After executing each of the two instructions, update the TLB contents accordingly. Note that the TLB’s dirty bits are synchronized with those in the page table. Remember: The TLB stores page table entries, which may be altered by the page fault handler. For accurate operation, most processors use a special instruction to invalidate a TLB entry. The page fault handler is assumed to use this instruction. Also, assume all page fault handler accesses are through physical memory and bypass the TLB. Initial TLB contents | Index | VPN(Tag+Index) | V | D | PPN | | ------- | ---------------- | --- | --- | ----- | | 0 | 0x200 | 1 | 1 | 0xBC | | 0 | 0x1 | 1 | 0 | 0x123 | | 0 | 0x36 | 1 | 0 | 0x412 | | 0 | 0x23 | 1 | 1 | 0x99 | Update the contents of the TLB following the execution of the instruction lw x12, 0x180(x0) with the PC set at 0x2FC. | Index | VPN(Tag+Index) | V | D | PPN | | ------- | ---------------- | --- | --- | ----- | | 0 | 0x200 | 1 | 1 | 0xBC | | 0 | 0x1 | 1 | 0 | 0x123 | | 0 | H01 | H02 | H03 | 0H04 | | 0 | 0x23 | 1 | 1 | 0x99 | >- H01 = ? 0x36 >- H02 = ? 1 >- H03 = ? 0 >- H04 = ? 0x412 Update the contents of the TLB following the execution of the instruction sw x12, 0x60C(x0) with the PC set at 0x300. | Index | VPN(Tag+Index) | V | D | PPN | | ------- | ---------------- | --- | --- | ----- | | 0 | 0x200 | 1 | 1 | 0xBC | | 0 | 0x1 | 1 | 0 | 0x123 | | 0 | H05 | 1 | 1 | H06 | | 0 | H07 | 1 | 0 | H08 | >- H05 = ? 0x6 >- H06 = ? 0x651 >- H07 = ? 0x3 >- H08 = ? 0xABC ### 📝 Annotate :::info Updating the TLB: Executing the lw x12, 0x180(x0) Instruction The instruction lw x12, 0x180(x0) sets the PC at 0x2FC. This instruction will access the virtual address 0x2FC + 0x180. TLB Index: The VPN's bits 1:0 are used as the TLB index. Since the TLB is direct-mapped, we need to replace the least recently used entry according to the LRU strategy. Therefore, the updated TLB is as follows: H01 = 0x36: Because it is the least recently used. H02 = 1: The new entry is valid. H03 = 0: A load instruction does not change the data, so the dirty bit remains 0. H04 = 0x412: Assuming the page fault handler has loaded the physical page number 0x412. Updating the TLB: Executing the sw x12, 0x60C(x0) Instruction The instruction sw x12, 0x60C(x0) sets the PC at 0x300. This instruction will access the virtual address 0x300 + 0x60C. Calculating the Accessed Virtual Address: 0x300 + 0x60C = 0x90C. Therefore, the updated TLB is as follows: H05 = 0x6: Because the new VPN corresponds to the higher portion of 0x90C, with its lowest two bits as the index. H06 = 0x651: This is the new physical page number, assumed to be provided by the page fault handler. H07 = 0x3: Because the new VPN corresponds to another index part of the higher portion of 0x90 C. H08 = 0xABC: This is another new physical page number, also assumed to be provided by the page fault handler. ::: ## Problem I RISC-V includes a floating-point extension (aka “F”) supporting IEEE-754 single-precision floats, featuring 1 sign bit, 8 exponent bits, and 23 mantissa bits. This extension introduces new instructions and 32 floating-point registers (f0 to f31), each holding 32 bits. Key Instructions: - fadd.s rd, rs1, rs2 : Performs floating-point addition, setting rd = rs1 + rs2. - fsub.s rd, rs1, rs2 : Conducts floating-point subtraction, with rd = rs1 - rs2. - mul.s rd, rs1, rs2 : Executes floating-point multiplication, resulting in rd = rs1 * rs2. - fdiv.s rd, rs1, rs2 : Carries out floating-point division, assigning rd = rs1 / rs2. - fmv.w.x rd, rs1 : Moves an integer to a floating-point register, converting rs1 to float and storing in rd. Note: The four arithmetic floating-point instructions solely operate on floating-point registers. Floating-point registers are incompatible with base instructions. For example, fadd.s f0, t1, t2 and addi f0, t1, 5 are invalid. The instruction fmv.w.x transfers a bitstring from an integer register rs1 to a floating-point register rd. For instance, if t0 holds the hexadecimal value 0x40800000 and fmv.w.x f0, t0 is executed, then f0 will represent the floating point value 4.0. Convert the value 3 into its hexadecimal representation using IEEE-754 single-precision floating point format. __ I01 __ >- I01 = ? 0x40400000 First write 3 in binary scientific notation: 3 = 0b1.1 = 0b1.1 × 21. Sign bit is 0b0 since the number is positive. Exponent is 1. To encode in biased notation, we subtract the bias (for an 8-bit exponent, use the default bias of −(28−1 − 1) = −127) to get 1 − (−127) = 128. Then encode in 8-bit unsigned binary to get 0b1000 0000. Mantissa is 0b1 (dropping the implicit 1). Adding trailing zeros until it’s 23 bits long gives 0b100 0000 0000 0000 0000 0000. Putting sign, exponent, mantissa together: 0b0 1000 0000 100 0000 0000 0000 0000 0000 Regrouping bits: 0b0100 0000 0100 0000 0000 0000 0000 0000 Converting to hex: 0x4040 0000 Let’s create a sequence of instructions to load a floating-point value as close as possible to 1.33333… into register f1. Restrict your instructions to the five floating-point operations previously mentioned. Each line should contain only one RISC-V instruction or pseudoinstruction. ```cpp li t4, 0x40800000 li t3, 0x40400000 __ I02 __ __ I03 __ __ I04 __ ``` >- I02 = ? fmv.w.x f4, t4 >- I03 = ? fmv.w.x f3, t3 >- I04 = ? fdiv.s f1, f4, f3 Put the bit patterns for 4 (in floating-point) and 3 (in floating-point) into two integer registers. For clarity, we used t4 to hold 4 and t3 to hold 3 in our solution. Then, move these bit patterns into floating-point registers. For clarity, we used f4 to hold 4 and f3 to hold 3 in our solution. Finally, use floating-point division to compute 4/3 and put the result in f1. Note that something like li t4, 4 and li t4, 3 would not work; these would put the bit patterns 0x0000 0004 and 0x0000 0003 into floating-point registers, but when interpreted in floating-point, these bit patterns do not represent the numbers 4 and 3. Note that we have to use li t4 0x4080 0000 or lui x4 0x40800 to change the upper 20 bits of the register. Using an instruction like addi won’t work because I-type immediates are only 12 bits long. Some alternate solutions do exist, such as computing 1 + 1/3, or somehow computing that 0x3FAA AAAB is the floating-point number closest to 4/3 (how you’d compute this is out-ofscope) and hard-coding this number into f1. ### 📝 Annotate :::info I01: Binary Scientific Notation Representation: First, we write the number 3 in binary scientific notation, which is 3 = 1.1 × 2^1. Sign Bit: Since the number is positive, the sign bit is 0. Exponent: The exponent is 1. To encode it in biased notation, we use the default bias of -127 for an 8-bit exponent, calculating 1 - (-127) = 128. Then we encode it as an 8-bit unsigned binary number, resulting in 1000 0000. Mantissa: We pad the mantissa to make it 23 bits long, resulting in 100 0000 0000 0000 0000 0000. Combining the Sign Bit, Exponent, and Mantissa: We get 0 1000 0000 100 0000 0000 0000 0000 0000. Converting to Hexadecimal: Results in 0x40400000. I02: Use the fmv.w.x instruction to move the bit pattern from the integer register t4 to the floating-point register f4, converting the value in t4 to float and storing it in f4. Here, t4 holds the bit pattern 0x40800000 representing the floating-point number 4.0. I03: Similarly, move the bit pattern from the integer register t3 to the floating-point register f3. Here, t3 holds the bit pattern 0x40400000 representing the floating-point number 3.0. I04: Use the fdiv.s instruction to perform floating-point division, dividing f4 by f3 and storing the result in f1, thereby obtaining a floating-point value as close as possible to 1.33333. ::: ## Problem J Examine the circuit provided. The AND and OR gates each have a delay of 8 ps, while the NOT gates have a delay of 4 ps. Additionally, all registers in the circuit are subject to a setup time constraint of 6 ps and a clock-to-Q delay of 3 ps. For the purpose of this analysis, assume that all connecting wires are ideal, meaning they have no delay. ![image](https://hackmd.io/_uploads/By1ZyW-Y6.png) What is the smallest combinational delay of all paths in this circuit? __ J01 __ ps >- J01 = ? 12 The shortest path between two registers passes through a NOT gate followed by an AND gate, cumulatively resulting in a delay of 8 ps + 4 ps, which equals 12 ps. What is the maximum possible hold time constraint for registers to function properly in this circuit? __ J02 __ ps >- J02 = ? 15 Hold time = smallest combinatorial delay + clock-to-Q delay = 12+3 = 15 ps What is the maximum allowable clock frequency for this circuit to function properly? __ J03 __ GHz >- J03 = ? 40 Shortest clock period = clock-to-Q delay + largest combinatorial delay + hold time = 6+16+3 = 25 ps. 1 / (25 ps) = 40 GHz ### 📝 Annotate :::info J01: ![image](https://hackmd.io/_uploads/SkB3vbbtT.png) 4 + 8 = 12 J02: clock-to-Q delay: 3ps 12 + 3 = 15 J03: ![image](https://hackmd.io/_uploads/ryKE_ZZFp.png) clock-to-Q delay: 3ps hold time: 6ps 8 + 8 + 3 + 6 = 25 1 / (25 ps) = 40 GHz ::: ## Problem K In various applications, it is often necessary not only to identify the maximum value in an array but also to determine the index of this maximum element, known as the argmax. To achieve this efficiently, we can employ Data Level Parallelism. The function arg_max below accepts an array arr and its length n, returning the index of the highest value. In cases where multiple indices hold the same maximum value, the function will return the earliest of these indices. Utilize the provided pseudo SIMD (Single Instruction, Multiple Data) intrinsics to complete the argmax function, ensuring it operates as intended. These SIMD intrinsics work with vec structs, which are similar to Intel’s __m128i structs, containing 4 packed integers. Complete the sections labeled K01, K02, and K03 in the function to ensure it performs as intended. SIMD Instructions: - vec sum_epi32(vec a, vec b) : Returns a + b. - vec and_epi32(vec a, vec b) : Returns a & b. - vec set_epi32(int a) : Returns a SIMD vector with all entries set to a. - vec load_epi32(int *a) : Returns a SIMD vector with entries a[0], a[1], a[2], and a[3]. - int reducemax_epi32(vec a) : Returns the maximum int value in vector a. - vec maskeq_epi32(vec a, int b) : Returns a mask vector with 0xFFFFFFFF at indices where a equals b, and 0 otherwise. - int firstv_epi32(vec a) : Returns the index of the first entry with its lowest bit set to 1. ``` int arg_max(int *arr, int n) { int curr, index = 0, running_max = -2147483648; /* -2^31 */ for (int i = 0; i < n / 4 * 4; i += 4) { vec tmp = load_epi32(arr + i); curr = reducemax_epi32(tmp); if (curr > running_max) { tmp = K01 /* Fill in your code here */; index = i + K02 /* Fill in your code here */; running_max = curr; } } for (int i = n / 4 * 4; i < n; i += 1) { if (K03 /* Fill in your code here */) { running_max = arr[i]; index = i; } } return index; } ``` >- K01 = ? maskeq_epi32(tmp, curr) >- K02 = ? firstv_epi32(tmp) >- K03 = ? arr[i] > running_max (或等價的形式) ### 📝 Annotate :::info K01: Use maskeq_epi32(tmp, curr) to create a mask vector, which sets the positions in tmp equal to the current maximum value curr to 0xFFFFFFFF and the rest to 0. This is done to find the specific index of this maximum value later on. K02: firstv_epi32(tmp) is used to find the index of the first occurrence of the maximum value curr in tmp. Since firstv_epi32 returns the index of the first entry with its least significant bit set to 1, it will return the index of the first value that is 0xFFFFFFFF in the mask vector, which corresponds to the index of the maximum value in tmp. K03: In the final loop, we check if arr[i] > running_max. This is to ensure that the maximum value and its corresponding index are found and updated while processing the remaining elements of the array that cannot be loaded at once by SIMD instructions. :::