--- tags: computer-arch --- # Quiz7 of Computer Architecture (2023 Fall) :::info :information_source: General Information * You are allowed to read **[lecture materials](http://wiki.csie.ncku.edu.tw/arch/schedule)**. * That is, an open book exam. * We are using the honor system during this quiz, and would like you to accept the following: 1. You will not share the quiz with anyone. 2. You will not discuss the material on the quiz with anyone until after solutions are released. * Each answer has $3.5$ points. * You must answer in valid numeric representation and/or English alphabet except your formal name. * Message **[the instructor via Facebook](https://www.facebook.com/JservFans)** if you found something wrong. * :timer_clock: 09:10 ~ 10:15AM on Jan 2, 2024 ::: ## Problem `A` Consider that we are using the typical 5-stage pipelined datapath. Also, assume that the CPU always predicts branches as not taken. Review the following RISC-V code snippet: ```c= beq x0, x0, L1 addi x0, x0, 3 addi x0, x0, 1 addi x0, x0, 4 addi x0, x0, 1 addi x0, x0, 5 L1: addi t0, x0, 9 addi t1, t0, 2 xori t1, x0, 6 ``` Assume the Instruction Fetch (IF) stage for the `beq` instruction at line 1 occurs in cycle 1. Assuming that all pipeline hazards are managed by inserting stalls (without using techniques such as double pumping or forwarding paths), identify in which cycle the Write Back (WB) stage for the `xori` instruction at the last will be executed. For each pipeline hazard identified, please specify the nature of the hazard, the instructions it involves, and the number of stall cycles required. This section will only be reviewed if a regrade is requested. __ A01 __ > * A01 = ? If double pumping and all forwarding paths are utilized, in which cycle will the Write Back (WB) stage for the `xori` instruction on the last line occur? Please identify each hazard, the relevant instructions, and the number of stall cycles needed. Note: This section will be reviewed only if a regrade is requested. __ A02 __ > * A02 = ? --- ## Problem `B` In a RISC-V system employing segmentation-based virtual memory, process A is operating in user space. The segment base register is set at `0x20000`, and the bound register is also `0x20000`. Evaluate the virtual address `0x200`. If it does not lead to an out-of-bounds violation and is within bounds, convert it to its corresponding physical address. __ B01 __ > * B01 = ? Analyze the code for process A and the minimal kernel-space handler provided. Identify which instructions in process A's code result in an exception. __ B02 __ - [ ] Process A ```c li a0, 0x40000 li a1, 0x20 lw a2, 0x0(a0) li a3, 0x200 loop: add a2, a2, a1 addi a1, a1, -1 addi a3, a3, 4 sw a2, 0(a3) bnez a1, loop addi a3, a3, -0x200 li a7, 0x13 # syscall number mv a0, a3 ecall ``` - [ ] minimal kernel-space handler ```c minimal_handler: addi a4, a4, 1 csrr a5, mepc mret addi a5, a5, 4 csrw mepc, a5 ``` Note that `ret` is a pseudoinstruction that effectively functions as a `jalr` instruction, whereas `mret` is an actual instruction. These instructions serve different purposes: `ret` is typically employed for returning from a standard function, while `mret` is utilized for returning from a trap, such as an exception or interruption, and it involves several additional effects. From RISC-V privileged document, > The `MRET`, `SRET`, or `URET` instructions are used to return from traps in M-mode, S-mode, or U-mode respectively. When executing an xRET instruction, supposing xPP holds the value y, x IE is set to x PIE; the privilege mode is changed to y; x PIE is set to 1; and xPP is set to U (or M if user-mode is not supported). > * B02 = ? This program operates on a RISC-V processor equipped with complete bypassing and annulment capabilities, and exceptions are active. Assume that exceptions are processed instantly. Furthermore, it is assumed that the `minimal_handler` consistently returns control to the instruction immediately following the one that triggered the exception. The specific code within the `minimal_handler` need not be considered. Complete the pipeline diagram starting with the initial `li` instruction (disregard cells marked with '-' and '?'). It is unnecessary to display any bypasses used, if applicable. Consider `mret` similar to a branch instruction, with the next PC established in the EXE stage and branches presumed to be not taken. | - | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | |---|---|---|---|---|---|---|---|---|---|----| | IF | `li` | `li` | `lw` | `li` | `add` | `addi` | `addi` | `csrr` | `mret` | `addi` | | ID | - | `li` | `li` | `lw` | `li` | `add` | ? | B03 | B04 | B05 | | EXE | - | - | `li` | `li` | `lw` | `li` | ? | ? | B06 | B07 | | MEM | - | - | - | `li` | `li` | `lw` | ? | ? | ? | B08 | | WB | - | - | - | - | `li` | `li` | ? | ? | ? | ? | > * B03 = ? > * B04 = ? > * B05 = ? > * B06 = ? > * B07 = ? > * B08 = ? --- ## Problem `C` Assuming a 16-byte, fully associative cache with 4-byte blocks, express hit rates as simplified fractions. Consider the following code: ```c #include <stdint.h> int main() { int32_t arr[6]; for (int i = 0; i < 6; i++) { arr[i] += arr[0]; arr[i] += arr[1]; arr[i] += arr[2]; arr[i] += arr[3]; } } ``` Determine the hit rate during the first iteration of the `for` loop when using a Least Recently Used (LRU) replacement policy. __ C01 __ > * C01 = ? Determine the hit rate during the final iteration of the `for` loop when utilizing a LRU replacement policy. __ C02 __ > * C02 = ? --- ## Problem `D` Consider that "way prediction" is a method to enhance efficiency in set-associative caches. It involves forecasting the specific cache way that will likely be used for a given memory request. When the prediction is accurate, it eliminates the need to search through other cache ways. However, if the prediction is wrong, the process continues as it would in a standard set-associative cache access. ![set-associative caches](https://hackmd.io/_uploads/BkoFBKg_6.png) Examine a 16 KiB two-way set associative cache that does not employ way prediction. With a hit time of 5 cycles, an 80% hit rate, and a Level 2 (L2) cache access time of 30 cycles, calculate the Average Memory Access Time (AMAT) for this cache. __ D01 __ cycles > * D01 = ? Evaluate an 8 KiB direct mapped cache with a hit time of 2 cycles, a 60% hit rate, and a L2 cache access time of 30 cycles. Calculate the AMAT for this cache. __ D02 __ cycles > * D02 = ? Now, evaluate a 16 KiB two-way set associative cache that incorporates "way prediction." Calculate the AMAT for this cache configuration. __ D03 __ cycles > * D03 = ? --- ## Problem `E` We want design a Finite State Machine (FSM) where the output at time step $N$ is 0 for the initial two time steps and then matches the input from time step $N - 2$ for subsequent steps. For instance, if the input sequence to this FSM is `0b01 1011 1011 1000 1001`, the corresponding output should be `0b00 0110 1110 1110 0010`. To extend beyond a 2-cycle delay, consider an FSM that delays the input by 12 cycles, outputting 0 for the initial 12 cycles. For example, if the input to this updated FSM is `0b01 1011 1011 1000 1001`, the output should be `0b00 0000 0000 0001 1011`. Determine the minimum number of states required for an FSM to achieve this function. Provide your answer as a precise whole number. __ E01 __ > * E01 = ? Determine the shortest possible clock period for a circuit that addresses this issue, assuming the register is expanded to adequately many bits without affecting clk-to-q and setup times. Our circuit should be optimized for the minimum clock period, considering the given component delays. __ E02 __ ps ![circuit](https://hackmd.io/_uploads/Hkdi3Fed6.png) * t~AND~ = 12ps * t~OR~ = 15ps * t~NOT~ = 4ps * t~XOR~ = 31ps * t~Register-clk-to-q~ = 10ps * t~Register-setup~ = 15ps * t~Bit-splitter~ = 0ps * t~Wire~ = 0ps > * E02 = ? --- ## Problem `F` Consider a memory address space where addresses are chunked in 64-byte segments. For instance, address 1024 is 64 bytes apart from address 1025, but only 1 byte away from the "address" $1024 + \frac{1}{64}$. In this system, standard integer binary addressing is ineffective, so consider using floating-point representation based on the IEEE-754 standard, which includes 1 sign bit. You will need to determine the appropriate number of exponent and mantissa bits. Given a memory size of 4KiB and chunk addresses labeled as 0, 1, 2, etc., identify the minimum number of exponent bits needed in the floating-point memory address to access every byte, assuming the use of a standard bias. __ F01 __ > * F01 = ? Determine the least number of mantissa bits needed in our floating point memory address system to precisely address each byte. __ F02 __ > * F02 = ? --- ## Problem `G` Suppose a complex task is divided into four segments: PartA, PartB, PartC, and PartD. To efficiently tally their populations, each segment utilizes one core of a quad-core processor system, where each core has its own cache. These caches maintain coherence through a snoopy-based, write-invalidate MSI (Modified, Shared, Invalid) protocol, as illustrated. The goal is to examine the performance of this existing system to determine if improvements can be achieved. ![A system](https://hackmd.io/_uploads/HyWbxsldT.png) Each core starts by accessing and modifying its local copy of the population data, stored under a suitably named label (like `pop_A` for the population of PartA). This process is demonstrated in the pseudocode below: ```c call pop_count # This function counts the additional population in the area. lw a1, <pop_label> # This loads the earlier population count. add a1, a1, a0 # This performs a calculation to sum up the total. sw a1, <pop_label> # This saves the updated population count. ``` The sequence of these memory accesses is as follows: ![sequence of memory accesses](https://hackmd.io/_uploads/SJ2DxixuT.png) We aim to closely examine the activities on Core A specifically. Understanding that these memory accesses are independent (meaning `pop_A`, `pop_B`, `pop_C`, and `pop_D` are distinct memory locations), it suffices to focus solely on the memory interactions on Core A to grasp its behavior. Please complete the table below by entering information in columns G01 and G02. For each entry, detail the bus transaction(s) resulting from each access, and the state of the cache line for `pop_A` following each access. | Access | Shared bus transactions | Cache A | |:-------|:------------------------|:--------| | Initial state | - | `pop_A`: I | | `lw a1, pop_A` | BusRd | `pop_A`: G01 | | `sw a1, pop_A` | BusRdX | `pop_A`: G02 | > * G01 = ? > * G02 = ? We are curious to know whether adopting an MESI protocol, as illustrated, could enhance the performance of this processor. Suppose these four cores were to use a snoopy-based, write-invalidate MESI protocol, what would be the sequence of bus transactions and the cache line states for `pop_A` from Core A? Please complete the table below by entering information in columns G03 and G04. ![MESI protocol](https://hackmd.io/_uploads/rkzTWsx_6.png) | Access | Shared bus transactions | Cache A | |:-------|:------------------------|:--------| | Initial state | - | `pop_A`: I | | `lw a1, pop_A` | BusRd | `pop_A`: G03 | | `sw a1, pop_A` | - | `pop_A`: G04 | > * G03 = ? > * G04 = ? Each core proceeds to modify the overall census count, stored at a shared label named `sum`, as demonstrated in the following pseudocode: ```c lw a2, sum # retrieves the current global census count. add a2, a2, a0 # adds to the population count. sw a2, sum # saves the new population count. ``` The sequence of these memory accesses is as follows: ![sequence of memory accesses](https://hackmd.io/_uploads/B1Ql7slOT.png) Complete the table provided by detailing the bus transactions resulting from each memory access and the corresponding states of the cache line for `sum` after each access. An additional copy of this table is available at the end of the exam. Remember, this processor utilizes an MSI protocol for cache coherence. | Access | Shared bus transactions | Cache A | Cache B | Cache C | Cache D | |:-------|:------------------------|:--------|:--------|:--------|:--------| | Initial state | - | sum: I | sum: I | sum: I | sum: I | | Core A: `lw a2, sum` | BusRd | sum: G05 | sum: I | sum: I | sum: I | | Core A: `sw a2, sum` | BusRdX | sum: G06 | sum: I | sum: I | sum: I | | Core B: `lw a2, sum` | BusRd, BusWB | sum: S | sum: S | sum: I | sum: I | | Core C: `lw a2, sum` | BusRd | sum: G07 | sum: G08 | sum: G09 | sum: I | | Core C: `sw a2, sum` | BusRdX | sum: I | sum: I | sum: G10 | sum: I | | Core D: `lw a2, sum` | BusRd, BusWB | sum: I | sum: I | sum: S | sum: G11 | | Core B: `sw a2, sum` | BusRdX | sum: I | sum: G12 | sum: I | sum: I | | Core D: `sw a2, sum` | BusRdX, BusWB | sum: I | sum: I | sum: I | sum: G013 | > * G05 = ? > * G06 = ? > * G07 = ? > * G08 = ? > * G09 = ? > * G10 = ? > * G11 = ? > * G12 = ? > * G13 = ? We identify an issue with the current census counting implementation: there is a chance that two processors might simultaneously read `sum`, each add their population count to it without considering the other's addition, and then write back to `sum`, leading to inaccuracies. To ensure the correct final value of `sum` — that accurately reflects the aggregate of the `a0` values from each processor — where should we integrate [semaphores](https://en.wikipedia.org/wiki/Semaphore_(programming))? These semaphores should minimize blocking of instructions. Define and initialize any necessary semaphores, then add `wait` (`P`) and `signal` (`V`) operations in the pseudocode below to guarantee accurate calculation of `sum`. It is not required to include ecall instructions in this modification. Semaphore Initialization: ```c sum_sem = 1 ``` Pseudocode: ```c call pop_count # Z1: counts the new population of the area lw a1, <pop_label> # Z2: loads the previous population add a1, a1, a0 # Z3: calculates a sum sw a1, <pop_label> # Z4: stores the updated population lw a2, sum # Z5: gets the current global count add a2, a2, a0 # Z6: updates the population sw a2, sum # Z7: stores the updated population ``` Specify the precise instruction, labeled as mnemonic `z1` to `z7`, after which the `wait(sum_sem)` should be inserted. This is to guarantee exclusive access to `sum` at crucial update moments, thereby ensuring data accuracy. > * G14 = ? Specify the precise instruction, labeled as mnemonic `z1` to `z7`, after which the `signal(sum_sem)` should be inserted. This is to guarantee exclusive access to `sum` at crucial update moments, thereby ensuring data accuracy. > * G15 = ? --- ## Problem `H` Assuming a RISC-V processor with 38-bit virtual addresses, 26-bit physical addresses, and a 256-byte page size, the Page Table employs an LRU strategy for managing pages and addresses missing pages through a page fault handler. This processor has a 4-entry, direct-mapped TLB (Translation Lookaside Buffer) and utilizes VPN (Virtual Page Number) bits 1:0 for indexing into the TLB. Refer to the below table which shows the TLB contents before executing two instructions. The table includes both tag and index bits of the VPN for clarity, although only tag bits are typically stored in the TLB. After executing each of the two instructions, update the TLB contents accordingly. Note that the TLB's dirty bits are synchronized with those in the page table. Remember: The TLB stores page table entries, which may be altered by the page fault handler. For accurate operation, most processors use a special instruction to invalidate a TLB entry. The page fault handler is assumed to use this instruction. Also, assume all page fault handler accesses are through physical memory and bypass the TLB. - [ ] Initial TLB contents | Index | VPN <br> (Tag+Index) | V | D | PPN | |-------|-----------------|---|---|-----| | 0 | 0x200 | 1 | 1 | 0xBC | | 1 | 0x1 | 1 | 0 | 0x123 | | 2 | 0x36 | 1 | 0 | 0x412 | | 3 | 0x23 | 1 | 1 | 0x99 | - [ ] Update the contents of the TLB following the execution of the instruction `lw x12, 0x180(x0)` with the PC set at `0x2FC`. | Index | VPN <br> (Tag+Index) | V | D | PPN | |-------|-----------------|---|---|-----| | 0 | 0x200 | 1 | 1 | 0xBC | | 1 | 0x1 | 1 | 0 | 0x123 | | 2 | H01 | H02 | H03 | H04 | | 3 | 0x23 | 1 | 1 | 0x99 | > * H01 = ? > * H02 = ? > * H03 = ? > * H04 = ? - [ ] Update the contents of the TLB following the execution of the instruction `sw x12, 0x60C(x0)` with the PC set at `0x300`. | Index | VPN <br> (Tag+Index) | V | D | PPN | |-------|-----------------|---|---|-----| | 0 | 0x200 | 1 | 1 | 0xBC | | 1 | 0x1 | 1 | 0 | 0x123 | | 2 | H05 | 1 | 1 | H06 | | 3 | H07 | 1 | 0 | H08 | > * H05 = ? > * H06 = ? > * H07 = ? > * H08 = ? --- ## Problem `I` RISC-V includes a floating-point extension (aka "F") supporting IEEE-754 single-precision floats, featuring 1 sign bit, 8 exponent bits, and 23 mantissa bits. This extension introduces new instructions and 32 floating-point registers (`f0` to `f31`), each holding 32 bits. Key Instructions: * `fadd.s rd, rs1, rs2` : Performs floating-point addition, setting `rd = rs1 + rs2`. * `fsub.s rd, rs1, rs2` : Conducts floating-point subtraction, with `rd = rs1 - rs2`. * `fmul.s rd, rs1, rs2` : Executes floating-point multiplication, resulting in `rd = rs1 * rs2`. * `fdiv.s rd, rs1, rs2` : Carries out floating-point division, assigning `rd = rs1 / rs2`. * `fmv.w.x rd, rs1` : Moves an integer to a floating-point register, converting rs1 to float and storing in rd. Note: The four arithmetic floating-point instructions solely operate on floating-point registers. Floating-point registers are incompatible with base instructions. For example, `fadd.s f0, t1, t2` and `addi f0, t1, 5` are invalid. The instruction `fmv.w.x` transfers a bitstring from an integer register `rs1` to a floating-point register `rd`. For instance, if `t0` holds the hexadecimal value `0x40800000` and `fmv.w.x f0, t0` is executed, then f0 will represent the floating point value `4.0`. Convert the value 3 into its hexadecimal representation using IEEE-754 single-precision floating point format. __ I01 __ > * I01 = ? Let's create a sequence of instructions to load a floating-point value as close as possible to 4/3 = 1.33333... into register `f1`. Restrict your instructions to the 5 floating-point operations previously mentioned. Each line should contain only one RISC-V instruction or pseudoinstruction. ```c li t4, 0x40800000 li t3, 0x40400000 __ I02 __ __ I03 __ __ I04 __ ``` > * I02 = ? > * I03 = ? > * I04 = ? --- ## Problem `J` Examine the circuit below. The AND and OR gates each have a delay of 8 ps, while the NOT gates have a delay of 4 ps. Additionally, all registers in the circuit are subject to a setup time constraint of 6 ps and a clock-to-Q delay of 3 ps. For the purpose of this analysis, assume that all connecting wires are ideal, meaning they have no delay. ![circuit](https://hackmd.io/_uploads/ryliyhxda.png) What is the smallest combinational delay of all paths in this circuit? __ J01 __ ps > * J01 = ? What is the maximum possible hold time constraint for registers to function properly in this circuit? __ J02 __ ps > * J02 = ? What is the maximum allowable clock frequency for this circuit to function properly? __ J03 __ GHz > * J03 = ? --- ## Problem `K` In various applications, it is often necessary not only to identify the maximum value in an array but also to determine the index of this maximum element, known as the argmax. To achieve this efficiently, we can employ data level parallelism. The function `arg_max` below accepts an array `arr` and its length $n$, returning the index of the highest value. In cases where multiple indices hold the same maximum value, the function will return the earliest of these indices. Utilize the provided pseudo SIMD ([Single Instruction, Multiple Data](https://en.wikipedia.org/wiki/Single_instruction,_multiple_data)) intrinsics to complete the argmax function, ensuring it operates as intended. These SIMD intrinsics work with vec structs, which are similar to Intel's `__m128i` structs, containing 4 packed integers. Complete the sections labeled K01, K02, and K03 in the function to ensure it performs as intended. SIMD Instructions: * `vec sum_epi32(vec a, vec b)` : Returns `a + b`. * `vec and_epi32(vec a, vec b)` : Returns `a & b`. * `vec set_epi32(int a)` : Returns a SIMD vector with all entries set to a. * `vec load_epi32(int *a)` : Returns a SIMD vector with entries `a[0]`, `a[1]`, `a[2]`, and `a[3]`. * `int reducemax_epi32(vec a`) : Returns the maximum int value in vector `a`. * `vec maskeq_epi32(vec a, int b)` : Returns a mask vector with `0xFFFFFFFF` at indices where a equals `b`, and `0` otherwise. * `int firstv_epi32(vec a)` : Returns the index of the first entry with its lowest bit set to `1`. ```c int arg_max(int *arr, int n) { int curr, index = 0, running_max = -2147483648; /* -2^31 */ for (int i = 0; i < n / 4 * 4; i += 4) { vec tmp = load_epi32(arr + i); curr = reducemax_epi32(tmp); if (curr > running_max) { tmp = K01 /* Fill in your code here */; index = i + K02 /* Fill in your code here */; running_max = curr; } } for (int i = n / 4 * 4; i < n; i += 1) { if (K03 /* Fill in your code here */) { running_max = arr[i]; index = i; } } return index; } ``` > * K01 = ? > * K02 = ? > * K03 = ? ---