--- tags: computer-arch --- # Quiz5 of Computer Architecture (2023 Fall) > Solutions ## Problem `A` NASA's team bid a poignant farewell to the [Opportunity rover](https://en.wikipedia.org/wiki/Opportunity_(rover)), marking the end of its 15-year Martian exploration journey. Opportunity had been a treasure trove of data for Earth's scientists but began transmitting concerning signals months ago, indicating low battery levels and diminishing light as a massive dust storm engulfed Mars. NASA persistently tried to reestablish contact, hoping to revive the rover. However, when these efforts proved futile, they officially concluded the mission and announced Opportunity's end. Amidst these events, expressions of homage emerged, including one from a Twitter user, dutchess-becky, who paid tribute in a language befitting the rover: binary code. ![postcard](https://hackmd.io/_uploads/Skeojdcra.jpg) The message, "01000111 01101111 01100100 01110011 01110000 01100101 01100101 01100100 00100000 01001111 01110000 01110000 01111001," translates to "Godspeed Oppy" in English. This reflects a sentiment echoed by many in the space community, where "Oppy" is the nickname of the Opportunity rover. > Reference: ['Godspeed, Oppy' - emotional tributes as Mars Opportunity rover pronounced dead](https://www.walesonline.co.uk/news/uk-news/godspeed-oppy-emotional-tributes-mars-15833299) Let's create a C program designed to translate messages from binary to plain, human-readable text. The program is stored in a file named `conv.c`, and its contents are as follows. ```c #include <stdio.h> #define BUF_SIZE 64 int main() { int stack = 0, buf_idx = 0; char c, buf[BUF_SIZE] = {0}; while ((c = getchar()) != EOF) { if ((c & ~1) == '0') { A01 /* Fill your code here */; if (stack & 0x100) { buf[buf_idx++] = stack >> 1; A02 /* Fill your code here */; buf[buf_idx] = 0; } } else { if (stack) { buf[buf_idx++] = stack; stack = 0; } buf[buf_idx++] = c; buf[buf_idx] = 0; } if (buf_idx >= BUF_SIZE - 1) { printf("%s", buf); buf_idx = 0; buf[buf_idx] = 0; } } printf("%s", buf); return 0; } ``` Assuming `conv` is the built executable from `conv.c`, the message sent from Earth is: ``` Dear Opportunity, 01000111 01101111 01100100 01110011 01110000 01100101 01100101 01100100 00100000 01001111 01110000 01110000 01111001 From, A Martian Fan ``` The decoded message appears as follows. ``` Dear Opportunity, G o d s p e e d O p p y From, A Martian Fan ``` To complete the code `conv.c` as a functional message decoder as described earlier, please insert the necessary statements. Your solutions should **exclusively** utilize the operations `=`, `&`, `|`, `^`, `>>`, and `<<`, along with parentheses `(` and `)`. **Other operators are not permitted**. Additionally, ensure that both A1 and A2 consist of a single statement each, without including a semicolon `;`. Your code should be written in the most concise manner possible. > * A01 = ? > ==`stack = (stack << 1) | (c & 1)`== > * A02 = ? > ==`stack = c & 1`== --- ## Problem `B` Evaluate the cache performance (2-way set associative, 8 sets, block size of 2 words) using the given RISC-V assembly code. This program cycles through arrays `A` (starting at `0x130`) and `B` (starting at `0x200`), performing the operation `A[i] - B[i]` and storing the result back in `A[i]`. ```c! # Initial register settings: # x1 = 0 (loop index i) # x2 = 4 (size of arrays A and B) . = 0x0 # Code begins at address 0x0 loop: slli x3, x1, 2 # Convert index to byte offset lw x4, 0x130(x3) # Load A[i] lw x5, 0x200(x3) # Load B[i] sub x4, x4, x5 # Calculate A[i] - B[i] sw x4, 0x130(x3) # Store result in A[i] addi x1, x1, 1 # Increment index blt x1, x2, loop # Repeat 4 times unimp # End of this program ``` 1. Begin with an empty cache (all valid bits set to 0) before running the code. This cache employs a least-recently used (LRU) replacement strategy, with all cache lines in Way 0 being the least recently used. After one iteration of the loop (post the first execution of the blt instruction), update the LRU bit, the dirty bit (D), the valid bit (V), the tag, and the data words. Use the opcode for instruction words (like blt) and the array element for data words (like `B[0]`). It is assumed that if V = 0, then D = 0 as well. | Index | LRU after one loop<br> iteration | |------:|:---------------------------------| | 0 | __ B01 __ | | 1 | __ B02 __ | | 2 | __ B03 __ | | 3 | __ B04 __ | | 4 | __ B05 __ | | 5 | __ B06 __ | | 6 | __ B07 __ | | 7 | __ B08 __ | > * B01 = ? > ==0== > * B02 = ? > ==1== > * B03 = ? > ==1== > * B04 = ? > ==1== > * B05 = ? > ==0== > * B06 = ? > ==0== > * B07 = ? > ==1== > * B08 = ? > ==0== 2. Complete the following table by noting the counts of instruction hits, instruction misses, data hits, and data misses across each of the four iterations of the loop. I$ = instruction cache; D$ = data cache; | # Loop interation | I$ Hits | I$ Misses | D$ Hits | D$ Misses | |-------------------|---------|-----------|--------- |-----------| | First | __ B09 __ | __ B10 __ | __ B11 __ | __ B12 __ | | Second | __ B13 __ | __ B14 __ | __ B15 __ | __ B16 __ | | Third | __ B17 __ | __ B18 __ | __ B19 __ | __ B20 __ | | Fourth | __ B21 __ | __ B22 __ | __ B23 __ | __ B24 __ | > * B09 = ? > ==3== > * B10 = ? > ==4== > * B11 = ? > ==1== > * B12 = ? > ==2== > * B13 = ? > ==7== > * B14 = ? > ==0== > * B15 = ? > ==3== > * B16 = ? > ==0== > * B17 = ? > ==7== > * B18 = ? > ==0== > * B19 = ? > ==1== > * B20 = ? > ==2== > * B21 = ? > ==7== > * B22 = ? > ==0== > * B23 = ? > ==3== > * B24 = ? > ==0== 3. Calculate the hit ratio for the four iterations of the loop, excluding the execution of the `unimp` instruction. Your answer can be presented in fractional form. > * B25 = ? > ==$\frac{32}{40}$== --- ## Problem `C` In the given problems, you will work with a simplified [vector instruction set architecture](https://github.com/riscv/riscv-v-spec/blob/master/v-spec.adoc). Key terms include `VL` (vector length) and `VE` (vector element size). Registers in this ISA are denoted by prefixes: `v` for vector, `r` for scalar, and `f` for mask registers (examples: `vd, rs1, fs1`). Here are the operations: * `vld vd, (rs1)`: Load `VL` consecutive values of size `VE` from the memory address in register `R[rs1]` into vector register `vd`. Essentially, `vd` gets values from `Mem[R[rs1]]`. * `vld vd, vs1`: Load `VL` values from memory addresses specified in each element of vector register `vs1` into vector register `vd`. So, `vd = Mem[vs1]`. * `vst vd, (rs1)`: Store `VL` consecutive values of size `VE` from vector register `vd` into the memory address in `R[rs1]`. Thus, `Mem[R[rs1]]` receives values from `vd`. * `vst vd, vs1`: Store values from vector register `vd` into memory addresses specified in each element of vector register `vs1`. Therefore, `Mem[vs1]` gets values from `vd`. * `vst vd, (rs1), imm`: Store the first imm consecutive values of size VE from vector register `vd` into the memory address in `R[rs1]`, which means `Mem[R[rs1]]` equals `vd`. * `vadd vd, vs2, vs1`: Perform element-wise addition of `vs1` to `vs2`, storing the result in `vd`. So, `vd = vs1 + vs2`. * `vsub vd, vs2, vs1`: Subtract `vs2` from `vs1` on an element-wise basis and store in `vd`, hence `vd = vs1 - vs2`. * `vmul vd, vs2, vs1`: Multiply `vs1` by `vs2` element-wise, with the result in `vd`, so `vd = vs1 * vs2`. * `vdiv vd, vs2, vs1`: Divide `vs1` by `vs2` element-wise, placing the result in `vd`. Therefore, `vd = vs1 / vs2`. * `vredsum vd, vs1`: Reduce the sum of elements in `vs1` into the first element of `vd`. That is, $vd[0] = \sum_{i=0}^{VL}vs1[i]$ * `vgt vs1, fd`: This operation examines each element in vector register `vs1`. If an element is greater than 0, the corresponding bit in the mask register fd is set to 1; otherwise, it is set to 0. In other words, for each index i from `0` to `VL`, `fd[i]` becomes `1` if `vs1[i] > 0`, else it is `0`. After executing this instruction, the vector register vd will be subject to this mask for subsequent instructions. You have a vector code snippet for calculating the dot product of a vector of length `VL`. The code is as follows: ```c vld v1, (x1) # Load vector v1 vld v2, (x2) # Load vector v2 vmul v3, v2, v1 # Multiply vectors v2 and v1 vredsum v4, v3 # Reduce sum of vector v3 vst v4, (x3), 0 # Store vector v4 ``` When computing the number of cycles, consider these assumptions: * Vector length (VL) is 32 elements. * There are 8 vector lanes. * Each vector register holds 32 elements. * Each lane has one ALU with a 1-cycle latency. * Each lane has one load/store unit with a 4-cycle latency, and it is fully pipelined. * There is no dead time. * The memory system is ideal, with no bank conflicts or cache misses. * Reduction operations (like `vredsum`) behave similarly to arithmetic vector instructions (such as `vmul`). Each vector instruction goes through fetch (F) and decode (D) phases, then stalls (-) until the necessary vector functional unit becomes available. Without chaining, a dependent vector instruction will stall until the preceding instruction has completed writing back all its elements. Vector instructions are executed in a pipelined manner across all lanes simultaneously. For each element, operands are read `(R)` from the vector register file, processed in the load/store unit (M) or the ALU (X), and then the result is written back (W) to the vector register file. 1. Determine the total number of cycles required to execute the sequence without vector chaining, excluding the cycle for the final Writeback (W) phase of the store instruction. > * C01 = ? > ==35== > ![cycles](https://hackmd.io/_uploads/SkbJThoBT.png) 2. Calculate the total number of cycles required with vector chaining. In this setup, vector loads and stores use the same load-store unit, while vector arithmetic operations have distinct execution units. Vector chaining operates via the register file, allowing an element to be read `(R)` in the same cycle as its writeback (W), or in any subsequent cycle, offering flexible chaining. Considering there is only one set of vector lanes (8 lanes) that can perform writeback in a single cycle, provide the cycle count excluding the final Writeback (W) phase of the store instruction. > * C02 = ? > ==27== > ![cycles](https://hackmd.io/_uploads/HyBjanora.png) --- ## Problem `D` Complete the table with the hexadecimal tags for each cache line, given these conditions: * You are working with a 2-way set-associative L1 cache that is initially empty and uses a FIFO (First-In, First-Out) replacement policy. * Way 0 is filled before Way 1. * The address structure for the L1 cache is divided as follows: 3-bit tag, 1-bit index, and 4-bit offset. Cache Access Table | Address | Set 0, Way 0 | Set 0, Way 1 | Set 1, Way 0 | Set 1, Way 1 | Hit (Y/N) | |---------|--------------|--------------|--------------|---------------|-----------| | 0b1000 1000 | __ D01 __ | - | - | - | __ D02 __ | | 0b0000 0000 | - | __ D03 __ | - | - | __ D04 __ | | 0b1000 0010 | - | - | - | - | __ D05 __ | | 0b0101 0010 | - | - | __ D06 __ | - | __ D07 __ | | 0b0001 0000 | - | __ D08 __ | - | - | __ D09 __ | | 0b0100 0000 | __ D10 __ | - | - | __ D11 __ | __ D12 __ | | 0b1000 1101 | - | __ D13 __ | - | - | __ D14 __ | | 0b0010 0000 | __ D15 __ | - | - | - | __ D16 __ | | 0b0011 0000 | - | - | __ D17 __ | - | __ D18 __ | 1. Provide the hexadecimal tags or specify whether each cell, from D01 to D18, represents a hit (Y for Yes) or a miss (N for No) in hexadecimal format, shown in Cache Access Table. > * D01 = ? > ==0x4== > * D02 = ? > ==N== > * D03 = ? > ==0x0== > * D04 = ? > ==N== > * D05 = ? > ==Y== > * D06 = ? > ==0x2== > * D07 = ? > ==N== > * D08 = ? > ==0x0== > * D09 = ? > ==N== > * D10 = ? > ==0x2== > * D11 = ? > ==0x0== > * D12 = ? > ==N== > * D13 = ? > ==0x4== > * D14 = ? > ==N== > * D15 = ? > ==0x1== > * D16 = ? > ==N== > * D17 = ? > ==0x1== > * D18 = ? > ==N== 2. Calculate the average memory access time (AMAT) for a program with the following parameters: The L1 cache hit time is 4ns, the L2 cache access time is 20ns, and out of 10 memory accesses, 2 are hits in the L1 cache. All misses in L1 are resolved in the L2 cache. __ D19 __ ns > * D19 = ? > ==20== > AMAT = HitTime + MissRate * MissPenalty > AMAT = $4 ns + \frac{8}{10} \times 20 ns = 4 + 16 = 20 ns$ > AMAT = 20 ns --- ## Problem `E` Take into account a computer system that uses 256-byte page sizes, 16-bit virtual addresses, and 16-bit Page Table Entries (PTEs), with 4 bits of these PTEs dedicated to validity and protection flags. This computer employs a two-level hierarchical page table system and is byte-addressable. 1. Calculate the total number of virtual pages that this computer can address. > * E01 = ? > ==256== > Since the virtual addresses are 16 bits wide and the machine is byte-addressable, the total amount of virtual address space is 2^16^ bytes. > Pages size is 256 bytes, so a total of 2^16^ / 2^8^ = 2^8^ = 256 virtual pages can be addressed. 2. Determine the largest possible size of physical memory that this computer, with its given specifications, can support. __ E02 __ KiB > * E02 = ? > ==1024== > The size of the physical page number (PPN) is 16 − 4 = 12 bits. > Each of the PPNs can address a single page in memory which is 256-bytes. > Thus, the maximum size of physical memory that this machine can address is $2^{12} \times 256$ = 1 megabyte (MiB). 3. Estimate the minimum number of valid Page Table Entries (PTEs) and Page Table Pointers (PTPs) required in the page table(s) for a program that is currently utilizing 300 bytes of memory. __ E03 __ PTP(s) > * E03 = ? > ==1== > Since the page size is 256 Bytes, there must be at least two valid pages in memory. > This means at least 2 PTEs must be valid, and since we have a 2-level page table, at least 1 PTP must be valid. --- ## Problem `F` Consider the scenario where we are performing vector addition using RISC-V assembly language. The code we have developed is as follows: ```c mv x3, x4 loop: lw x6, 0(x4) # Load an element from vector 1 lw x7, 0x300(x4) # Load an element from vector 2 add x7, x7, x6 # Perform addition of the two elements sw x7, 0(x4) # Store the result back into vector 1 addi x4, x4, 4 # Increment the vector offset blt x4, x8, loop # Continue loop for each vector element mv x10, x3 addi sp, sp, 8 # Here, sp is equivalent to x2 ret ``` This code is executed on a conventional 5-stage RISC-V processor equipped with full bypassing and branch annulment capabilities. It operates under the assumption that branch predictions are consistently made as not taken, meaning we expect that the branch will not be taken, and branch decisions are finalized during the EXE stage. The loop is designed to run multiple times and is currently in the midst of its execution process. 1. Suppose the instruction `lw x6, 0(x4)` is fetched at cycle `100`. Determine the number of cycles each iteration of the loop requires. Additionally, calculate the number of cycles lost per iteration due to stalls and the number of cycles wasted due to annulments. * Number of cycles per loop iteration: __ F01 __ * Number of cycles per loop iteration wasted due to stalls: __ F02 __ * Number of cycles per loop iteration wasted due to annulments: __ F03 __ > * F01 = ? > ==10== > * F02 = ? > ==2== > * F03 = ? > ==2== > ![pipeline](https://hackmd.io/_uploads/HyTi9psHa.png) We are aiming to enhance the performance of a processor, particularly noting that the vectors involved in the operations are lengthy, leading to the loop's branch being taken almost always. To address this, we introduce a branch predictor into the processor. This predictor initially learns the loop's starting location during the first few iterations. Once it reaches a steady state, it consistently predicts the branch as taken, looping back to the start. During the EXE stage, the processor checks the accuracy of the branch prediction. If the prediction is incorrect, it flushes all instructions fetched since the loop's beginning from the pipeline. The processor then resumes fetching instructions from the point immediately following the branch instruction. This approach aims to optimize the processor's performance by reducing unnecessary instruction execution and improving the efficiency of the loop processing. 2. With the implementation of the improved branch prediction, suppose the instruction `blt x4, x8, loop` is fetched at cycle `200`. Calculate the number of cycles saved in each loop iteration as a result of this enhanced branch prediction mechanism. * Number of cycles saved per loop iteration in steady state: __ F04 __ > * F04 = ? > ==2== > ![pipeline](https://hackmd.io/_uploads/HJhAj6oBa.png =50%x) We acknowledge that there are several methods to increase the speed of a computation. Previously, we focused on optimizing the cycle count in part B. Now, we shift our focus to making each cycle quicker. Recognizing that excessive bypassing can actually hinder performance by increasing the propagation delay in the circuit, we modify the processor described in Part 1 (which lacked branch prediction). Specifically, we eliminate the bypassing from the Writeback (WB) stage to the Decode (DEC) stage. 3. Determine how many cycles each loop iteration takes after the removal of the bypass from the Writeback (WB) stage to the Decode (DEC) stage. > * F05 = ? > ==11== --- ## Problem `G` Vectorize the given C function using the RISC-V Vector extension, adhering to these guidelines: * Provide the appropriate parameters for the `vsetvli` instruction. * The vector unit is set for SEW=32, accommodating 32-bit elements. * With LMUL=1, any of the 32 standard vector registers are available for use. Temporary registers can be utilized as needed. * Assume that the arrays 'a', 'b', and 'idx' do not have any overlapping memory locations (no aliasing). In the `activation` function, the parameters are organized as follows: `a0`, `a1`, `a2`, `a3` for the first four arguments, and `a4`, `a5` for the remaining two. Here is the function's defintion: ```c void activation(unsigned int n, int32_t *a, int32_t *b, uint32_t *idx, int32_t scale, uint32_t dilation) { for (unsigned int i = 0; i < n; i++) { if (a[dilation*i] < 0) b[idx[i]] = scale * a[dilation * i]; } } ``` This assembly code (incomplete yet, fill in G01 to G7) performs vectorized operations for a given function, efficiently utilizing RISC-V vector instructions. The comments are aimed to clarify the purpose and the action of each instruction in the context of vector processing. ```c # Prologue (if required) slli a5, a5, G01 # Calculate stride in bytes loop: vsetvli G02, G03, e32, m1, ta, mu # Set vector length and type # Perform vector operations in this section vlse32.v v1, (a1), a5 # Load vector elements from a[] with stride dilation*i vmslt.vx v0, v1, x0 # Create a mask where a[dilation*i] is less than 0 vmul.vx v1, v1, a4, v0.t # Multiply a[dilation*i] by scale where mask is true G04 v2, (a3), v0.t # Load idx[i] under the mask vsll.vi v2, v2, 2, v0.t # Convert idx values to byte offsets G05 v1, (a2), v2, v0.t # Store results to b[] at byte-offset positions # Increment pointers for the next loop iteration sub a0, a0, t0 # Decrease loop counter n by the number of processed elements mul t1, t0, a5 # Calculate byte offset for the next a[] elements slli t0, t0, G06 # Convert vector length to byte offset for idx[] add a3, a3, t0 # Move idx pointer forward by processed elements add a1, a1, t1 # Move a pointer forward accounting for dilation # Loop back if there are more elements to process G07 end: ret # End of function ``` > * G01 = ? > ==2== > * G02 = ? > ==t0== > * G03 = ? > ==a0== > * G04 = ? > ==`vle32.v`== > * G05 = ? > ==`vsoxei32.v`== > * G06 = ? > ==2== > * G07 = ? > ==`bnez a0, loop`== --- ## Problem `H` Consider the design of a basic 3-stage pipeline processor, which includes the following stages: * Instruction fetch and decode; * Execution; * Memory access and write-back. ![pipeline](https://hackmd.io/_uploads/SJLc9Corp.png) This pipeline lacks bypassing capabilities. Branches are uniformly predicted as not taken, with the option for branch annulment present. Branch decision-making occurs during the execution stage. Additionally, the processor is characterized by specific propagation delays for each segment of its combinational logic. | stage | propagation delay | |-------|-------------------| | IF | 3 ns | | DEC | 4 ns | | EXE | 6 ns | | MEM | 5 ns | | WB | 3 ns | Our recent activity has been with the Fibonacci sequence. To indulge this interest, we have crafted a loop in RISC-V assembly language. This loop computes the first 1000 numbers of the Fibonacci sequence, as our patience does not extend beyond calculating 1000 numbers. These computed values are then stored in memory. ```c addi t0, zero, 1000 # Set counter t0 to 1000 addi a1, zero, 1 # Initialize fib(1) with value 1 sw a1, 0(t1) # Store fib(1) at the base memory address in t1 addi a2, zero, 1 # Initialize fib(2) with value 1 addi t1, t1, 4 # Increment memory address pointer t1 sw a2, 0(t1) # Store fib(2) at the new memory address fib: add a3, a2, a1 # Calculate fib(n) as sum of fib(n-1) and fib(n-2) addi t1, t1, 4 # Increment memory address pointer t1 sw a3, 0(t1) # Store fib(n) at the new memory address mv a1, a2 # Update fib(n-2) with value of fib(n-1) mv a2, a3 # Update fib(n-1) with value of fib(n) addi t0, t0, -1 # Decrement the counter t0 bnez t0, fib # If t0 is not zero, loop back to 'fib' xor t2, t3, t4 not t5, t3 and t6, t3, t4 ``` 1. Consider that during cycle 100, the instruction `add a3, a2, a1` is fetched, with the registers a2 and a1 already containing current data. Calculate the total number of cycles required for each iteration of the loop. Additionally, determine the number of cycles lost in each iteration due to stalls and the number of cycles lost due to annulments. Determine the shortest possible clock cycle duration for this processor. Additionally, calculate the time taken to complete one iteration of the loop. * Number of cycles per loop iteration: __ H01 __ * Number of cycles per loop iteration wasted due to stalls: __ H02 __ * Number of cycles per loop iteration wasted due to annulments: __ H03 __ * Processor minimum clock period: __ H04 __ ns * Time to execute 1 iteration of the loop: __ H05 __ ns > * H01 = ? > ==12== > * H02 = ? > ==4== > * H03 = ? > ==1== > * H04 = ? > ==8== > MEM+WB > * H05 = ? > ==96== > $12 \times 8$ > ![cycles](https://hackmd.io/_uploads/SJjQARiHT.png) We then opt to enhance our pipeline by incorporating bypass and branch prediction features. The pipeline now includes complete bypassing to IFDEC, meaning data can bypass directly from the EXE stage to IFDEC and from the MEMWB stage to IFDEC. Furthermore, branch instructions are now predicted as taken, and the option for branch annulment is still available, with branch decision-making occurring in the EXE stage. In addition to these changes, we integrate bypass and prediction logic, which introduces specific propagation delays as follows: | stage | propagation delay | |-------|-------------------| | IF | 3 ns | | DEC | 4 ns | | EXE | 6 ns | | MEM | 5 ns | | WB | 3 ns | | PRD | 1 ns | | BYP | 1 ns | 2. Consider that during the 100th cycle, the instruction `add a3, a2, a1` is fetched, with the registers `a2` and `a1` already containing the latest values. Determine the number of cycles required for each loop iteration. Taking into account the specified delays for the bypass and branch prediction mechanisms, calculate the shortest possible clock cycle duration for this processor. Additionally, ascertain the duration of one loop iteration. * Number of cycles per loop iteration: __ H06 __ * Processor minimum clock period: __ H07 __ ns * Time to execute 1 iteration of the loop: __ H08 __ ns > * H06 = ? > ==7== > * H07 = ? > ==15== > EXE+PRD+ IF+DEC+BYP > * H08 = ? > ==105== > $7 \times 15$ --- ## Problem `I` Examine the given program: ```c int A[8][8]; int B[8][8]; int C[8][8][8]; for (int i = 0; i < 8; i++) { for (int j = 0; j < 8; j++) { for (int k = 0; k < 8; k++) /* Inner "k" loop */ C[i][j][k] = A[i][j] + B[j][k]; } } ``` Now, consider a data cache setup that is three-way set associative with 4 sets, where each block contains 4 words. This cache exclusively stores data, not instructions. The cache is divided such that accesses to array `A` are mapped to way 0, accesses to array `B` to way 1, and accesses to array `C` to way 2. Remember that arrays are stored in memory in row-major order. For array `C`, this implies that elements with varying `k` values for each `C[i][j]` are stored consecutively, followed by subsequent `j` values and their respective `k` elements. 1. During the execution of the innermost "k" loop, determine the quantity of distinct elements from each array that are being accessed. * Elements of A: __ I01 __ * Elements of B: __ I02 __ * Elements of C: __ I03 __ > * I01 = ? > ==1== > * I02 = ? > ==8== > * I03 = ? > ==8== 2. Calculate the proportion of memory accesses to elements in array C that result in cache misses throughout the execution of the entire program. Express this as a fraction. * Miss Rate: __ I04 __ > * I04 = ? > ==$\frac{1}{4}$== > Each access to C accesses a new element not seen before. C is traversed in row-by-row order, > so the next element of C accessed is directly adjacent in memory. > This means that when a block is brought in from memory from a miss on an access to C, it contains the next 3 elements to be accessed by C. > Thus, only one in 4 accesses are a cache miss as after each miss there will be 3 following hits. > Order of C[i,j,k] array in its cache for i = 0, j = 0 and j = 1: > C[0,0,0] … C[0,0,3] > C[0,0,4] … C[0,0,7] > C[0,1,0] … C[0,1,3] > C[0,1,4] … C[0,1,7] 3. Determine the ratio of cache misses to total memory accesses for elements in array A over the course of the entire program, and express this as a fraction. * Miss Rate: __ I05 __ > * I05 = ? > ==$\frac{1}{32}$== > A is traversed in row-by-row order. > The same element of A is accessed in the inner k loop 8 different times. It will not be kicked out of the cache, since it is accessed in every calculation and will not be the LRU. > Thus, once a block is brought in for A, the next three elements will stick around for subsequent inner k loops. We only need one initial miss of an element of A to cover hits for four inner k loops, or 32 total accesses to A. > Order of A[i][j] array in its cache for i = 0 and i = 1: > A[0,0] … A[0,3] > A[0,4] … A[0,7] > A[1,0] … A[1,3] > A[1,4] … A[1,7] Think about rearranging the loop order by swapping the positions of the 'j' and 'k' loops, as shown below: ```diff int B[8][8]; int C[8][8][8]; for (int i = 0; i < 8; i++) { - for (int j = 0; j < 8; j++) { - for (int k = 0; k < 8; k++) /* Inner "k" loop */ + for (int k = 0; k < 8; k++) { + for (int j = 0; j < 8; j++) /* Inner "j" loop */ C[i][j][k] = A[i][j] + B[j][k]; } } ``` Proceed to address the subsequent questions using the same data cache configuration as previously mentioned. 4. Calculate the proportion of memory accesses to elements in array C that result in cache misses throughout the entire program, expressed as a fraction. * Miss Rate: __ I06 __ > * I06 = ? > ==1== > C is now traversed in column-by-column order. When C[i,0,0] is brought into cache so are C[i,0,1-3]. > However, now our inner loop is j, so next you will access C[i,1,0] which is not in the cache. > This will be brought into set 2. Then C[i,2,0] will be brought into set 0 and evict C[i,0,0], > so by the time you get to k = 1, you have evicted C[i,0,1] from the cache. This means that you don’t get to take advantage of the special locality and you miss every single time. 5. What fraction of the accesses to elements of A are cache misses for the entire program? * Miss Rate: __ I07 __ > * I07 = ? > ==$\frac{1}{32}$== > A continues to be accessed in row-by-row order. You first bring A[0,0-3] into set 0, A[0,4-7] into set 1. > So out of the 8 values of the inner loop j, you miss twice and hit 6 times. For the next 7 values of k, you get all hits. So you have a miss rate of 2/64 or 1/32. Once you move to the next value of i, the process repeats itself. --- ## Problem `J` Imagine a Chisel module designed to calculate the Greatest Common Divisor (GCD) using the subtraction method. This approach involves repeatedly subtracting the smaller number from the larger one until the value in register y becomes zero. At that point, the value remaining in register x represents the GCD. ```scala! package gcd import chisel3._ class GCD extends Module { val io = IO(new Bundle { val value1 = Input(UInt(16.W)) val value2 = Input(UInt(16.W)) val loadingValues = Input(Bool()) val outputGCD = Output(UInt(16.W)) val outputValid = Output(Bool()) }) val x = Reg(UInt()) val y = Reg(UInt()) when(x > y) { J01 /* Fill in your answer here */ } .otherwise { J02 /* Fill in your answer here */ } when(io.loadingValues) { x := io.value1 y := io.value2 } io.outputGCD := x io.outputValid := y === 0.U } ``` Finalize the provided code to ensure it performs as intended. > * J01 = ? > ==`x := x - y`== > * J02 = ? > ==`y := y - x`==