--- tags: computer-arch --- # Quiz6 of Computer Architecture (2021 Fall) :::info :information_source: General Information * You are allowed to read **[lecture materials](http://wiki.csie.ncku.edu.tw/arch/schedule)**. * That is, an open book exam. * We are using the honor system during this quiz, and would like you to accept the following: 1. You will not share the quiz with anyone. 2. You will not discuss the material on the quiz with anyone until after solutions are released. * Each answer has `3` points. * Of course, you must answer in valid numeric representation and/or English alphabet except your formal name. * :timer_clock: 09:10 ~ 10:10 AM on Dec 21, 2021 * Fill ==[Google Form](https://docs.google.com/forms/d/e/1FAIpQLSei90JuzaLF6pnecoCmw2Aw_oZ5iao3ES_g4xFoXLoAa8sgaA/viewform)== to answer ::: ## Problem `A` Assume that you have to design a mystery module "M" to work with the newest pipeline, but do not know the maximum latency this module can have. You decide to help to consider a few scenarios to calculate the maximum latency for module "M". 1. For your first iteration, you are told that you must have a 2-stage pipeline whose latency is no more than 16ns. Using a minimal number of registers, please show the twostage pipeline that maximizes the allowed latency of the "M" module. Then, calculate the maximum latency of module M and the throughput of the resulting circuit (using your value of `M`). ![](https://hackmd.io/_uploads/HJQdxPRcF.png) * Max latency of "M" module (ns): __ A01 __ * Throughput with max latency "M" (ns^-1^): __ A02 __ > * A01 = ? > * A02 = ? 2. For your next iteration, you are told that you must have a 4-stage pipeline whose latency is no more than 20ns. Using a minimal number of registers, please show the four-stage pipeline that maximizes the allowed latency of the "M" module. Then, calculate the maximum latency of module M and the throughput of the resulting circuit (using your value of M). ![](https://hackmd.io/_uploads/Bk1d-wR5K.png) * Max latency of "M" module (ns): __ A03 __ * Throughput with max latency "M" (ns^-1^): __ A04 __ > * A03 = ? > * A04 = ? --- ## Problem `B` Assume that we are working on building a 32-bit RISC-V processor. As part of this project, they are considering several cache designs. The hardware has a limited amount of memory available, so any cache design will hold a total of 32 32-bit words (plus any associated metadata such as tag, valid, and dirty bits when needed). We first consider a **direct-mapped** cache using a block size of 4 words. 1. If the cache holds a total of 32 words, how many lines will be in this cache? * Number of lines in the direct-mapped cache: __ B01 __ > * B01 = ? 2. To properly take advantage of locality, which address bits should be used for the block offset, the cache index, and the tag field? Express your answer using [minispec](https://minispec-hdl.github.io/minispec_reference.pdf)-style indexing into a 32-bit address. For example, the address bits for the byte offset would be expressed as `addr[ 1 : 0 ]`. * Address bits for block offset: addr[ __ B02 __ : __ B03 __ ] * Address bits for cache index: addr[ __ B04 __ : __ B05 __ ] * Address bits for tag field: addr[ __ B06 __ : __ B07 __ ] > * B02 = ? > * B03 = ? > * B04 = ? > * B05 = ? > * B06 = ? > * B07 = ? We now consider switching to a 2-way **set-associative** cache, again using a block size of 4 words and storing 32 data words total (along with necessary metadata). The 2-way cache uses an LRU replacement policy. 3. How does this change affect the number of sets and the size of the tag compared to the direct-mapped cache? * Change in the number of sets (select one): __ B08 __ - `(a)` None / can't tell - `(b)` 0.5x - `(c)` 2x - `(d)` -1 - `(e)` +1 - `(f)` Unchanged * Change in the size of the tag in bits (select one): __ B09 __ - `(a)` None / can't tell - `(b)` 0.5x - `(c)` 2x - `(d)` -1 - `(e)` +1 - `(f)` Unchanged > * B08 = ? > * B09 = ? We decide to use a 2-way **set-associative** cache with 4 cache sets and a block size of 4 words for our processor and would like to test out our caching system. We write the following code to simulate checking if the array from a quicksort function is properly sorted. The code iterates over a 200-element array and checks for correct ordering. You may treat `unimp` as a 32-bit instruction that terminates the program. ```cpp . = 0x100 // The following code starts at address 0x100 check_sorted: // Initialize some registers li t0, 0 // t0 = loop index li t1, 199 // t1 = array size - 1 lui t2, 0x3 // t2 = starting address of array = 0x3000 loop: lw t3, 0(t2) // Load current element lw t4, 4(t2) // Load next element ble t3, t4, endif // Assume branch is always TAKEN if_unsorted: unimp // If two elements are out of order, terminate endif: addi t0, t0, 1 addi t2, t2, 4 blt t0, t1, loop // Continue checking; assume branch is TAKEN ret ``` For the rest of this problem, assume that the code is running at steady state (i.e., the code is in the middle of the array) and that the array is sorted correctly. In other words, assume that both the `ble` and `blt` branches are always taken. Also, you may assume that when execution of the code started, all cache lines were set to invalid and Way 0 was the LRU way for each cache set. 4. For one iteration of the loop (i.e. from the loop label to the blt instruction), how many instruction fetches and data accesses are performed? * Number of Instruction Fetches: __ B10 __ * Number of Data Accesses: __ B11 __ > * B10 = ? > * B11 = ? 5. In the steady state (i.e. ignoring any cold-start effects), what is the instruction fetch hit ratio and the data access hit ratio? Note: please provide the hit ratio and not the miss ratio. You may use the cache diagram provided below to help you, but nothing written in the diagram will be graded. ![](https://hackmd.io/_uploads/HJ15PPC9Y.png) * Instruction Fetch HIT Ratio: __ B12 __ * Data Access HIT Ratio: __ B13 __ > * B12 = ? > * B13 = ? 6. Assume that it takes 2 cycles to access the cache. We have benchmarked our code to have an average hit ratio of 0.9. If they need to achieve an average memory access time (AMAT) of at most 4 cycles, what is the upper bound on our miss penalty? Miss penalty here is defined as the amount of additional time required beyond the 2-cycle cache access to handle a cache miss. * Maximum possible miss penalty: __ B14 __ cycles > * B14 = ? --- ## Problem `C` Consider that we are analyzing grade statistics and are performing some hefty calculations, so we suspect that a cache could improve our system's performance. 1. We are considering using a 2-way set-associative cache with a block size of 4 (i.e. 4 words per line). The cache can store a total of 64 words. Assume that addresses and data words are 32 bits wide. To properly make use of locality, which address bits should be used for the block offset, the cache index, and the tag field? * Address bits used for byte offset: A[ 1 : 0 ] * Address bits used for tag field: A[ __ C01 __ : __ C02 __ ] * Address bits used for block offset: A[ __ C03 __ : __ C04 __ ] * Address bits used for cache index: A[ __ C05 __ : __ C06 __] > * C01 = ? > * C02 = ? > * C03 = ? > * C04 = ? > * C05 = ? > * C06 = ? 2. If we instead used a direct-mapped cache with the same total capacity (64 words) and same block size (4 words), how would the following parameters in our system change? * Change in the number of cache lines (select all of the choices below that apply): __ C07 __ - `(a)` None / can't tell - `(b)` 0.5x - `(c)` 2x - `(d)` -1 - `(e)` +1 - `(f)` Unchanged * Change in the number of bits in tag field (select one of the choices below): __ C08 __ - `(a)` None / can't tell - `(b)` 0.5x - `(c)` 2x - `(d)` -1 - `(e)` +1 - `(f)` Unchanged > * C07 = ? > * C08 = ? Ultimately, we decide that the 2-way set associative cache would probably have better performance for our application, so the remainder of the problem will be considering a 2-way set associative cache. Below is a snapshot of this cache during the execution of some unknown code. V is the valid bit and D is the dirty bit of each set. ![](https://hackmd.io/_uploads/rkJFAwC5Y.png) 3. Would the following memory accesses result in a hit or a miss? If it results in a hit, specify what value is returned; if it is a miss, explain why in a few words or by showing your work. - [ ] 32-Bit Byte Address: `0x4AB4` * Line index: __ C09 __ * Tag: __ C10 __ (in HEX) * Block offset: __ C11 __ * Returned value if hit (in HEX) / Explanation if miss: __ C12 __ - [ ] 32-Bit Byte Address: `0x21E0` * Line index: __ C13 __ * Tag: __ C14 __ * Block offset: __ C15 __ * Returned value if hit (in HEX) / Explanation if miss: __ C16 __ > * C09 = ? > * C10 = ? > * C11 = ? > * C12 = ? > * C13 = ? > * C14 = ? > * C15 = ? > * C16 = ? --- ## Problem `D` Consider the case that we lost the final iteration among two other prototypes while building the pipelined RISC-V processor. Show all bypasses used in each cycle where the processor is not stalled. For the processor, also determine the value of `x3` and `x4` after executing these 12 cycles. Below is the code and the initial state of the relevant registers in the register file for each of the three processors. Note that the values in registers `x1`–`x5` are given in decimal while `x6` is in hexadecimal. A copy of the code and initial register values is provided for the processor. ```cpp start: lw x1, 0(x6) addi x2, x0, 5 blt x1, x2, end addi x3, x2, 11 sub x4, x3, x1 xori x5, x6, 0x1 end: sub x4, x3, x2 addi x3, x4, 7 addi x3, x1, 3 . = 0x400 .word 0x1 ``` | Register | Value | | -------- | -------- | | x1 | 5 | | x2 | 11 | | x3 | 30 | | x4 | 19 | | x5 | 20 | | x6 | 0x400 | Assume the processor is built as a 5-stage pipelined RISC-V processor, which is fully bypassed and has branch annulment. Branch decisions are made in the `EXE` stage and branches are always predicted not taken. What are the values of registers `x3` and `x4` at the start of cycle 12 (in decimal)? * x3 : __ D01 __ * x4 : __ D02 __ > * D01 = ? > * D02 = ? --- ## Problem `E` This problem evaluates the cache performances for different loop orderings. You are asked to consider the following two loops, written in C, which calculate the sum of the entries in a 128 by 32 matrix of 32-bit integers: - [ ] Loop A ```cpp sum = 0; for (i = 0; i < 128; i++) for (j = 0; j < 32; j++) sum += A[i][j]; ``` - [ ] Loop B ```cpp sum = 0; for (j = 0; j < 32; j++) for (i = 0; i < 128; i++) sum += A[i][j]; ``` The matrix A is stored contiguously in memory in row-major order. Row major order means that elements in the same row of the matrix are adjacent in memory as shown in the following memory layout: `A[i][j]` resides in memory location `[4*(32*i + j)]` Memory Location: ![](https://hackmd.io/_uploads/SJTUWKR5Y.png) For Problem 1 to 3, assume that the caches are initially empty. Also, assume that only accesses to matrix A cause memory references and all other necessary variables are stored in registers. Instructions are in a separate instruction cache. 1. Consider a 4KB direct-mapped data cache with 8-word (32-byte) cache lines. * Calculate the number of cache misses that will occur when running Loop `A`. __ E01 __ * Calculate the number of cache misses that will occur when running Loop `B`. __ E02 __ > * E01 = ? > * E02 = ? 2. Consider a direct-mapped data cache with 8-word (32-byte) cache lines. Calculate the minimum number of cache lines required for the data cache if Loop A is to run without any cache misses other than compulsory misses. Calculate the minimum number of cache lines required for the data cache if Loop B is to run without any cache misses other than compulsory misses. * Data-cache size required for Loop `A`: __ E03 __ cache line(s) * Data-cache size required for Loop `B`: __ E04 __ cache line(s) > * E03 = ? > * E04 = ? 3. Consider a 4KB set-associative data cache with 4 ways, and 8-word (32-byte) cache lines. This data cache uses a first-in/first-out (FIFO) replacement policy. * The number of cache misses for Loop `A`: __ E05 __ * The number of cache misses for Loop `B`: __ E06 __ > * E05 = ? > * E06 = ? --- ## Problem `F` The following diagram shows a classic fully-bypassed 5-stage pipeline that has been augmented with an unpipelined divider in parallel with the ALU. Bypass paths are not shown in the diagram. This iterative divider produces 2 bits per cycle until it outputs a full 32-bit result. ![](https://hackmd.io/_uploads/BkXq8t05K.png) 1. What is the latency of a divide operation in cycles? __ F01 __ > * F01 = ? 2. What is the occupancy of a divide operation in cycles? __ F02 __ > * F02 = ? --- ## Problem `G` Given the follow chunk of code, analyze the hit rate given that we have a byte-addressed computer with a total memory of 1 MiB. It also features a 16 KiB Direct-Mapped cache with 1 KiB blocks. Assume that your cache begins cold. ```cpp #define NUM_INTS 8192 // 2ˆ13 int A[NUM_INTS]; // A lives at 0x10000 int i, total = 0; for (i = 0; i < NUM_INTS; i += 128) A[i] = i; // Code 1 for (i = 0; i < NUM_INTS; i += 128) total += A[i]; // Code 2 ``` 1. How many bits make up a memory address on this computer? __ G01 __ > * G01 = ? 2. What is the T:I:O breakdown? * Offset: __ G02 __ * Index: __ G03 __ * Tag: __ G04 __ > * G02 = ? > * G03 = ? > * G04 = ? 3. Calculate the cache hit rate for the line marked as `Code 1`: __ G05 __ > * G05 = ? 4. Calculate the cache hit rate for the line marked as `Code 2`: __ G06 __ > * G06 = ? ---