# Annotate and explain Quiz6 with Ripes simulation ## Problem A 1. For your first iteration, you are told that you must have a 2-stage pipeline whose latency is no more than 16ns. Using a minimal number of registers, please show the twostage pipeline that maximizes the allowed latency of the “M” module. Then, calculate the maximum latency of module M and the throughput of the resulting circuit (using your value of M). * Max latency of “M” module (ns): __ A01 __ * Throughput with max latency “M” (ns-1): __ A02 __ ![](https://i.imgur.com/mw7MBRx.png =70%x) > A01 = 3 > A02 = $\frac{1}{8}$ > ![](https://i.imgur.com/0udYijR.png =70%x) > Since the latency is no more than 16ns and these is a 2-stage pipline, the maximum latency of each stage is equal to 8. Therefore, as we split the circuit as figure above, the latency of the right part of pipline is $5+2=7$. As a result, the latency of left part must equal to 8, which is equal to $M+2+3$. $M=3$ > Throughput with max latency "M" $=\dfrac{1}{8}$ 2. For your next iteration, you are told that you must have a 4-stage pipeline whose latency is no more than 20ns. Using a minimal number of registers, please show the four-stage pipeline that maximizes the allowed latency of the “M” module. Then, calculate the maximum latency of module M and the throughput of the resulting circuit (using your value of M). * Max latency of “M” module (ns): __ A03 __ * Throughput with max latency “M” (ns-1): __ A04 __ > A03 = 5 > A04 = $\frac{1}{5}$ > ![](https://i.imgur.com/AYM5X9f.png =70%x) > Since the latency is no more than 16ns and these is a 4-stage pipline, the maximum latency of each stage is equal to 5. Therefore, we split the circuit as the figure above. The maximum value of $M$ must be $5$ because the maximum latency of each stage is $5$. > Throughput with max latency "M" $=\dfrac{1}{5}$ --- ## Problem B Assume that we are working on building a 32-bit RISC-V processor. As part of this project, they are considering several cache designs. The hardware has a limited amount of memory available, so any cache design will hold a total of 32 32-bit words (plus any associated metadata such as tag, valid, and dirty bits when needed). We first consider a **direct-mapped** cache using a block size of 4 words. 1. If the cache holds a total of 32 words, how many lines will be in this cache? * Number of lines in the direct-mapped cache: __ B01 __ > B01 = 8 > Number of lines in cache $= 32 \div 4 = 8$ 2. To properly take advantage of locality, which address bits should be used for the block offset, the cache index, and the tag field? Express your answer using minispec-style indexing into a 32-bit address. For example, the address bits for the byte offset would be expressed as ```addr[ 1 : 0 ]```. * Address bits for block offset: addr[ __ B02 __ : __ B03 __ ] * Address bits for cache index: addr[ __ B04 __ : __ B05 __ ] * Address bits for tag field: addr[ __ B06 __ : __ B07 __ ] > B02 = 3 > B03 = 2 > B04 = 3 > B05 = 4 > B06 = 31 > B07 = 7 > Bits 0 and 1 are the byte offset > Since it is a 4-word cache, $log_2(4) = 2$. The block offset is 2 (bits 3:2) > There are 8 lines. $\ log_2(8) = 3$ bits for the cache index (bits 6:4). > Rest of the bits are tag bits(bits 31:7) > ![](https://i.imgur.com/NW0naDN.png =70%x) We now consider switching to a 2-way set-associative cache, again using a block size of 4 words and storing 32 data words total (along with necessary metadata). The 2-way cache uses an LRU replacement policy. 3. How does this change affect the number of sets and the size of the tag compared to the direct-mapped cache? * Change in the number of sets (select one): __ B08 __ * ```(a)``` None / can’t tell * ```(b)``` 0.5x * `(c)` 2x * `(d)` -1 * `(e)` +1 * ``(f)`` Unchanged * Change in the size of the tag in bits (select one): __ B09 __ * ```(a)``` None / can’t tell * ```(b)``` 0.5x * `(c)` 2x * `(d)` -1 * `(e)` +1 * ``(f)`` Unchanged > B08 = b > Each set now contains two lines(one on each way), but the volume of cache does not change. As a result, the number of sets becomes half. > B09 = e > Since the number of sets becomes half, the number of set index bits decrease by one and tag bits increase by 1. > * Original: > ![](https://i.imgur.com/NW0naDN.png =70%x) > * 2-way set-associative > ![](https://i.imgur.com/XyGhT69.png =73%x) We decide to use a 2-way set-associative cache with 4 cache sets and a block size of 4 words for our processor and would like to test out our caching system. We write the following code to simulate checking if the array from a quicksort function is properly sorted. The code iterates over a 200-element array and checks for correct ordering. You may treat unimp as a 32-bit instruction that terminates the program. ```c . = 0x100 // The following code starts at address 0x100 check_sorted: // Initialize some registers li t0, 0 // t0 = loop index li t1, 199 // t1 = array size - 1 lui t2, 0x3 // t2 = starting address of array = 0x3000 loop: lw t3, 0(t2) // Load current element lw t4, 4(t2) // Load next element ble t3, t4, endif // Assume branch is always TAKEN if_unsorted: unimp // If two elements are out of order, terminate endif: addi t0, t0, 1 addi t2, t2, 4 blt t0, t1, loop // Continue checking; assume branch is TAKEN ret ``` For the rest of this problem, assume that the code is running at steady state (i.e., the code is in the middle of the array) and that the array is sorted correctly. In other words, assume that both the ble and blt branches are always taken. Also, you may assume that when execution of the code started, all cache lines were set to invalid and Way 0 was the LRU way for each cache set. 4. For one iteration of the loop (i.e. from the loop label to the blt instruction), how many instruction fetches and data accesses are performed? * Number of Instruction Fetches: __ B10 __ * Number of Data Accesses: __ B11 __ > B10 = 6 > There are 6 instruction fetchs, which are2 `lw`, 1 `ble`, 1 `unimp` and 2 `addi` instructions. > B11 = 2 > There are 2 data accesses, which are `lw t3, 0(t2)`,`lw t4, 4(t2)`. 5. In the steady state (i.e. ignoring any cold-start effects), what is the instruction fetch hit ratio and the data access hit ratio? Note: please provide the hit ratio and not the miss ratio. You may use the cache diagram provided below to help you, but nothing written in the diagram will be graded. ![](https://i.imgur.com/g5muvsM.png) * Instruction Fetch HIT Ratio: __ B12 __ * Data Access HIT Ratio: __ B13 __ > B12 = 1 > All instructions reside in sets 0, 1, and 2 of Way 0 and are not evicted. Thus the instruction hit ratio is 1. > The first instruction is at `0x100` = `0b1 0|000 |00|00`, which is at set `0`. > The whole instruction cache looks like the table below: > ![](https://i.imgur.com/bi0ZxUL.png =80%x) > B13 = 0.875 > Data accesses are split between Way 0 and Way 1, but ultimately, they fit around all the instructions in the cache (which are always accessed more recently and so do not get evicted). Each data miss will cause the cache to load the next four consecutive words (four words/block). Each iteration performs 2 data accesses. Every four iterations, one of these accesses will miss. This corresponds to 1 miss per 8 accesses for a hit ratio of 0.875 > ![](https://i.imgur.com/lZPwxD9.gif) 6. Assume that it takes 2 cycles to access the cache. We have benchmarked our code to have an average hit ratio of 0.9. If they need to achieve an average memory access time (AMAT) of at most 4 cycles, what is the upper bound on our miss penalty? Miss penalty here is defined as the amount of additional time required beyond the 2-cycle cache access to handle a cache miss. * Maximum possible miss penalty: __ B14 __ cycles > B14 = 20 > AMAT = cache access time + miss penalty * (1 - hit ratio) 4 >= 2 + miss penalty * (1 - 0.9) miss penalty <= 20 --- ## Problem C Consider that we are analyzing grade statistics and are performing some hefty calculations, so we suspect that a cache could improve our system’s performance. 1. We are considering using a 2-way set-associative cache with a block size of 4 (i.e. 4 words per line). The cache can store a total of 64 words. Assume that addresses and data words are 32 bits wide. To properly make use of locality, which address bits should be used for the block offset, the cache index, and the tag field? * Address bits used for byte offset: A[ 1 : 0 ] * Address bits used for tag field: A[ __ C01 __ : __ C02 __ ] * Address bits used for block offset: A[ __ C03 __ : __ C04 __ ] * Address bits used for cache index: A[ __ C05 __ : __ C06 __] > C01 = 31 > C02 = 7 > C03 = 3 > C04 = 2 > C05 = 6 > C06 = 4 > ![](https://i.imgur.com/06StqXR.png) 2. If we instead used a direct-mapped cache with the same total capacity (64 words) and same block size (4 words), how would the following parameters in our system change? * Change in the number of cache lines (select all of the choices below that apply): __ C07 __ * `(a)` None / can’t tell * `(b)` 0.5x * `(c)` 2x * `(d)` -1 * `(e)` +1 * `(f)` Unchanged * Change in the number of bits in tag field (select one of the choices below): __ C08 __ * `(a)` None / can’t tell * `(b)` 0.5x * `(c)` 2x * `(d)` -1 * `(e)` +1 * `(f)` Unchanged > C07 = c,f > Accepted Unchanged or 2x as solutions because 2-way cache also has $\frac{1}{2}$ the number of sets as the direct mapped cache, but the number of cache lines is actually the same. > * Original: > ![](https://i.imgur.com/AAzCLFf.png ) > * Direct-mapped: > ![](https://i.imgur.com/KdYOvd0.png ) > C08 = 8 > Block size of 4 → 2 bits for block offset 64 words / 4 words per block = 16 blocks → 4 bits for cache line index 32 – 4 – 2 – 2 = 24 bits of tag > * Original: > ![](https://i.imgur.com/06StqXR.png ) > * Direct-mapped: > ![](https://i.imgur.com/TNKHuDF.png) Ultimately, we decide that the 2-way set associative cache would probably have better performance for our application, so the remainder of the problem will be considering a 2-way set associative cache. Below is a snapshot of this cache during the execution of some unknown code. V is the valid bit and D is the dirty bit of each set. ![](https://i.imgur.com/tFjrsSI.png) 3. Would the following memory accesses result in a hit or a miss? If it results in a hit, specify what value is returned; if it is a miss, explain why in a few words or by showing your work. * 32-Bit Byte Address: `0x4AB4` * Line index: __ C09 __ * Tag: __ C10 __ (in HEX) * Block offset: __ C11 __ * Returned value if hit (in HEX) / Explanation if miss: __ C12 __ * 32-Bit Byte Address: `0x21E0` * Line index: __ C13 __ * Tag: __ C14 __ * Block offset: __ C15 __ * Returned value if hit (in HEX) / Explanation if miss: __ C16 __ > C09 = 3 > C10 = 0x95 > C11 = 1 > C12 = 0xD4 > ![](https://i.imgur.com/06StqXR.png ) > `0x4AB4` = `0b 0100 1010 1|011 |01|00` > Line index = `0b011`=`3` > Tag = `0b0100 1010 1` = `0x95` > Block offset = `0b01` > As we can see that the value we want to get is at line 3 block 1 and the tag is `0x95`. When looking the table, we find the tag of line 3 in `Way1` is equal to `0x95`, which means a cache hit. > ![](https://i.imgur.com/B2E6uPr.png) > C13 = 6 > C14 = 43 > C15 = 0 > C16 = miss, valid bit is 0 > `0x21E0` = `0b 10 0001 1|110 |00|00` > As we can see that the value we want to get is at line 6 block 0 and the tag is `0x43`. When looking the table, we find the tag of line 3 in `Way0` is equal to `0x43`. However, the `valid` bit is `0`. As a result, it is a cache miss. --- ## Problem D Consider the case that we lost the final iteration among two other prototypes while building the pipelined RISC-V processor. Show all bypasses used in each cycle where the processor is not stalled. For the processor, also determine the value of x3 and x4 after executing these 12 cycles. Below is the code and the initial state of the relevant registers in the register file for each of the three processors. Note that the values in registers x1–x5 are given in decimal while x6 is in hexadecimal. A copy of the code and initial register values is provided for the processor. ```c start: lw x1, 0(x6) addi x2, x0, 5 blt x1, x2, end addi x3, x2, 11 sub x4, x3, x1 xori x5, x6, 0x1 end: sub x4, x3, x2 addi x3, x4, 7 addi x3, x1, 3 . = 0x400 .word 0x1 ``` | Register | Value | |:--------:|:-----:| | x1 | 5 | | x2 | 11 | | x3 | 30 | | x4 | 19 | | x5 | 20 | | x6 | 0x400 | Assume the processor is built as a 5-stage pipelined RISC-V processor, which is fully bypassed and has branch annulment. Branch decisions are made in the EXE stage and branches are always predicted not taken. What are the values of registers x3 and x4 at the start of cycle 12 (in decimal)? * x3 : __ D01 __ * x4 : __ D02 __ > D01 = 32 > D02 = 25 > ![](https://i.imgur.com/uXiKd5m.png) > > ![](https://i.imgur.com/84ysJKB.png) > The figure above shows the three forwarding and the execution in 12 cycles. > > ![](https://i.imgur.com/WtYzRAi.png) > The figure above shows the first forwarding. The processor load the value into x1 and send to EXE stage since the `blt` needs the value. > ![](https://i.imgur.com/cEfe0Bg.png) > The figure above shows the second forwarding. The `addi` puts 5 in `x2` and pass to blt. > ![](https://i.imgur.com/ictatwP.png) > The figure above shows the third forwarding. The `sub` calculates the result and puts it in `x4`. Then pass the value to EXE stage because `addi` instruction needs it. --- ## Problem E This problem evaluates the cache performances for different loop orderings. You are asked to consider the following two loops, written in C, which calculate the sum of the entries in a 128 by 32 matrix of 32-bit integers: * Loop A ```c sum = 0; for (i = 0; i < 128; i++) for (j = 0; j < 32; j++) sum += A[i][j]; ``` * Loop B ```c sum = 0; for (j = 0; j < 32; j++) for (i = 0; i < 128; i++) sum += A[i][j]; ``` The matrix A is stored contiguously in memory in row-major order. Row major order means that elements in the same row of the matrix are adjacent in memory as shown in the following memory layout: `A[i][j]` resides in memory location ``[4*(32*i + j)]`` Memory Location: ![](https://i.imgur.com/UZJjbPl.png) For Problem 1 to 3, assume that the caches are initially empty. Also, assume that only accesses to matrix A cause memory references and all other necessary variables are stored in registers. Instructions are in a separate instruction cache. 1. Consider a 4KB direct-mapped data cache with 8-word (32-byte) cache lines. * Calculate the number of cache misses that will occur when running Loop A. __ E01 __ * Calculate the number of cache misses that will occur when running Loop B. __ E02 __ > E01 = 512 > E02 = 4096 > Each element of the 128x32 matrix A can only be mapped to one particular cache location in this direct-mapped data cache. Since each row has 32 32-bit integers, and since each cache line can hold 8 32-bit ints, a row of the matrix occupies the lines in 4 consecutive sets of the cache. > Loop A—where each iteration of the inner loop sums a row of A—accesses memory addresses in a linear sequence. Given this access pattern, the access to the first word in each cache line will miss, but the next seven accesses will hit. After sequentially moving through this line, it will not be accessed again, so its later eviction will not cause any future misses. Therefore, Loop A will only have compulsory misses for the 512 (128 rows x 4 lines per row) that matrix A spans. > The consecutive accesses in Loop B will move in a stride of 32 words. Therefore, the inner loop will touch the first element in 128 cache lines before the next iteration of the outer loop. While intuition might suggest that the 128 lines could all fit in the cache with 128 sets, there is a complicating factor: each row is four cache lines past the previous row, meaning that the lines accessed when traversing the first column go in indices 0, 4, 8, 16, and so on. Since the lines containing the column are competing for only one quarter of the sets, the lines loaded when starting a column are evicted by the time the column is complete, preventing any reuse. Therefore, all 4096 (128 x 32) accesses miss. > ![](https://i.imgur.com/rNujTpV.gif) > The iteration of `Loop A` looks like the gif shown above. > > ![](https://i.imgur.com/xFOmSkb.gif) > The part of iteration of `Loop B` looks like the gif shown above. 2. Consider a direct-mapped data cache with 8-word (32-byte) cache lines. Calculate the minimum number of cache lines required for the data cache if Loop A is to run without any cache misses other than compulsory misses. Calculate the minimum number of cache lines required for the data cache if Loop B is to run without any cache misses other than compulsory misses. * Data-cache size required for Loop A: __ E03 __ cache line(s) * Data-cache size required for Loop B: __ E04 __ cache line(s) > E03 = 1 > E04 = 512 > Since Loop A accesses memory sequentially, we can sum all the elements in a cache line and then never touch it again. Therefore, we only need to hold 1 active line at any given time to avoid all but compulsory misses. > For Loop B to run without any cache misses other than compulsory misses, the data cache needs to have the ability to hold one column of matrix A in the cache. Since the consecutive accesses in the inner loop of Loop B will use one out of every four cache lines, and since we have 128 rows, Loop B requires 512 (128 × 4) lines to avoid all but compulsory misses. > > > ![](https://i.imgur.com/vE7GMao.gif) > The gif above shows the iteration of `Loop A` in 1-line cache. 3. Consider a 4KB set-associative data cache with 4 ways, and 8-word (32-byte) cache lines. This data cache uses a first-in/first-out (FIFO) replacement policy. * The number of cache misses for Loop A: __ E05 __ * The number of cache misses for Loop B: __ E06 __ > E05 = 512 > E06 = 4096 > Loop A still only has 512 (128 rows x 4 lines per row) compulsory misses. Loop B still cannot fully utilize the cache. The first 8 accesses will allocate into way 1 in sets 0, 4, 8, 16, etc.; the next 8 accesses will allocate into way 2 of those same sets; and so on. After 32 accesses, all four ways will be filled, and the next 8 accesses along the column will evict the previous lines in way 1, preventing any reuse. Therefore, all 4096 (128 x 32) accesses miss. > > ![](https://i.imgur.com/vKNYrE7.gif =80%x) > The gif above shows the iteration of `Loop A` in 4 ways, and 8-word (32-byte) cache lines. > > ![](https://i.imgur.com/9bo1tZD.gif =70%x) > The gif above shows the iteration of `Loop B` in 4 ways, and 8-word (32-byte) cache lines. ## Problem F The following diagram shows a classic fully-bypassed 5-stage pipeline that has been augmented with an unpipelined divider in parallel with the ALU. Bypass paths are not shown in the diagram. This iterative divider produces 2 bits per cycle until it outputs a full 32-bit result. ![](https://i.imgur.com/n5qHB6e.png) * What is the latency of a divide operation in cycles? __ F01 __ * What is the occupancy of a divide operation in cycles? __ F02 __ > F01 = 16 > 32/2 = 16 > F02 = 16 > 16 > since the divider is unpipelined --- ## Problem G Given the follow chunk of code, analyze the hit rate given that we have a byte-addressed computer with a total memory of 1 MiB. It also features a 16 KiB Direct-Mapped cache with 1 KiB blocks. Assume that your cache begins cold. ```cpp #define NUM_INTS 8192 // 2ˆ13 int A[NUM_INTS]; // A lives at 0x10000 int i, total = 0; for (i = 0; i < NUM_INTS; i += 128) A[i] = i; // Code 1 for (i = 0; i < NUM_INTS; i += 128) total += A[i]; // Code 2 ``` 1. How many bits make up a memory address on this computer? __ G01 __ > $log_2(1MiB) = log_2(2^{20}) = 20$ bits 2. What is the T:I:O breakdown? * Offset: __ G02 __ * Index: __ G03 __ * Tag: __ G04 __ > G02 = 10 > G03 = 4 > G04 = 6 > Offset = $log_2(1 \ KiB) =log_2(2^{10}) = 10$ > Index = $log_2(\dfrac{16\ KiB}{1\ KiB})=log_2(16)=4$ > Tag = $20 - 4 - 10=6$ 3. Calculate the cache hit rate for the line marked as `Code 1`: __ G05 __ > G05 = 50 > The Access address looks like the figure below. There are 2 bits for byte offset, 8 bits for 1 KiB Block($1024*8/32=256$ words ), 4 bits for index and the rest of bits for tag. As a result, we have a cache with 16 cache lines and each cache line contains 256 words. > ![](https://i.imgur.com/Ino78Ew.png =70%x) > > The integer accesses are 4 ∗ 128 = 512 bytes apart, which means there are 2 accesses per block. The first accesses in each block is a compulsory cache miss, but the second is a hit because `A[i]` and `A[i+128]` are in the same cache block. Thus, we end up with a hit rate of 50%. > > The gif below shows how the cache work. > ![](https://i.imgur.com/SQ54yDn.gif) 4. Calculate the cache hit rate for the line marked as `Code 2`: __ G06 __ > G06 = 50% > The size of A is 8192 ∗ 4 = $2^{15}$ bytes. This is exactly twice the size of our cache. At the end of `Code 1`, we have the second half of A inside our cache, but `Code 2` starts with the first half of A. Thus, we cannot reuse any of the cache data brought in from `Code 1` and must start from the beginning. Thus our hit rate is the same as `Code 1` since we access memory in the same exact way as Code 1. We do not have to consider cache hits for total, as the compiler will most likely store it in a register. Thus, we end up with a hit rate of 50%. > > The gif below shows how `Code2` work in cache. > ![](https://i.imgur.com/HwLP2p5.gif)