---
tags: computer-arch
---
# Quiz5 of Computer Architecture (2022 Fall)
> Solutions
## Problem `A`
An algorithm called the [Sieve of Eratosthenes](https://en.wikipedia.org/wiki/Sieve_of_Eratosthenes) could be used to generate a list of prime integers. The code listing below offers a comprehensive breakdown of the algorithm.
The below function implements this Sieve, and is defined as follows:
* Input: An integer n. You may assume that n > 0.
* Output: An array of $n$ 1-byte integers. The $i$-th element of this list is 0 if i is prime, and 1 otherwise.
```c
char *prime_list(uint32_t n)
{
char *primes = A01 /* Fill your answer here */;
if (A02 /* Fill your answer here */)
return NULL;
primes[0] = 1;
primes[1] = 1;
for (int i = 2; i * i < n; i++) {
if (!primes[i]) { /* If i is a prime number */
for (int j = 2 * i; j < n; j += i) {
primes[j] = 1; /* Cross out all multiples of i */
}
}
}
return primes;
}
```
1. What should be written in the given blanks in the C code? i.e., A01 and A02
> * A01 = ?
> ==`calloc(1, n)`== or equivalent.
> * A02 = ?
> ==`!primes`== or equivalent.
We run this program on a 32-bit system with a direct-mapped, 4 KiB cache, with 128B blocks.
2. What is the T:I:O breakdown of this cache? T: A03, I: A04, O: A05
> * A03 = ?
> ==20==
> * A04 = ?
> ==5==
> * A05 = ?
> ==7==
> Each block is 128 = 2^7^ bytes, so we need 7 offset bits to uniquely address each byte in a block.
> The cache is direct-mapped, so each block of the cache has a unique index. The cache contains 4 KiB = 2^12^ bytes, and each block is 2^7^ bytes, so there are $\dfrac{2^{12}}{2^{7}} = 2^{5}$ blocks in the cache. We need 5 bits to uniquely address each block in the cache.
> The system uses 32-bit addresses, so we have $32 − 7 − 5 = 20$ bits left over for the tag.
3. How many misses of each type occur if we call `prime_list(4096)`?
* number of compulsory misses
> * A06 = ?
> ==32==
* number of capacity misses
> * A07 = ?
> ==0==
* number of conflict misses
> * A08 = ?
> ==0==
> Let's start by considering the first iteration of the outer for loop (i=2). At this point, `primes[2]` is 0, so the inner for loop will execute. Its bounds are `j=4; j<4096; j+=2`, and on each iteration, we are accessing `primes[j]`.
> `primes[4]` is a miss, because our cache starts cold. It is a compulsory miss because we have never accessed this part of memory before. This miss brings in a 128-byte block containing `primes[0]` to `primes[127]`.
> `primes[6]` is a hit because of the block we just brought in. `primes[8]`, `primes[10]`, ..., `primes[126]` are also hits.
> `primes[128]` is a compulsory miss, since we have never accessed this part of memory before.
> This miss brings in another 128-byte block containing `primes[128]` to `primes[255]`. Since the cache is direct-mapped, and this block is directly after the previous block in memory, these two blocks occupy different indices in the cache. (You can also verify this by checking that the index bit increments by 1 between these two blocks.)
> `primes[130]` is a hit because of the block we just brought in. `primes[132]`, `primes[134]`, ..., `primes[254]` are also hits.
4. How many misses of each type occur if we call `prime_list(16384)`? You may use the function $\pi(n)$ to denote the number of primes less than or equal to $n$. (So $\pi(2) = 1$, $\pi(10) = 4$, etc.)
* number of compulsory misses
> * A09 = ?
> ==128==
* number of capacity misses
> * A10 = ?
> ==$(\pi(128) - 1) \times 128$==
* number of conflict misses
> * A11 = ?
> ==0==
> Let's start by considering the first iteration of the outer for loop (i=2). At this point, `primes[2]` is 0, so the inner for loop will execute. Its bounds are `j=4; j<16384; j+=2`, and on each iteration, we are accessing `primes[j]`.
> On this first iteration of the outer for loop, we have the same pattern as the previous part. Each block of 128 bytes forces one compulsory miss, and we iterate through every block of the array. The array is 16384 = 2^14^ bytes, and each block is 128 = 2^7^ bytes, so we bring in $\dfrac{2^{14}}{2^{7}} = 2^{7} = 128$ blocks. This means we have 128 compulsory misses, 0 capacity misses, and 0 conflict misses from the first iteration of the outer for loop.
> Unlike the previous part, where the entire array fits in the cache, the 128-block array is now four times larger than the 32-block cache. Because the cache is direct-mapped, this causes the first 1/4th of the array to fill up the cache, then get completely replaced by the next 1/4th of the array, then the next 1/4th of the array, then the last 1/4 of the array. After the first iteration of the outer for loop, only the last 1/4th of the array is in the cache.
> On the second iteration of the outer for loop, i=3 and `primes[3]` is 0, so the inner for loop will execute. Its bounds are `j=3; j<16384; j+=3`, and as before, we are accessing `primes[j]` on each iteration.
> There are two important things to note about this loop. First, it starts back at the beginning of the array. Second, it accesses every 128-byte block of the array, just like the previous j+=2 loop.
> This means that the cache, which currently contains the last 1/4th of the array, will be entirely replaced with the first 1/4th of the array as this loop executes. In other words, nothing in the cache at the end of the the previous loop helps us in this loop; the cache effectively starts cold again. Also, because we access every block of the array again, we have another 128 misses, just like in the previous for loop. This time, the 128 misses are capacity misses because the cache is full on every miss. (There are no compulsory or conflict misses during this loop.)
> On the third iteration of the outer for loop, i=4 and `primes[4]` is 1, so we skip the inner for loop. This still causes one miss due to accessing the 0th block of our array.
> On the fourth iteration of the outer for loop, i=5 and `primes[5]` is 1, so the inner for loop executes with `j=5; j<16384; j+=5`. As before, we end up having to access every block of the array, which results in another 127 capacity misses. Note that because we missed on the first block in iteration i=4, we don’t miss again on the first block.
> In general, we miss exactly 128 times for each prime number; once for the first composite number after the previous prime, and 127 times within the inner for loop.
> This pattern continues all the way to i=127, the last point where $i \times i < n$. In all of these iterations of the outer for loop, the inner for loop executes only when i is prime (per the comment at `if (!primes[i])`), so in total, we have $\pi(128)$ (equivalently, $\pi(127)$) iterations of the inner for loop. The first iteration of the inner for loop causes compulsory misses, and the other $\pi(128) − 1$ iterations of the inner for loop causes capacity misses.
> Each iteration of the inner for loop causes 128 misses; note that because we only go up to i=127, the inner for loop always has to access every block of the array. In other words, for a block to be skipped, the inner for loop would need to be skipping by at least 128 bytes each time, which never happens here.
> Finally, we note that 127 is prime, so our last iteration of the loop is a prime number. As such, we do not have an additional miss to check a composite number after our last prime.
> In total, we have $(\pi(128) − 1) \times 128$ capacity misses, 128 compulsory misses, and 0 conflict misses.
---
## Problem `B`
We take into account a change to the typical 5-stage, fully bypassed RISC-V processor pipeline. The data cache in our new CPU has a two-cycle latency. The memory stage is pipelined into two stages, `M1` and `M2`, as illustrated in Figure B-1, to accommodate this cache. Additional bypasses are added to keep the pipeline fully bypassed.
Suppose we are implementing this 6-stage pipeline in a technology in which register file ports are inexpensive but bypasses are costly. We wish to reduce cost by removing some of the bypass paths, but without increasing CPI. The proposal is for all integer arithmetic instructions to write their results to the register file at the end of the Execute stage, rather than waiting until the Writeback stage. A second register file write port is added for this purpose. Remember that register file writes occur on each rising clock edge, and values can be read in the next clock cycle. The proposed change is shown in Figure B-2.
Assume that the only exceptions that can occur in this pipeline are illegal opcodes (detected in the Decode stage) and invalid memory address (detected at the start of the `M2` stage). Additionally, assume that the control logic is optimized to stall only when necessary.
You may ignore branch and jump instructions in this problem.
![](https://hackmd.io/_uploads/ryrDN9MDs.png)
> $\uparrow$ Figure B-1. 6-stage pipeline. For clarity, bypass paths are not shown.
![](https://hackmd.io/_uploads/SkRcV9Mvo.png)
> $\uparrow$ Figure B-2. 6-stage pipeline with proposed additional write port.
The second write port allows some bypass paths to be removed without adding stalls in the decode stage. Explain how the second write port improves performance by eliminating such stalls. You MUST mention which kind(s) of hazards can be resolved explicitly along with the reasons.
> * B01 = ?
> ==The second write port improves performance by resolving some RAW hazards earlier than they would be if ALU operations had to wait until writeback to provide their results to subsequent dependent instructions.== or equivalent. (一定要解釋才給分,敘述可少)
Will the second write port help the following instruction sequence? (Answer in **Yes** or **No**)
```c
add x1, x2, x3
add x4, x5, x6
add x7, x1, x9
```
> * B02 = ?
> ==Yes==
Reasons of the previous. Explain!
> * B03 = ?
> ==The important insight is that the second write port cannot resolve data hazards for immediately back-to-back instructions. An arithmetic instruction in the `EX` stage writes back as it leaves the `EX` stage; therefore, the bypass path is necessary if the next instruction has a RAW dependency and is allowed to leave the ID stage.== or equivalent. (一定要解釋才給分,敘述可少)
After the second write port is added, which bypass paths can be removed in this new pipeline without introducing additional stalls? List each removed bypass individually.
> * B04 = ?
> ==The bypass path from the end of M1 to the end of ID== (equivalent: the bypass path from the beginning of `M2` to the beginning of `EX`)
> Additionally, ALU results no longer must be bypassed from the end of M2 or the end of WB, but these bypass paths are still used to forward load results to earlier stages.
Are any new hazards added to the pipeline due to the earlier writeback of arithmetic instructions? Answer in `Yes` or `No` with strong reasons.
> * B05 = ?
> ==Yes== AND ==There are multiple potential WAW hazards that must be appropriately addressed by the control logic. The two instructions writing at the same time must be appropriately prioritized. Also, if an arithmetic instruction is in M1 and a load with the same destination register is in M2, the write of the earlier load can clobber the result of the older instruction, leading to an incorrect architectural state. The control logic needs to be modified to handle these situations by suppressing the writes of older instructions when they conflict with the writes of newer instructions== or equivalent. (一定要解釋才給分,敘述可少)
Without further modifications, this pipeline may not support precise exceptions. Will the code sequence below result in an imprecise exception? Answer in `Yes` or `No`
```c
lw x1, -1(x0)
add x2, x3, x4
```
> * B06 = ?
> ==Yes==
Explain the previous.
> * B07 = ?
> ==Illegal address exceptions are not detected until the start of the M2 stage. Since writebacks can occur at the end of the EX stage, it is possible for an arithmetic instruction following a memory access to an illegal address to have written its value back before the exception is detected, resulting in an imprecise exception.== 一定要解釋才給分,敘述可少,也可搭配程式解釋,如:
> ```c
> lw x1, -1(x0) # address -1 is misaligned
> add x2, x3, x4 # x2 will be overwritten,
> # but last instruction has faulted
> ```
---
## Problem `C`
Assume that we are dealing with a byte-addressed system that has a 2 KiB page size, 64 MiB of physical memory, 32 GiB of virtual memory, and 32 GiB of virtual memory. Each page table entry is specified to be 4 bytes long, including metadata.
Assume we have a single level page table with no TLBs.
1. How many bits long is our page offset?
> * C01 = ?
> ==11==
> Each page is 2 KiB = 2^11^ B large, so we need 11 bits to uniquely identify one of the bytes in this page.
2. How many bits are there in the PPN?
> * C02 = ?
> ==15==
> PPN = physical page number. Remember that the physical address is split into two parts: the PPN identifies a page of memory, and the page offset identifies a byte within that page.
> Physical memory is 64 MiB = 2^26^ B large, so we need 16 bits to address every byte of physical memory. Of the 26 bits, 11 bits are used as the page offset (from the previous part). That leaves 26 − 11 = 15 bits to be used as the PPN.
3. How many bits are there in the VPN?
> * C03 = ?
> ==24==
> VPN = virtual page number. Remember that the virtual address is split into two parts: the VPN identifies a page of memory, and the page offset identifies a byte within that page. Virtual memory is 32 GiB = 2^35^ B large, so we need 35 bits to address every byte of virtual memory.
> Of the 35 bits, 11 are used as the page offset. That leaves 35 − 11 = 24 bits to be used as the VPN.
4. How many PTEs will we have in our page table?
> * C04 = ?
> ==2^24^== or equivalent.
> PTE = page table entry. A PTE maps a virtual page number to a physical page number, so we need one PTE per virtual page number. The VPN is 24 bits long, so there are 2^24^ different VPNs, which means there are 2^24^ PTEs.
Noticing that this is inefficient, we decide to use a two-level hierarchical page table with no TLBs. Our VPN bits are split evenly among the two levels.
5. How many PTEs are in the L1 page table?
> * C05 = ?
> ==2^12^== or equivalent.
> There are 24 bits in the VPN. If we split these bits evenly among the two levels, we have 12 bits for uniquely identifying a L1 PTE. This gives us 2^12^ possible PTEs in the L1 page table.
6. How many pages worth of memory does a single L2 page table take?
> * C06 = ?
> ==8==
> As in the previous part, splitting the bits evenly gives us 12 bits for uniquely identifying a page within a single L2 page table. This means that a single L2 page table contains 2^12^ PTEs.
> From the top of the question, each PTE is 4 bytes long, so a single L2 page table takes up $2^{12} \times 4 = 2^{14}$ bytes.
> One page of memory is 2 KiB = 2^11^ bytes, so a single L2 page table takes up $\dfrac{2^{14}}{2^{11}} = 2^{3} = 8$ pages of memory.
---
## Problem `D`
Consider the requirement to implement wo new instructions for RISC-V:
* `push rs2` : Put the word stored in `rs2` on the stack, decrementing the stack pointer as necessary.
* `pop rd` : Set `rd` to the bottom-most word of the stack, and remove that element from the stack.
It is undefined behavior to call `push sp` or `pop sp`.
We first decide to implement `push` and `pop` as pseudo-instructions.
1. Write a two-instruction sequence that implements push `rs2`. Your instructions can use `rs2`.
> * D01 = ?
> ==addi sp, sp, -4== AND ==sw rs2, 0(sp)== (要寫這二道指令才給分)
> The first instruction decrements the stack pointer by 4 bytes (one word). The second instruction stores the word in rs2 in the newly-allocated space on the stack.
> Note that accessing memory below the stack pointer sp is undefined, so we have to decrement the stack first before accessing that memory.
2. Write a two-instruction sequence that implements pop `rd`. Your instructions can use `rd`.
> * D02 = ?
> ==lw rd, 0(sp)== AND ==addi sp, sp, 4== (要寫這二道指令才給分)
> The first instruction loads the bottom-most word on the stack into `rd`. The second instruction removes that word from the stack by incrementing the stack pointer by 4 bytes (one word).
> As in the previous part, accessing memory below the stack pointer sp is undefined, so we have to load the word from memory first before incrementing the stack (which effectively deletes that word from the stack).
Instead of implementing them as pseudoinstructions, we decide to modify our datapath to make them proper instructions that run in one cycle of a single-cycle datapath. For this question, we will make modifications to the RISC-V datapath included in lecture materials. In order to implement push and pop, we decide to modify our immediate generator so that setting the `ImmSel` control signal to 7 always outputs an immediate with value 4.
3. What further changes would we need to make to our datapath in order for us to implement the `push` instruction with as few changes as possible? Anwer `N/A` if there are no further changes needed.
> * D03 = ?
> ==N/A==
> We can reuse one of the existing instruction format types, as the push instruction only needs rs2 encoded in the instruction. For example, we could make push an S-type instruction, give it a new opcode (not in use by any other instruction), and leave the funct3, immediate, and rs1 fields unused (e.g. fill them with all zeros, or hard-code them to a value consistent with how we are using the rest of the datapath, for rs1).
> ImmGen will be used to output 4 here (so we can add 4 to sp), so a second new immediate type is not needed.
> The existing Regfile has two outputs for reading two registers. The push instruction only requires reading the values in sp and rs2, so a third Regfile output is not needed.
> The existing Regfile has one writeback input for writing to one register. The push instruction only requires writing to sp, so a second writeback input is not needed. The ALU will be used to subtract 4 from sp. We can use the first Regfile read output (rs1) to read the value in sp, so AMux selects the register output, not the PC. BMux selects the immediate (which is always outputting 4), not the rs2 Regfile read output. Neither AMux nor BMux needs a new input, and the ALU does not need a third register input or a new operation (sub already exists).
> In the existing datapath, the DMEM address input is the ALU output. In the push instruction, the ALU output is `sp - 4`, which coincidentally is also the address we want to store to, so an additional DMEM address input is not needed. In the existing datapath, WBMux supports writing back the ALU output to Regfile. In the push instruction, the ALU output is sp-4, which is what we want to write back to the sp register, so no new input to WBMux is needed.
4. What further changes would we need to make to our datapath in order for us to implement the `pop` instruction with as few changes as possible? Select all that apply:
**a**. Add a new instruction format
**b**. Add a mux before the DMEM address input, including relevant inputs and selectors
**c**. Add a second new immediate type for the ImmGen
**d**. Add a new output to Regfile for a third register value
**e**. Add a new input to AMux, with the relevant selectors
**f**. Add a new input to BMux, with the relevant selectors
**g**. Add a new ALU input for a third register input
**h**. Add a new ALU operation, with a corresponding ALUSel
**i**. Add a new input to WBMux, with the relevant selectors
**j**. Add a second writeback input to Regfile, along with a second RegWEn
> * D04 = ?
> ==b,j== (二者都符合才給分)
> We can reuse one of the existing instruction format types, as the pop instruction only needs rd encoded in the instruction. For example, we could make pop an I-type instruction, give it a new opcode (not in use by any other instruction), and leave the funct3, immediate, and rs1 fields unused (e.g. fill them with all zeros, or hard-code them to a value consistent with how we are using the rest of the datapath).
> ImmGen will be used to output 4 here (so we can add 4 to sp), so a second new immediate type is not needed.
> The existing Regfile has two outputs for reading two registers. The pop instruction only requires reading the value in sp, so a third Regfile output is not needed.
> The existing Regfile has one writeback input for writing to one register. The pop instruction requires writing to two registers, sp and rs2, so a second writeback input is needed.
> The ALU will be used to add 4 to sp. We can use the first Regfile read output (rs1) to read the value in sp, so AMux selects the register output, not the PC. BMux selects the immediate (which is always outputting 4), not the rs2 Regfile read output. Neither AMux nor BMux needs a new input, and the ALU does not need a third register input or a new operation (add already exists).
> In the existing datapath, the DMEM address input is the ALU output. In the pop instruction, the ALU output is sp+4, but the address we want to load from is sp, so an additional DMEM address input is needed.
> In the existing datapath, WBMux supports writing back the ALU output to Regfile. In the pop instruction, the ALU output is `sp + 4`, which is what we want to write back to the sp register, so no new input to WBMux is needed.
5. Assuming the above changes were implemented, what hazards can be caused by push if we use the typical 5-stage pipeline? Select all that apply:
==a==. Data hazard
==b==. Control hazard
==c==. Neither data nor control hazard
> * D05 = ?
> ==a==
> push can cause data hazards if we write to sp in the instruction immediately following a push instruction.
> push does not cause PC to jump/branch to a different place (it just increments PC by 4), so it can not cause control hazards.
---
## Problem `E`
Consider a cache optimization known as "[sub-blocking](https://courses.cs.duke.edu/fall06/cps220/lectures/PPT/lect12.pdf)" (also called "sectored caches"):
* The number of sets and ways is unchanged.
* Each cache block is broken up into smaller "sub-blocks".
– Each sub-block has its own valid bit.
– On a cache miss, only the cache sub-blocks accessed by the user's program are loaded in.
∗ Other sub-blocks remain in the "invalid" state until they are also loaded in.
* Make sure you understand that "sets" are not "sub-blocks".
Suppose that we have an 8 KB, two-way set associative cache with 4-word (16-byte) cache lines. The
replacement policy is LRU. Each cache block is broken up into four smaller sub blocks.
We will evaluate the following two loops:
- [ ] Loop A
```c
sum = 0;
for (int i = 0; i < 128; i++)
for (int j = 0; j < 32; j++)
sum += buf[i*32 + j];
```
- [ ] Loop B
```c
sum = 0;
for (int j = 0; j < 32; j++)
for (int i = 0; i < 128; i++)
sum += buf[i * 32 + j];
```
1. What is the number of misses for Loop A and for Loop B with the sectored cache?
* number of misses for Loop A
> * E01 = ?
> ==4096==
* number of misses for Loop B
> * E02 = ?
> ==4096==
> For both Loop A and B, all 4096 (32 × 128) memory accesses will miss.
> In a sectored cache, only a single word (sub-block) is loaded at a time and none of the loaded sub-words are accessed again in both Loop A and B.
2. What is the number of misses for Loop A and for Loop B if the cache is not sectored (i.e. no sub-blocks)?
* number of misses for Loop A
> * E03 = ?
> ==1024==
> There are only compulsory misses upon the access to the first word in each cache line.
> This is once every 4 words and the total number of misses for Loop A is 1024.
* number of misses for Loop B
> * E04 = ?
> ==4096==
> Since memory is accessed in a stride of 32 words, the loop can not utilize the full cache (only sets 0, 8, 16, . . . , etc. are used).
> The LRU policy evicts cache lines (in both ways) of the selected sets before the next access to those lines can happen.
> Thus, we have all 4096 accesses resulting in a miss.
Suppose that we removed an outer-level cache to free up area on our chip. With this new area, we doubled the size of our L1 cache.
Suppose that this optimization worsened the L1 hit time from 1 cycle to two cycles, and increased the miss penalty from 50 cycles to 100 cycles. Before this optimization, the L1 miss rate was $10\%$.
3. What does the new miss rate have to be for our new optimization to improve the average L1 cache access time?
> * E05 = ?
> ==`<= 40%`==
> Let the new miss rate be 0.1x.
> 1 + 0.1 × 50 ≥ 2 + (0.1 × x) × 100 ⇒ x ≤ 0.4
> Thus, the new miss rate has to be less or equal to 40% of the original miss rate to improve the average L1 access time.
---
## Problem `F`
Think about the implementation of the single-cycle processor we saw in the lecture:
![](https://hackmd.io/_uploads/SJDaW3zvi.png)
The timing characteristics of all components are listed below:
![](https://hackmd.io/_uploads/ry2yM3zvi.png)
Assume that any components not listed have a delay of 0 ns.
1. What is the minimum clock cycle time of this processor? Minimum t~CLK~ = ? ns
> * F01 = ?
> ==22==
> Critical path is PC -> IMem -> Decoder -> RF(read) -> BSEL mux -> ALU -> DMem (read) -> WDSEL mux -> RF(write) => t~CLK~ >= 1 + 4 + 2 + 3 + 1 + 4 + 4 + 1 + 2 (RF setup)
Then, let's implement this alternative datapath (the control logic stays the same), where the data memory's address input comes from a different place:
![](https://hackmd.io/_uploads/BJqsz2GPi.png)
2. What is the minimum clock cycle time of Ben’s new processor? Minimum t~CLK~ = ? ns
> * F02 = ?
> ==18==
> Critical path is PC -> IMem -> Decoder -> RF(read) -> BSEL mux -> ALU -> WDSEL mux -> RF(write) => tCLK >= 1 + 4 + 2 + 3 + 1 + 4 + 1 + 2 (RF setup)
The program below takes 90 instructions to execute in the original processor. However, it produces incorrect results in the new processor. Modify the program so that it runs correctly on the new processor.
- [ ] C code
```c
int x[16];
for (int i = 0; i < 15; i++)
x[i+1] = x[i] + x[i+1];
```
- [ ] RISC-V Assembly
```c
# Initial values:
# a0: address of x[0]
# a7: address of x[15]
loop:
lw a1, 0(a0)
lw a2, 4(a0)
add a3, a2, a1
sw a3, 4(a0)
addi a0, a0, 4
blt a0, a7, loop
```
3. Number of executed instructions of new program?
> * F03 = ?
> ==77== (檢查是否有更少的指令序列)
> 15*5 + 2 = 77
> (other solutions are possible; any solution with fewer instructions than the original)
> The correct code:
> ```c
> # Initial values:
> # a0: address of x[0]
> # a7: address of x[15]
> lw a1, 0(a0)
> addi a0, a0, 4
> loop:
> lw a2, 0(a0)
> add a1, a1, a2
> sw a1, 0(a0)
> addi a0, a0, 4
> ble a0, a7, loop
> ```
4. What is the execution time of the above program in the original processors? Answer in **ns**.
> * F04 = ?
> ==1694==
> 77 * 22 = 1694 ns
5. What is the execution time of the above program in the new processors? Answer in **ns**.
> * F05 = ?
> ==1386==
> 77 * 18 = 1386 ns
---
## Problem `G`
A 2-way set-associative cache with a block size of 4 would probably have better performance for certain applications. Below is a snapshot of this cache during the execution of some unknown code. `V` is the valid bit and D is the dirty bit of each set.
![](https://hackmd.io/_uploads/SJ3CS3fwj.png)
1. Would the following memory accesses result in a hit or a miss? If it results in a hit, specify what value is returned; if it is a miss, explain why in a few words or by showing your work.
* 32-Bit Byte Address: `0x50D8`
* Line index: __ G01 __
* Tag: __ G02 __ (in HEX)
* Block offset: __ G03 __
* Returned value if hit / Explanation if miss: __ G04 __
* 32-Bit Byte Address: `0x2CB0`
* Line index: __ G05 __
* Tag: __ G06 __ (in HEX)
* Block offset: __ G07 __
* Returned value if hit / Explanation if miss: __ G08 __
> * G01 = ?
> ==5==
> * G02 = ?
> ==0xA1==
> * G03 = ?
> ==2==
> * G04 = ?
> ==0xCC==
> * G05 = ?
> ==3==
> * G06 = ?
> ==0x59==
> * G07 = ?
> ==0==
> * G08 = ?
> ==0xD3==
Then, we want to analyze the performance of this cache on the following benchmark assembly program, which repeatedly calculates the sum of all the elements in the 64-element array `A`. The base address of array `A` is `0x3000`.
```c
// Assume the following registers are initialized:
// x2 = 64 (size of the array)
// x3 = 0x3000 (base address of array A)
. = 0x100 // The following code starts at address 0x100
sum_loop:
li x1, 0 // Start with i == 0
li x4, 0 // x4 = sum of the array
sum:
slli x5, x1, 2 // x5 = byte offset of the ith element
add x5, x5, x3 // x5 = address of A[i]
lw x5, 0(x5) // x5 = A[i]
add x4, x4, x5 // add element to sum
addi x1, x1, 1 // increment i
blt x1, x2, sum // continue looping
jal x0, sum_loop // restart the benchmark
```
Answer the following questions about the behavior of the cache during execution of the above code. Assume that the cache uses a least recently used (LRU) replacement policy, that the cache is initially empty, and that all cache lines in Way 0 are currently the least-recently used.
2. How many instruction fetches and data accesses occur per iteration of the outer loop (`sum_loop`)?
* Number of instruction fetches: __ G09 __
> * G09 = ?
> ==387==
> 6*64 + 3 = 387
* Number of data accesses: __ G10 __
> * G10 = ?
> ==64==
3. After the program has been running for many outer loop iterations, what is the steady-state hit ratio for instruction fetches and data accesses? Be aware of the total size of the cache, and the block size.
* Steady-state hit ratio for instruction fetches for the outer loop: __ G11 __
> * G11 = ?
> ==$\dfrac{386}{387}$==
> All of the inner loop instructions will be most recently used and will not be kicked out of the cache by data fetches. Since our block size is 4, this also means the two li instructions will remain in the cache. However, The jal instruction will be kicked out by data every loop, so our total instruction hit ratio is 386/387.
* Steady-state hit ratio for data accesses: __ G12 __
> * G12 = ?
> ==$\dfrac{29}{32}$==
> We will always miss on the first block of data overlapping with the inner loop instructions, so there are 4 data misses as those data blocks get pulled in. We will also miss the data fetch in the first block of set 3 because jal will overwrite it, so that’s another 2 data misses. That leads to a total of 6 data misses, so our hit ratio is (64 – 6) / 64 = 58/64 = 29/32.
---