Solutions
A
An algorithm called the Sieve of Eratosthenes could be used to generate a list of prime integers. The code listing below offers a comprehensive breakdown of the algorithm.
The below function implements this Sieve, and is defined as follows:
- A01 = ?
calloc(1, n)
or equivalent.- A02 = ?
!primes
or equivalent.
We run this program on a 32-bit system with a direct-mapped, 4 KiB cache, with 128B blocks.
What is the T:I:O breakdown of this cache? T: A03, I: A04, O: A05
- A03 = ?
20- A04 = ?
5- A05 = ?
7
Each block is 128 = 27 bytes, so we need 7 offset bits to uniquely address each byte in a block.
The cache is direct-mapped, so each block of the cache has a unique index. The cache contains 4 KiB = 212 bytes, and each block is 27 bytes, so there are blocks in the cache. We need 5 bits to uniquely address each block in the cache.
The system uses 32-bit addresses, so we have bits left over for the tag.
How many misses of each type occur if we call prime_list(4096)
?
- A06 = ?
32
- A07 = ?
0
- A08 = ?
0
Let's start by considering the first iteration of the outer for loop (i=2). At this point,primes[2]
is 0, so the inner for loop will execute. Its bounds arej=4; j<4096; j+=2
, and on each iteration, we are accessingprimes[j]
.
primes[4]
is a miss, because our cache starts cold. It is a compulsory miss because we have never accessed this part of memory before. This miss brings in a 128-byte block containingprimes[0]
toprimes[127]
.
primes[6]
is a hit because of the block we just brought in.primes[8]
,primes[10]
, …,primes[126]
are also hits.
primes[128]
is a compulsory miss, since we have never accessed this part of memory before.
This miss brings in another 128-byte block containingprimes[128]
toprimes[255]
. Since the cache is direct-mapped, and this block is directly after the previous block in memory, these two blocks occupy different indices in the cache. (You can also verify this by checking that the index bit increments by 1 between these two blocks.)
primes[130]
is a hit because of the block we just brought in.primes[132]
,primes[134]
, …,primes[254]
are also hits.
How many misses of each type occur if we call prime_list(16384)
? You may use the function to denote the number of primes less than or equal to . (So , , etc.)
- A09 = ?
128
- A10 = ?
- A11 = ?
0
Let's start by considering the first iteration of the outer for loop (i=2). At this point,primes[2]
is 0, so the inner for loop will execute. Its bounds arej=4; j<16384; j+=2
, and on each iteration, we are accessingprimes[j]
.
On this first iteration of the outer for loop, we have the same pattern as the previous part. Each block of 128 bytes forces one compulsory miss, and we iterate through every block of the array. The array is 16384 = 214 bytes, and each block is 128 = 27 bytes, so we bring in blocks. This means we have 128 compulsory misses, 0 capacity misses, and 0 conflict misses from the first iteration of the outer for loop.
Unlike the previous part, where the entire array fits in the cache, the 128-block array is now four times larger than the 32-block cache. Because the cache is direct-mapped, this causes the first 1/4th of the array to fill up the cache, then get completely replaced by the next 1/4th of the array, then the next 1/4th of the array, then the last 1/4 of the array. After the first iteration of the outer for loop, only the last 1/4th of the array is in the cache.
On the second iteration of the outer for loop, i=3 andprimes[3]
is 0, so the inner for loop will execute. Its bounds arej=3; j<16384; j+=3
, and as before, we are accessingprimes[j]
on each iteration.
There are two important things to note about this loop. First, it starts back at the beginning of the array. Second, it accesses every 128-byte block of the array, just like the previous j+=2 loop.
This means that the cache, which currently contains the last 1/4th of the array, will be entirely replaced with the first 1/4th of the array as this loop executes. In other words, nothing in the cache at the end of the the previous loop helps us in this loop; the cache effectively starts cold again. Also, because we access every block of the array again, we have another 128 misses, just like in the previous for loop. This time, the 128 misses are capacity misses because the cache is full on every miss. (There are no compulsory or conflict misses during this loop.)
On the third iteration of the outer for loop, i=4 andprimes[4]
is 1, so we skip the inner for loop. This still causes one miss due to accessing the 0th block of our array.
On the fourth iteration of the outer for loop, i=5 andprimes[5]
is 1, so the inner for loop executes withj=5; j<16384; j+=5
. As before, we end up having to access every block of the array, which results in another 127 capacity misses. Note that because we missed on the first block in iteration i=4, we don’t miss again on the first block.
In general, we miss exactly 128 times for each prime number; once for the first composite number after the previous prime, and 127 times within the inner for loop.
This pattern continues all the way to i=127, the last point where . In all of these iterations of the outer for loop, the inner for loop executes only when i is prime (per the comment atif (!primes[i])
), so in total, we have (equivalently, ) iterations of the inner for loop. The first iteration of the inner for loop causes compulsory misses, and the other iterations of the inner for loop causes capacity misses.
Each iteration of the inner for loop causes 128 misses; note that because we only go up to i=127, the inner for loop always has to access every block of the array. In other words, for a block to be skipped, the inner for loop would need to be skipping by at least 128 bytes each time, which never happens here.
Finally, we note that 127 is prime, so our last iteration of the loop is a prime number. As such, we do not have an additional miss to check a composite number after our last prime.
In total, we have capacity misses, 128 compulsory misses, and 0 conflict misses.
B
We take into account a change to the typical 5-stage, fully bypassed RISC-V processor pipeline. The data cache in our new CPU has a two-cycle latency. The memory stage is pipelined into two stages, M1
and M2
, as illustrated in Figure B-1, to accommodate this cache. Additional bypasses are added to keep the pipeline fully bypassed.
Suppose we are implementing this 6-stage pipeline in a technology in which register file ports are inexpensive but bypasses are costly. We wish to reduce cost by removing some of the bypass paths, but without increasing CPI. The proposal is for all integer arithmetic instructions to write their results to the register file at the end of the Execute stage, rather than waiting until the Writeback stage. A second register file write port is added for this purpose. Remember that register file writes occur on each rising clock edge, and values can be read in the next clock cycle. The proposed change is shown in Figure B-2.
Assume that the only exceptions that can occur in this pipeline are illegal opcodes (detected in the Decode stage) and invalid memory address (detected at the start of the M2
stage). Additionally, assume that the control logic is optimized to stall only when necessary.
You may ignore branch and jump instructions in this problem.
Figure B-1. 6-stage pipeline. For clarity, bypass paths are not shown.
Figure B-2. 6-stage pipeline with proposed additional write port.
The second write port allows some bypass paths to be removed without adding stalls in the decode stage. Explain how the second write port improves performance by eliminating such stalls. You MUST mention which kind(s) of hazards can be resolved explicitly along with the reasons.
- B01 = ?
The second write port improves performance by resolving some RAW hazards earlier than they would be if ALU operations had to wait until writeback to provide their results to subsequent dependent instructions. or equivalent. (一定要解釋才給分,敘述可少)
Will the second write port help the following instruction sequence? (Answer in Yes or No)
- B02 = ?
Yes
Reasons of the previous. Explain!
- B03 = ?
The important insight is that the second write port cannot resolve data hazards for immediately back-to-back instructions. An arithmetic instruction in theEX
stage writes back as it leaves theEX
stage; therefore, the bypass path is necessary if the next instruction has a RAW dependency and is allowed to leave the ID stage. or equivalent. (一定要解釋才給分,敘述可少)
After the second write port is added, which bypass paths can be removed in this new pipeline without introducing additional stalls? List each removed bypass individually.
- B04 = ?
The bypass path from the end of M1 to the end of ID (equivalent: the bypass path from the beginning ofM2
to the beginning ofEX
)
Additionally, ALU results no longer must be bypassed from the end of M2 or the end of WB, but these bypass paths are still used to forward load results to earlier stages.
Are any new hazards added to the pipeline due to the earlier writeback of arithmetic instructions? Answer in Yes
or No
with strong reasons.
- B05 = ?
Yes AND There are multiple potential WAW hazards that must be appropriately addressed by the control logic. The two instructions writing at the same time must be appropriately prioritized. Also, if an arithmetic instruction is in M1 and a load with the same destination register is in M2, the write of the earlier load can clobber the result of the older instruction, leading to an incorrect architectural state. The control logic needs to be modified to handle these situations by suppressing the writes of older instructions when they conflict with the writes of newer instructions or equivalent. (一定要解釋才給分,敘述可少)
Without further modifications, this pipeline may not support precise exceptions. Will the code sequence below result in an imprecise exception? Answer in Yes
or No
- B06 = ?
Yes
Explain the previous.
- B07 = ?
Illegal address exceptions are not detected until the start of the M2 stage. Since writebacks can occur at the end of the EX stage, it is possible for an arithmetic instruction following a memory access to an illegal address to have written its value back before the exception is detected, resulting in an imprecise exception. 一定要解釋才給分,敘述可少,也可搭配程式解釋,如:
C
Assume that we are dealing with a byte-addressed system that has a 2 KiB page size, 64 MiB of physical memory, 32 GiB of virtual memory, and 32 GiB of virtual memory. Each page table entry is specified to be 4 bytes long, including metadata.
Assume we have a single level page table with no TLBs.
How many bits long is our page offset?
- C01 = ?
11
Each page is 2 KiB = 211 B large, so we need 11 bits to uniquely identify one of the bytes in this page.
How many bits are there in the PPN?
- C02 = ?
15
PPN = physical page number. Remember that the physical address is split into two parts: the PPN identifies a page of memory, and the page offset identifies a byte within that page.
Physical memory is 64 MiB = 226 B large, so we need 16 bits to address every byte of physical memory. Of the 26 bits, 11 bits are used as the page offset (from the previous part). That leaves 26 − 11 = 15 bits to be used as the PPN.
How many bits are there in the VPN?
- C03 = ?
24
VPN = virtual page number. Remember that the virtual address is split into two parts: the VPN identifies a page of memory, and the page offset identifies a byte within that page. Virtual memory is 32 GiB = 235 B large, so we need 35 bits to address every byte of virtual memory.
Of the 35 bits, 11 are used as the page offset. That leaves 35 − 11 = 24 bits to be used as the VPN.
How many PTEs will we have in our page table?
- C04 = ?
224 or equivalent.
PTE = page table entry. A PTE maps a virtual page number to a physical page number, so we need one PTE per virtual page number. The VPN is 24 bits long, so there are 224 different VPNs, which means there are 224 PTEs.
Noticing that this is inefficient, we decide to use a two-level hierarchical page table with no TLBs. Our VPN bits are split evenly among the two levels.
How many PTEs are in the L1 page table?
- C05 = ?
212 or equivalent.
There are 24 bits in the VPN. If we split these bits evenly among the two levels, we have 12 bits for uniquely identifying a L1 PTE. This gives us 212 possible PTEs in the L1 page table.
How many pages worth of memory does a single L2 page table take?
- C06 = ?
8
As in the previous part, splitting the bits evenly gives us 12 bits for uniquely identifying a page within a single L2 page table. This means that a single L2 page table contains 212 PTEs.
From the top of the question, each PTE is 4 bytes long, so a single L2 page table takes up bytes.
One page of memory is 2 KiB = 211 bytes, so a single L2 page table takes up pages of memory.
D
Consider the requirement to implement wo new instructions for RISC-V:
push rs2
: Put the word stored in rs2
on the stack, decrementing the stack pointer as necessary.pop rd
: Set rd
to the bottom-most word of the stack, and remove that element from the stack.It is undefined behavior to call push sp
or pop sp
.
We first decide to implement push
and pop
as pseudo-instructions.
Write a two-instruction sequence that implements push rs2
. Your instructions can use rs2
.
- D01 = ?
addi sp, sp, -4 AND sw rs2, 0(sp) (要寫這二道指令才給分)
The first instruction decrements the stack pointer by 4 bytes (one word). The second instruction stores the word in rs2 in the newly-allocated space on the stack.
Note that accessing memory below the stack pointer sp is undefined, so we have to decrement the stack first before accessing that memory.
Write a two-instruction sequence that implements pop rd
. Your instructions can use rd
.
- D02 = ?
lw rd, 0(sp) AND addi sp, sp, 4 (要寫這二道指令才給分)
The first instruction loads the bottom-most word on the stack intord
. The second instruction removes that word from the stack by incrementing the stack pointer by 4 bytes (one word).
As in the previous part, accessing memory below the stack pointer sp is undefined, so we have to load the word from memory first before incrementing the stack (which effectively deletes that word from the stack).
Instead of implementing them as pseudoinstructions, we decide to modify our datapath to make them proper instructions that run in one cycle of a single-cycle datapath. For this question, we will make modifications to the RISC-V datapath included in lecture materials. In order to implement push and pop, we decide to modify our immediate generator so that setting the ImmSel
control signal to 7 always outputs an immediate with value 4.
What further changes would we need to make to our datapath in order for us to implement the push
instruction with as few changes as possible? Anwer N/A
if there are no further changes needed.
- D03 = ?
N/A
We can reuse one of the existing instruction format types, as the push instruction only needs rs2 encoded in the instruction. For example, we could make push an S-type instruction, give it a new opcode (not in use by any other instruction), and leave the funct3, immediate, and rs1 fields unused (e.g. fill them with all zeros, or hard-code them to a value consistent with how we are using the rest of the datapath, for rs1).
ImmGen will be used to output 4 here (so we can add 4 to sp), so a second new immediate type is not needed.
The existing Regfile has two outputs for reading two registers. The push instruction only requires reading the values in sp and rs2, so a third Regfile output is not needed.
The existing Regfile has one writeback input for writing to one register. The push instruction only requires writing to sp, so a second writeback input is not needed. The ALU will be used to subtract 4 from sp. We can use the first Regfile read output (rs1) to read the value in sp, so AMux selects the register output, not the PC. BMux selects the immediate (which is always outputting 4), not the rs2 Regfile read output. Neither AMux nor BMux needs a new input, and the ALU does not need a third register input or a new operation (sub already exists).
In the existing datapath, the DMEM address input is the ALU output. In the push instruction, the ALU output issp - 4
, which coincidentally is also the address we want to store to, so an additional DMEM address input is not needed. In the existing datapath, WBMux supports writing back the ALU output to Regfile. In the push instruction, the ALU output is sp-4, which is what we want to write back to the sp register, so no new input to WBMux is needed.
What further changes would we need to make to our datapath in order for us to implement the pop
instruction with as few changes as possible? Select all that apply:
a. Add a new instruction format
b. Add a mux before the DMEM address input, including relevant inputs and selectors
c. Add a second new immediate type for the ImmGen
d. Add a new output to Regfile for a third register value
e. Add a new input to AMux, with the relevant selectors
f. Add a new input to BMux, with the relevant selectors
g. Add a new ALU input for a third register input
h. Add a new ALU operation, with a corresponding ALUSel
i. Add a new input to WBMux, with the relevant selectors
j. Add a second writeback input to Regfile, along with a second RegWEn
- D04 = ?
b,j (二者都符合才給分)
We can reuse one of the existing instruction format types, as the pop instruction only needs rd encoded in the instruction. For example, we could make pop an I-type instruction, give it a new opcode (not in use by any other instruction), and leave the funct3, immediate, and rs1 fields unused (e.g. fill them with all zeros, or hard-code them to a value consistent with how we are using the rest of the datapath).
ImmGen will be used to output 4 here (so we can add 4 to sp), so a second new immediate type is not needed.
The existing Regfile has two outputs for reading two registers. The pop instruction only requires reading the value in sp, so a third Regfile output is not needed.
The existing Regfile has one writeback input for writing to one register. The pop instruction requires writing to two registers, sp and rs2, so a second writeback input is needed.
The ALU will be used to add 4 to sp. We can use the first Regfile read output (rs1) to read the value in sp, so AMux selects the register output, not the PC. BMux selects the immediate (which is always outputting 4), not the rs2 Regfile read output. Neither AMux nor BMux needs a new input, and the ALU does not need a third register input or a new operation (add already exists).
In the existing datapath, the DMEM address input is the ALU output. In the pop instruction, the ALU output is sp+4, but the address we want to load from is sp, so an additional DMEM address input is needed.
In the existing datapath, WBMux supports writing back the ALU output to Regfile. In the pop instruction, the ALU output issp + 4
, which is what we want to write back to the sp register, so no new input to WBMux is needed.
Assuming the above changes were implemented, what hazards can be caused by push if we use the typical 5-stage pipeline? Select all that apply:
a. Data hazard
b. Control hazard
c. Neither data nor control hazard
- D05 = ?
a
push can cause data hazards if we write to sp in the instruction immediately following a push instruction.
push does not cause PC to jump/branch to a different place (it just increments PC by 4), so it can not cause control hazards.
E
Consider a cache optimization known as "sub-blocking" (also called "sectored caches"):
Suppose that we have an 8 KB, two-way set associative cache with 4-word (16-byte) cache lines. The
replacement policy is LRU. Each cache block is broken up into four smaller sub blocks.
We will evaluate the following two loops:
What is the number of misses for Loop A and for Loop B with the sectored cache?
- E01 = ?
4096
- E02 = ?
4096
For both Loop A and B, all 4096 (32 × 128) memory accesses will miss.
In a sectored cache, only a single word (sub-block) is loaded at a time and none of the loaded sub-words are accessed again in both Loop A and B.
What is the number of misses for Loop A and for Loop B if the cache is not sectored (i.e. no sub-blocks)?
- E03 = ?
1024
There are only compulsory misses upon the access to the first word in each cache line.
This is once every 4 words and the total number of misses for Loop A is 1024.
- E04 = ?
4096
Since memory is accessed in a stride of 32 words, the loop can not utilize the full cache (only sets 0, 8, 16, . . . , etc. are used).
The LRU policy evicts cache lines (in both ways) of the selected sets before the next access to those lines can happen.
Thus, we have all 4096 accesses resulting in a miss.
Suppose that we removed an outer-level cache to free up area on our chip. With this new area, we doubled the size of our L1 cache.
Suppose that this optimization worsened the L1 hit time from 1 cycle to two cycles, and increased the miss penalty from 50 cycles to 100 cycles. Before this optimization, the L1 miss rate was .
- E05 = ?
<= 40%
Let the new miss rate be 0.1x.
1 + 0.1 × 50 ≥ 2 + (0.1 × x) × 100 ⇒ x ≤ 0.4
Thus, the new miss rate has to be less or equal to 40% of the original miss rate to improve the average L1 access time.
F
Think about the implementation of the single-cycle processor we saw in the lecture:
The timing characteristics of all components are listed below:
Assume that any components not listed have a delay of 0 ns.
- F01 = ?
22
Critical path is PC -> IMem -> Decoder -> RF(read) -> BSEL mux -> ALU -> DMem (read) -> WDSEL mux -> RF(write) => tCLK >= 1 + 4 + 2 + 3 + 1 + 4 + 4 + 1 + 2 (RF setup)
Then, let's implement this alternative datapath (the control logic stays the same), where the data memory's address input comes from a different place:
- F02 = ?
18
Critical path is PC -> IMem -> Decoder -> RF(read) -> BSEL mux -> ALU -> WDSEL mux -> RF(write) => tCLK >= 1 + 4 + 2 + 3 + 1 + 4 + 1 + 2 (RF setup)
The program below takes 90 instructions to execute in the original processor. However, it produces incorrect results in the new processor. Modify the program so that it runs correctly on the new processor.
Number of executed instructions of new program?
- F03 = ?
77 (檢查是否有更少的指令序列)
15*5 + 2 = 77
(other solutions are possible; any solution with fewer instructions than the original)
The correct code:
What is the execution time of the above program in the original processors? Answer in ns.
- F04 = ?
1694
77 * 22 = 1694 ns
What is the execution time of the above program in the new processors? Answer in ns.
- F05 = ?
1386
77 * 18 = 1386 ns
G
A 2-way set-associative cache with a block size of 4 would probably have better performance for certain applications. Below is a snapshot of this cache during the execution of some unknown code. V
is the valid bit and D is the dirty bit of each set.
0x50D8
0x2CB0
- G01 = ?
5- G02 = ?
0xA1- G03 = ?
2- G04 = ?
0xCC- G05 = ?
3- G06 = ?
0x59- G07 = ?
0- G08 = ?
0xD3
Then, we want to analyze the performance of this cache on the following benchmark assembly program, which repeatedly calculates the sum of all the elements in the 64-element array A
. The base address of array A
is 0x3000
.
Answer the following questions about the behavior of the cache during execution of the above code. Assume that the cache uses a least recently used (LRU) replacement policy, that the cache is initially empty, and that all cache lines in Way 0 are currently the least-recently used.
How many instruction fetches and data accesses occur per iteration of the outer loop (sum_loop
)?
- G09 = ?
387
6*64 + 3 = 387
- G10 = ?
64
After the program has been running for many outer loop iterations, what is the steady-state hit ratio for instruction fetches and data accesses? Be aware of the total size of the cache, and the block size.
- G11 = ?
All of the inner loop instructions will be most recently used and will not be kicked out of the cache by data fetches. Since our block size is 4, this also means the two li instructions will remain in the cache. However, The jal instruction will be kicked out by data every loop, so our total instruction hit ratio is 386/387.
- G12 = ?
We will always miss on the first block of data overlapping with the inner loop instructions, so there are 4 data misses as those data blocks get pulled in. We will also miss the data fetch in the first block of set 3 because jal will overwrite it, so that’s another 2 data misses. That leads to a total of 6 data misses, so our hit ratio is (64 – 6) / 64 = 58/64 = 29/32.