--- tags: computer-arch --- # Quiz5 of Computer Architecture (2022 Fall) :::info :information_source: General Information * You are allowed to read **[lecture materials](http://wiki.csie.ncku.edu.tw/arch/schedule)**. * That is, an open book exam. * We are using the honor system during this quiz, and would like you to accept the following: 1. You will not share the quiz with anyone. 2. You will not talk to anyone about the information on the quiz until the answers are made public. * Each answer has 3 points. * You must answer in valid numeric/C99/assembly representation and/or English alphabet except your formal name. * Message **[the instructor via Facebook](https://www.facebook.com/JservFans)** if you found something wrong. * :timer_clock: 09:10 ~ 10:20AM on Nov 29, 2022 ::: ## Problem A An algorithm called the [Sieve of Eratosthenes](https://en.wikipedia.org/wiki/Sieve_of_Eratosthenes) could be used to generate a list of prime integers. The code listing below offers a comprehensive breakdown of the algorithm. The below function implements this Sieve, and is defined as follows: * Input: An integer n. You may assume that n > 0. * Output: An array of $n$ 1-byte integers. The $i$-th element of this list is 0 if i is prime, and 1 otherwise. c char *prime_list(uint32_t n) { char *primes = A01 /* Fill your answer here */; if (A02 /* Fill your answer here */) return NULL; primes[0] = 1; primes[1] = 1; for (int i = 2; i * i < n; i++) { if (!primes[i]) { /* If i is a prime number */ for (int j = 2 * i; j < n; j += i) { primes[j] = 1; /* Cross out all multiples of i */ } } } return primes; }  1. What should be written in the given blanks in the C code? i.e., A01 and A02 > * A01 = ? > * A02 = ? We run this program on a 32-bit system with a direct-mapped, 4 KiB cache, with 128B blocks. 2. What is the T:I:O breakdown of this cache? T: A03, I: A04, O: A05 > * A03 = ? > * A04 = ? > * A05 = ? 3. How many misses of each type occur if we call prime_list(4096)? * number of compulsory misses > * A06 = ? * number of capacity misses > * A07 = ? * number of conflict misses > * A08 = ? 4. How many misses of each type occur if we call prime_list(16384)? You may use the function $\pi(n)$ to denote the number of primes less than or equal to $n$. (So $\pi(2) = 1$, $\pi(10) = 4$, etc.) * number of compulsory misses > * A09 = ? * number of capacity misses > * A10 = ? * number of conflict misses > * A11 = ? --- ## Problem B We take into account a change to the typical 5-stage, fully bypassed RISC-V processor pipeline. The data cache in our new CPU has a two-cycle latency. The memory stage is pipelined into two stages, M1 and M2, as illustrated in Figure B-1, to accommodate this cache. Additional bypasses are added to keep the pipeline fully bypassed. Suppose we are implementing this 6-stage pipeline in a technology in which register file ports are inexpensive but bypasses are costly. We wish to reduce cost by removing some of the bypass paths, but without increasing CPI. The proposal is for all integer arithmetic instructions to write their results to the register file at the end of the Execute stage, rather than waiting until the Writeback stage. A second register file write port is added for this purpose. Remember that register file writes occur on each rising clock edge, and values can be read in the next clock cycle. The proposed change is shown in Figure B-2. Assume that the only exceptions that can occur in this pipeline are illegal opcodes (detected in the Decode stage) and invalid memory address (detected at the start of the M2 stage). Additionally, assume that the control logic is optimized to stall only when necessary. You may ignore branch and jump instructions in this problem. ![](https://hackmd.io/_uploads/ryrDN9MDs.png) > $\uparrow$ Figure B-1. 6-stage pipeline. For clarity, bypass paths are not shown. ![](https://hackmd.io/_uploads/SkRcV9Mvo.png) > $\uparrow$ Figure B-2. 6-stage pipeline with proposed additional write port. The second write port allows some bypass paths to be removed without adding stalls in the decode stage. Explain how the second write port improves performance by eliminating such stalls. You MUST mention which kind(s) of hazards can be resolved explicitly along with the reasons. > * B01 = ? Will the second write port help the following instruction sequence? (Answer in **Yes** or **No**) c add x1, x2, x3 add x4, x5, x6 add x7, x1, x9  > * B02 = ? Reasons of the previous. Explain! > * B03 = ? After the second write port is added, which bypass paths can be removed in this new pipeline without introducing additional stalls? List each removed bypass individually. > * B04 = ? Are any new hazards added to the pipeline due to the earlier writeback of arithmetic instructions? Answer in Yes or No with strong reasons. Explain! > * B05 = ? Without further modifications, this pipeline may not support precise exceptions. Will the code sequence below result in an imprecise exception? Answer in Yes or No c lw x1, -1(x0) add x2, x3, x4  > * B06 = ? Explain the previous. Explain! > * B07 = ? --- ## Problem C Assume that we are dealing with a byte-addressed system that has a 2 KiB page size, 64 MiB of physical memory, 32 GiB of virtual memory, and 32 GiB of virtual memory. Each page table entry is specified to be 4 bytes long, including metadata. Assume we have a single level page table with no TLBs. 1. How many bits long is our page offset? > * C01 = ? 2. How many bits are there in the PPN? > * C02 = ? 3. How many bits are there in the VPN? > * C03 = ? 4. How many PTEs will we have in our page table? > * C04 = ? Noticing that this is inefficient, we decide to use a two-level hierarchical page table with no TLBs. Our VPN bits are split evenly among the two levels. 5. How many PTEs are in the L1 page table? > * C05 = ? 6. How many pages worth of memory does a single L2 page table take? > * C06 = ? --- ## Problem D Consider the requirement to implement wo new instructions for RISC-V: * push rs2 : Put the word stored in rs2 on the stack, decrementing the stack pointer as necessary. * pop rd : Set rd to the bottom-most word of the stack, and remove that element from the stack. It is undefined behavior to call push sp or pop sp. We first decide to implement push and pop as pseudo-instructions. 1. Write a two-instruction sequence that implements push rs2. Your instructions can use rs2. > * D01 = ? 2. Write a two-instruction sequence that implements pop rd. Your instructions can use rd. > * D02 = ? Instead of implementing them as pseudoinstructions, we decide to modify our datapath to make them proper instructions that run in one cycle of a single-cycle datapath. For this question, we will make modifications to the RISC-V datapath included in lecture materials. In order to implement push and pop, we decide to modify our immediate generator so that setting the ImmSel control signal to 7 always outputs an immediate with value 4. 3. What further changes would we need to make to our datapath in order for us to implement the push instruction with as few changes as possible? Anwer N/A if there are no further changes needed. > * D03 = ? 4. What further changes would we need to make to our datapath in order for us to implement the pop instruction with as few changes as possible? Select all that apply: **a**. Add a new instruction format **b**. Add a mux before the DMEM address input, including relevant inputs and selectors **c**. Add a second new immediate type for the ImmGen **d**. Add a new output to Regfile for a third register value **e**. Add a new input to AMux, with the relevant selectors **f**. Add a new input to BMux, with the relevant selectors **g**. Add a new ALU input for a third register input **h**. Add a new ALU operation, with a corresponding ALUSel **i**. Add a new input to WBMux, with the relevant selectors **j**. Add a second writeback input to Regfile, along with a second RegWEn > * D04 = ? 5. Assuming the above changes were implemented, what hazards can be caused by push if we use the typical 5-stage pipeline? Select all that apply: ==a==. Data hazard ==b==. Control hazard ==c==. Neither data nor control hazard > * D05 = ? --- ## Problem E Consider a cache optimization known as "[sub-blocking](https://courses.cs.duke.edu/fall06/cps220/lectures/PPT/lect12.pdf)" (also called "sectored caches"): * The number of sets and ways is unchanged. * Each cache block is broken up into smaller "sub-blocks". – Each sub-block has its own valid bit. – On a cache miss, only the cache sub-blocks accessed by the user's program are loaded in. ∗ Other sub-blocks remain in the "invalid" state until they are also loaded in. * Make sure you understand that "sets" are not "sub-blocks". Suppose that we have an 8 KB, two-way set associative cache with 4-word (16-byte) cache lines. The replacement policy is LRU. Each cache block is broken up into four smaller sub blocks. We will evaluate the following two loops: - [ ] Loop A c sum = 0; for (int i = 0; i < 128; i++) for (int j = 0; j < 32; j++) sum += buf[i*32 + j];  - [ ] Loop B c sum = 0; for (int j = 0; j < 32; j++) for (int i = 0; i < 128; i++) sum += buf[i * 32 + j];  1. What is the number of misses for Loop A and for Loop B with the sectored cache? * number of misses for Loop A > * E01 = ? * number of misses for Loop B > * E02 = ? 2. What is the number of misses for Loop A and for Loop B if the cache is not sectored (i.e. no sub-blocks)? * number of misses for Loop A > * E03 = ? * number of misses for Loop B > * E04 = ? Suppose that we removed an outer-level cache to free up area on our chip. With this new area, we doubled the size of our L1 cache. Suppose that this optimization worsened the L1 hit time from 1 cycle to two cycles, and increased the miss penalty from 50 cycles to 100 cycles. Before this optimization, the L1 miss rate was $10\%$. 3. What does the new miss rate have to be for our new optimization to improve the average L1 cache access time? > * E05 = ? --- ## Problem F Think about the implementation of the single-cycle processor we saw in the lecture: ![](https://hackmd.io/_uploads/SJDaW3zvi.png) The timing characteristics of all components are listed below: ![](https://hackmd.io/_uploads/ry2yM3zvi.png) Assume that any components not listed have a delay of 0 ns. 1. What is the minimum clock cycle time of this processor? Minimum t~CLK~ = ? ns > * F01 = ? Then, let's implement this alternative datapath (the control logic stays the same), where the data memory's address input comes from a different place: ![](https://hackmd.io/_uploads/BJqsz2GPi.png) 2. What is the minimum clock cycle time of Ben’s new processor? Minimum t~CLK~ = ? ns > * F02 = ? The program below takes 90 instructions to execute in the original processor. However, it produces incorrect results in the new processor. Modify the program so that it runs correctly on the new processor. - [ ] C code c int x[16]; for (int i = 0; i < 15; i++) x[i+1] = x[i] + x[i+1];  - [ ] RISC-V Assembly c # Initial values: # a0: address of x[0] # a7: address of x[15] loop: lw a1, 0(a0) lw a2, 4(a0) add a3, a2, a1 sw a3, 4(a0) addi a0, a0, 4 blt a0, a7, loop  3. Number of executed instructions of new program? > * F03 = ? 4. What is the execution time of the above program in the original processors? Answer in **ns**. > * F04 = ? 5. What is the execution time of the above program in the new processors? Answer in **ns**. > * F05 = ? --- ## Problem G A 2-way set-associative cache with a block size of 4 would probably have better performance for certain applications. Below is a snapshot of this cache during the execution of some unknown code. V is the valid bit and D is the dirty bit of each set. ![](https://hackmd.io/_uploads/SJ3CS3fwj.png) 1. Would the following memory accesses result in a hit or a miss? If it results in a hit, specify what value is returned; if it is a miss, explain why in a few words or by showing your work. * 32-Bit Byte Address: 0x50D8 * Line index: __ G01 __ * Tag: __ G02 __ (in HEX) * Block offset: __ G03 __ * Returned value if hit / Explanation if miss: __ G04 __ * 32-Bit Byte Address: 0x2CB0 * Line index: __ G05 __ * Tag: __ G06 __ (in HEX) * Block offset: __ G07 __ * Returned value if hit / Explanation if miss: __ G08 __ > * G01 = ? > * G02 = ? > * G03 = ? > * G04 = ? > * G05 = ? > * G06 = ? > * G07 = ? > * G08 = ? Then, we want to analyze the performance of this cache on the following benchmark assembly program, which repeatedly calculates the sum of all the elements in the 64-element array A. The base address of array A is 0x3000. c // Assume the following registers are initialized: // x2 = 64 (size of the array) // x3 = 0x3000 (base address of array A) . = 0x100 // The following code starts at address 0x100 sum_loop: li x1, 0 // Start with i == 0 li x4, 0 // x4 = sum of the array sum: slli x5, x1, 2 // x5 = byte offset of the ith element add x5, x5, x3 // x5 = address of A[i] lw x5, 0(x5) // x5 = A[i] add x4, x4, x5 // add element to sum addi x1, x1, 1 // increment i blt x1, x2, sum // continue looping jal x0, sum_loop // restart the benchmark  Answer the following questions about the behavior of the cache during execution of the above code. Assume that the cache uses a least recently used (LRU) replacement policy, that the cache is initially empty, and that all cache lines in Way 0 are currently the least-recently used. 2. How many instruction fetches and data accesses occur per iteration of the outer loop (sum_loop)? * Number of instruction fetches: __ G09 __ > * G09 = ? * Number of data accesses: __ G10 __ > * G10 = ? 3. After the program has been running for many outer loop iterations, what is the steady-state hit ratio for instruction fetches and data accesses? Be aware of the total size of the cache, and the block size. * Steady-state hit ratio for instruction fetches for the outer loop: __ G11 __ > * G11 = ? * Steady-state hit ratio for data accesses: __ G12 __ > * G12 = ?