3
points.A
Assume that you have to design a mystery module "M" to work with the newest pipeline, but do not know the maximum latency this module can have. You decide to help to consider a few scenarios to calculate the maximum
latency for module "M".
For your first iteration, you are told that you must have a 2-stage pipeline whose latency is no more than 16ns. Using a minimal number of registers, please show the twostage pipeline that maximizes the allowed latency of the "M" module. Then, calculate the maximum latency of module M and the throughput of the resulting circuit (using your value of M
).
- A01 = ?
- A02 = ?
For your next iteration, you are told that you must have a 4-stage pipeline whose latency is no more than 20ns. Using a minimal number of registers, please show the four-stage pipeline that maximizes the allowed latency of the "M" module. Then, calculate the maximum latency of module M and the throughput of the resulting circuit (using your value of M).
- A03 = ?
- A04 = ?
B
Assume that we are working on building a 32-bit RISC-V processor. As part of this project, they are considering several cache designs. The hardware has a limited
amount of memory available, so any cache design will hold a total of 32 32-bit words (plus any associated metadata such as tag, valid, and dirty bits when needed).
We first consider a direct-mapped cache using a block size of 4 words.
If the cache holds a total of 32 words, how many lines will be in this cache?
- B01 = ?
To properly take advantage of locality, which address bits should be used for the block offset, the cache index, and the tag field? Express your answer using minispec-style indexing into a 32-bit address. For example, the address bits for the byte offset would be expressed as addr[ 1 : 0 ]
.
- B02 = ?
- B03 = ?
- B04 = ?
- B05 = ?
- B06 = ?
- B07 = ?
We now consider switching to a 2-way set-associative cache, again using a block size of 4 words and storing 32 data words total (along with necessary metadata). The 2-way cache uses an LRU replacement policy.
(a)
None / can't tell(b)
0.5x(c)
2x(d)
-1(e)
+1(f)
Unchanged(a)
None / can't tell(b)
0.5x(c)
2x(d)
-1(e)
+1(f)
Unchanged
- B08 = ?
- B09 = ?
We decide to use a 2-way set-associative cache with 4 cache sets and a block size of 4 words for our processor and would like to test out our caching system. We write the following code to simulate checking if the array from a quicksort function is properly sorted. The code iterates over a 200-element array and checks for correct ordering. You may treat unimp
as a 32-bit instruction that terminates the program.
For the rest of this problem, assume that the code is running at steady state (i.e., the code is in the middle of the array) and that the array is sorted correctly. In other words, assume that both the ble
and blt
branches are always taken. Also, you may assume that when execution of the code started, all cache lines were set to invalid and Way 0 was the LRU way for each cache set.
For one iteration of the loop (i.e. from the loop label to the blt instruction), how many instruction fetches and data accesses are performed?
- B10 = ?
- B11 = ?
In the steady state (i.e. ignoring any cold-start effects), what is the instruction fetch hit ratio and the data access hit ratio? Note: please provide the hit ratio and not the miss ratio. You may use the cache diagram provided below to help you, but nothing written in the diagram will be graded.
- B12 = ?
- B13 = ?
Assume that it takes 2 cycles to access the cache. We have benchmarked our code to have an average hit ratio of 0.9. If they need to achieve an average memory access time (AMAT) of at most 4 cycles, what is the upper bound on our miss penalty? Miss penalty here is defined as the amount of additional time required beyond the 2-cycle cache access to handle a cache miss.
- B14 = ?
C
Consider that we are analyzing grade statistics and are performing some hefty calculations, so we suspect that a cache could improve our system's performance.
We are considering using a 2-way set-associative cache with a block size of 4 (i.e. 4 words per line). The cache can store a total of 64 words. Assume that addresses and data words are 32 bits wide. To properly make use of locality, which address bits should be used for the block offset, the cache index, and the tag field?
- C01 = ?
- C02 = ?
- C03 = ?
- C04 = ?
- C05 = ?
- C06 = ?
If we instead used a direct-mapped cache with the same total capacity (64 words) and same block size (4 words), how would the following parameters in our system change?
(a)
None / can't tell(b)
0.5x(c)
2x(d)
-1(e)
+1(f)
Unchanged(a)
None / can't tell(b)
0.5x(c)
2x(d)
-1(e)
+1(f)
Unchanged
- C07 = ?
- C08 = ?
Ultimately, we decide that the 2-way set associative cache would probably have better performance for our application, so the remainder of the problem will be considering a 2-way set associative cache. Below is a snapshot of this cache during the execution of some unknown code. V is the valid bit and D is the dirty bit of each set.
Would the following memory accesses result in a hit or a miss? If it results in a hit, specify what value is returned; if it is a miss, explain why in a few words or by showing your work.
0x4AB4
0x21E0
- C09 = ?
- C10 = ?
- C11 = ?
- C12 = ?
- C13 = ?
- C14 = ?
- C15 = ?
- C16 = ?
D
Consider the case that we lost the final iteration among two other prototypes while building the pipelined RISC-V processor. Show all bypasses used in each cycle where the processor is not stalled. For the processor, also determine the value of x3
and x4
after executing these 12 cycles. Below is the code and the initial state of the relevant registers in the register file for each of the three processors. Note that the values in registers x1
–x5
are given in decimal while x6
is in hexadecimal. A copy of the code and initial register values is provided for the processor.
Register | Value |
---|---|
x1 | 5 |
x2 | 11 |
x3 | 30 |
x4 | 19 |
x5 | 20 |
x6 | 0x400 |
Assume the processor is built as a 5-stage pipelined RISC-V processor, which is fully bypassed and has branch annulment. Branch decisions are made in the EXE
stage and branches are always predicted not taken.
What are the values of registers x3
and x4
at the start of cycle 12 (in decimal)?
- D01 = ?
- D02 = ?
E
This problem evaluates the cache performances for different loop orderings. You are asked to consider the following two loops, written in C, which calculate the sum of the entries in a 128 by 32 matrix of 32-bit integers:
The matrix A is stored contiguously in memory in row-major order. Row major order means that elements in the same row of the matrix are adjacent in memory as shown in the following memory layout:
A[i][j]
resides in memory location [4*(32*i + j)]
Memory Location:
For Problem 1 to 3, assume that the caches are initially empty. Also, assume that only accesses to matrix A cause memory references and all other necessary variables are stored in registers. Instructions are in a separate instruction cache.
Consider a 4KB direct-mapped data cache with 8-word (32-byte) cache lines.
A
. __ E01 __B
. __ E02 __
- E01 = ?
- E02 = ?
Consider a direct-mapped data cache with 8-word (32-byte) cache lines. Calculate the minimum number of cache lines required for the data cache if Loop A is to run without any cache misses other than compulsory misses. Calculate the minimum number of cache lines required for the data cache if Loop B is to run without any cache misses other than compulsory misses.
A
: __ E03 __ cache line(s)B
: __ E04 __ cache line(s)
- E03 = ?
- E04 = ?
Consider a 4KB set-associative data cache with 4 ways, and 8-word (32-byte) cache lines. This data cache uses a first-in/first-out (FIFO) replacement policy.
A
: __ E05 __B
: __ E06 __
- E05 = ?
- E06 = ?
F
The following diagram shows a classic fully-bypassed 5-stage pipeline that has been augmented with an unpipelined divider in parallel with the ALU. Bypass paths are not shown in the diagram. This iterative divider produces 2 bits per cycle until it outputs a full 32-bit result.
What is the latency of a divide operation in cycles? __ F01 __
- F01 = ?
What is the occupancy of a divide operation in cycles? __ F02 __
- F02 = ?
G
Given the follow chunk of code, analyze the hit rate given that we have a byte-addressed computer with a total memory of 1 MiB. It also features a 16 KiB Direct-Mapped cache with 1 KiB blocks. Assume that your cache begins cold.
How many bits make up a memory address on this computer? __ G01 __
- G01 = ?
What is the T:I:O breakdown?
- G02 = ?
- G03 = ?
- G04 = ?
Calculate the cache hit rate for the line marked as Code 1
: __ G05 __
- G05 = ?
Calculate the cache hit rate for the line marked as Code 2
: __ G06 __
- G06 = ?