Quiz6 of Computer Architecture (2021 Fall)

Solutions

Problem `A`

Assume that you have to design a mystery module "M" to work with the newest pipeline, but do not know the maximum latency this module can have. You decide to help to consider a few scenarios to calculate the maximum
latency for module "M".

For your first iteration, you are told that you must have a 2-stage pipeline whose latency is no more than 16ns. Using a minimal number of registers, please show the twostage pipeline that maximizes the allowed latency of the "M" module. Then, calculate the maximum latency of module M and the throughput of the resulting circuit (using your value of M).
Image Not Showing Possible Reasons
- The image was uploaded to a note which you don't have access to
- The note which the image was originally uploaded to has been deleted
Learn More →
- Max latency of "M" module (ns): __ A01 __
- Throughput with max latency "M" (ns^-1): __ A02 __
- A01 = ?
  3
- A02 = ?
  
  $\frac{1}{8}$
  Image Not Showing Possible Reasons
  The image was uploaded to a note which you don't have access to
  The note which the image was originally uploaded to has been deleted
  Learn More →
For your next iteration, you are told that you must have a 4-stage pipeline whose latency is no more than 20ns. Using a minimal number of registers, please show the four-stage pipeline that maximizes the allowed latency of the "M" module. Then, calculate the maximum latency of module M and the throughput of the resulting circuit (using your value of M).
Image Not Showing Possible Reasons
- The image was uploaded to a note which you don't have access to
- The note which the image was originally uploaded to has been deleted
Learn More →
- Max latency of "M" module (ns): __ A03 __
- Throughput with max latency "M" (ns^-1): __ A04 __
- A03 = ?
  5
- A04 = ?
  
  $\frac{1}{5}$
  Image Not Showing Possible Reasons
  The image was uploaded to a note which you don't have access to
  The note which the image was originally uploaded to has been deleted
  Learn More →

Problem `B`

Assume that we are working on building a 32-bit RISC-V processor. As part of this project, they are considering several cache designs. The hardware has a limited
amount of memory available, so any cache design will hold a total of 32 32-bit words (plus any associated metadata such as tag, valid, and dirty bits when needed).

We first consider a direct-mapped cache using a block size of 4 words.

If the cache holds a total of 32 words, how many lines will be in this cache?
- Number of lines in the direct-mapped cache: __ B01 __
- B01 = ?
  8
To properly take advantage of locality, which address bits should be used for the block offset, the cache index, and the tag field? Express your answer using minispec-style indexing into a 32-bit address. For example, the address bits for the byte offset would be expressed as addr[ 1 : 0 ].
- Address bits for block offset: addr[ __ B02 __ : __ B03 __ ]
- Address bits for cache index: addr[ __ B04 __ : __ B05 __ ]
- Address bits for tag field: addr[ __ B06 __ : __ B07 __ ]
- B02 = ?
  3
- B03 = ?
  2
- B04 = ?
  6
- B05 = ?
  4
- B06 = ?
  31
- B07 = ?
  7
Bits 0 and 1 are the byte offset (recall that addresses in RV32 always end in 00)
4 words / block -> lg(4) = 2 bits for the block offset (bits 3:2)
8 lines -> lg(8) = 3 bits for the cache index (bits 6:4)
Rest of the bits are tag bits (bits 31:7)

We now consider switching to a 2-way set-associative cache, again using a block size of 4 words and storing 32 data words total (along with necessary metadata). The 2-way cache uses an LRU replacement policy.

How does this change affect the number of sets and the size of the tag compared to the direct-mapped cache?
- Change in the number of sets (select one): __ B08 __
  - (a) None / can't tell
  - (b) 0.5x
  - (c) 2x
  - (d) -1
  - (e) +1
  - (f) Unchanged
- Change in the size of the tag in bits (select one): __ B09 __
  - (a) None / can't tell
  - (b) 0.5x
  - (c) 2x
  - (d) -1
  - (e) +1
  - (f) Unchanged
- B08 = ?
  b
- B09 = ?
  e
Switching to two ways halves the number of cache sets we have since each cache set now contains two cache lines (one in each way).
Halving the number of cache sets reduces the number of set index bits by 1 which increases the number of tag bits by 1.

We decide to use a 2-way set-associative cache with 4 cache sets and a block size of 4 words for our processor and would like to test out our caching system. We write the following code to simulate checking if the array from a quicksort function is properly sorted. The code iterates over a 200-element array and checks for correct ordering. You may treat unimp as a 32-bit instruction that terminates the program.

. = 0x100 // The following code starts at address 0x100
check_sorted:
    // Initialize some registers
    li t0, 0 // t0 = loop index
    li t1, 199 // t1 = array size - 1
    lui t2, 0x3 // t2 = starting address of array = 0x3000
loop:
    lw t3, 0(t2) // Load current element
    lw t4, 4(t2) // Load next element
    ble t3, t4, endif // Assume branch is always TAKEN
if_unsorted:
    unimp // If two elements are out of order, terminate
endif:
    addi t0, t0, 1
    addi t2, t2, 4
    blt t0, t1, loop // Continue checking; assume branch is TAKEN
    ret

For the rest of this problem, assume that the code is running at steady state (i.e., the code is in the middle of the array) and that the array is sorted correctly. In other words, assume that both the ble and blt branches are always taken. Also, you may assume that when execution of the code started, all cache lines were set to invalid and Way 0 was the LRU way for each cache set.

For one iteration of the loop (i.e. from the loop label to the blt instruction), how many instruction fetches and data accesses are performed?
- Number of Instruction Fetches: __ B10 __
- Number of Data Accesses: __ B11 __
- B10 = ?
  6
- B11 = ?
  2
Note that the unimp instruction is never fetched as long as the ble branch is always taken since PC will never point to unimp.
In the steady state (i.e. ignoring any cold-start effects), what is the instruction fetch hit ratio and the data access hit ratio? Note: please provide the hit ratio and not the miss ratio. You may use the cache diagram provided below to help you, but nothing written in the diagram will be graded.
Image Not Showing Possible Reasons
- The image was uploaded to a note which you don't have access to
- The note which the image was originally uploaded to has been deleted
Learn More →
- Instruction Fetch HIT Ratio: __ B12 __
- Data Access HIT Ratio: __ B13 __
- B12 = ?
  1
  All instructions reside in sets 0, 1, and 2 of Way 0 and are not evicted. Thus the instruction hit ratio is 1.
- B13 = ?
  
  $\frac{7}{8}$ 或 0.875
Data accesses are split between Way 0 and Way 1, but ultimately, they fit around all the instructions in the cache (which are always accessed more recently and so do not get evicted).
Each data miss will cause the cache to load the next four consecutive words (four words/block). Each iteration performs 2 data accesses. Every four iterations, one of these accesses will miss. This corresponds to 1 miss per 8 accesses for a hit ratio of
$\frac{7}{8}$ .
Assume that it takes 2 cycles to access the cache. We have benchmarked our code to have an average hit ratio of 0.9. If they need to achieve an average memory access time (AMAT) of at most 4 cycles, what is the upper bound on our miss penalty? Miss penalty here is defined as the amount of additional time required beyond the 2-cycle cache access to handle a cache miss.
- Maximum possible miss penalty: __ B14 __ cycles
- B14 = ?
  20
  AMAT = cache access time + miss penalty * (1 - hit ratio)
  4 >= 2 + miss penalty * (1 - 0.9)
  miss penalty <= 20

Problem `C`

Consider that we are analyzing grade statistics and are performing some hefty calculations, so we suspect that a cache could improve our system's performance.

We are considering using a 2-way set-associative cache with a block size of 4 (i.e. 4 words per line). The cache can store a total of 64 words. Assume that addresses and data words are 32 bits wide. To properly make use of locality, which address bits should be used for the block offset, the cache index, and the tag field?
- Address bits used for byte offset: A[ 1 : 0 ]
- Address bits used for tag field: A[ __ C01 __ : __ C02 __ ]
- Address bits used for block offset: A[ __ C03 __ : __ C04 __ ]
- Address bits used for cache index: A[ __ C05 __ : __ C06 __]
- C01 = ?
  31
- C02 = ?
  7
- C03 = ?
  3
- C04 = ?
  2
- C05 = ?
  6
- C06 = ?
  4
Block size of 4 → 2 bits for block offset
64 words / 4 words per block = 16 blocks
16 blocks / 2 ways = 8 sets → 3 bits for cache index
32 – 3 – 2 – 2 = 25 bits of tag
If we instead used a direct-mapped cache with the same total capacity (64 words) and same block size (4 words), how would the following parameters in our system change?
- Change in the number of cache lines (select all of the choices below that apply): __ C07 __
  - (a) None / can't tell
  - (b) 0.5x
  - (c) 2x
  - (d) -1
  - (e) +1
  - (f) Unchanged
- Change in the number of bits in tag field (select one of the choices below): __ C08 __
  - (a) None / can't tell
  - (b) 0.5x
  - (c) 2x
  - (d) -1
  - (e) +1
  - (f) Unchanged
- C07 = ?
  c,f
  Accepted Unchanged or 2x as solutions because 2-way cache also has
  $\frac{1}{2}$ the number of sets as the direct mapped cache, but the number of cache lines is actually the same.
- C08 = ?
  d
  Block size of 4 → 2 bits for block offset
  64 words / 4 words per block = 16 blocks → 4 bits for cache line index
  32 – 4 – 2 – 2 = 24 bits of tag

Ultimately, we decide that the 2-way set associative cache would probably have better performance for our application, so the remainder of the problem will be considering a 2-way set associative cache. Below is a snapshot of this cache during the execution of some unknown code. V is the valid bit and D is the dirty bit of each set.

Would the following memory accesses result in a hit or a miss? If it results in a hit, specify what value is returned; if it is a miss, explain why in a few words or by showing your work.
- 32-Bit Byte Address: 0x4AB4
  - Line index: __ C09 __
  - Tag: __ C10 __ (in HEX)
  - Block offset: __ C11 __
  - Returned value if hit (in HEX) / Explanation if miss: __ C12 __
- 32-Bit Byte Address: 0x21E0
  - Line index: __ C13 __
  - Tag: __ C14 __
  - Block offset: __ C15 __
  - Returned value if hit (in HEX) / Explanation if miss: __ C16 __
- C09 = ?
  3
- C10 = ?
  0x95
- C11 = ?
  1
- C12 = ?
  0xD4
- C13 = ?
  6
- C14 = ?
  43
- C15 = ?
  0
- C16 = ?
  miss, valid bit is 0 (意思相近就給分)

Problem `D`

Consider the case that we lost the final iteration among two other prototypes while building the pipelined RISC-V processor. Show all bypasses used in each cycle where the processor is not stalled. For the processor, also determine the value of x3 and x4 after executing these 12 cycles. Below is the code and the initial state of the relevant registers in the register file for each of the three processors. Note that the values in registers x1–x5 are given in decimal while x6 is in hexadecimal. A copy of the code and initial register values is provided for the processor.

start: lw x1, 0(x6)
       addi x2, x0, 5
       blt x1, x2, end
       addi x3, x2, 11
       sub x4, x3, x1
       xori x5, x6, 0x1
end:   sub x4, x3, x2
       addi x3, x4, 7
       addi x3, x1, 3
. = 0x400
.word 0x1

Register	Value
x1	5
x2	11
x3	30
x4	19
x5	20
x6	0x400

Assume the processor is built as a 5-stage pipelined RISC-V processor, which is fully bypassed and has branch annulment. Branch decisions are made in the EXE stage and branches are always predicted not taken.

What are the values of registers x3 and x4 at the start of cycle 12 (in decimal)?

x3 : __ D01 __
x4 : __ D02 __
- D01 = ?
  32
- D02 = ?
  25

Problem `E`

This problem evaluates the cache performances for different loop orderings. You are asked to consider the following two loops, written in C, which calculate the sum of the entries in a 128 by 32 matrix of 32-bit integers:

Loop A

sum = 0;
for (i = 0; i < 128; i++)
    for (j = 0; j < 32; j++)
        sum += A[i][j];

Loop B

sum = 0;
for (j = 0; j < 32; j++)
 for (i = 0; i < 128; i++)
 sum += A[i][j];

The matrix A is stored contiguously in memory in row-major order. Row major order means that elements in the same row of the matrix are adjacent in memory as shown in the following memory layout:
A[i][j] resides in memory location [4*(32*i + j)]

Memory Location:

For Problem 1 to 3, assume that the caches are initially empty. Also, assume that only accesses to matrix A cause memory references and all other necessary variables are stored in registers. Instructions are in a separate instruction cache.

Consider a 4KB direct-mapped data cache with 8-word (32-byte) cache lines.
- Calculate the number of cache misses that will occur when running Loop A. __ E01 __
- Calculate the number of cache misses that will occur when running Loop B. __ E02 __
- E01 = ?
  512
- E02 = ?
  4096
Each element of the 128x32 matrix A can only be mapped to one particular cache location in this direct-mapped data cache. Since each row has 32 32-bit integers, and since each cache line can hold 8 32-bit ints, a row of the matrix occupies the lines in 4 consecutive sets of the cache.

Loop A—where each iteration of the inner loop sums a row of A—accesses memory addresses in a linear sequence. Given this access pattern, the access to the first word in each cache line will miss, but the next seven accesses will hit. After sequentially moving through this line, it will not be accessed again, so its later eviction will not cause any future misses. Therefore, Loop A will only have compulsory misses for the 512 (128 rows x 4 lines per row) that matrix A spans.

The consecutive accesses in Loop B will move in a stride of 32 words. Therefore, the inner loop will touch the first element in 128 cache lines before the next iteration of the outer loop. While intuition might suggest that the 128 lines could all fit in the cache with 128 sets, there is a complicating factor: each row is four cache lines past the previous row, meaning that the lines accessed when traversing the first column go in indices 0, 4, 8, 16, and so on. Since the lines containing the column are competing for only one quarter of the sets, the lines loaded when starting a column are evicted by the time the column is complete, preventing any reuse. Therefore, all 4096 (128 x 32) accesses miss.
Consider a direct-mapped data cache with 8-word (32-byte) cache lines. Calculate the minimum number of cache lines required for the data cache if Loop A is to run without any cache misses other than compulsory misses. Calculate the minimum number of cache lines required for the data cache if Loop B is to run without any cache misses other than compulsory misses.
- Data-cache size required for Loop A: __ E03 __ cache line(s)
- Data-cache size required for Loop B: __ E04 __ cache line(s)
- E03 = ?
  1
- E04 = ?
  512
Since Loop A accesses memory sequentially, we can sum all the elements in a cache line and then never touch it again. Therefore, we only need to hold 1 active line at any given time to avoid all but compulsory misses.

For Loop B to run without any cache misses other than compulsory misses, the data cache needs to have the ability to hold one column of matrix A in the cache. Since the consecutive accesses in the inner loop of Loop B will use one out of every four cache lines, and since we have 128 rows, Loop B requires 512 (128 × 4) lines to avoid all but compulsory misses.
Consider a 4KB set-associative data cache with 4 ways, and 8-word (32-byte) cache lines. This data cache uses a first-in/first-out (FIFO) replacement policy.
- The number of cache misses for Loop A: __ E05 __
- The number of cache misses for Loop B: __ E06 __
- E05 = ?
  512
- E06 = ?
  4096
Loop A still only has 512 (128 rows x 4 lines per row) compulsory misses.
Loop B still cannot fully utilize the cache. The first 8 accesses will allocate into way 1 in sets 0, 4, 8, 16, etc.; the next 8 accesses will allocate into way 2 of those same sets; and so on. After 32 accesses, all four ways will be filled, and the next 8 accesses along the column will evict the previous lines in way 1, preventing any reuse. Therefore, all 4096 (128 x 32) accesses miss.

Problem `F`

The following diagram shows a classic fully-bypassed 5-stage pipeline that has been augmented with an unpipelined divider in parallel with the ALU. Bypass paths are not shown in the diagram. This iterative divider produces 2 bits per cycle until it outputs a full 32-bit result.

What is the latency of a divide operation in cycles? __ F01 __
- F01 = ?
  16
  32 / 2 = 16 cycles
What is the occupancy of a divide operation in cycles? __ F02 __
- F02 = ?
  16
  since the divider is unpipelined

Problem `G`

Given the follow chunk of code, analyze the hit rate given that we have a byte-addressed computer with a total memory of 1 MiB. It also features a 16 KiB Direct-Mapped cache with 1 KiB blocks. Assume that your cache begins cold.

#define NUM_INTS 8192 // 2ˆ13
int A[NUM_INTS]; // A lives at 0x10000
int i, total = 0;
for (i = 0; i < NUM_INTS; i += 128)
    A[i] = i; // Code 1

for (i = 0; i < NUM_INTS; i += 128)
    total += A[i]; // Code 2

How many bits make up a memory address on this computer? __ G01 __
- G01 = ?
  20
  We take log₂(1 MiB) = log₂(2²⁰) = 20.
What is the T:I:O breakdown?
- Offset: __ G02 __
- Index: __ G03 __
- Tag: __ G04 __
- G02 = ?
  10
- G03 = ?
  4
- G04 = ?
  6
Offset = log₂(1 KiB) = log₂(2¹⁰) = 10
Index =
$\log_{2} (\frac{16 K i B}{1 K i B}) = \log_{2} (16) = 4$
Tag = 20 − 4 − 10 = 6
Calculate the cache hit rate for the line marked as Code 1: __ G05 __
- G05 = ?
  50% (或相等數值表示法)
  The integer accesses are 4 ∗ 128 = 512 bytes apart, which means there are 2 accesses per block. The first accesses in each block is a compulsory cache miss, but the second is a hit because A[i] and A[i+128] are in the same cache block. Thus, we end up with a hit rate of 50%
Calculate the cache hit rate for the line marked as Code 2: __ G06 __
- G06 = ?
  50%
  The size of A is 8192 ∗ 4 = 2¹⁵ bytes. This is exactly twice the size of our cache. At the end of Code 1, we have the second half of A inside our cache, but Code 2 starts with the first half of A. Thus, we cannot reuse any of the cache data brought in from Code 1 and must start from the beginning. Thus our hit rate is the same as Code 1 since we access memory in the same exact way as Code 1. We do not have to consider cache hits for total, as the compiler will most likely store it in a register. Thus, we end up with a hit rate of 50%.

Quiz6 of Computer Architecture (2021 Fall)

Problem A

Problem B

Problem C

Problem D

Problem E

Problem F

Problem G

Read more

資訊科技詞彙翻譯

單一指令處理器 (OISC)

從 CPU cache coherence 談 Linux spinlock 可擴展能力議題

淺談 Microkernel 設計和真實世界中的應用