Solutions
A
Average Memory Access Time (AMAT) represents the average time required to access memory. The main formula is:
In a multi-level cache system, two types of miss rates are considered for each cache level:
In a two-level cache system, suppose there were 100 total accesses to the entire cache system, and the L2 cache experienced 20 misses. If the L1 cache has a local miss rate of 50%, what is the local miss rate of the L2 cache? __ A01 __
40%
Since the L1 cache has a local miss rate of 50%, it means L2 cache is accessed 50 times out of the 100 total memory accesses. Given that L2 cache experienced 20 misses, the local miss rate of L2 cache is calculated as:
Suppose the system has the following characteristics:
What is the local miss rate of the L2 cache? __ A02 __
25%
The number of accesses to the L2 cache equals the number of misses in the L1 cache. To find the local miss rate of the L2 cache, we divide its global miss rate by the miss rate of the L1 cache:
Substituting the values:
Suppose we aim to reduce the system's AMAT to 8 cycles or lower by introducing an L3 cache. If the L3 cache has a local miss rate of 30%, what is the maximum allowable hit time for the L3 cache? __ A03 __
cycles.
30
Let ( H ) represent the hit time of the L3 cache. By extending the AMAT equation, where the miss penalty of the L2 cache becomes the local AMAT of the L3 cache, we can express the system's AMAT as:
To ensure the AMAT does not exceed 8 cycles, we solve the inequality:
Simplifying this yields:
Thus, the maximum hit time for the L3 cache is 30 cycles.
Given the following block of C code, determine the cache hit rate on a byte-addressed system with a total memory size of 1 MiB. The system has a 16 KiB direct-mapped cache with 1 KiB blocks. Assume the cache starts cold.
What is the Tag:Index:Offset (TIO) breakdown? Tag: __ A04 __ bits; Index: __ A05 __ bits; Offset: __ A06 __ bits
6, 4, 10
The breakdown of the Tag:Index:Offset (T/I/O) is calculated as follows:
Determine the cache hit rate for the line labeled as Line 7. __ A07 __
50%
Since integer accesses are spaced bytes apart, each cache block holds two accesses. The first access to each block results in a compulsory miss, while the second access hits because bothA[i]
andA[i + 128]
reside in the same cache block. This results in a cache hit rate of 50%.
Determine the cache hit rate for the line labeled as Line 11. __ A08 __
50%
The size of arrayA
is bytes, which is exactly twice the size of the cache. After executing Line 7, the second half ofA
resides in the cache, but Line 11 starts processing from the first half ofA
, meaning none of the cached data from Line 7 can be reused. Since memory accesses in Line 11 follow the same pattern as Line 7, the cache hit rate remains the same at 50%. The variabletotal
is likely kept in a register by the compiler, so it does not impact the cache hit calculation.
Associativity refers to the number of cache slots a memory block can map to. A fully associative cache has the highest associativity, meaning any block can be placed in any cache slot. In contrast, a direct-mapped cache has the lowest associativity, being only 1-way set associative, where each block has only one specific slot.
For an -way set-associative cache, the following properties hold:
Consider a system with 2-way Set-Associative Cache
Classify the following memory accesses as either a cache hit (H
), cache miss (M
), or cache miss with replacement (R
). If a miss occurs, specify its type (Compulsory, Capacity, or Conflict). Assume the Least Recently Used (LRU) replacement policy. Note that while LRU is used here, other policies may apply in different scenarios.
Address (Binary) | Tag / Index / Offset | Hit, Miss, Replace |
---|---|---|
0b0000 0100 | _ | _ |
0b0000 0101 | _ | _ |
0b0110 1000 | _ | _ |
0b1100 1000 | Tag: __ A09 __ / Index: __ A10 __ / Offset: __ A11 __ | __ A12 __ |
0b0110 1000 | _ | _ |
0b1101 1101 | Tag: __ A13 __ / Index: __ A14 __ / Offset: __ A15 __ | __ A16 __ |
0b0100 0101 | _ | _ |
0b0000 0100 | _ | _ |
0b0011 0000 | Tag: __ A17 __ / Index: __ A18 __ /Offset: __ A19 __ | __ A20 __ |
0b1100 1011 | _ | _ |
0b0100 0010 | _ | _ |
You should fill in the "Tag / Index / Offset" and "Hit, Miss, Replace" columns based on the cache configuration and memory access behavior.
A09 = ? A10 = ? A11 = ? A12 = ?
1100, 1, 000, Miss or M
A13 = ? A14 = ? A15 = ? A16 = ?
1101, 1, 101, Replace or R
A17 = ? A18 = ? A19 = ? A20 = ?
0011, 0, 000, Replace or R
What is the cache hit rate for the accesses listed above? __ A21 __
27.3% or
There are 3 hits out of 11 total accesses, resulting in a hit rate of approximately 27.3%.
B
Considering the RISC-V code provided and assuming a pipelined CPU without forwarding, determine the number and types of hazards present. Evaluate all potential hazards between the instructions.
Here is the requested table in Markdown format:
Instruction | C1 | C2 | C3 | C4 | C5 | C6 | C7 | C8 | C9 |
---|---|---|---|---|---|---|---|---|---|
1. sub t1, s0, s1 |
IF | ID | EX | MEM | WB | ||||
2. or s0, t0, t1 |
IF | ID | EX | MEM | WB | ||||
3. sw s1, 100(s0) |
IF | ID | EX | MEM | WB | ||||
4. bgeu s0, s2, loop |
IF | ID | EX | MEM | WB | ||||
5. add t2, x0, x0 |
IF | ID | EX | MEM | WB |
How many stalls are required to resolve data hazards if the register file supports double-pumping (write-then-read)? __ B01 __ hazard(s)
4
There are four hazards present:
- A data hazard between instructions 1 and 2 due to
t1
.- A data hazard between instructions 2 and 3 caused by
s0
.- A data hazard between instructions 2 and 4 also involving
s0
.- A control hazard between instructions 4 and 5 caused by the branch instruction.
How many stalls are required to resolve the data hazards if the register file supports double-pumping (write-then-read)? What about the control hazards if branch prediction is used? Assuming the ability to read and write to the register file in the same cycle:
WB→ID
dependency.WB→ID
dependency.MEM
to one cycle before IF
.B02 = ?
2
B03 = ?
1
B04 = ?
2
B05 = ?
2
B06 = ?
2
B07 = ?
3
B08 = ?
2
B09 = ?
4
B10 = ?
no (or zero)
B11 = ?
3
Instruction C1 C2 C3 C4 C5 C6 C7 C8 C9 1. sub t1, s0, s1
IF ID EX MEM WB 2. or -> nop
IF ID 2. or s0, t0, t1
IF ID EX MEM WB 3. sw s1, 100(s0)
IF > ID EX MEM WB 4. bgeu s0, s2, loop
IF ID EX MEM WB 5. add t2, x0, x0
IF ID EX MEM
C
We will use the single-cycle CPU datapath shown below.
The following control signals may take on for the single cycle datapath.
PCSel
:RegWEn
:ImmSel
:BrEq
:BrLt
:ALUSel
:Complete the table by specifying the control signals for each instruction based on the datapath.
*
(don't care) symbol.0/1
.Instruction | BrEq | BrLT | PCSel | ImmSel | BrUn | ASel | BSel | ALUSel | MemRW | RegWEn | WBSel |
---|---|---|---|---|---|---|---|---|---|---|---|
add |
* | * | 0 (PC+4) | * | * | 0 (Reg) | 0 (Reg) | add | 0 | 1 | 1 (ALU) |
ori |
* | * | 0 | I | * | 0 (Reg) | 1 (Imm) | or | 0 | 1 | 1 (ALU) |
lw |
* | * | C04 | I | * | 0 (Reg) | 1 (Imm) | add | 0 | C11 | 0 (MEM) |
sw |
* | * | C05 | S | * | 0 (Reg) | 1 (Imm) | add | 1 | C12 | * |
beq |
C01 | * | C06 | B | C09 | 1 (PC) | 1 (Imm) | add | 0 | C13 | * |
jal |
* | C02 | C07 | J | * | 1 (PC) | 1 (Imm) | add | 0 | C14 | C16 |
blt |
* | C03 | C08 | B | C10 | 1 (PC) | 1 (Imm) | add | 0 | C15 | * |
C01 = ?
0/1
C02 = ?
*
C03 = ?
0/1
C04 = ?
0
C05 = ?
0
C06 = ?
0/1
C07 = ?
1 (ALU)
C08 = ?
0/1
C09 = ?
*
C10 = ?
0
C11 = ?
1
C12 = ?
0
C13 = ?
0
C14 = ?
1
C15 = ?
0
C16 = ?
2 (PC + 4)
D
We will implement the Fast Fourier Transform (FFT) and its Inverse (IFFT) using RISC-V instructions (RVV included). The task involves translating C code into RISC-V assembly while adhering to the specified constraints and guidelines.
Part 1: bitwise reverse
Convert to RISC-V assembly:
Complete the above RISC-V assembly code to ensure functionality without using any pseudo-instructions. All registers should start with s
.
D01 = ?
sll s3, s3, s2
D02 = ?
and s3, a1, s3
D03 = ?
sll s5, s5, s4
D04 = ?
or s1, s1, s5
D05 = ?
addi s0, s0, 1
You are now allowed to use the RISC-V BitManip extension. One of its supported instructions is clz rd, rs
, which counts the number of leading zero bits in the most significant bit (MSB) portion of the source register rs
and stores the result in rd
.
Reference: RISC-V BitManip
Consider the following RISC-V function set_log2n
: It computes log2(N)
for a 32-bit unsigned integer stored in a0
and writes the result to the memory location labeled logsize
.
Inputs:
a0
: 32-bit unsigned integer N
Outputs:
logsize
).Registers Modified:
t0
, t1
The set_log2n
function can be used to calculate and store log2(size)
for shared access by other functions.
Complete the above RISC-V assembly code to ensure functionality without using any pseudo-instructions. All registers should start with t
.
D06 = ?
sub t1, t1, t0
D07 = ?
sw t1, 0(t0)
Let’s implement the previously described reverse
routine as follows without branches:
Complete the above RISC-V assembly code to ensure functionality without using any pseudo-instructions.
D08 = ?
and t2, a0, t0
C:v & 0x55555555
D09 = ?
slli t2, t2, 1
C:(v & 0x55555555) << 1
D10 = ?
and t2, a0, t0
C:v & 0x33333333
D11 = ?
slli t2, t2, 2
C:(v & 0x33333333) << 2
D12 = ?
srli t1, a0, 4
C:v >> 4
D13 = ?
and t2, a0, t0
C:v & 0x0F0F0F0F
D14 = ?
slli t2, t2, 4
C:(v & 0x0F0F0F0F) << 4
D15 = ?
srli t1, a0, 8
C:v >> 8
D16 = ?
and t2, a0, t0
C:v & 0x00FF00FF
D17 = ?
slli t2, t2, 8
C:(v & 0x00FF00FF) << 8
D18 = ?
16
D19 = ?
16
Next, we will utilize the RISC-V Vector Extension (RVV). Consider the following code:
The above should be built with --march=rv32i_zbb_zve32x
using GNU Toolchain.
Extension | Minimum VLEN | Supported EEW | FP32 | FP64 |
---|---|---|---|---|
zve32x |
32 | 8, 16, 32 | N | N |
Reference: RISC-V "V" Vector Extension
We can implement a vectorized version of the reverse routine, vreverse
, using the RISC-V Vector Extension as follows.
Complete the above RISC-V assembly code to ensure functionality.
D20 = ?
sub a0, t0, a0
C:log2(N) = 31 - clz(N)
D21 = ?
vand.vx v0, v29, t3
D22 = ?
vor.vxv28, v28, t3, v0.t
D23 = ?
vmv.v.v v29, v28
E
A processor uses 16-bit addresses, 256-byte pages, and an 8-entry fully associative TLB with LRU replacement. The LRU field is represented by 3 bits, indicating the order of page access, where 0
means the most recently accessed page. At a given point in time, the TLB for the current process is in the initial state, as described in the table below. Additionally, three free physical pages are available.
Assumptions:
Your task:
Available free physical pages:
Memory access pattern:
Initial TLB:
VPN | PPN | Valid | Dirty | LRU |
---|---|---|---|---|
0x01 | 0x11 | 1 | 1 | 0 |
0x00 | 0x00 | 0 | 0 | 7 |
0x10 | 0x13 | 1 | 1 | 1 |
0x20 | 0x12 | 1 | 0 | 5 |
0x00 | 0x00 | 0 | 0 | 7 |
0x11 | 0x14 | 1 | 0 | 4 |
0xac | 0x15 | 1 | 1 | 2 |
0xff | 0xff | 1 | 0 | 3 |
Final TLB: (Use hexadecimal literal for address, 0/1 for dirty)
VPN | PPN | Valid | Dirty |
---|---|---|---|
0x01 | 0x11 | 1 | 1 |
E01 | E02 | 1 | E03 |
E04 | E05 | 1 | E06 |
E07 | E08 | 1 | E09 |
E10 | E11 | 1 | E12 |
E13 | E14 | 1 | E15 |
E16 | E17 | 1 | E18 |
E19 | E20 | 1 | E21 |
E01 = ? E02 = ? E03 = ?
0x13, 0x17, 1
E04 = ? E05 = ? E06 = ?
0x10, 0x13, 1
E07 = ? E08 = ? E09 = ?
0x20, 0x12, 1
E10 = ? E11 = ? E12 = ?
0x23, 0x18, 1
E13 = ? E14 = ? E15 = ?
0x11, 0x14, 0
E16 = ? E17 = ? E18 = ?
0xac, 0x15, 1
E19 = ? E20 = ? E21 = ?
0x34, 0x19, 1
- Access
0x11F0
(Read): Hit, Physical Address:0x14F0
; Updated LRUs:1, 7, 2, 5, 7, 0, 3, 4
Corresponding Virtual Page:0x11
- Access
0x1301
(Write): Miss, Map VPN0x13
to PPN0x17
, set valid and dirty bits.
Physical Address:0x1701
; Updated LRUs:2, 0, 3, 6, 7, 1, 4, 5
- Access
0x20AE
(Write): Hit, set dirty bit.
Physical Address:0x12AE
; Updated LRUs:3, 1, 4, 0, 7, 2, 5, 6
- Access
0x2332
(Write): Miss, Map VPN0x23
to PPN0x18
, set valid and dirty bits.
Physical Address:0x1832
; Updated LRUs:4, 2, 5, 1, 0, 3, 6, 7
- Access
0x20FF
(Read): Hit, Physical Address:0x12FF
; Updated LRUs:4, 2, 5, 0, 1, 3, 6, 7
- Access
0x3415
(Write): Miss (Replacement Needed). Replace the last entry, Map VPN0x34
to PPN0x19
, set dirty bit.
Physical Address:0x1915
; Updated LRUs:5, 3, 6, 1, 2, 4, 7, 0
F
Consider a memory address space divided into 64-byte segments. For example, address 1024 is 64 bytes away from address 1025 but only 1 byte away from the address . In this system, standard integer binary addressing is ineffective. Instead, consider using floating-point representation based on the IEEE-754 standard, which includes 1 sign bit. You need to determine the appropriate number of exponent and mantissa bits.
Given a memory size of 4KiB and chunk addresses labeled as 0, 1, 2, …, determine the minimum number of exponent bits required in the floating-point memory address representation to access every byte, assuming a standard bias.
F01 = ?
4
With a memory size of 4KiB divided into 64B chunks, there are 64 total chunks. This requires 4 exponent bits to cover values from 0 to 63. The highest exponent value before applying the standard bias (excluding infinity/NaN values) is . Applying the standard bias of -7 results in a maximum exponent value of 7, sufficient to cover values up to 63.Conversely, with only 3 exponent bits, the maximum exponent value after applying a bias of -3 is 3, insufficient to represent the value 63.
Determine the minimum number of mantissa bits required in the floating-point memory address system to accurately address each byte.
F02 = ?
11
The largest memory address to represent is , corresponding to . To encode this value, we need 11 mantissa bits, resulting in the representation . Thus, 11 bits of mantissa are necessary.
G
In many applications, it is essential not only to identify the maximum value in an array but also to determine its corresponding index, commonly referred to as the argmax. This task can be efficiently accomplished using Data Level Parallelism. The argmax
function below accepts an array arr
and its length n
, returning the index of the maximum element. If multiple elements have the same maximum value, the function returns the earliest index.
To implement this efficiently, use pseudo-SIMD (Single Instruction, Multiple Data) intrinsics. These intrinsics operate on vec
structs, similar to Intel's __m128i
structs, which contain four packed integers. Complete the sections labeled G01
, G02
, and G03
in the function to ensure it works correctly.
Proposed SIMD Intrinsics reference:
vec sum_epi32(vec a, vec b)
: Returns a + b
.vec and_epi32(vec a, vec b)
: Returns a & b
.vec set_epi32(int a)
: Returns a SIMD vector with all entries set to a
.vec load_epi32(int *a)
: Returns a SIMD vector with entries a[0]
, a[1]
, a[2]
, and a[3]
.int reducemax_epi32(vec a)
: Returns the maximum value in vector a
.vec maskeq_epi32(vec a, int b)
: Returns a mask vector with 0xFFFFFFFF
where a
equals b
, and 0
otherwise.int firstv_epi32(vec a)
: Returns the index of the first entry with its lowest bit set to 1
.Complete the above C code to ensure functionality.
G01 = ?
maskeq_epi32(tmp, curr)
G02 = ?
firstv_epi32(tmp)
G03 = ?
arr[i] > running_max
(or equivalent)
H
Suppose we are designing a new microprocessor with a custom ISA. To simplify the architecture, we decide to use a single, universal instruction type called the Z-type instruction.
We aim to support the following instructions:
add rd1, rs1, rs2
and rd1, rs1, rs2
lw rd1, offset1 (rs1)
sw rs2, offset1 (rs1)
addi rd1, rs1, imm1
beq rs1, rs2, offset1
lui rd1, offset1
jal rd1, imm
stw rs3, offset1, offset2 (rs1)
The new stw
instruction stores the contents of rs3
into both memory locations rs1 + offset1
and rs1 + offset2
. Its Register Transfer Level (RTL) is:
How many bits must the opcode field have if we want to support the instructions listed above? __ H01 __ bits
H01 = ?
4
There are 9 instructions in total, so we need 4 bits to represent them, as .
With only one instruction format, we want to enable jumps of up to 64 KiB in either direction. How many bits are required for the immediate field to support this? Assume that, like RV32, the instruction’s least significant bit is an implicit 0 and not saved.
H02 = ?
16
Jumping 64 KiB in either direction equals bytes. Since immediate values are counted in half-instructions, we need 15 bits for the signed offset, covering to . An extra bit is required for the sign, making the immediate field 16 bits.
Consider the following instruction format:
The instruction format is 64 bits long, where each immediate field (imm1
and imm2
) is 11 bits, and the opcode is 7 bits. What is the maximum number of registers the ISA can support?
H03 = ?
128
Total bits used by immediates and the opcode:
Remaining bits for registers:
There are 5 registers (rs1
,rs2
,rs3
,rd1
,rd2
). Dividing the remaining bits equally among the registers:
This allows registers.
I
You are tasked with designing a memory hierarchy using an 8-word (32-byte) cache line and 32-bit virtual and physical addresses. Consider the following loops:
Assumptions:
X[0][0]
is stored at address 0x0
.X
and Y
are stored contiguously in row-major order, meaning X[i][j]
is adjacent to X[i][j + 1]
in memory, and X[256][64]
is next to Y[0][0]
.sum
is not stored in the cache.Assume a 32 KiB 4-way set-associative data cache (D$
) with a FIFO replacement policy. Complete the following table:
Loop | Compulsory Misses | Conflict and Capacity Misses | Total Misses |
---|---|---|---|
Loop A | I01 | I02 | I03 |
Loop B | I04 | I05 | I06 |
I01 = ?
4096 (Compulsory Misses for Loop A)
I02 = ?
0 (No Conflict or Capacity Misses for Loop A)
I03 = ?
4096 (Total Misses for Loop A)
I04 = ?
4096 (Compulsory Misses for Loop B)
I05 = ?
28672 (Conflict and Capacity Misses for Loop B)
I06 = ?
32768 (Total Misses for Loop B)
Assume a 256 KiB 8-way set-associative L2 cache with a FIFO replacement policy, including the L1 (D$
). How many L2 cache misses occur when Loops A and B are run? Loop A: __ I07 __ / Loop B: __ I08 __
I07 = ?
4096 (L2 Cache Misses for Loop A)
I08 = ?
4096 (L2 Cache Misses for Loop B)
Given the following hit times:
Calculate the AMAT for Loops A and B: Loop A: __ I09 __ cycles / Loop B: __ I10 __ cycles
I09 = ?
11.75 cycles (AMAT for Loop A)
I10 = ?
22.25 cycles (AMAT for Loop B)
Assume a Virtually-Indexed Physically-Tagged (VIPT) cache with a 4 KiB page size and a 32 KiB cache size. Is it possible to have virtual address aliasing? __ I11 __ (answer in Yes or No) What is the minimum set-associativity required to avoid aliasing? __ I12 __
I11 = ?
Yes (Virtual Address Aliasing Possible)
I12 = ?
8 (Minimum Set-Associativity Required)