Solutions
A
Average Memory Access Time (AMAT) represents the average time required to access memory. The main formula is:
In a multi-level cache system, two types of miss rates are considered for each cache level:
In a two-level cache system, suppose there were 100 total accesses to the entire cache system, and the L2 cache experienced 20 misses. If the L1 cache has a local miss rate of 50%, what is the local miss rate of the L2 cache? __ A01 __
40%
Since the L1 cache has a local miss rate of 50%, it means L2 cache is accessed 50 times out of the 100 total memory accesses. Given that L2 cache experienced 20 misses, the local miss rate of L2 cache is calculated as:
Suppose the system has the following characteristics:
What is the local miss rate of the L2 cache? __ A02 __
25%
The number of accesses to the L2 cache equals the number of misses in the L1 cache. To find the local miss rate of the L2 cache, we divide its global miss rate by the miss rate of the L1 cache:
Substituting the values:
Suppose we aim to reduce the system's AMAT to 8 cycles or lower by introducing an L3 cache. If the L3 cache has a local miss rate of 30%, what is the maximum allowable hit time for the L3 cache? __ A03 __
cycles.
30
Let ( H ) represent the hit time of the L3 cache. By extending the AMAT equation, where the miss penalty of the L2 cache becomes the local AMAT of the L3 cache, we can express the system's AMAT as:
To ensure the AMAT does not exceed 8 cycles, we solve the inequality:
Simplifying this yields:
Thus, the maximum hit time for the L3 cache is 30 cycles.
Given the following block of C code, determine the cache hit rate on a byte-addressed system with a total memory size of 1 MiB. The system has a 16 KiB direct-mapped cache with 1 KiB blocks. Assume the cache starts cold.
#define N_INTS 8192
int A[N_INTS]; /* A lives at 0x10000 */
int i, total = 0;
/* Initialize array */
for (i = 0; i < N_INTS; i += 128)
A[i] = i;
/* Process array */
for (i = 0; i < N_INTS; i += 128)
total += A[i];
What is the Tag:Index:Offset (TIO) breakdown? Tag: __ A04 __ bits; Index: __ A05 __ bits; Offset: __ A06 __ bits
6, 4, 10
The breakdown of the Tag:Index:Offset (T/I/O) is calculated as follows:
Determine the cache hit rate for the line labeled as Line 7. __ A07 __
50%
Since integer accesses are spacedbytes apart, each cache block holds two accesses. The first access to each block results in a compulsory miss, while the second access hits because both A[i]
andA[i + 128]
reside in the same cache block. This results in a cache hit rate of 50%.
Determine the cache hit rate for the line labeled as Line 11. __ A08 __
50%
The size of arrayA
isbytes, which is exactly twice the size of the cache. After executing Line 7, the second half of A
resides in the cache, but Line 11 starts processing from the first half ofA
, meaning none of the cached data from Line 7 can be reused. Since memory accesses in Line 11 follow the same pattern as Line 7, the cache hit rate remains the same at 50%. The variabletotal
is likely kept in a register by the compiler, so it does not impact the cache hit calculation.
Associativity refers to the number of cache slots a memory block can map to. A fully associative cache has the highest associativity, meaning any block can be placed in any cache slot. In contrast, a direct-mapped cache has the lowest associativity, being only 1-way set associative, where each block has only one specific slot.
For an
Consider a system with 2-way Set-Associative Cache
Classify the following memory accesses as either a cache hit (H
), cache miss (M
), or cache miss with replacement (R
). If a miss occurs, specify its type (Compulsory, Capacity, or Conflict). Assume the Least Recently Used (LRU) replacement policy. Note that while LRU is used here, other policies may apply in different scenarios.
Address (Binary) | Tag / Index / Offset | Hit, Miss, Replace |
---|---|---|
0b0000 0100 | _ | _ |
0b0000 0101 | _ | _ |
0b0110 1000 | _ | _ |
0b1100 1000 | Tag: __ A09 __ / Index: __ A10 __ / Offset: __ A11 __ | __ A12 __ |
0b0110 1000 | _ | _ |
0b1101 1101 | Tag: __ A13 __ / Index: __ A14 __ / Offset: __ A15 __ | __ A16 __ |
0b0100 0101 | _ | _ |
0b0000 0100 | _ | _ |
0b0011 0000 | Tag: __ A17 __ / Index: __ A18 __ /Offset: __ A19 __ | __ A20 __ |
0b1100 1011 | _ | _ |
0b0100 0010 | _ | _ |
You should fill in the "Tag / Index / Offset" and "Hit, Miss, Replace" columns based on the cache configuration and memory access behavior.
A09 = ? A10 = ? A11 = ? A12 = ?
1100, 1, 000, Miss or M
A13 = ? A14 = ? A15 = ? A16 = ?
1101, 1, 101, Replace or R
A17 = ? A18 = ? A19 = ? A20 = ?
0011, 0, 000, Replace or R
What is the cache hit rate for the accesses listed above? __ A21 __
27.3% or
There are 3 hits out of 11 total accesses, resulting in a hit rate of approximately 27.3%.
B
Considering the RISC-V code provided and assuming a pipelined CPU without forwarding, determine the number and types of hazards present. Evaluate all potential hazards between the instructions.
Here is the requested table in Markdown format:
Instruction | C1 | C2 | C3 | C4 | C5 | C6 | C7 | C8 | C9 |
---|---|---|---|---|---|---|---|---|---|
1. sub t1, s0, s1 |
IF | ID | EX | MEM | WB | ||||
2. or s0, t0, t1 |
IF | ID | EX | MEM | WB | ||||
3. sw s1, 100(s0) |
IF | ID | EX | MEM | WB | ||||
4. bgeu s0, s2, loop |
IF | ID | EX | MEM | WB | ||||
5. add t2, x0, x0 |
IF | ID | EX | MEM | WB |
How many stalls are required to resolve data hazards if the register file supports double-pumping (write-then-read)? __ B01 __ hazard(s)
4
There are four hazards present:
- A data hazard between instructions 1 and 2 due to
t1
.- A data hazard between instructions 2 and 3 caused by
s0
.- A data hazard between instructions 2 and 4 also involving
s0
.- A control hazard between instructions 4 and 5 caused by the branch instruction.
How many stalls are required to resolve the data hazards if the register file supports double-pumping (write-then-read)? What about the control hazards if branch prediction is used? Assuming the ability to read and write to the register file in the same cycle:
WB→ID
dependency.WB→ID
dependency.MEM
to one cycle before IF
.B02 = ?
2
B03 = ?
1
B04 = ?
2
B05 = ?
2
B06 = ?
2
B07 = ?
3
B08 = ?
2
B09 = ?
4
B10 = ?
no (or zero)
B11 = ?
3
Instruction C1 C2 C3 C4 C5 C6 C7 C8 C9 1. sub t1, s0, s1
IF ID EX MEM WB 2. or -> nop
IF ID 2. or s0, t0, t1
IF ID EX MEM WB 3. sw s1, 100(s0)
IF > ID EX MEM WB 4. bgeu s0, s2, loop
IF ID EX MEM WB 5. add t2, x0, x0
IF ID EX MEM
C
We will use the single-cycle CPU datapath shown below.
The following control signals may take on for the single cycle datapath.
PCSel
:RegWEn
:ImmSel
:BrEq
:BrLt
:ALUSel
:Complete the table by specifying the control signals for each instruction based on the datapath.
*
(don't care) symbol.0/1
.Instruction | BrEq | BrLT | PCSel | ImmSel | BrUn | ASel | BSel | ALUSel | MemRW | RegWEn | WBSel |
---|---|---|---|---|---|---|---|---|---|---|---|
add |
* | * | 0 (PC+4) | * | * | 0 (Reg) | 0 (Reg) | add | 0 | 1 | 1 (ALU) |
ori |
* | * | 0 | I | * | 0 (Reg) | 1 (Imm) | or | 0 | 1 | 1 (ALU) |
lw |
* | * | C04 | I | * | 0 (Reg) | 1 (Imm) | add | 0 | C11 | 0 (MEM) |
sw |
* | * | C05 | S | * | 0 (Reg) | 1 (Imm) | add | 1 | C12 | * |
beq |
C01 | * | C06 | B | C09 | 1 (PC) | 1 (Imm) | add | 0 | C13 | * |
jal |
* | C02 | C07 | J | * | 1 (PC) | 1 (Imm) | add | 0 | C14 | C16 |
blt |
* | C03 | C08 | B | C10 | 1 (PC) | 1 (Imm) | add | 0 | C15 | * |
C01 = ?
0/1
C02 = ?
*
C03 = ?
0/1
C04 = ?
0
C05 = ?
0
C06 = ?
0/1
C07 = ?
1 (ALU)
C08 = ?
0/1
C09 = ?
*
C10 = ?
0
C11 = ?
1
C12 = ?
0
C13 = ?
0
C14 = ?
1
C15 = ?
0
C16 = ?
2 (PC + 4)
D
We will implement the Fast Fourier Transform (FFT) and its Inverse (IFFT) using RISC-V instructions (RVV included). The task involves translating C code into RISC-V assembly while adhering to the specified constraints and guidelines.
Part 1: bitwise reverse
/* Calculates the log2 of number */
int logint(int N)
{
int k = N, i = 0;
while (k) {
k >>= 1;
i++;
}
return i - 1;
}
/* Bitwise reverses the number */
int reverse(int N, int n)
{
int j, p = 0;
for (j = 1; j <= logint(N); j++) {
if (n & (1 << (logint(N) - j))) {
p |= 1 << (j - 1);
}
}
return p;
}
Convert to RISC-V assembly:
# Takes input N (a0), returns its log base 2 in a0
logint:
addi sp, sp, -4
sw t0, 0(sp)
add t0, a0, zero# k = N
add a0, zero, zero# i = 0
logloop:
beq t0, zero, logloop_end # Exit if k == 0
srai t0, t0, 1 # k >>= 1
addi a0, a0, 1 # i++
j logloop
logloop_end:
addi a0, a0, -1 # Return i - 1
lw t0, 0(sp)
addi sp, sp, 4
jr ra
# Takes inputs N(a0) and n(a1), reverses the number in binary
reverse:
addi sp, sp, -28
sw ra, 0(sp)
sw s0, 4(sp)
sw s1, 8(sp)
sw s2, 12(sp)
sw s3, 16(sp)
sw s4, 20(sp)
sw s5, 24(sp)
call logint# Now a0 has log2(N)
addi s0, zero, 1 # j = 1
add s1, zero, zero # p = 0
forloop_reverse:
bgt s0, a0, forloop_reverse_end
sub s2, a0, s0# s2 = a0 - s0
add is3, zero, 1
__ D01 __ # Fill your answer!
__ D02 __ # Fill your answer! Check if bit is set
beq s3, zero, elses3 # If not, skip
ifs3:
addi s4, s0, -1 # s4 = j - 1
addi s5, zero, 1
__ D03 __ # Fill your answer!
__ D04 __ # Fill your answer! Set bit in p
elses3:
__ D05 __ # Fill your answer!
j forloop_reverse
forloop_reverse_end:
add a0, s1, zero # Return p
lw ra, 0(sp)
lw s0, 4(sp)
lw s1, 8(sp)
lw s2, 12(sp)
lw s3, 16(sp)
lw s4, 20(sp)
lw s5, 24(sp)
addi sp, sp, 28
jr ra
Complete the above RISC-V assembly code to ensure functionality without using any pseudo-instructions. All registers should start with s
.
D01 = ?
sll s3, s3, s2
D02 = ?
and s3, a1, s3
D03 = ?
sll s5, s5, s4
D04 = ?
or s1, s1, s5
D05 = ?
addi s0, s0, 1
You are now allowed to use the RISC-V BitManip extension. One of its supported instructions is clz rd, rs
, which counts the number of leading zero bits in the most significant bit (MSB) portion of the source register rs
and stores the result in rd
.
Reference: RISC-V BitManip
Consider the following RISC-V function set_log2n
: It computes log2(N)
for a 32-bit unsigned integer stored in a0
and writes the result to the memory location labeled logsize
.
Inputs:
a0
: 32-bit unsigned integer N
Outputs:
logsize
).Registers Modified:
t0
, t1
set_log2n:
clz t0, a0# Count leading zeros in a0
li t1, 31# Maximum index for a 32-bit register
__ D06 __ # Fill your answer! Compute log2(N)
la t0, logsize # Load address of 'logsize'
__ D07 __ # Fill your answer!
jr ra# Return to caller
logsize: .word 0
The set_log2n
function can be used to calculate and store log2(size)
for shared access by other functions.
Complete the above RISC-V assembly code to ensure functionality without using any pseudo-instructions. All registers should start with t
.
D06 = ?
sub t1, t1, t0
D07 = ?
sw t1, 0(t0)
Let’s implement the previously described reverse
routine as follows without branches:
# Function: reverse
# Reverses the binary digits of a 32-bit integer.
# Inputs:
# - a0: Input number to reverse.
# - a1: Number of significant bits to reverse (optional; default 32).
# Outputs:
# - a0: The reversed binary number.
# Clobbers:
# - t0, t1, t2
reverse:
# Swap odd and even bits
lui t0, 0x55555 # Load upper 20 bits of 0x55555555
ori t0, t0, 0x555 # Complete loading 0x55555555
srli t1, a0, 1 # v >> 1
and t1, t1, t0 # (v >> 1) & 0x55555555
__ D08 __ # Fill your answer!
__ D09 __ # Fill your answer!
ora0, t1, t2 # Store result in a0
# Swap consecutive pairs
lui t0, 0x33333 # Load upper 20 bits of 0x33333333
ori t0, t0, 0x333 # Complete loading 0x33333333
srli t1, a0, 2 # v >> 2
and t1, t1, t0 # (v >> 2) & 0x33333333
__ D10 __ # Fill your answer!
__ D11 __ # Fill your answer!
or a0, t1, t2 # Store result in a0
# Swap nibbles
lui t0, 0x0F0F0 # Load upper 20 bits of 0x0F0F0F0F
ori t0, t0, 0x0F0 # Complete loading 0x0F0F0F0F
__ D12 __ # Fill your answer!
and t1, t1, t0
__ D13 __ # Fill your answer!
__ D14 __ # Fill your answer!
or a0, t1, t2 # Store result in a0
# Swap bytes
lui t0, 0x00FF0# Load upper 20 bits of 0x00FF00FF
ori t0, t0, 0x0FF# Complete loading 0x00FF00FF
__ D15 __ # Fill your answer!
and t1, t1, t0
__ D16 __ # Fill your answer!
__ D17 __ # Fill your answer!
or a0, t1, t2 # Store result in a0
# Swap 2-byte pairs
srli t1, a0, __ D18 __
slli t2, a0, __ D19 __
ora0, t1, t2 # Final result in a0
# Adjust number of significant bits
addi t1, zero, 32 # Load immediate value 32
sub t1, t1, a1 # Calculate shift amount
srl a0, a0, t1 # Shift right to fit bit count
jr ra # Return with result in a0
Complete the above RISC-V assembly code to ensure functionality without using any pseudo-instructions.
D08 = ?
and t2, a0, t0
C:v & 0x55555555
D09 = ?
slli t2, t2, 1
C:(v & 0x55555555) << 1
D10 = ?
and t2, a0, t0
C:v & 0x33333333
D11 = ?
slli t2, t2, 2
C:(v & 0x33333333) << 2
D12 = ?
srli t1, a0, 4
C:v >> 4
D13 = ?
and t2, a0, t0
C:v & 0x0F0F0F0F
D14 = ?
slli t2, t2, 4
C:(v & 0x0F0F0F0F) << 4
D15 = ?
srli t1, a0, 8
C:v >> 8
D16 = ?
and t2, a0, t0
C:v & 0x00FF00FF
D17 = ?
slli t2, t2, 8
C:(v & 0x00FF00FF) << 8
D18 = ?
16
D19 = ?
16
Next, we will utilize the RISC-V Vector Extension (RVV). Consider the following code:
# Initialize the helper vector with sequential integers (0,1,2,...)
init_helper_vector:
la t0, helperVector # Load the base address of helperVector
lw t1, size# Load the size of the vector (N)
vsetvli t2, t1, e32, m1 # Set the vector length configuration
vid.v v0 # Generate a vector of indices
vse32.v v0, (t0)# Store the generated vector in helperVector
ret
The above should be built with --march=rv32i_zbb_zve32x
using GNU Toolchain.
Extension | Minimum VLEN | Supported EEW | FP32 | FP64 |
---|---|---|---|---|
zve32x |
32 | 8, 16, 32 | N | N |
Reference: RISC-V "V" Vector Extension
We can implement a vectorized version of the reverse routine, vreverse
, using the RISC-V Vector Extension as follows.
# Calculate log base 2 of input
# Input:a0 = N
# Output: a0 = log2(N)
logint:
clz a0, a0# Count leading zeros of a0
addi t0, zero, 31# Load 31 (32-bit word size - 1)
__ D20 __ # Fill your answer!
jr ra# Return
# Bit-reverse elements of a vector
# Input:v29 = input vector, a0 = N
# Output: v29 = bit-reversed vector
vreverse:
addi sp, sp, -4# Save return address
sw ra, 0(sp)
call logint# Compute log2(N) in a0
lw ra, 0(sp) # Restore return address
addi sp, sp, 4
li t1, 1 # Constant 1 for bit manipulation
li t0, 1 # Initialize bit position counter j
vmv.v.x v28, zero # Clear result vector v28
vreverse_loop: # Loop for j <= log2(N)
bgt t0, a0, vreverse_end # Break if j > log2(N)
sub t2, a0, t0# t2 = log2(N) - j
sll t3, t1, t2# t3 = (1 << (log2(N) - j))
__ D21 __ # Fill your answer! v0 = v29 & t3 (check bit)
vmsne.vx v0, v0, zero # v0 = 1 if set, else 0
addi t4, t0, -1# t4 = j - 1
sll t3, t1, t4# t3 = (1 << (j - 1))
__ D22 __ # Fill your answer! Set bit in v28 where mask is true
addi t0, t0, 1 # Increment bit position
j vreverse_loop
vreverse_end:
__ D23 __ # Fill your answer! Store result in v29
jr ra # Return
```
The usage of vreverse would look something like this:
```c
mva0, a2# a0 = N, input for reverse function. Overwriting a0 is fine
vmv.v.v v29, v26# v29 = v26 = <i>, used as input for vreverse
call vreverse# v29 now contains rev(N, <i>), keep it for later use
Complete the above RISC-V assembly code to ensure functionality.
D20 = ?
sub a0, t0, a0
C:log2(N) = 31 - clz(N)
D21 = ?
vand.vx v0, v29, t3
D22 = ?
vor.vxv28, v28, t3, v0.t
D23 = ?
vmv.v.v v29, v28
E
A processor uses 16-bit addresses, 256-byte pages, and an 8-entry fully associative TLB with LRU replacement. The LRU field is represented by 3 bits, indicating the order of page access, where 0
means the most recently accessed page. At a given point in time, the TLB for the current process is in the initial state, as described in the table below. Additionally, three free physical pages are available.
Assumptions:
Your task:
Available free physical pages:
Memory access pattern:
Initial TLB:
VPN | PPN | Valid | Dirty | LRU |
---|---|---|---|---|
0x01 | 0x11 | 1 | 1 | 0 |
0x00 | 0x00 | 0 | 0 | 7 |
0x10 | 0x13 | 1 | 1 | 1 |
0x20 | 0x12 | 1 | 0 | 5 |
0x00 | 0x00 | 0 | 0 | 7 |
0x11 | 0x14 | 1 | 0 | 4 |
0xac | 0x15 | 1 | 1 | 2 |
0xff | 0xff | 1 | 0 | 3 |
Final TLB: (Use hexadecimal literal for address, 0/1 for dirty)
VPN | PPN | Valid | Dirty |
---|---|---|---|
0x01 | 0x11 | 1 | 1 |
E01 | E02 | 1 | E03 |
E04 | E05 | 1 | E06 |
E07 | E08 | 1 | E09 |
E10 | E11 | 1 | E12 |
E13 | E14 | 1 | E15 |
E16 | E17 | 1 | E18 |
E19 | E20 | 1 | E21 |
E01 = ? E02 = ? E03 = ?
0x13, 0x17, 1
E04 = ? E05 = ? E06 = ?
0x10, 0x13, 1
E07 = ? E08 = ? E09 = ?
0x20, 0x12, 1
E10 = ? E11 = ? E12 = ?
0x23, 0x18, 1
E13 = ? E14 = ? E15 = ?
0x11, 0x14, 0
E16 = ? E17 = ? E18 = ?
0xac, 0x15, 1
E19 = ? E20 = ? E21 = ?
0x34, 0x19, 1
- Access
0x11F0
(Read): Hit, Physical Address:0x14F0
; Updated LRUs:1, 7, 2, 5, 7, 0, 3, 4
Corresponding Virtual Page:0x11
- Access
0x1301
(Write): Miss, Map VPN0x13
to PPN0x17
, set valid and dirty bits.
Physical Address:0x1701
; Updated LRUs:2, 0, 3, 6, 7, 1, 4, 5
- Access
0x20AE
(Write): Hit, set dirty bit.
Physical Address:0x12AE
; Updated LRUs:3, 1, 4, 0, 7, 2, 5, 6
- Access
0x2332
(Write): Miss, Map VPN0x23
to PPN0x18
, set valid and dirty bits.
Physical Address:0x1832
; Updated LRUs:4, 2, 5, 1, 0, 3, 6, 7
- Access
0x20FF
(Read): Hit, Physical Address:0x12FF
; Updated LRUs:4, 2, 5, 0, 1, 3, 6, 7
- Access
0x3415
(Write): Miss (Replacement Needed). Replace the last entry, Map VPN0x34
to PPN0x19
, set dirty bit.
Physical Address:0x1915
; Updated LRUs:5, 3, 6, 1, 2, 4, 7, 0
F
Consider a memory address space divided into 64-byte segments. For example, address 1024 is 64 bytes away from address 1025 but only 1 byte away from the address
Given a memory size of 4KiB and chunk addresses labeled as 0, 1, 2, …, determine the minimum number of exponent bits required in the floating-point memory address representation to access every byte, assuming a standard bias.
F01 = ?
4
With a memory size of 4KiB divided into 64B chunks, there are 64 total chunks. This requires 4 exponent bits to cover values from 0 to 63. The highest exponent value before applying the standard bias (excluding infinity/NaN values) is. Applying the standard bias of -7 results in a maximum exponent value of 7, sufficient to cover values up to 63. Conversely, with only 3 exponent bits, the maximum exponent value after applying a bias of -3 is 3, insufficient to represent the value 63.
Determine the minimum number of mantissa bits required in the floating-point memory address system to accurately address each byte.
F02 = ?
11
The largest memory address to represent is, corresponding to . To encode this value, we need 11 mantissa bits, resulting in the representation . Thus, 11 bits of mantissa are necessary.
G
In many applications, it is essential not only to identify the maximum value in an array but also to determine its corresponding index, commonly referred to as the argmax. This task can be efficiently accomplished using Data Level Parallelism. The argmax
function below accepts an array arr
and its length n
, returning the index of the maximum element. If multiple elements have the same maximum value, the function returns the earliest index.
To implement this efficiently, use pseudo-SIMD (Single Instruction, Multiple Data) intrinsics. These intrinsics operate on vec
structs, similar to Intel's __m128i
structs, which contain four packed integers. Complete the sections labeled G01
, G02
, and G03
in the function to ensure it works correctly.
Proposed SIMD Intrinsics reference:
vec sum_epi32(vec a, vec b)
: Returns a + b
.vec and_epi32(vec a, vec b)
: Returns a & b
.vec set_epi32(int a)
: Returns a SIMD vector with all entries set to a
.vec load_epi32(int *a)
: Returns a SIMD vector with entries a[0]
, a[1]
, a[2]
, and a[3]
.int reducemax_epi32(vec a)
: Returns the maximum value in vector a
.vec maskeq_epi32(vec a, int b)
: Returns a mask vector with 0xFFFFFFFF
where a
equals b
, and 0
otherwise.int firstv_epi32(vec a)
: Returns the index of the first entry with its lowest bit set to 1
.int argmax(int *arr, int n)
{
int curr, index = 0, running_max = -2147483648; /* -2^31 */
for (int i = 0; i < n / 4 * 4; i += 4) {
vec tmp = load_epi32(arr + i);
curr = reducemax_epi32(tmp);
if (curr > running_max) {
tmp = G01;/* Fill in your code here */
index = i + G02; /* Fill in your code here */
running_max = curr;
}
}
for (int i = n / 4 * 4; i < n; i += 1) {
if (G03) {/* Fill in your code here */
running_max = arr[i];
index = i;
}
}
return index;
}
Complete the above C code to ensure functionality.
G01 = ?
maskeq_epi32(tmp, curr)
G02 = ?
firstv_epi32(tmp)
G03 = ?
arr[i] > running_max
(or equivalent)
H
Suppose we are designing a new microprocessor with a custom ISA. To simplify the architecture, we decide to use a single, universal instruction type called the Z-type instruction.
We aim to support the following instructions:
add rd1, rs1, rs2
and rd1, rs1, rs2
lw rd1, offset1 (rs1)
sw rs2, offset1 (rs1)
addi rd1, rs1, imm1
beq rs1, rs2, offset1
lui rd1, offset1
jal rd1, imm
stw rs3, offset1, offset2 (rs1)
The new stw
instruction stores the contents of rs3
into both memory locations rs1 + offset1
and rs1 + offset2
. Its Register Transfer Level (RTL) is:
How many bits must the opcode field have if we want to support the instructions listed above? __ H01 __ bits
H01 = ?
4
There are 9 instructions in total, so we need 4 bits to represent them, as.
With only one instruction format, we want to enable jumps of up to 64 KiB in either direction. How many bits are required for the immediate field to support this? Assume that, like RV32, the instruction’s least significant bit is an implicit 0 and not saved.
H02 = ?
16
Jumping 64 KiB in either direction equalsbytes. Since immediate values are counted in half-instructions, we need 15 bits for the signed offset, covering to . An extra bit is required for the sign, making the immediate field 16 bits.
Consider the following instruction format:
imm2 imm1 rs3 rs2 rs1 rd2 rd1 opcode
The instruction format is 64 bits long, where each immediate field (imm1
and imm2
) is 11 bits, and the opcode is 7 bits. What is the maximum number of registers the ISA can support?
H03 = ?
128
Total bits used by immediates and the opcode:
Remaining bits for registers:
There are 5 registers (rs1
,rs2
,rs3
,rd1
,rd2
). Dividing the remaining bits equally among the registers:
This allowsregisters.
I
You are tasked with designing a memory hierarchy using an 8-word (32-byte) cache line and 32-bit virtual and physical addresses. Consider the following loops:
int sum = 0;
for (int i = 0; i < (1 << 8); i++)
for (int j = 0; j < (1 << 6); j++)
sum += X[i][j] * Y[i][j];
int sum = 0;
for (int j = 0; j < (1 << 6); j++)
for (int i = 0; i < (1 << 8); i++)
sum += X[i][j] * Y[i][j];
Assumptions:
X[0][0]
is stored at address 0x0
.X
and Y
are stored contiguously in row-major order, meaning X[i][j]
is adjacent to X[i][j + 1]
in memory, and X[256][64]
is next to Y[0][0]
.sum
is not stored in the cache.Assume a 32 KiB 4-way set-associative data cache (D$
) with a FIFO replacement policy. Complete the following table:
Loop | Compulsory Misses | Conflict and Capacity Misses | Total Misses |
---|---|---|---|
Loop A | I01 | I02 | I03 |
Loop B | I04 | I05 | I06 |
I01 = ?
4096 (Compulsory Misses for Loop A)
I02 = ?
0 (No Conflict or Capacity Misses for Loop A)
I03 = ?
4096 (Total Misses for Loop A)
I04 = ?
4096 (Compulsory Misses for Loop B)
I05 = ?
28672 (Conflict and Capacity Misses for Loop B)
I06 = ?
32768 (Total Misses for Loop B)
Assume a 256 KiB 8-way set-associative L2 cache with a FIFO replacement policy, including the L1 (D$
). How many L2 cache misses occur when Loops A and B are run? Loop A: __ I07 __ / Loop B: __ I08 __
I07 = ?
4096 (L2 Cache Misses for Loop A)
I08 = ?
4096 (L2 Cache Misses for Loop B)
Given the following hit times:
Calculate the AMAT for Loops A and B: Loop A: __ I09 __ cycles / Loop B: __ I10 __ cycles
I09 = ?
11.75 cycles (AMAT for Loop A)
I10 = ?
22.25 cycles (AMAT for Loop B)
Assume a Virtually-Indexed Physically-Tagged (VIPT) cache with a 4 KiB page size and a 32 KiB cache size. Is it possible to have virtual address aliasing? __ I11 __ (answer in Yes or No) What is the minimum set-associativity required to avoid aliasing? __ I12 __
I11 = ?
Yes (Virtual Address Aliasing Possible)
I12 = ?
8 (Minimum Set-Associativity Required)