ptr->member
instead of (*ptr).member
.<stdint.h>
, <stddef.h>
, <stdlib.h>
, <stdio.h>
, and <string.h>
.A
Average Memory Access Time (AMAT) represents the average time required to access memory. The main formula is:
In a multi-level cache system, two types of miss rates are considered for each cache level:
In a two-level cache system, suppose there were 100 total accesses to the entire cache system, and the L2 cache experienced 20 misses. If the L1 cache has a local miss rate of 50%, what is the local miss rate of the L2 cache? __ A01 __
A01 = ?
Suppose the system has the following characteristics:
What is the local miss rate of the L2 cache? __ A02 __
A02 = ?
Suppose we aim to reduce the system's AMAT to 8 cycles or lower by introducing an L3 cache. If the L3 cache has a local miss rate of 30%, what is the maximum allowable hit time for the L3 cache? __ A03 __
cycles.
A03 = ?
Given the following block of C code, determine the cache hit rate on a byte-addressed system with a total memory size of 1 MiB. The system has a 16 KiB direct-mapped cache with 1 KiB blocks. Assume the cache starts cold.
What is the Tag:Index:Offset (TIO) breakdown? Tag: __ A04 __ bits; Index: __ A05 __ bits; Offset: __ A06 __ bits
A04 = ? A05 = ? A06 = ?
Determine the cache hit rate for the line labeled as Line 7. __ A07 __
A07 = ?
Determine the cache hit rate for the line labeled as Line 11. __ A08 __
A08 = ?
Associativity refers to the number of cache slots a memory block can map to. A fully associative cache has the highest associativity, meaning any block can be placed in any cache slot. In contrast, a direct-mapped cache has the lowest associativity, being only 1-way set associative, where each block has only one specific slot.
For an -way set-associative cache, the following properties hold:
Consider a system with 2-way Set-Associative Cache
Classify the following memory accesses as either a cache hit (H
), cache miss (M
), or cache miss with replacement (R
). If a miss occurs, specify its type (Compulsory, Capacity, or Conflict). Assume the Least Recently Used (LRU) replacement policy. Note that while LRU is used here, other policies may apply in different scenarios.
Address (Binary) | Tag / Index / Offset | Hit, Miss, Replace |
---|---|---|
0b0000 0100 | _ | _ |
0b0000 0101 | _ | _ |
0b0110 1000 | _ | _ |
0b1100 1000 | Tag: __ A09 __ / Index: __ A10 __ / Offset: __ A11 __ | __ A12 __ |
0b0110 1000 | _ | _ |
0b1101 1101 | Tag: __ A13 __ / Index: __ A14 __ / Offset: __ A15 __ | __ A16 __ |
0b0100 0101 | _ | _ |
0b0000 0100 | _ | _ |
0b0011 0000 | Tag: __ A17 __ / Index: __ A18 __ /Offset: __ A19 __ | __ A20 __ |
0b1100 1011 | _ | _ |
0b0100 0010 | _ | _ |
You should fill in the "Tag / Index / Offset" and "Hit, Miss, Replace" columns based on the cache configuration and memory access behavior.
A09 = ? A10 = ? … A20 = ?
What is the cache hit rate for the accesses listed above? __ A21 __
A21 = ?
B
Considering the RISC-V code provided and assuming a pipelined CPU without forwarding, determine the number and types of hazards present. Evaluate all potential hazards between the instructions.
Here is the requested table in Markdown format:
Instruction | C1 | C2 | C3 | C4 | C5 | C6 | C7 | C8 | C9 |
---|---|---|---|---|---|---|---|---|---|
1. sub t1, s0, s1 |
IF | ID | EX | MEM | WB | ||||
2. or s0, t0, t1 |
IF | ID | EX | MEM | WB | ||||
3. sw s1, 100(s0) |
IF | ID | EX | MEM | WB | ||||
4. bgeu s0, s2, loop |
IF | ID | EX | MEM | WB | ||||
5. add t2, x0, x0 |
IF | ID | EX | MEM | WB |
How many stalls are required to resolve data hazards if the register file supports double-pumping (write-then-read)? __ B01 __ hazard(s)
B01 = ?
How many stalls are required to resolve the data hazards if the register file supports double-pumping (write-then-read)? What about the control hazards if branch prediction is used? Assuming the ability to read and write to the register file in the same cycle:
WB→ID
dependency.WB→ID
dependency.MEM
to one cycle before IF
.B02 = ? B03 = ? … B11 = ?
C
We will use the single-cycle CPU datapath shown below.
The following control signals may take on for the single cycle datapath.
PCSel
:
RegWEn
:
ImmSel
:
BrEq
:
BrLt
:
ALUSel
:
Complete the table by specifying the control signals for each instruction based on the datapath.
*
(don't care) symbol.0/1
.Instruction | BrEq | BrLT | PCSel | ImmSel | BrUn | ASel | BSel | ALUSel | MemRW | RegWEn | WBSel |
---|---|---|---|---|---|---|---|---|---|---|---|
add |
* | * | 0 (PC+4) | * | * | 0 (Reg) | 0 (Reg) | add | 0 | 1 | 1 (ALU) |
ori |
* | * | 0 | I | * | 0 (Reg) | 1 (Imm) | or | 0 | 1 | 1 (ALU) |
lw |
* | * | C04 | I | * | 0 (Reg) | 1 (Imm) | add | 0 | C11 | 0 (MEM) |
sw |
* | * | C05 | S | * | 0 (Reg) | 1 (Imm) | add | 1 | C12 | * |
beq |
C01 | * | C06 | B | C09 | 1 (PC) | 1 (Imm) | add | 0 | C13 | * |
jal |
* | C02 | C07 | J | * | 1 (PC) | 1 (Imm) | add | 0 | C14 | C16 |
blt |
* | C03 | C08 | B | C10 | 1 (PC) | 1 (Imm) | add | 0 | C15 | * |
C01 = ? C02 = ? … C16 = ?
D
We will implement the Fast Fourier Transform (FFT) and its Inverse (IFFT) using RISC-V instructions (RVV included). The task involves translating C code into RISC-V assembly while adhering to the specified constraints and guidelines.
Part 1: bitwise reverse
Convert to RISC-V assembly:
Complete the above RISC-V assembly code to ensure functionality without using any pseudo-instructions. All registers should start with s
.
D01 = ? D02 = ? … D05 = ?
You are now allowed to use the RISC-V BitManip extension. One of its supported instructions is clz rd, rs
, which counts the number of leading zero bits in the most significant bit (MSB) portion of the source register rs
and stores the result in rd
.
Reference: RISC-V BitManip
Consider the following RISC-V function set_log2n
: It computes log2(N)
for a 32-bit unsigned integer stored in a0
and writes the result to the memory location labeled logsize
.
Inputs:
a0
: 32-bit unsigned integer N
Outputs:
logsize
).Registers Modified:
t0
, t1
The set_log2n
function can be used to calculate and store log2(size)
for shared access by other functions.
Complete the above RISC-V assembly code to ensure functionality without using any pseudo-instructions. All registers should start with t
.
D06 = ?
D07 = ?
Let's implement the previously described reverse
routine as follows without branches:
Complete the above RISC-V assembly code to ensure functionality without using any pseudo-instructions.
D08 = ? D09 = ? … D19 = ?
Next, we will utilize the RISC-V Vector Extension (RVV). Consider the following code:
The above should be built with --march=rv32i_zbb_zve32x
using GNU Toolchain.
Extension | Minimum VLEN | Supported EEW | FP32 | FP64 |
---|---|---|---|---|
zve32x |
32 | 8, 16, 32 | N | N |
Reference: RISC-V "V" Vector Extension
We can implement a vectorized version of the reverse routine, vreverse
, using the RISC-V Vector Extension as follows.
The usage of vreverse would look something like this:
Complete the above RISC-V assembly code to ensure functionality.
D20 = ? D21 = ? .. D23 = ?
E
A processor uses 16-bit addresses, 256-byte pages, and an 8-entry fully associative TLB with LRU replacement. The LRU field is represented by 3 bits, indicating the order of page access, where 0
means the most recently accessed page. At a given point in time, the TLB for the current process is in the initial state, as described in the table below. Additionally, three free physical pages are available.
Assumptions:
Your task:
Available free physical pages:
Memory access pattern:
Initial TLB:
VPN | PPN | Valid | Dirty | LRU |
---|---|---|---|---|
0x01 | 0x11 | 1 | 1 | 0 |
0x00 | 0x00 | 0 | 0 | 7 |
0x10 | 0x13 | 1 | 1 | 1 |
0x20 | 0x12 | 1 | 0 | 5 |
0x00 | 0x00 | 0 | 0 | 7 |
0x11 | 0x14 | 1 | 0 | 4 |
0xac | 0x15 | 1 | 1 | 2 |
0xff | 0xff | 1 | 0 | 3 |
Final TLB: (Use hexadecimal literal for address, 0/1 for dirty)
VPN | PPN | Valid | Dirty |
---|---|---|---|
0x01 | 0x11 | 1 | 1 |
E01 | E02 | 1 | E03 |
E04 | E05 | 1 | E06 |
E07 | E08 | 1 | E09 |
E10 | E11 | 1 | E12 |
E13 | E14 | 1 | E15 |
E16 | E17 | 1 | E18 |
E19 | E20 | 1 | E21 |
E01 = ? E02 = ? … E21 = ?
F
Consider a memory address space divided into 64-byte segments. For example, address 1024 is 64 bytes away from address 1025 but only 1 byte away from the address . In this system, standard integer binary addressing is ineffective. Instead, consider using floating-point representation based on the IEEE-754 standard, which includes 1 sign bit. You need to determine the appropriate number of exponent and mantissa bits.
Given a memory size of 4KiB and chunk addresses labeled as 0, 1, 2, …, determine the minimum number of exponent bits required in the floating-point memory address representation to access every byte, assuming a standard bias.
F01 = ?
Determine the minimum number of mantissa bits required in the floating-point memory address system to accurately address each byte.
F02 = ?
G
In many applications, it is essential not only to identify the maximum value in an array but also to determine its corresponding index, commonly referred to as the argmax. This task can be efficiently accomplished using Data Level Parallelism. The argmax
function below accepts an array arr
and its length n
, returning the index of the maximum element. If multiple elements have the same maximum value, the function returns the earliest index.
To implement this efficiently, use pseudo-SIMD (Single Instruction, Multiple Data) intrinsics. These intrinsics operate on vec
structs, similar to Intel's __m128i
structs, which contain four packed integers. Complete the sections labeled G01
, G02
, and G03
in the function to ensure it works correctly.
Proposed SIMD Intrinsics reference:
vec sum_epi32(vec a, vec b)
: Returns a + b
.vec and_epi32(vec a, vec b)
: Returns a & b
.vec set_epi32(int a)
: Returns a SIMD vector with all entries set to a
.vec load_epi32(int *a)
: Returns a SIMD vector with entries a[0]
, a[1]
, a[2]
, and a[3]
.int reducemax_epi32(vec a)
: Returns the maximum value in vector a
.vec maskeq_epi32(vec a, int b)
: Returns a mask vector with 0xFFFFFFFF
where a
equals b
, and 0
otherwise.int firstv_epi32(vec a)
: Returns the index of the first entry with its lowest bit set to 1
.Complete the above C code to ensure functionality.
G01 = ?
G02 = ?
G03 = ?
H
Suppose we are designing a new microprocessor with a custom ISA. To simplify the architecture, we decide to use a single, universal instruction type called the Z-type instruction.
We aim to support the following instructions:
add rd1, rs1, rs2
and rd1, rs1, rs2
lw rd1, offset1 (rs1)
sw rs2, offset1 (rs1)
addi rd1, rs1, imm1
beq rs1, rs2, offset1
lui rd1, offset1
jal rd1, imm
stw rs3, offset1, offset2 (rs1)
The new stw
instruction stores the contents of rs3
into both memory locations rs1 + offset1
and rs1 + offset2
. Its Register Transfer Level (RTL) is:
How many bits must the opcode field have if we want to support the instructions listed above? __ H01 __ bits
H01 = ?
With only one instruction format, we want to enable jumps of up to 64 KiB in either direction. How many bits are required for the immediate field to support this? Assume that, like RV32, the instruction's least significant bit is an implicit 0 and not saved.
H02 = ?
Consider the following instruction format:
The instruction format is 64 bits long, where each immediate field (imm1
and imm2
) is 11 bits, and the opcode is 7 bits. What is the maximum number of registers the ISA can support?
H03 = ?
I
You are tasked with designing a memory hierarchy using an 8-word (32-byte) cache line and 32-bit virtual and physical addresses. Consider the following loops:
Assumptions:
X[0][0]
is stored at address 0x0
.X
and Y
are stored contiguously in row-major order, meaning X[i][j]
is adjacent to X[i][j + 1]
in memory, and X[256][64]
is next to Y[0][0]
.sum
is not stored in the cache.Assume a 32 KiB 4-way set-associative data cache (D$
) with a FIFO replacement policy. Complete the following table:
Loop | Compulsory Misses | Conflict and Capacity Misses | Total Misses |
---|---|---|---|
Loop A | I01 | I02 | I03 |
Loop B | I04 | I05 | I06 |
I01 = ? I02 = ? … I06 = ?
Assume a 256 KiB 8-way set-associative L2 cache with a FIFO replacement policy, including the L1 (D$
). How many L2 cache misses occur when Loops A and B are run? Loop A: __ I07 __ / Loop B: __ I08 __
I07 = ?
I08 = ?
Given the following hit times:
Calculate the AMAT for Loops A and B: Loop A: __ I09 __ cycles / Loop B: __ I10 __ cycles
I09 = ?
I10 = ?
Assume a Virtually-Indexed Physically-Tagged (VIPT) cache with a 4 KiB page size and a 32 KiB cache size. Is it possible to have virtual address aliasing? __ I11 __ (answer in Yes or No) What is the minimum set-associativity required to avoid aliasing? __ I12 __
I11 = ?
I12 = ?