--- tags: computer-arch --- # Quiz5 of Computer Architecture (2024 Fall) > Solutions ## Problem `A` Average Memory Access Time (AMAT) represents the average time required to access memory. The main formula is: $$ AMAT = \text{Hit Time} + (\text{Miss Rate} \times \text{Miss Penalty}) $$ In a multi-level cache system, two types of miss rates are considered for each cache level: - Global miss rate: Calculated as the number of accesses that missed at a specific cache level divided by the total number of accesses to the entire cache system. - Local miss rate: Calculated as the number of accesses that missed at a specific cache level divided by the total number of accesses made to that cache level. In a two-level cache system, suppose there were 100 total accesses to the entire cache system, and the L2 cache experienced 20 misses. If the L1 cache has a local miss rate of 50%, what is the local miss rate of the L2 cache? __ A01 __ > ==40%== > Since the L1 cache has a local miss rate of 50%, it means L2 cache is accessed 50 times out of the 100 total memory accesses. Given that L2 cache experienced 20 misses, the local miss rate of L2 cache is calculated as: $$ \text{Local Miss Rate of L2} = \frac{20 \ \text{misses}}{50 \ \text{accesses}} \times 100 = 40\% $$ Suppose the system has the following characteristics: 1. An L1 cache with a hit time of 2 cycles and a local miss rate of 20%. 2. An L2 cache with a hit time of 15 cycles and a global miss rate of 5%. 3. Main memory with an access time of 100 cycles. What is the local miss rate of the L2 cache? __ A02 __ > ==25%== > The number of accesses to the L2 cache equals the number of misses in the L1 cache. To find the local miss rate of the L2 cache, we divide its global miss rate by the miss rate of the L1 cache: $$ \text{L2 Local Miss Rate} = \frac{\text{Misses in L2}}{\text{Accesses in L2}} = \frac{\frac{\text{Misses in L2}}{\text{Total Accesses}}}{\frac{\text{Misses in L1}}{\text{Total Accesses}}} = \frac{\text{Global Miss Rate of L2}}{\text{Miss Rate of L1}} $$ Substituting the values: $$ \text{L2 Local Miss Rate} = \frac{5\%}{20\%} = 0.25 = 25\% $$ Suppose we aim to reduce the system's AMAT to 8 cycles or lower by introducing an L3 cache. If the L3 cache has a local miss rate of 30%, what is the maximum allowable hit time for the L3 cache? __ A03 __ cycles. > ==30== > Let \( H \) represent the hit time of the L3 cache. By extending the AMAT equation, where the miss penalty of the L2 cache becomes the local AMAT of the L3 cache, we can express the system's AMAT as: $$ AMAT = 2 + 0.2 \times \left( 15 + 0.25 \times \left( H + 0.3 \times 100 \right) \right) $$ > > To ensure the AMAT does not exceed 8 cycles, we solve the inequality: $$ 2 + 0.2 \times \left( 15 + 0.25 \times \left( H + 30 \right) \right) \leq 8 $$ > > Simplifying this yields: $$ H \leq 30 $$ > > Thus, the maximum hit time for the L3 cache is 30 cycles. Given the following block of C code, determine the cache hit rate on a byte-addressed system with a total memory size of 1 MiB. The system has a 16 KiB direct-mapped cache with 1 KiB blocks. Assume the cache starts cold. ```c= #define N_INTS 8192 int A[N_INTS]; /* A lives at 0x10000 */ int i, total = 0; /* Initialize array */ for (i = 0; i < N_INTS; i += 128) A[i] = i; /* Process array */ for (i = 0; i < N_INTS; i += 128) total += A[i]; ``` What is the Tag:Index:Offset (TIO) breakdown? Tag: __ A04 __ bits; Index: __ A05 __ bits; Offset: __ A06 __ bits > ==6==, ==4==, ==10== > The breakdown of the Tag:Index:Offset (T/I/O) is calculated as follows: - Offset = $\log_2(1 \, \text{KiB}) = \log_2(2^{10}) = 10$ bits - Index = $\log_2\left(\frac{16 \, \text{KiB}}{1 \, \text{KiB}}\right) = \log_2(16) = 4$ bits - Tag = Total address bits - Index bits - Offset bits $$ \text{Tag} = 20 - 4 - 10 = 6 \, \text{bits} $$ Determine the cache hit rate for the line labeled as Line 7. __ A07 __ > ==50%== > Since integer accesses are spaced $4 \times 128 = 512$ bytes apart, each cache block holds two accesses. The first access to each block results in a compulsory miss, while the second access hits because both `A[i]` and`A[i + 128]` reside in the same cache block. This results in a cache hit rate of 50%. Determine the cache hit rate for the line labeled as Line 11. __ A08 __ > ==50%== > The size of array `A` is $8192 \times 4 = 2^{15}$ bytes, which is exactly twice the size of the cache. After executing Line 7, the second half of `A` resides in the cache, but Line 11 starts processing from the first half of `A`, meaning none of the cached data from Line 7 can be reused. Since memory accesses in Line 11 follow the same pattern as Line 7, the cache hit rate remains the same at 50%. The variable `total` is likely kept in a register by the compiler, so it does not impact the cache hit calculation. Associativity refers to the number of cache slots a memory block can map to. A fully associative cache has the highest associativity, meaning any block can be placed in any cache slot. In contrast, a direct-mapped cache has the lowest associativity, being only 1-way set associative, where each block has only one specific slot. For an $N$-way set-associative cache, the following properties hold: $$ N \times \text{# of sets} = \text{# of blocks} $$ $$ \text{Index bits} = \log_2(\text{# of sets}) $$ Consider a system with 2-way Set-Associative Cache - Address space: 8 bits - Block size: 8 bytes - Cache size: 32 bytes Classify the following memory accesses as either a cache hit (`H`), cache miss (`M`), or cache miss with replacement (`R`). If a miss occurs, specify its type (Compulsory, Capacity, or Conflict). Assume the Least Recently Used (LRU) replacement policy. Note that while LRU is used here, other policies may apply in different scenarios. | Address (Binary) | Tag / Index / Offset | Hit, Miss, Replace | |------------------|------------------------|--------------------| | 0b0000 0100| _ | _ | | 0b0000 0101| _ | _ | | 0b0110 1000| _ | _ | | 0b1100 1000| Tag: __ A09 __ / Index: __ A10 __ / Offset: __ A11 __ | __ A12 __ | | 0b0110 1000| _ | _ | | 0b1101 1101| Tag: __ A13 __ / Index: __ A14 __ / Offset: __ A15 __ | __ A16 __ | | 0b0100 0101| _ | _ | | 0b0000 0100| _ | _ | | 0b0011 0000| Tag: __ A17 __ / Index: __ A18 __ /Offset: __ A19 __ | __ A20 __ | | 0b1100 1011| _ | _ | | 0b0100 0010| _ | _ | You should fill in the "Tag / Index / Offset" and "Hit, Miss, Replace" columns based on the cache configuration and memory access behavior. > A09 = ? A10 = ? A11 = ? A12 = ? > ==1100==, ==1==, ==000==, ==Miss== or ==M== > A13 = ? A14 = ? A15 = ? A16 = ? > ==1101==, ==1==, ==101==, ==Replace== or ==R== > A17 = ? A18 = ? A19 = ? A20 = ? > ==0011==, ==0==, ==000==, ==Replace== or ==R== What is the cache hit rate for the accesses listed above? __ A21 __ > ==27.3%== or $\frac{3}{11}$ > There are 3 hits out of 11 total accesses, resulting in a hit rate of approximately 27.3%. --- ## Problem `B` Considering the RISC-V code provided and assuming a pipelined CPU without forwarding, determine the number and types of hazards present. Evaluate all potential hazards between the instructions. Here is the requested table in Markdown format: | Instruction| C1| C2| C3| C4| C5| C6| C7| C8| C9| |---------------------------|------|------|------|------|------|------|------|------|------| | 1. `sub t1, s0, s1` | IF | ID | EX | MEM| WB ||||| | 2. `or s0, t0, t1`|| IF | ID | EX | MEM| WB |||| | 3. `sw s1, 100(s0)` ||| IF | ID | EX | MEM| WB ||| | 4. `bgeu s0, s2, loop`|||| IF | ID | EX | MEM| WB || | 5. `add t2, x0, x0` ||||| IF | ID | EX | MEM| WB | How many stalls are required to resolve data hazards if the register file supports double-pumping (write-then-read)? __ B01 __ hazard(s) > ==4== > There are four hazards present: > - A data hazard between instructions 1 and 2 due to `t1`. > - A data hazard between instructions 2 and 3 caused by `s0`. > - A data hazard between instructions 2 and 4 also involving `s0`. > - A control hazard between instructions 4 and 5 caused by the branch instruction. How many stalls are required to resolve the data hazards if the register file supports double-pumping (write-then-read)? What about the control hazards if branch prediction is used? Assuming the ability to read and write to the register file in the same cycle: - Data Hazards: - __ B02 __ stalls are needed between instructions __ B03 __ and __ B04 __ due to the `WB→ID` dependency. - __ B05 __ stalls are required between instructions __ B06 __ and __ B07 __ for the same `WB→ID` dependency. - No additional stalls are needed between instructions __ B08 __ and __ B09 __, as the stalls added for instruction 3 already resolve this dependency. - Control Hazards: - If the branch prediction is correct, __ B10 __ stalls are required. - If the branch prediction is incorrect, __ B11 __ stalls are needed to flush the pipeline, covering cycles from `MEM` to one cycle before `IF`. > B02 = ? > ==2== > B03 = ? > ==1== > B04 = ? > ==2== > B05 = ? > ==2== > B06 = ? > ==2== > B07 = ? > ==3== > B08 = ? > ==2== > B09 = ? > ==4== > B10 = ? > ==no== (or zero) > B11 = ? > ==3== > > | Instruction | C1| C2| C3| C4| C5| C6| C7| C8| C9| > |--------------------------|------|------|------|------|------|------|------|------|------| > | 1. `sub t1, s0, s1`| IF | ID | EX | MEM | WB ||||| > | 2. `or -> nop` || IF | ID | |||||| > | 2. `or s0, t0, t1` ||| IF | ID| EX | MEM| WB ||| > | 3. `sw s1, 100(s0)`|||| IF > | ID | EX | MEM| WB || > | 4. `bgeu s0, s2, loop` |||| | IF | ID | EX | MEM| WB | > | 5. `add t2, x0, x0`|||||| IF | ID | EX | MEM| WB| --- ## Problem `C` We will use the single-cycle CPU datapath shown below. ![datapath](https://hackmd.io/_uploads/S1C0mTNEkl.png =80%x) The following control signals may take on for the single cycle datapath. - `PCSel`: - 0: The next PC is OldPC + 4. - 1: The next PC is the ALU result (used for branches, jumps, etc.). - `RegWEn`: - 0: The write-back value is not written to the register file. - 1: The write-back value is written to the register file. - `ImmSel`: - 0-4: Used for I, B, S, J, and U type immediates. - 5-7: Reserved or unused. - `BrEq`: - 0: Inputs are not equal. - 1: Inputs are equal. - `BrLt`: - 0: Register rs1 is not less than rs2. - 1: Register rs1 is less than rs2. - `ALUSel`: - Follows the same design convention as referenced in the CS61C course. Complete the table by specifying the control signals for each instruction based on the datapath. - If a signal's value does not impact the execution of an instruction, use the `*` (don't care) symbol. - If a signal's value affects the instruction and can vary depending on the program, list all possible values. For example, if a signal can be either 0 or 1, write `0/1`. | Instruction | BrEq | BrLT | PCSel | ImmSel | BrUn | ASel | BSel | ALUSel | MemRW | RegWEn | WBSel| |-------------|------|------|---------|--------|------|----------|----------|---------|--------|---------|----------| | `add` | * | * | 0 (PC+4)| * | * | 0 (Reg)| 0 (Reg)|add| 0|1| 1 (ALU)| | `ori` | * | * | 0 | I | * | 0 (Reg)| 1 (Imm) | or | 0 | 1 | 1 (ALU)| | `lw`| * | * | C04 | I | * | 0 (Reg)| 1 (Imm) | add | 0 | C11 | 0 (MEM)| | `sw`| * | * | C05 | S | * | 0 (Reg)| 1 (Imm) | add | 1 | C12 | * | | `beq` | C01 | * | C06 | B | C09 | 1 (PC) | 1 (Imm) | add | 0 | C13 | * | | `jal` | * | C02 | C07 | J | * | 1 (PC) | 1 (Imm) | add | 0 | C14 | C16 | | `blt` | * | C03 | C08 | B | C10 | 1 (PC) | 1 (Imm) |add | 0 | C15 | * | C01 = ? > ==0/1== C02 = ? > ==*== C03 = ? > ==0/1== C04 = ? > ==0== C05 = ? > ==0== C06 = ? > ==0/1== C07 = ? > ==1== (ALU) C08 = ? > ==0/1== C09 = ? > ==*== C10 = ? > ==0== C11 = ? > ==1== C12 = ? > ==0== C13 = ? > ==0== C14 = ? > ==1== C15 = ? > ==0== C16 = ? > ==2== (PC + 4) --- ## Problem `D` We will implement the [Fast Fourier Transform](https://en.wikipedia.org/wiki/Fast_Fourier_transform) (FFT) and its Inverse (IFFT) using RISC-V instructions (RVV included). The task involves translating C code into RISC-V assembly while adhering to the specified constraints and guidelines. Part 1: bitwise reverse ```c /* Calculates the log2 of number */ int logint(int N) { int k = N, i = 0; while (k) { k >>= 1; i++; } return i - 1; } /* Bitwise reverses the number */ int reverse(int N, int n) { int j, p = 0; for (j = 1; j <= logint(N); j++) { if (n & (1 << (logint(N) - j))) { p |= 1 << (j - 1); } } return p; } ``` Convert to RISC-V assembly: ```c # Takes input N (a0), returns its log base 2 in a0 logint: addi sp, sp, -4 sw t0, 0(sp) add t0, a0, zero# k = N add a0, zero, zero# i = 0 logloop: beq t0, zero, logloop_end # Exit if k == 0 srai t0, t0, 1 # k >>= 1 addi a0, a0, 1 # i++ j logloop logloop_end: addi a0, a0, -1 # Return i - 1 lw t0, 0(sp) addi sp, sp, 4 jr ra # Takes inputs N(a0) and n(a1), reverses the number in binary reverse: addi sp, sp, -28 sw ra, 0(sp) sw s0, 4(sp) sw s1, 8(sp) sw s2, 12(sp) sw s3, 16(sp) sw s4, 20(sp) sw s5, 24(sp) call logint# Now a0 has log2(N) addi s0, zero, 1 # j = 1 add s1, zero, zero # p = 0 forloop_reverse: bgt s0, a0, forloop_reverse_end sub s2, a0, s0# s2 = a0 - s0 add is3, zero, 1 __ D01 __ # Fill your answer! __ D02 __ # Fill your answer! Check if bit is set beq s3, zero, elses3 # If not, skip ifs3: addi s4, s0, -1 # s4 = j - 1 addi s5, zero, 1 __ D03 __ # Fill your answer! __ D04 __ # Fill your answer! Set bit in p elses3: __ D05 __ # Fill your answer! j forloop_reverse forloop_reverse_end: add a0, s1, zero # Return p lw ra, 0(sp) lw s0, 4(sp) lw s1, 8(sp) lw s2, 12(sp) lw s3, 16(sp) lw s4, 20(sp) lw s5, 24(sp) addi sp, sp, 28 jr ra ``` Complete the above RISC-V assembly code to ensure functionality without using any pseudo-instructions. All registers should start with `s`. D01 = ? > ==`sll s3, s3, s2`== D02 = ? > ==`and s3, a1, s3`== D03 = ? > ==`sll s5, s5, s4`== D04 = ? > ==`or s1, s1, s5`== D05 = ? > ==`addi s0, s0, 1`== You are now allowed to use the RISC-V BitManip extension. One of its supported instructions is `clz rd, rs`, which counts the number of leading zero bits in the most significant bit (MSB) portion of the source register `rs` and stores the result in `rd`. > Reference: [RISC-V BitManip](https://courses.cs.washington.edu/courses/cse481a/20sp/notes/bitmanip.pdf) Consider the following RISC-V function `set_log2n`: It computes `log2(N)` for a 32-bit unsigned integer stored in `a0` and writes the result to the memory location labeled `logsize`. Inputs: - `a0`: 32-bit unsigned integer `N` Outputs: - None (result stored in memory at `logsize`). Registers Modified: - `t0`, `t1` ```c set_log2n: clz t0, a0# Count leading zeros in a0 li t1, 31# Maximum index for a 32-bit register __ D06 __ # Fill your answer! Compute log2(N) la t0, logsize # Load address of 'logsize' __ D07 __ # Fill your answer! jr ra# Return to caller logsize: .word 0 ``` The `set_log2n` function can be used to calculate and store `log2(size)` for shared access by other functions. Complete the above RISC-V assembly code to ensure functionality without using any pseudo-instructions. All registers should start with `t`. D06 = ? > ==`sub t1, t1, t0`== D07 = ? > ==`sw t1, 0(t0)`== Let’s implement the previously described `reverse` routine as follows without branches: ```c # Function: reverse # Reverses the binary digits of a 32-bit integer. # Inputs: # - a0: Input number to reverse. # - a1: Number of significant bits to reverse (optional; default 32). # Outputs: # - a0: The reversed binary number. # Clobbers: # - t0, t1, t2 reverse: # Swap odd and even bits lui t0, 0x55555 # Load upper 20 bits of 0x55555555 ori t0, t0, 0x555 # Complete loading 0x55555555 srli t1, a0, 1 # v >> 1 and t1, t1, t0 # (v >> 1) & 0x55555555 __ D08 __ # Fill your answer! __ D09 __ # Fill your answer! ora0, t1, t2 # Store result in a0 # Swap consecutive pairs lui t0, 0x33333 # Load upper 20 bits of 0x33333333 ori t0, t0, 0x333 # Complete loading 0x33333333 srli t1, a0, 2 # v >> 2 and t1, t1, t0 # (v >> 2) & 0x33333333 __ D10 __ # Fill your answer! __ D11 __ # Fill your answer! or a0, t1, t2 # Store result in a0 # Swap nibbles lui t0, 0x0F0F0 # Load upper 20 bits of 0x0F0F0F0F ori t0, t0, 0x0F0 # Complete loading 0x0F0F0F0F __ D12 __ # Fill your answer! and t1, t1, t0 __ D13 __ # Fill your answer! __ D14 __ # Fill your answer! or a0, t1, t2 # Store result in a0 # Swap bytes lui t0, 0x00FF0# Load upper 20 bits of 0x00FF00FF ori t0, t0, 0x0FF# Complete loading 0x00FF00FF __ D15 __ # Fill your answer! and t1, t1, t0 __ D16 __ # Fill your answer! __ D17 __ # Fill your answer! or a0, t1, t2 # Store result in a0 # Swap 2-byte pairs srli t1, a0, __ D18 __ slli t2, a0, __ D19 __ ora0, t1, t2 # Final result in a0 # Adjust number of significant bits addi t1, zero, 32 # Load immediate value 32 sub t1, t1, a1 # Calculate shift amount srl a0, a0, t1 # Shift right to fit bit count jr ra # Return with result in a0 ``` Complete the above RISC-V assembly code to ensure functionality without using any pseudo-instructions. D08 = ? > ==`and t2, a0, t0`== > C: `v & 0x55555555` D09 = ? > ==`slli t2, t2, 1`== > C: `(v & 0x55555555) << 1` D10 = ? > ==`and t2, a0, t0`== > C: `v & 0x33333333` D11 = ? > ==`slli t2, t2, 2`== > C: `(v & 0x33333333) << 2` D12 = ? > ==`srli t1, a0, 4`== > C: `v >> 4` D13 = ? > ==`and t2, a0, t0`== > C: `v & 0x0F0F0F0F` D14 = ? > ==`slli t2, t2, 4`== > C: `(v & 0x0F0F0F0F) << 4` D15 = ? > ==`srli t1, a0, 8`== > C: `v >> 8` D16 = ? > ==`and t2, a0, t0`== > C: `v & 0x00FF00FF` D17 = ? > ==`slli t2, t2, 8`== > C: `(v & 0x00FF00FF) << 8` D18 = ? > ==16== D19 = ? > ==16== Next, we will utilize the RISC-V Vector Extension (RVV). Consider the following code: ```c # Initialize the helper vector with sequential integers (0,1,2,...) init_helper_vector: la t0, helperVector # Load the base address of helperVector lw t1, size# Load the size of the vector (N) vsetvli t2, t1, e32, m1 # Set the vector length configuration vid.v v0 # Generate a vector of indices vse32.v v0, (t0)# Store the generated vector in helperVector ret ``` The above should be built with `--march=rv32i_zbb_zve32x` using GNU Toolchain. | Extension | Minimum VLEN | Supported EEW | FP32 | FP64 | |-----------|---------------|----------------|------|------| | `zve32x`|32 | 8, 16, 32 | N| N| > Reference: [RISC-V "V" Vector Extension](https://github.com/riscvarchive/riscv-v-spec/blob/master/v-spec.adoc) We can implement a vectorized version of the reverse routine, `vreverse`, using the RISC-V Vector Extension as follows. ```c # Calculate log base 2 of input # Input:a0 = N # Output: a0 = log2(N) logint: clz a0, a0# Count leading zeros of a0 addi t0, zero, 31# Load 31 (32-bit word size - 1) __ D20 __ # Fill your answer! jr ra# Return # Bit-reverse elements of a vector # Input:v29 = input vector, a0 = N # Output: v29 = bit-reversed vector vreverse: addi sp, sp, -4# Save return address sw ra, 0(sp) call logint# Compute log2(N) in a0 lw ra, 0(sp) # Restore return address addi sp, sp, 4 li t1, 1 # Constant 1 for bit manipulation li t0, 1 # Initialize bit position counter j vmv.v.x v28, zero # Clear result vector v28 vreverse_loop: # Loop for j <= log2(N) bgt t0, a0, vreverse_end # Break if j > log2(N) sub t2, a0, t0# t2 = log2(N) - j sll t3, t1, t2# t3 = (1 << (log2(N) - j)) __ D21 __ # Fill your answer! v0 = v29 & t3 (check bit) vmsne.vx v0, v0, zero # v0 = 1 if set, else 0 addi t4, t0, -1# t4 = j - 1 sll t3, t1, t4# t3 = (1 << (j - 1)) __ D22 __ # Fill your answer! Set bit in v28 where mask is true addi t0, t0, 1 # Increment bit position j vreverse_loop vreverse_end: __ D23 __ # Fill your answer! Store result in v29 jr ra # Return ``` The usage of vreverse would look something like this: ```c mva0, a2# a0 = N, input for reverse function. Overwriting a0 is fine vmv.v.v v29, v26# v29 = v26 = <i>, used as input for vreverse call vreverse# v29 now contains rev(N, <i>), keep it for later use ``` Complete the above RISC-V assembly code to ensure functionality. D20 = ? > ==`sub a0, t0, a0`== > C: `log2(N) = 31 - clz(N)` D21 = ? > ==`vand.vx v0, v29, t3`== D22 = ? > ==`vor.vxv28, v28, t3, v0.t`== D23 = ? > ==`vmv.v.v v29, v28`== --- ## Problem `E` A processor uses 16-bit addresses, 256-byte pages, and an 8-entry fully associative TLB with LRU replacement. The LRU field is represented by 3 bits, indicating the order of page access, where `0` means the most recently accessed page. At a given point in time, the TLB for the current process is in the initial state, as described in the table below. Additionally, three free physical pages are available. Assumptions: - All current page table entries are already present in the initial TLB. - All pages are readable and writable. Your task: 1. Update the TLB's final state based on the given access pattern, adjusting entries as needed. 2. Write the corresponding physical addresses for each accessed memory location after the TLB update. Available free physical pages: - 0x17 - 0x18 - 0x19 Memory access pattern: 1. Read from virtual address 0x11F0 2. Write to virtual address 0x1301 3. Write to virtual address 0x20AE 4. Write to virtual address 0x2332 5. Read from virtual address 0x20FF 6. Write to virtual address 0x3415 Initial TLB: | VPN| PPN| Valid | Dirty | LRU | |------|------|-------|-------|-----| | 0x01 | 0x11 | 1 | 1 |0| | 0x00 | 0x00 | 0 | 0 |7| | 0x10 | 0x13 | 1 | 1 |1| | 0x20 | 0x12 | 1 | 0 |5| | 0x00 | 0x00 | 0 | 0 |7| | 0x11 | 0x14 | 1 | 0 |4| | 0xac | 0x15 | 1 | 1 |2| | 0xff | 0xff | 1 | 0 |3| Final TLB: (Use hexadecimal literal for address, 0/1 for dirty) | VPN| PPN| Valid | Dirty | |------|------|-------|-------| | 0x01 | 0x11 | 1 | 1 | | E01 | E02 | 1 | E03 | | E04 | E05 | 1 | E06 | | E07 | E08 | 1 | E09 | | E10 | E11 | 1 | E12 | | E13 | E14 | 1 | E15 | | E16 | E17 | 1 | E18 | | E19 | E20 | 1 | E21 | E01 = ? E02 = ? E03 = ? > ==0x13==, ==0x17==, ==1== E04 = ? E05 = ? E06 = ? > ==0x10==, ==0x13==, ==1== E07 = ? E08 = ? E09 = ? > ==0x20==, ==0x12==, ==1== E10 = ? E11 = ? E12 = ? > ==0x23==, ==0x18==, ==1== E13 = ? E14 = ? E15 = ? > ==0x11==, ==0x14==, ==0== E16 = ? E17 = ? E18 = ? > ==0xac==, ==0x15==, ==1== E19 = ? E20 = ? E21 = ? > ==0x34==, ==0x19==, ==1== > 1. Access `0x11F0` (Read): **Hit**, Physical Address: `0x14F0`; Updated LRUs: `1, 7, 2, 5, 7, 0, 3, 4` Corresponding Virtual Page: `0x11` > 2. Access `0x1301` (Write): **Miss**, Map VPN `0x13` to PPN `0x17`, set valid and dirty bits. Physical Address: `0x1701`; Updated LRUs: `2, 0, 3, 6, 7, 1, 4, 5` > 3. Access `0x20AE` (Write): **Hit**, set dirty bit. Physical Address: `0x12AE`; Updated LRUs: `3, 1, 4, 0, 7, 2, 5, 6` > 4. Access `0x2332` (Write): **Miss**, Map VPN `0x23` to PPN `0x18`, set valid and dirty bits. Physical Address: `0x1832`; Updated LRUs: `4, 2, 5, 1, 0, 3, 6, 7` > 5. Access `0x20FF` (Read): **Hit**, Physical Address: `0x12FF`; Updated LRUs: `4, 2, 5, 0, 1, 3, 6, 7` > 6. Access `0x3415` (Write): **Miss** (Replacement Needed). Replace the last entry, Map VPN `0x34` to PPN `0x19`, set dirty bit. Physical Address: `0x1915`; Updated LRUs: `5, 3, 6, 1, 2, 4, 7, 0` --- ## Problem `F` Consider a memory address space divided into 64-byte segments. For example, address 1024 is 64 bytes away from address 1025 but only 1 byte away from the address $1024 + \frac{1}{64}$. In this system, standard integer binary addressing is ineffective. Instead, consider using floating-point representation based on the IEEE-754 standard, which includes 1 sign bit. You need to determine the appropriate number of exponent and mantissa bits. Given a memory size of 4KiB and chunk addresses labeled as 0, 1, 2, ..., determine the minimum number of exponent bits required in the floating-point memory address representation to access every byte, assuming a standard bias. F01 = ? > ==4== > With a memory size of 4KiB divided into 64B chunks, there are 64 total chunks. This requires 4 exponent bits to cover values from 0 to 63. The highest exponent value before applying the standard bias (excluding infinity/NaN values) is $2^4 - 2 = 14$. Applying the standard bias of -7 results in a maximum exponent value of 7, sufficient to cover values up to 63. > > Conversely, with only 3 exponent bits, the maximum exponent value after applying a bias of -3 is 3, insufficient to represent the value 63. Determine the minimum number of mantissa bits required in the floating-point memory address system to accurately address each byte. F02 = ? > ==11== > The largest memory address to represent is $0b11\_1111.111111$, corresponding to $63 + \frac{63}{64}$. To encode this value, we need 11 mantissa bits, resulting in the representation $0b1.11111111111 \times 2^5$. Thus, 11 bits of mantissa are necessary. --- ## Problem `G` In many applications, it is essential not only to identify the maximum value in an array but also to determine its corresponding index, commonly referred to as the argmax. This task can be efficiently accomplished using Data Level Parallelism. The `argmax` function below accepts an array `arr` and its length `n`, returning the index of the maximum element. If multiple elements have the same maximum value, the function returns the earliest index. To implement this efficiently, use pseudo-SIMD ([Single Instruction, Multiple Data](https://en.wikipedia.org/wiki/Single_instruction,_multiple_data)) intrinsics. These intrinsics operate on `vec` structs, similar to Intel's `__m128i` structs, which contain four packed integers. Complete the sections labeled `G01`, `G02`, and `G03` in the function to ensure it works correctly. Proposed SIMD Intrinsics reference: - `vec sum_epi32(vec a, vec b)` : Returns `a + b`. - `vec and_epi32(vec a, vec b)` : Returns `a & b`. - `vec set_epi32(int a)` : Returns a SIMD vector with all entries set to `a`. - `vec load_epi32(int *a)` : Returns a SIMD vector with entries `a[0]`, `a[1]`, `a[2]`, and `a[3]`. - `int reducemax_epi32(vec a)` : Returns the maximum value in vector `a`. - `vec maskeq_epi32(vec a, int b)` : Returns a mask vector with `0xFFFFFFFF` where `a` equals `b`, and `0` otherwise. - `int firstv_epi32(vec a)` : Returns the index of the first entry with its lowest bit set to `1`. ```c int argmax(int *arr, int n) { int curr, index = 0, running_max = -2147483648; /* -2^31 */ for (int i = 0; i < n / 4 * 4; i += 4) { vec tmp = load_epi32(arr + i); curr = reducemax_epi32(tmp); if (curr > running_max) { tmp = G01;/* Fill in your code here */ index = i + G02; /* Fill in your code here */ running_max = curr; } } for (int i = n / 4 * 4; i < n; i += 1) { if (G03) {/* Fill in your code here */ running_max = arr[i]; index = i; } } return index; } ``` Complete the above C code to ensure functionality. G01 = ? > ==`maskeq_epi32(tmp, curr)`== G02 = ? > ==`firstv_epi32(tmp)`== G03 = ? > ==`arr[i] > running_max`== (or equivalent) --- ## Problem `H` Suppose we are designing a new microprocessor with a custom ISA. To simplify the architecture, we decide to use a single, universal instruction type called the Z-type instruction. We aim to support the following instructions: 1. `add rd1, rs1, rs2` 2. `and rd1, rs1, rs2` 3. `lw rd1, offset1 (rs1)` 4. `sw rs2, offset1 (rs1)` 5. `addi rd1, rs1, imm1` 6. `beq rs1, rs2, offset1` 7. `lui rd1, offset1` 8. `jal rd1, imm` 9. `stw rs3, offset1, offset2 (rs1)` The new `stw` instruction stores the contents of `rs3` into both memory locations `rs1 + offset1` and `rs1 + offset2`. Its Register Transfer Level (RTL) is: $$ \text{Mem}(R[rs1] + \text{offset1}) \leftarrow R[rs3] \quad \text{AND} \quad \text{Mem}(R[rs1] + \text{offset2}) \leftarrow R[rs3] $$ How many bits must the opcode field have if we want to support the instructions listed above? __ H01 __ bits H01 = ? > ==4== > There are 9 instructions in total, so we need 4 bits to represent them, as $\lceil \log_2(9) \rceil = 4$. With only one instruction format, we want to enable jumps of up to 64 KiB in either direction. How many bits are required for the immediate field to support this? Assume that, like RV32, the instruction’s least significant bit is an implicit 0 and not saved. H02 = ? > ==16== > Jumping 64 KiB in either direction equals $2^{16}$ bytes. Since immediate values are counted in half-instructions, we need 15 bits for the signed offset, covering $-2^{15}$ to $2^{15} - 1$. An extra bit is required for the sign, making the immediate field 16 bits. Consider the following instruction format: ``` imm2 imm1 rs3 rs2 rs1 rd2 rd1 opcode ``` The instruction format is 64 bits long, where each immediate field (`imm1` and `imm2`) is 11 bits, and the opcode is 7 bits. What is the maximum number of registers the ISA can support? H03 = ? > ==128== > Total bits used by immediates and the opcode: > $$ > 2 \times 11 + 7 = 29 \text{ bits} > $$ > Remaining bits for registers: > $$ > 64 - 29 = 35 \text{ bits} > $$ > There are 5 registers (`rs1`, `rs2`, `rs3`, `rd1`, `rd2`). Dividing the remaining bits equally among the registers: > $$ > \frac{35}{5} = 7 \text{ bits per register} > $$ > This allows $2^7 = 128$ registers. --- ## Problem `I` You are tasked with designing a memory hierarchy using an 8-word (32-byte) cache line and 32-bit virtual and physical addresses. Consider the following loops: - [ ] Loop A: ```c int sum = 0; for (int i = 0; i < (1 << 8); i++) for (int j = 0; j < (1 << 6); j++) sum += X[i][j] * Y[i][j]; ``` - [ ] Loop B: ```c int sum = 0; for (int j = 0; j < (1 << 6); j++) for (int i = 0; i < (1 << 8); i++) sum += X[i][j] * Y[i][j]; ``` Assumptions: - `X[0][0]` is stored at address `0x0`. - Both matrices `X` and `Y` are stored contiguously in **row-major order**, meaning `X[i][j]` is adjacent to `X[i][j + 1]` in memory, and `X[256][64]` is next to `Y[0][0]`. - `sum` is not stored in the cache. - Caches are initially empty, and all matrix elements are 32-bit integers. Assume a 32 KiB 4-way set-associative data cache (`D$`) with a FIFO replacement policy. Complete the following table: | Loop | Compulsory Misses | Conflict and Capacity Misses | Total Misses | |----------|--------------------|------------------------------|--------------| |Loop A| I01 | I02 | I03 | |Loop B| I04 | I05 | I06 | I01 = ? > ==4096== (Compulsory Misses for Loop A) I02 = ? > ==0== (No Conflict or Capacity Misses for Loop A) I03 = ? > ==4096== (Total Misses for Loop A) I04 = ? > ==4096== (Compulsory Misses for Loop B) I05 = ? > ==28672== (Conflict and Capacity Misses for Loop B) I06 = ? > ==32768== (Total Misses for Loop B) Assume a 256 KiB 8-way set-associative L2 cache with a FIFO replacement policy, including the L1 (`D$`). How many L2 cache misses occur when Loops A and B are run? Loop A: __ I07 __ / Loop B: __ I08 __ I07 = ? > ==4096== (L2 Cache Misses for Loop A) I08 = ? > 4096 (L2 Cache Misses for Loop B) Given the following hit times: - L1 hit time: 4 cycles - L2 hit time: 12 cycles - DRAM access time: 50 cycles Calculate the AMAT for Loops A and B: Loop A: __ I09 __ cycles / Loop B: __ I10 __ cycles I09 = ? > ==11.75== cycles (AMAT for Loop A) I10 = ? > ==22.25== cycles (AMAT for Loop B) Assume a Virtually-Indexed Physically-Tagged (VIPT) cache with a 4 KiB page size and a 32 KiB cache size. Is it possible to have virtual address aliasing? __ I11 __ (answer in Yes or No) What is the minimum set-associativity required to avoid aliasing? __ I12 __ I11 = ? > ==Yes== (Virtual Address Aliasing Possible) I12 = ? > ==8== (Minimum Set-Associativity Required)