--- tags: computer-arch --- # Quiz7 of Computer Architecture (2024 Fall) > Solutions ## Problem `A` > 15 points In approximately 80 words (or slightly more), describe your final project, covering the task description and your current progress. Note: You should have already received a confirmation email from the instructor with the subject "CA2024: Term Project." If you have not received it, promptly contact the instructor via email. $\to$ 限用英文作答,用中文作答則不計分。字數不要太少都給分,順便檢查學員的期末專題內容。 --- ## Problem `B` [General Matrix Multiply](https://spatial-lang.org/gemm) (GEMM) is a widely used algorithm in linear algebra, machine learning, statistics, and various other fields. It offers a more complex trade-off space compared to the previous tutorial, as there are multiple ways to structure the computation. These include techniques such as blocking, inner products, outer products, and systolic arrays. Below is the reference implementation: ```c void sgemm( int m, int n, int k, float *XA, int lda, float *XB, int ldb, float *XC, int ldc ) { /* Local variables */ int i, j, p; float alpha = 1.0, beta = 1.0; /* Early exit for invalid input */ if (m == 0 || n == 0 || k == 0) return; for (i = 0; i < m; i++) { for (j = 0; j < n; j++) { for (p = 0; p < k; p++) { XC[j * ldc + i] += XA[p * lda + i] * XB[j * ldb + p]; } } } } ``` Take note of the variables `i`, `j`, and `p` used in the nested loops, as they play a key role in iterating over the computation. Adjusting the order of `i`, `j`, and `p` does not affect the result, but it can impact performance. Considering the cache, what is the optimal order for best performance? Answer in the form "i-j-p". * B01 = ? > ==j-p-i== > ![image](https://hackmd.io/_uploads/ryYO3cF8yx.png) > The `j-p-i` order is well-suited for cache prefetching, resulting in a higher L1 cache hit rate compared to the `ipj` order. What order results in the worst performance? * B02 = ? > ==i-p-j== Test the performance of a specific order across different data sizes. From the graph, it can be observed that as the data size increases, overall performance decreases (the fluctuations in the middle may be related to cache evictions and reloads). ![image](https://hackmd.io/_uploads/rJwtnctUJg.png) It can be observed that as the data size increases, FLOPS (Floating-point operations per second) will decline. Why does FLOPS decrease rapidly as the matrix size grows larger? * B03 = ? > When matrices A and B are smaller than the L2 cache, GEMM only needs to read memory equivalent to the size of A and B from DDR. However, when A and B exceed the size of the L2 cache, due to the row-major layout of B or the column-major layout of A being non-contiguous in memory, GEMM reads more memory from DDR than the size of A and B. This increases cache misses, leading to degraded performance. To make GEMM faster, let's divide a large matrix into smaller submatrices so that each submatrix can fit entirely in the cache. First, partition the matrix (along the $m$- and $n$-dimensions) without blocking the $k$-dimension, and then expand: As shown in the diagram: ![image](https://hackmd.io/_uploads/ByxOhctLkg.png) - The $A$ matrix (the middle one) is divided along the $m$-dimension, with each submatrix sized $(MR, k)$. - The $B$ matrix (the right one) is divided along the $n$-dimension, with each submatrix sized $(k, NR)$. The multiplication of the submatrix $(MR, k)$ from $A$ and the submatrix $(k, NR)$ from $B$ produces a submatrix $(MR, NR)$ in $C$ (the left one). Use optimization techniques such as loop unrolling and register caching (reusing register results). Next, divide the large matrix into smaller submatrices to ensure they fit entirely in the cache, thereby improving access performance. The process involves splitting the large matrix into smaller submatrices that can fully reside in the cache for efficient access. ![image](https://hackmd.io/_uploads/ryMkZitLJg.png) The parameters must meet certain constraints: $MC \times KC$ must be less than half the size of the L2 cache. ![image](https://hackmd.io/_uploads/SyL6XsK8ke.png) Consider the corresponding code below: ```c= #include <stddef.h> #include <stdio.h> #include <stdlib.h> /* Define block sizes */ #define dgemm_mc 72 /* Block size for the m dimension */ #define dgemm_nc 4080 /* Block size for the n dimension */ #define dgemm_kc 256 /* Block size for the k dimension */ /* Helper macros */ #define min(a, b) ((a) < (b) ? (a) : (b)) /* Function to allocate aligned memory */ static inline void *malloc_aligned(size_t alignment, size_t size, size_t type_size) { return aligned_alloc(alignment, size * type_size); } /* Pack matrix A */ inline void packa_mcxkc(int m, int k, float *xa, int ldxa, float *packa) { for (int p = 0; p < k; p++) { for (int i = 0; i < m; i++) { *packa++ = *(xa + p * ldxa + i); } } } /* Pack matrix B */ inline void packb_kcxnc(int n, int k, float *xb, int ldxb, float *packb) { for (int j = 0; j < n; j++) { for (int p = 0; p < k; p++) { *packb++ = *(xb + j * ldxb + p); } } } /* Macro kernel */ void macro_kernel(int m, int n, int k, float *packa, float *packb, float *c, int ldc) { int i, p, j; for (j = 0; j < n; j++) { /* Start 2-nd loop */ for (p = 0; p < k; p++) { /* Start 1-st loop */ float elem_b = packb[j * k + p]; float *p_elem_a = &packa[p * m]; float *p_elem_c = &c[j * ldc]; for (i = 0; i < m; i++) { /* Start 0-th loop */ *p_elem_c++ += *p_elem_a++ * elem_b; } /* End 0-th loop */ } /* End 1-st loop */ } /* End 2-nd loop */ } /* SGEMM implementation */ void sgemm(int m, int n, int k, float *xa, int lda, float *xb, int ldb, float *c, /* must be aligned */ int ldc /* ldc must also be aligned */ ) { int ic, ib, jc, jb, pc, pb; float *packa, *packb; /* Early return if possible */ if (m == 0 || n == 0 || k == 0) { printf("sgemm(): early return\n"); return; } /* Allocate packing buffers */ packa = malloc_aligned(64, (dgemm_kc * (dgemm_mc + 1)), sizeof(float)); packb = malloc_aligned(64, (dgemm_kc * (dgemm_nc + 1)), sizeof(float)); for (jc = 0; jc < n; jc += dgemm_nc) { /* 5-th loop around micro-kernel */ jb = min(n - jc, dgemm_nc); for (pc = 0; pc < k; pc += pb) { /* 4-th loop around micro-kernel */ pb = min(k - pc, dgemm_kc); packb_kcxnc(jb, pb, &xb[/* B04: Fill your code here */], ldb, packb); for (ic = 0; ic < m; ic += ib) { /* 3-rd loop around micro-kernel */ ib = min(m - ic, dgemm_mc); packa_mcxkc(ib, pb, &xa[/* B05: Fill your code here */], lda, packa); macro_kernel(ib, jb, pb, packa, packb, &c[/* B05: Fill your code here */], ldc); } /* End 3-rd loop around micro-kernel */ } /* End 4-th loop around micro-kernel */ } /* End 5-th loop around micro-kernel */ free(packa); free(packb); } ``` * B04 = ? (Line 94) > ==`jc * ldb + pc`== * B05 = ? (Line 99) > ==`pc * lda + ic`== * B06 = ? (Line 101) > ==`jc * ldc + ic`== --- ## Problem `C` Determining the center of a peak in discrete data, such as a series of bins, can be useful for identifying the "center" of a signal, particularly in contexts like [discrete Fourier transform](https://en.wikipedia.org/wiki/Discrete_Fourier_transform) (DFT) bins, touch sensors, or detecting impulses. This method works especially well with normal distributions but can also be effective for any symmetric or nearly symmetric distribution. To locate the center, identify the highest peak in the dataset and examine the values of the bins immediately to its left and right. Using these values, you can calculate the center's position relative to the peak using the formula below, which estimates the offset $d$ from the highest cell: ```c int n = the highest value in a local data set; float lowercell = data[n-1]; float center = data[n]; float uppercell = data[n+1]; float d = (uppercell - lowercell) * 0.5 / min(uppercell - center, lowercell - center); ``` ![image](https://hackmd.io/_uploads/r1C_BsKIJl.png) This approach does not handle cases involving multiple peaks or multipath effects. In such scenarios, it is necessary to identify other peaks, account for their contributions, and iteratively refine the data to isolate and analyze additional peaks. Additionally, this method only applies to sources with a distinct hill-shaped distribution, as mathematically such centers can be detected. For computational efficiency, consider using fixed-point arithmetic instead of floating-point. Fixed-point math provides higher accuracy with the same bit-width and is generally faster, particularly in embedded systems, even when a floating-point unit (FPU) is available. This makes it a practical choice for resource-constrained environments where performance and precision are critical. Consider the code below: ```c void sum_fixed() { /* Assumes a fixed-point representation with 16 fractional bits */ int a = 132345; /* Represents 2.019424438 in floating-point (a / 65536) */ int b = 7491; /* Represents 0.114303589 in floating-point (b / 65536) */ int sum = a + b; /* Fixed-point addition */ printf("Sum: %f\n", sum / 65536.0); /* Output as floating-point: 2.133728 */ } ``` This example illustrates the use of fixed-point arithmetic with 16 fractional bits, a common technique in embedded systems for efficient numerical computation. The variables `a` and `b` are represented in a fixed-point format, where the fractional part is scaled by $2^{16} = 65536$. For instance, the value `a = 132345` corresponds to $2.019424438 = \frac{132345}{65536}$, and `b = 7491` corresponds to $0.114303589 = \frac{7491}{65536}$. The addition operation `a + b` combines the fixed-point representations directly without any conversion, keeping the result in the same fixed-point format. To display the result in a human-readable floating-point format, the program divides the sum by $65536.0$, which is the scaling factor for 16 fractional bits. The program outputs the sum as `2.133728`, demonstrating how fixed-point arithmetic preserves computational efficiency while allowing precise representation of fractional numbers. This approach is particularly valuable in resource-constrained environments where performance and accuracy are critical. The code below shows a method for computing the product while avoiding overflows. ```c void product_fixed() { int a = 132345; /* Represents 2.019424438 in floating-point (a / 65536) */ int big_b = 4442414; /* Represents 67.785858154 in floating-point (big_b / 65536) */ /* Adjusted multiplication to prevent overflow */ int product = (/* C01: Fill your code */) * (/* C02: Fill your code */); /* Avoids overflow with slight precision loss */ printf("Product (correct): %f\n", product / 65536.0); /* Correct result: 136.629456 */ } ``` Complete the code to ensure it works as expected. The variable `a` must be defined before the variable `big_b`. You should use shifts. * C01 = ? > ==`a >> 8`== or equivalent * C02 = ? > ==`big_b >> 8`== or equivalent Let's perform division in fixed-point arithmetic, reversing the approach of multiplication. ```c void division_fixed() { int a = 132345; /* Represents 2.019424438 in floating-point (a / 65536) */ int b = 7491; /* Represents 0.114303589 in floating-point (b / 65536) */ /* Adjust numerator and denominator to maximize precision and avoid overflows */ int division = (/* C03: Fill your code */) / (/* C04: Fill your code */); printf("Division: %f\n", division / 65536.0); /* Result: 17.674072 (Close to expected) */ } ``` The use of shifts achieves three goals: 1. It avoids overflows by reducing the size of the numerator and denominator appropriately. 2. It maximizes the precision of the numerator by applying a larger shift to it. 3. It maintains the correct scaling for the fixed-point arithmetic. Complete the code to ensure it works as expected. You should use shifts. * C03 = ? > ==`a << 12`== or equivalent * C04 = ? > ==`b >> 4`== or equivalent > The result, when converted back to floating-point by dividing by $65536.0$, is approximately $17.674072$, which is close to the expected value. This approach demonstrates how to manage precision and avoid overflow in fixed-point division while adhering to the constraints of the representation. --- ## Problem `D` You are expected to calculate an approximate square root by determining the next power of 2 using the `__builtin_clz` function provided by GCC/Clang. ```c int mysqrt(int i) { if (i == 0) return 0; int x = 1 << (/* D01: Fill your code */); x = (x + i / x) / 2; return x; } ``` ![image](https://hackmd.io/_uploads/Sy_Xcjt8Jg.png) Complete the code to ensure it works as expected. You should invoke the `__builtin_clz` function. * D01 = ? > ==`(32 - __builtin_clz(i)) / 2`== or equivalent Here is the extracted description with LaTeX annotations: The [Fast Fourier Transform](https://en.wikipedia.org/wiki/Fast_Fourier_transform) (FFT) is a computational method used to efficiently compute the discrete Fourier transform of a sequence. This implementation also includes a function for convolution, defined as: $$ \text{conv}(a, b) = c, \quad \text{where} \quad c[x] = \sum_{i} a[i] \cdot b[x-i]. $$ This convolution function calculates the resulting sequence $c$ by summing the product of elements from the input sequences $a$ and $b$, with one sequence reversed and shifted. It is particularly useful in signal processing and applications requiring fast computation of polynomial or data convolutions. Consider the corresponding code below: ```c= #include <math.h> #include <complex.h> #include <string.h> #define PI 3.14159265358979323846 void fft(complex double *x, complex double *roots, int N) { if (N <= 1) return; /* Allocate memory for even and odd parts */ complex double *even = malloc(N / 2 * sizeof(complex double)); complex double *odd = malloc(N / 2 * sizeof(complex double)); complex double *rs = malloc(N / 2 * sizeof(complex double)); for (int i = 0; i < N / 2; i++) { even[i] = x[2 * i]; odd[i] = x[2 * i + 1]; rs[i] = roots[2 * i]; } fft(even, rs, N / 2); fft(odd, rs, N / 2); for (int k = 0; k < N / 2; k++) { complex double t = roots[k] * odd[k]; x[k] = even[k] + t; x[k + N / 2] = even[k] - t; } free(even); free(odd); free(rs); } void conv(const double *a, const double *b, int na, int nb, double **result, int *nc) { int s = na + nb - 1; int L = /* D02: Fill your code */; int n = 1 << L; *nc = s; complex double *av = calloc(n, sizeof(complex double)); complex double *bv = calloc(n, sizeof(complex double)); complex double *roots = malloc(n * sizeof(complex double)); for (int i = 0; i < n; i++) roots[i] = cexp(-2 * PI * I * i / n); for (int i = 0; i < na; i++) av[i] = a[i]; for (int i = 0; i < nb; i++) bv[i] = b[i]; fft(av, roots, n); fft(bv, roots, n); /* Conjugate roots */ for (int i = 0; i < n; i++) roots[i] = conj(roots[i]); complex double *cv = malloc(n * sizeof(complex double)); for (int i = 0; i < n; i++) /* D03: Fill your code */; fft(cv, roots, n); *result = malloc(s * sizeof(double)); for (int i = 0; i < s; i++) (*result)[i] = creal(cv[i]) / n; free(av); free(bv); free(roots); free(cv); } int main() { double a[] = {1.0, 2.0, 3.0}; double b[] = {4.0, 5.0}; int na = sizeof(a) / sizeof(a[0]), nb = sizeof(b) / sizeof(b[0]); double *c; int nc; conv(a, b, na, nb, &c, &nc); printf("Convolution result:\n"); for (int i = 0; i < nc; i++) printf("%f\n", c[i]); free(c); return 0; } ``` The reference output: ``` Convolution result: 4.000000 13.000000 22.000000 15.000000 ``` Complete the code to ensure it works as expected. You should invoke the `__builtin_clz` function. * D02 = ? (Line 38) > ==`32 - __builtin_clz(s)`== * D03 = ? (Line 61) > ==`av[i] * bv[i]`== or equivalent --- ## Problem `E` The core concept of data-level parallelism is vectorized computation, which involves performing operations on multiple elements of a single vector simultaneously. ![vector](https://hackmd.io/_uploads/S1IqMnFUJl.png) Consider the following function, which computes the product of all elements in an array: ```c int product_naive(int n, int *a) { int product = 1; for (int i = 0; i < n; i++) product *= a[i]; return product; } ``` Your task is to vectorize this function using the RISC-V Vector extension to leverage parallel processing. ```c int product_vectorized(int n, int *a) { int result[4] = {1, 1, 1, 1}; for (int i = 0; i < n / 4 * 4; i += 4) { asm volatile ( "__ E01 __ t0, zero, e32, m1\n" "__ E02 __ v0, (%1)\n" "vmul.vv v1, v1, v0\n" "__ E03 __ v1, (%0)\n" : "=r"(result) : "r"(a + i), "r"(result) : "t0", "v0", "v1" ); } /* Handle tail case */ for (int i = n / 4 * 4; i < n; i++) result[0] *= a[i]; return result[0] * result[1] * result[2] * result[3]; } ``` Complete the code to ensure it works as expected. * E01 = ? > ==`vsetvli`== * E02 = ? > ==`vlw.v`== * E03 = ? > ==`vsw.v`== --- ## Problem `F` Your task is to implement a simulator for the RISC-V instruction set. Below is a partial C source code snippet for instruction decoding and execution. ```c= #define SEXT(x, len) ({ struct { int64_t n : len; } __x = { .n = x }; (uint64_t)__x.n; }) #define R(i) gpr(i) #define CSR(i) csr(i) #define Mr vaddr_read #define Mw vaddr_write #define PC _info->pc #define NPC _info->dnpc #define rd decode_rd(inst) #define rs1 decode_rs1(inst) #define rs2 decode_rs2(inst) typedef struct { void (*handler)(); } funct7_item; typedef struct { funct7_item funct7_pool[0x80]; void (*handler)(); } funct3_item; typedef struct { funct3_item funct3_pool[0x8]; void (*handler)(); } opcode_item; static opcode_item pool[0x80]; static uint32_t inst; static exec_t *_info = NULL; #define IMM(TYPE) decode_imm ## TYPE(inst) #define INST_EXEC(NAME, TYPE, ...) \ static void __ ## NAME() { \ __VA_ARGS__; \ } #define _____ -1 #define _________ -1 #define RULE(NAME, TYPE, FUNCT7, FUNCT3, OPCODE) \ do { \ if (FUNCT3 == _____) { \ pool[OPCODE].handler = __ ## NAME; \ break; \ } \ if (FUNCT7 == _________) { \ pool[OPCODE].funct3_pool[FUNCT3].handler = __ ## NAME; \ break; \ } \ pool[OPCODE].funct3_pool[FUNCT3].funct7_pool[FUNCT7].handler = __ ## NAME; \ } while (0) // special handler for ones that can be matched with raw inst... static inline void __special_handler() { extern word_t isa_raise_intr(word_t no, vaddr_t epc); switch (inst) { case 0b00000000000100000000000001110011: // ebreak SET_STATE(gpr(10) ? ABORT : END); break; case 0b00000000000000000000000001110011: // ecall NPC = isa_raise_intr(0xb, PC); break; case 0b00110000001000000000000001110011: // mret NPC = CSR(MEPC); set_csr_field(MSTATUS, MSTATUS_MIE, MSTATUS_MIE_SHIFT, get_csr_field(MSTATUS, MSTATUS_MPIE, MSTATUS_MPIE_SHIFT)); set_csr_field(MSTATUS, MSTATUS_MPIE, MSTATUS_MPIE_SHIFT, 1); break; default: assert(0); } } INST_EXEC(add, R, R(rd) = R(rs1) + R(rs2)) INST_EXEC(and, R, R(rd) = R(rs1) & R(rs2)) INST_EXEC(div, R, R(rd) = (sword_t)R(rs1) / (sword_t)R(rs2)) INST_EXEC(divu, R, R(rd) = R(rs1) / R(rs2)) INST_EXEC(mret, R, __special_handler()) INST_EXEC(mul, R, R(rd) = R(rs1) * R(rs2)) INST_EXEC(mulh, R, R(rd) = (int64_t)(sword_t)R(rs1) * (int64_t)(sword_t)R(rs2) >> 32LL) INST_EXEC(mulhu, R, R(rd) = (uint64_t)R(rs1) * (uint64_t)R(rs2) >> 32ULL) INST_EXEC(or, R, R(rd) = R(rs1) | R(rs2)) INST_EXEC(rem, R, R(rd) = (sword_t)R(rs1) % (sword_t)R(rs2)) INST_EXEC(remu, R, R(rd) = R(rs1) % R(rs2)) INST_EXEC(sll, R, R(rd) = R(rs1) << R(rs2)) INST_EXEC(slt, R, R(rd) = (sword_t)R(rs1) < (sword_t)R(rs2)) INST_EXEC(sltu, R, R(rd) = R(rs1) < R(rs2)) INST_EXEC(sra, R, R(rd) = (sword_t)R(rs1) >> (R(rs2) & 0x1f)) INST_EXEC(srl, R, R(rd) = R(rs1) >> (R(rs2) & 0x1f)) INST_EXEC(sub, R, R(rd) = R(rs1) - R(rs2)) INST_EXEC(xor, R, R(rd) = R(rs1) ^ R(rs2)) INST_EXEC(addi, I, R(rd) = R(rs1) + IMM(I)) INST_EXEC(andi, I, R(rd) = R(rs1) & IMM(I)) INST_EXEC(csrrs, I, word_t imm = IMM(I); word_t t = CSR(imm); __ F01 __; R(rd) = t) INST_EXEC(csrrw, I, word_t imm = IMM(I); word_t t = CSR(imm); __ F02 __; R(rd) = t) INST_EXEC(ebreak, I, __special_handler()) INST_EXEC(ecall, I, __special_handler()) INST_EXEC(jalr, I, word_t t = PC + 4; NPC = (__ F03 __) & ~1; R(rd) = t) INST_EXEC(lb, I, R(rd) = SEXT(Mr(R(rs1) + IMM(I), 1), 8)) INST_EXEC(lbu, I, R(rd) = Mr(R(rs1) + IMM(I), 1)) INST_EXEC(lh, I, R(rd) = SEXT(Mr(R(rs1) + IMM(I), 2), __ F04 __)) INST_EXEC(lhu, I, R(rd) = Mr(R(rs1) + IMM(I), __ F05 __)) INST_EXEC(lw, I, R(rd) = Mr(R(rs1) + IMM(I), __ F06 __)) INST_EXEC(ori, I, R(rd) = R(rs1) | IMM(I)) INST_EXEC(slli, I, R(rd) = R(rs1) << (IMM(I) & __ F07 __)) INST_EXEC(slti, I, R(rd) = (sword_t)R(rs1) < (sword_t)IMM(I)) INST_EXEC(sltiu, I, R(rd) = R(rs1) < IMM(I)) INST_EXEC(srai, I, R(rd) = (sword_t)R(rs1) >> (IMM(I) & 0x3f)) INST_EXEC(srli, I, R(rd) = R(rs1) >> (IMM(I) & __ F08 __)) INST_EXEC(xori, I, R(rd) = R(rs1) ^ IMM(I)) INST_EXEC(sb, S, Mw(R(rs1) + IMM(S), 1, R(rs2))) INST_EXEC(sh, S, Mw(R(rs1) + IMM(S), 2, R(rs2))) INST_EXEC(sw, S, Mw(R(rs1) + IMM(S), 4, R(rs2))) INST_EXEC(beq, B, if (R(rs1) == R(rs2)) NPC = PC + IMM(B)) INST_EXEC(bge, B, if ((sword_t)R(rs1) >= (sword_t)R(rs2)) NPC = PC + IMM(B)) INST_EXEC(bgeu, B, if (R(rs1) >= R(rs2)) NPC = PC + IMM(B)) INST_EXEC(blt, B, if ((sword_t)R(rs1) < (sword_t)R(rs2)) NPC = PC + IMM(B)) INST_EXEC(bltu, B, if (R(rs1) < R(rs2)) NPC = PC + IMM(B)) INST_EXEC(bne, B, if (R(rs1) != R(rs2)) NPC = PC + IMM(B)) INST_EXEC(auipc, U, R(rd) = PC + IMM(U)) INST_EXEC(lui, U, R(rd) = IMM(U)) INST_EXEC(jal, J, R(rd) = PC + 4, __ F09 __) void init_inst_pool() { memset(pool, 0, sizeof(pool)); RULE(add, R, 0b0000000, 0b000, 0b0110011); RULE(and, R, 0b0000000, __ F10 __, 0b0110011); RULE(div, R, 0b0000001, 0b100, 0b0110011); RULE(divu, R, 0b0000001, 0b101, 0b0110011); RULE(mret, R, 0b0011000, 0b000, 0b1110011); RULE(mul, R, 0b0000001, 0b000, 0b0110011); RULE(mulh, R, 0b0000001, 0b001, 0b0110011); RULE(mulhu, R, 0b0000001, 0b011, 0b0110011); RULE(or, R, 0b0000000, __ F11 __, 0b0110011); RULE(rem, R, 0b0000001, 0b110, 0b0110011); RULE(remu, R, 0b0000001, 0b111, 0b0110011); RULE(sll, R, 0b0000000, 0b001, 0b0110011); RULE(slt, R, 0b0000000, 0b010, 0b0110011); RULE(sltu, R, 0b0000000, 0b011, 0b0110011); RULE(sra, R, 0b0100000, 0b101, 0b0110011); RULE(srl, R, 0b0000000, 0b101, 0b0110011); RULE(sub, R, 0b0100000, 0b000, 0b0110011); RULE(xor, R, 0b0000000, 0b100, 0b0110011); RULE(addi, I, _________, __ F12 __, 0b0010011); RULE(andi, I, _________, 0b111, 0b0010011); RULE(csrrs, I, _________, 0b010, 0b1110011); RULE(csrrw, I, _________, 0b001, 0b1110011); RULE(ebreak, I, 0b0000000, 0b000, __ F13 __); RULE(ecall, I, 0b0000000, 0b000, __ F14 __); RULE(jalr, I, _________, 0b000, __ F17 __); RULE(lb, I, _________, 0b000, 0b0000011); RULE(lbu, I, _________, 0b100, 0b0000011); RULE(lh, I, _________, 0b001, 0b0000011); RULE(lhu, I, _________, 0b101, 0b0000011); RULE(lw, I, _________, 0b010, 0b0000011); RULE(ori, I, _________, 0b110, 0b0010011); RULE(slli, I, 0b000000, 0b001, 0b0010011); RULE(slti, I, _________, 0b010, 0b0010011); RULE(sltiu, I, _________, 0b011, 0b0010011); RULE(srai, I, __ F15 __, 0b101, 0b0010011); RULE(srli, I, __ F16 __, 0b101, 0b0010011); RULE(xori, I, _________, 0b100, 0b0010011); RULE(sb, S, _________, 0b000, 0b0100011); RULE(sh, S, _________, 0b001, 0b0100011); RULE(sw, S, _________, 0b010, 0b0100011); RULE(beq, B, _________, 0b000, 0b1100011); RULE(bge, B, _________, 0b101, 0b1100011); RULE(bgeu, B, _________, 0b111, 0b1100011); RULE(blt, B, _________, 0b100, 0b1100011); RULE(bltu, B, _________, 0b110, 0b1100011); RULE(bne, B, _________, 0b001, 0b1100011); RULE(auipc, U, _________, _____, __ F18 __); RULE(lui, U, _________, _____, __ F19 __); RULE(jal, J, _________, _____, __ F20 __); } static inline void insn_exec() { void (*handler)() = NULL; uint32_t opcode = decode_opcode(inst); uint32_t funct3 = decode_funct3(inst); uint32_t funct7 = decode_funct7(inst); if ((handler = pool[opcode].funct3_pool[funct3].handler)) { handler(); } else if ((handler = pool[opcode].funct3_pool[funct3].funct7_pool[funct7].handler)) { handler(); } else if ((handler = pool[opcode].handler)) { handler(); } else if ((handler = pool[opcode].funct3_pool[funct3].funct7_pool[funct7 >> 1].handler)) { handler(); } else { Error("Failed to execute inst 0x%" PRIx32 " at PC 0x%" PRIaddr "", inst, cpu.pc); SET_STATE(ABORT); } __ F21 __; /* reset $zero */ } void inst_exec_once(exec_t *info) { extern void itrace(exec_t *info, uint32_t inst); /* Instruction fetch */ inst = vaddr_ifetch(info->snpc, sizeof(inst)); itrace(info, inst); info->snpc += (vaddr_t)sizeof(uint32_t); _info = info; _info->dnpc = _info->snpc; /* Instruction execution */ insn_exec(); } ``` * F01 = ? (Line 99) > ==`CSR(imm) = t | R(rs1)`== * F02 = ? (Line 100) > ==`CSR(imm) = R(rs1)`== * F03 = ? (Line 103) > ==`R(rs1) + IMM(I)`== * F04 = ? (Line 106) > ==`16`== * F05 = ? > ==`2`== * F06 = ? > ==`4`== * F07 = ? > ==`0x3f`== or equivalent * F08 = ? > ==`0x3f`== or equivalent * F09 = ? > ==`NPC = PC + IMM(J)`== or equivalent * F10 = ? > ==`0b111`== or equivalent * F11 = ? > ==`0b110`== or equivalent * F12 = ? > ==`0b000`== or equivalent * F13 = ? > ==`0b1110011`== or equivalent * F14 = ? > ==`0b1110011`== or equivalent * F15 = ? > ==`0b010000`== or equivalent * F16 = ? > ==`0b000000`== or equivalent * F17 = ? > ==`0b1100111`== or equivalent * F18 = ? > ==`0b0010111`== or equivalent * F19 = ? > ==`0b0110111`== or equivalent * F20 = ? > ==`0b1101111`== or equivalent * F21 = ? > ==`R(0) = 0`== or equivalent --- ## Problem `G` Our processor features an 8-entry direct-mapped TLB, indexed using the lowest-order bits of the virtual page number. The program below operates on a matrix $A$, which consists of 4 rows and 256 columns. Each element in $A$ is a 32-bit integer, laid out in row-major order (i.e., consecutive elements in the same row occupy contiguous memory locations). The program computes the sum of the entries in a top-right submatrix $B$ of matrix $A$. Assume $A$ starts at virtual address 0x0000, and the result of the sum is stored in a register. Instruction fetches can be ignored. ```c int sum = 0; for (int i = 0; i < H; i++) for (int j = 0; j < W; j++) sum += A[i][j]; ``` ![matrix](https://hackmd.io/_uploads/rJiQCht8Jl.png) How many TLB misses will this program encounter when $H = 4$ and $W = 256$ (i.e., when $B = A$)? * G01 = ? > ==32== > Matrix $A$ spans 32 contiguous 128B pages. Since all elements in $A$ are accessed sequentially in the order they are laid out in memory, there will be a total of 32 TLB misses. What is the highest possible TLB hit rate that this program can achieve? * G02 = ? > ==$\dfrac{31}{32}$== > Keep in mind that each page contains 32 elements of matrix $A$. The maximum TLB hit rate occurs when there is 1 miss to load a page into the TLB, followed by 31 hits, resulting in a hit rate of $31/32$. > To achieve this, $W$ must be $32 \times N$, where $0 < N \leq 8$, and $H$ can be any value. It has been suggested that increasing the page size could significantly reduce TLB misses in the direct-mapped TLB for this program. For which values of $H$ (G03, range) and $W$ (G04, range) would doubling the page size fail to improve the TLB hit rate? * G03 = ? > ==$\leq 32$== or equivalent * G04 = ? > ==any== > With the larger page size, each page now contains 64 elements, and a single row of $A$ spans 4 pages. For $W \leq 32$ and any $H$, the number of hits remains unchanged compared to smaller pages, as increasing the page size does not reduce compulsory misses. What is the smallest page size, as a power of 2, that ensures our program incurs at least one TLB hit for all values of $H$ and $W$ where $H \times W > 1$? * G05 = ? > ==2048B== or equivalent > To determine the minimum page size, consider the scenario with the lowest hit rate (0%), where accessing one element per column resulted in no reuse, as elements from different columns belonged to different pages. Therefore, the minimum page size required to guarantee at least one TLB hit must be greater than 1024B. Since page sizes must be powers of two, the smallest such size is 2048B. The program is now modified by swapping the order of the two nested loops, as shown below: ```c int sum = 0; for (int j = 0; j < W; j++) for (int i = 0; i < H; i++) sum += A[i][j]; ``` Assume $W = 8$. For which values of $H$ will doubling the page size enhance the TLB hit rate of the above program? * G06 = ? > ==$\geq 2$== or equivalent > For $H \geq 2$, doubling the page size causes TLB entry conflicts between the first and third rows, as well as between the second and fourth rows. --- ## Problem `H` This question involves analyzing an out-of-order machine. A program consisting of 50% integer and 50% floating-point instructions is profiled. The average latencies for these instructions, assuming infinite reservation station and commit queue sizes, are as follows: | Instruction Type | Decode to Issue | Decode to Commit | FU Latency | |--------------------|-----------------|------------------|------------| | Integer | 3 cycles | 8 cycles | 2 cycles | | Floating-Point | 6 cycles | 20 cycles | 12 cycles | If the processor commits an average of 0.5 instructions per cycle for this program, how many entries are occupied on average in the reservation stations and the commit queue? Integer Reservation Station: * H01 = ? > ==0.75== Floating-Point Reservation Station * H02 = ? > ==1.5== Commit Queue: * H03 =? > ==7== > We use Little's Law to calculate the occupancy (\(N\)) of each structure: $$ N = T \times L $$ where $T$ is the throughput, and $L$ is the latency. > > First, the throughput for integer ($T_\text{int}$) and floating-point ($T_\text{fp}$) instructions is 0.25 instructions per cycle each, as the total throughput is 0.5 instructions per cycle and the program consists of 50% integer and 50% floating-point instructions. > > For the integer reservation station, the latency is 3 cycles. Using Little's Law, the occupancy is: $$ N_\text{int} = 0.25 \times 3 = 0.75 $$ > For the floating-point reservation station, the latency is 6 cycles. The occupancy is: $$ N_\text{fp} = 0.25 \times 6 = 1.5 $$ > To compute the occupancy of the commit queue, we calculate the occupancy for each type of instruction separately and then sum them. > > For integer instructions, the latency is 8 cycles, so the occupancy is: $$ N_\text{commit-int} = 0.25 \times 8 = 2 $$ > For floating-point instructions, the latency is 20 cycles, so the occupancy is: $$ N_\text{commit-fp} = 0.25 \times 20 = 5 $$ > The total commit queue occupancy is: $$ N_\text{commit-total} = 2 + 5 = 7 $$ --- ## Problem `I` Suppose you are designing a branch predictor for a pipelined in-order processor with the following stages. ![processor](https://hackmd.io/_uploads/H1waX6K81g.png) The processor operates with the following characteristics: - It can issue at most one instruction per cycle. - Branch addresses are determined at the end of the B stage. - Branch conditions (taken or not taken) are resolved at the end of the E stage. Within this pipeline, control flow instructions are handled as follows: - During the A stage, the instruction at $PC+4$ is fetched by default. - In the B stage (Branch Address Calculation/Begin Decode), conditional branch instructions (e.g., `BLE`/`BNE`) consult the branch predictor. If a branch is predicted to be taken, all subsequent instructions in the pipeline are flushed, and the PC is redirected to the calculated branch target address. - `JAL` instructions execute the jump to the target address in the B stage, flushing all subsequent instructions in the pipeline. The implementation of `JALR` does not need to be considered. For the following questions, consider the given C code and its corresponding RISC-V assembly equivalent. Note that there are no branches other than those explicitly included in the assembly code. - [ ] C Code ```c int X[1000000]; int a, b; ... for (int i = 1000000; i > 0; i--) { a = X[i]; ... / * Some work independent of a or b */ ... if (a >= 0) { b += b + b; } else { b += b * b; } } ``` - [ ] RISC-V Assembly ```c # x1, x2, and x3 contain the values of &X, a, and b respectively. _loop: LW x2, 0(x1) # Load word: x2 = X[i] ADDI x1, x1, -1 # Decrement x1 to point to the next element ... # Some work independent of a or b ... B1: BLT x2, x0, _else # Branch to _else if a < 0 ADD x3, x3, x3 # b = b + b JAL x0, B2 # Jump to B2 _else: MUL x3, x3, x3 # b = b * b B2: BNE x1, x0, _loop # Continue the loop if x1 != 0 ``` This setup forms the basis for analyzing the control flow and behavior of the program. Assume that branch `B1` is always Not Taken, while the branch predictor predicts all branches as Taken. On average, how many instructions will the processor flush per iteration of the loop? * I01 = ? > ==13== > The loop involves two conditional branches. For the first branch, which is mispredicted as taken, 7 instructions are flushed. Additionally, the unconditional `JAL` is taken, resulting in another 3 instructions being flushed. Therefore, the total number of instructions flushed is $7 + 3 + 3 = 13$. The branch predictor uses a single branch history table (BHT) indexed by the lower 10 bits of the PC. Each entry in the table is a 1-bit counter that predicts "taken" if the value is 1 and "not taken" if the value is 0. The table is updated with the actual branch outcome at the end of the execute stage (1 for "taken," 0 for "not taken"). Given that the elements of array $X$ are uniformly distributed between $[-101, 100]$, will the predictor achieve high or low accuracy for branches $B1$ and $B2$? Branch `B1` will have a prediction accuracy of approximately __ I02 __ (rate) since the branch direction is completely random. * I02 = ? > ==50%== `B2` is __ I03 __ (condition). * I03 = ? > ==always taken== or equivalent --- ## Problem `J` To design a directory-based MSI coherence protocol, the directory uses a bit vector to represent the sharer set for each cache line. In an $N$-core system, the directory maintains an $N$-bit wide bit vector for each line it tracks. If the $i$-th bit of the bit vector is set, it indicates that core $i$'s cache holds a copy of the line in either the Shared (S) or Modified (M) state. Assume the processor has $N$ cores, each with a 1KB private cache using 32B cache lines. How many entries must the directory contain to track all the cache lines in the private caches? * J01 = ? > ==$32 \times N$== or equivalent > Each core has 32 private cache lines, and with \( N \) cores, the total number of cache lines is $32 \times N $. The design of the sharer sets is modified to replace the bit vector with a set of sharer pointers. Each pointer stores the core ID of one sharer. For instance, if cores 10 and 12 hold a cache line in the Shared (S) state, the sharer set in the directory will consist of two pointers, "10" and "12." Each sharer set can hold a maximum of 16 sharer pointers. In an $N$-core system, where $N$ is a power of 2, how many total bits are required to represent all sharer pointers in a sharer set? * J02 = ? > ==$16 \times \log N$== or equivalent > With 16 pointers and each pointer requiring $\log N$ bits to represent the core ID, the total number of bits needed is $16 \times \log N$. --- ## Problem `K` The following questions examine memory accesses from multiple cores in a cache-coherent shared memory system. Each question evaluates the possible outcomes under the following memory consistency models: - Sequential Consistency (SC): Ensures that memory operations appear to execute in a single global order, consistent with the program order of each core. - Total Store Order (TSO), x86-style: Allows stores to be reordered after later loads. Stores from the same core are visible in the same order to other cores. Store-to-load forwarding within the same core is permitted. - Relaxed Memory Order (RMO): Permits both loads and stores to be reordered after later loads and stores, with store-to-load forwarding allowed. Assume all registers (e.g., `r1`, `r2`, ...) and memory locations (e.g., `a`, `b`, ...) are initialized to 0. Consider a cache-coherent shared-memory system executing the following two threads on two different cores. Assume that memory locations `a`, `b`, and `c` are initialized to 0. - [ ] Thread Execution 1 | Thread 1 (T1) | Thread 2 (T2) | |--------------------------|--------------------------| | `ST (a) <- 1` | `LD r1 <- (a)` | | `LD r2 <- (b)` | `ST (b) <- 1` | | `ST (a) <- 2` | | Identify the memory consistency models where the final values `r1 = 1`, `r2 = 0`, and `(a) = 1` are possible. (Answer in SC, TSO, and RMO) * K01 = ? > ==TSO, RMO== Consider the following four threads executing on a cache-coherent shared-memory system: - [ ] Thread Execution 2 | Thread 1 (T1) | Thread 2 (T2) | Thread 3 (T3) | Thread 4 (T4) | |--------------------------|--------------------------|--------------------------|--------------------------| | `ST (a) <- 1` | `ST (a) <- 2` | `LD r1 <- (a)` | `LD r2 <- (a)` | | | | `LD r3 <- (a)` | `LD r4 <- (a)` | Determine the memory consistency models where the final values `r1 = 1`, `r2 = 2`, `r3 = 2`, and `r4 = 1` are possible. (Answer in SC, TSO, and RMO) * K02 = ? > ==RMO== Consider the following three threads executing on a cache-coherent shared-memory system: - [ ] Thread Execution 3 | Thread 1 (T1) | Thread 2 (T2) | Thread 3 (T3) | |--------------------------|--------------------------|--------------------------| | `ST (a) <- 1` | `LD r1 <- (a)` | `LD r3 <- (b)` | | | `LD r2 <- (b)` | `LD r4 <- (a)` | | `ST (b) <- 1` | | | Determine the memory consistency models where the final values `r1 = 1`, `r2 = 0`, `r3 = 1`, and `r4 = 0` are possible. (Answer in SC, TSO, and RMO) * K03 = ? > ==TSO, RMO== A [sequence lock](https://en.wikipedia.org/wiki/Seqlock) is a synchronization primitive used to allow a single writer thread to "publish" data that can be read by multiple reader threads. Its key property is that the writer is never blocked, though readers may be temporarily blocked. The code provided below is correct when executed on a system with sequential memory consistency but is not guaranteed to be correct under weaker memory consistency models. ```c typedef struct { volatile uint64_t seq; volatile object_t object; } seqlock_t; /* Writer function, called by a single thread */ void publish(seqlock_t* seqlock, object_t* src) { atomic_add(&seqlock->seq, 1); // Increment sequence number (odd implies write in progress) memcpy(&seqlock->object, src, sizeof(object_t)); // Copy data to the shared object atomic_add(&seqlock->seq, 1); // Increment sequence number (even implies write completed) } /* Reader function, can be called by any thread */ void get_obj(seqlock_t* seqlock, object_t* dest) { while (1) { uint64_t seq = seqlock->seq; // Capture current sequence number if (seq % 2 == 1) continue; // Skip if a write is in progress memcpy(dest, &seqlock->object, sizeof(object_t)); // Copy the object if (seqlock->seq == seq) break; // Verify no writes occurred during the read } } ``` The writer thread increments the sequence number twice—once before writing and once after. An odd sequence number indicates that a write operation is ongoing. The writer ensures that readers see either the previous consistent state or the fully updated state. Readers check the sequence number before and after reading the object. If the sequence number changes during the read or is odd at the start, the reader retries. This ensures that the reader observes a consistent view of the data. Notes 1. Non-Atomic memcpy: Since `memcpy` is not atomic, readers can observe partial updates if the sequence lock's guarantees are violated. 2. No Memory Barriers: On this system, atomic operations (e.g., `atomic_add`) do not act as memory barriers, which may lead to reordering of instructions under weaker memory consistency models. 3. Object Size: The type `object_t` is assumed to be an opaque structure larger than 16 bytes, making it unsuitable for atomic memory operations. The sequence lock ensures that each reader observes a complete and consistent version of the object. If this sequence lock is used on a multicore system with TSO (Total Store Order) consistency, would the code remain correct? If yes, briefly explain why. If no, clearly specify the memory barriers that need to be added to ensure correctness. * K04 = ? (yes or no) > ==Yes== * K05 = ? (Explain) > (意思相近就給分) > The code would still be correct under TSO. TSO only allows stores to be re-ordered after loads. `get_obj()` does not have any stores to shared data, so it is clearly correct. > `publish()` is also correct, but it requires more justification. Let’s reason through the loads and stores it does to shared data (with atomics broken down into LD/ST pairs): - `LD seq` <- cannot be reordered later due to RAW dependency. - `ST seq` <- cannot be reordered later because TSO doesn’t allow ST-ST reordering. - `ST object[0] ... object[n]` <- cannot be reordered later because TSO doesn’t allow ST-LD reordering. - `LD seq` <- cannot be reordered later due to RAW dependency. - `ST seq` If this sequence lock is used on a multicore system with RMO (Relaxed Memory Order) consistency, would the code remain correct? If yes, briefly explain why. If no, clearly specify the memory barriers that must be added to ensure correctness. * K06 = ? (yes or no) > ==No== * K07 = ? (Explain) > (意思相近就給分) > With RMO, we need to add fences in both `publish()` and `get_obj()`. Specifically, we need RR fences both before and after the `memcpy()` in `get_obj()` and WW fences before and after the `memcpy()` in `publish()`. The code above uses two atomic additions in the `publish` function. Assuming a correct implementation on either TSO or RMO (with the necessary barriers in place), would replacing the atomic increments with ordinary non-atomic increments be sufficient? Why or why not? * K08 = ? (yes or no) > ==Yes== * K09 = ? (Explain) > Non-atomic instructions are fine. > > The problem states that only a single thread will call `publish()`. Therefore, there will never be data races on `&seqlock->seq`. We can simply regard the store to `&seqlock->seq` as the serialization point. --- ## Problem `L` This problem evaluates a virtual memory system with the following characteristics: 1. Two page sizes are supported: 4KB and 1MB (instead of the traditional 4MB). 2. Each Page Table Entry (PTE) is 16 bytes (instead of 8 bytes). The page table structure is summarized below, including the sizes of the page tables and data pages (not drawn to scale). ![virtual-memory](https://hackmd.io/_uploads/B155NCFLyg.png) The processor includes a data TLB with 64 entries, and each entry can map either a 4KB page or a 1MB page. After a TLB miss, a hardware engine walks the page table to reload the TLB. The TLB operates using a first-in/first-out (FIFO) replacement policy. The following program adds the elements from two 1MB arrays and stores the results in a third 1MB array. Assume that $1 \text{MB} = 1,048,576 \text{Bytes}$, and the starting addresses of the arrays are as follows: ```c byte A[1048576]; // 1MB array at 0x00001000000 byte B[1048576]; // 1MB array at 0x00001100000 byte C[1048576]; // 1MB array at 0x00001200000 for (int i = 0; i < 1048576; i++) C[i] = A[i] + B[i]; ``` Assume this program is the only process running on the system, and ignore instruction memory or operating system overheads. The data TLB is initially empty. The system has no cache, and each memory lookup has a latency of 100 cycles. If all data pages are 4KB, compute the ratio of cycles for address translation to cycles for data access. Address translation cycles * L01 = ? > ==300== > 100 + 100 +100 (for L1, L2 and L3 PTE) Data access cycles * L02 = ? > ==400K== or equivalent > 4K * 100 > (there is no cache, this assumes that memory access is byte-wise) If all data pages are 1MB, compute the ratio of cycles for address translation to cycles for data access. Address translation cycles * L03 = ? > 200 > 100 + 100 (for L1, L2 PTE) Data access cycles * L04 = ? > ==100M== or equivalent > 1M * 100 > (there is no cache, this assumes that memory access is byte-wise) Assume the system includes a PTE cache with a latency of one cycle. The PTE cache stores page table entries and has unlimited capacity. Calculate the ratio of cycles spent on address translation to cycles spent on data access for the case where data pages are 4KB. Address translation cycles * L05 = ? > ==77200== > (256*3 + 3 + 1) * 100 > (Note that the arrays are contiguous and share some PTE entries. 256 L3 PTEs per array * 3 arrays, 1 L2 PTE per array * 3 arrays, 1 L1 PTE) Data access cycles * L06 = ? > ==300M== or equivalent > 3M*100 What is the minimum number of entries required in the PTE cache to achieve the same performance as an unlimited PTE cache? Assume the PTE cache does not store L3 PTE entries and all data pages are 4KB. * L07 = ? > ==4== > (1 for L1 and 3 for L2) --- ## Problem `M` Suppose we have the following two functions, each running in its own thread: ```c #define SIZE (1 << 30) char array[SIZE]; void thread1() { while (true) { for (int i = 0; i < SIZE; i += 64) array[i] += 1; } } void thread2() { int i = 1; while (true) { // Easy to predict: loops forever int x = i++; while (x != 1) { // Medium difficulty to predict if ((x & 0x1) == 0) // Hard to predict x >>= 1; else x = 3 * x + 1; } } } ``` - [ ] Assembly Code for `thread1`: ```c thread1(): # Assume a1 holds the address of `array` and a5 holds SIZE .outer: mv a2, a1 # Set a2 to point to `array` li a3, 1 # Initialize loop counter .inner: lbu a4, 0(a2) # Load byte from `array[i]` addi a2, a2, 64 # Increment pointer by 64 addi a3, a3, 1 # Increment loop counter addi a4, a4, 1 # Increment value in `array[i]` sb a4, -64(a2) # Store updated value back to `array[i]` bne a3, a5, .inner # Repeat loop if counter < SIZE j .outer # Restart the outer loop ``` - [ ] Assembly Code for `thread2`: ```c thread2(): li a4, 0 # Initialize i li a6, 1 # Set constant for x != 1 check li a3, 3 # Set multiplier for 3 * x + 1 .outer: addi a4, a4, 1 # Increment i mv a1, a4 # Copy i to x .inner: be a1, a6, .outer # If x == 1, exit inner loop andi a2, a1, 1 # Check if x is odd bne a2, zero, .odd # If x is odd, go to .odd srai a1, a1, 1 # If x is even, divide x by 2 j .inner # Repeat inner loop .odd: mul a1, a1, a3 # Multiply x by 3 addi a1, a1, 1 # Add 1 to x j .inner # Repeat inner loop ``` Processor Design ![processor](https://hackmd.io/_uploads/HyZSoAtLJe.png) The program runs on an in-order, two-way coarse-grained multithreaded processor with the following characteristics: - Instruction and data cache hits take 1 cycle. - All code fits in the instruction cache, so there are no instruction cache misses. - Data cache misses have a latency of 100 cycles. - The data cache is 32KB with 64B cache lines. - The ALU takes 1 cycle. - The pipeline has full bypassing. - Branch misprediction penalty is 5 cycles, meaning the time from fetching the mispredicted branch's following instruction to fetching the correct target is 5 cycles. If each of these threads were executed individually on this processor, without multithreading, which thread would achieve a higher IPC? * M01 = ? > ==thread2== Explain your reasoning. * M02 = ? > (意思相近就給分) > Cache misses (100 cycle penalty) are much more expensive than branch mispredictions. So, even though thread2 can mispredict ~2 branches per iteration, the penalty is much smaller, resulting in higher IPC. Suppose the core switches between threads upon encountering a cache miss. When a data cache miss occurs, the processor flushes the pipeline, switches to the other thread, and continues executing that thread until it encounters a miss. The pipeline flush discards the load instruction and all subsequent instructions, up to 6 instructions in total (including the load). Would this design enable both threads to execute close to their peak single-threaded IPC in steady state? No. `thread2` has no memory instructions that could miss in cache. Therefore, once we switch to thread2, we would never have another cache miss again, starving thread1 entirely. To fix this, we could switch on: - both branch mispredicts and cache misses - both cache misses and responses from memory - cache misses and then switch back __ M03 __ cycles later * M03 = ? > ==100== Now, consider running multiple instances of `thread1` on this CPU. To achieve maximum IPC, more than 2-way multithreading will be required. What is the minimum number of `thread1` instances that must run concurrently on this pipeline to maximize IPC? * M04 = ? > ==9== > Each time a cache miss occurs, 6 instructions are flushed from the pipeline. These instructions must be re-executed when the thread is switched back, assuming the load will now hit in the cache. > In each inner-loop iteration, there are 6 instructions to execute, plus an additional 6 instructions that were flushed, resulting in a total of 12 instructions per loop iteration. To maximize IPC, at least ceil(100 / 12) = 9 threads are required. Suppose we implement a policy of switching on branch instructions. In a coarse-grained multithreaded implementation, generating a "switch" signal for the multithreading control logic can occur at various stages of the pipeline. If the switch occurs at stage $N$, stages $0$ through $N-1$ must be flushed. Now, consider running two instances of `thread2`. Which of the following switching points would result in the highest overall IPC? Select one of them. a. Switch on every branch (as predicted by the BTB) b. Switch on every branch misprediction (detected in the ALU) c. Switch on every branch (as determined during decode) d. Switch on every branch predicted taken (as determined by the branch predictor) * M05 = ? > ==a== --- ## Problem `N` Multitasking is a fundamental concept in computer systems, allowing multiple tasks or processes to run concurrently, improving resource utilization and system responsiveness. Traditional multitasking in operating systems relies on threads and processes managed by the kernel. These kernel-level threads are preemptively scheduled, meaning the operating system decides when to switch between tasks. While effective, this approach introduces significant overhead due to context switching, where the CPU saves and restores registers, memory mappings, and other state information. In scenarios where lightweight concurrency is needed, kernel-level multitasking often proves excessive. For instance, tasks such as cooperative multitasking, event-driven programming, or managing independent components within a simulation do not require preemption. Here, user-level constructs like coroutines or userspace threads become powerful tools. [Coroutines](https://en.wikipedia.org/wiki/Coroutine) provide a mechanism for cooperative multitasking, where tasks explicitly yield control to other tasks at well-defined points. Unlike kernel-level threads, coroutines operate entirely in user space, requiring minimal overhead. This makes them an excellent choice for applications like asynchronous programming, where tasks need to pause while waiting for external events such as I/O completion, without blocking other operations. Coroutines maintain their execution state, including the program counter and local variables, allowing them to resume from where they left off seamlessly. This feature makes them particularly useful in game development, where behaviors like AI or animations require frequent context switching within a single-threaded environment. Userspace threads, on the other hand, are an abstraction similar to coroutines but with a broader scope. They simulate traditional threads but are scheduled and managed entirely in user space, independent of the kernel. Userspace threads offer greater control over scheduling policies and avoid the overhead of kernel-mode context switches. However, unlike coroutines, userspace threads can preempt one another, making them closer in behavior to kernel threads but much more efficient for many workloads. The trade-off between coroutines and userspace threads lies in complexity and flexibility. Coroutines are simpler, focusing on cooperative multitasking, while userspace threads provide additional features like preemption. In both cases, these constructs shine in environments where performance and fine-grained control are critical. They reduce the reliance on the kernel, avoid heavy context-switching overhead, and enable developers to implement concurrency with precision tailored to the application's needs. For example, a high-performance web server might use coroutines to handle thousands of simultaneous connections without creating thousands of threads, which would strain the system's resources. Similarly, scientific simulations with many interacting entities might leverage userspace threads to model their behavior efficiently. Both approaches showcase how user-level multitasking constructs can outperform kernel-level alternatives in specific domains, enabling scalable and efficient multitasking solutions. Implementing general-purpose coroutines typically requires a second call stack, a feature not natively supported by the C language. A practical, though platform-specific, approach involves using inline assembly to manipulate the stack pointer during the coroutine's initial setup. After obtaining a second call stack, the [setjmp](https://man7.org/linux/man-pages/man3/siglongjmp.3.html) and [longjmp](https://man7.org/linux/man-pages/man3/longjmp.3p.html) functions from the standard C library can be used to switch between coroutines. These functions work by saving and restoring the stack pointer, program counter, callee-saved registers, and other internal state required by the ABI. This ensures that returning to a coroutine after yielding restores its full state, as if resuming from a paused function call. Alternatively, minimalist implementations can achieve similar results by using inline assembly to swap only the stack pointer and program counter, ignoring other registers. While this approach clobbers all other registers, it can be much faster than relying on `setjmp` and `longjmp`, which must conservatively save all registers specified by the ABI. By contrast, the minimalist method allows the compiler to manage register spilling, saving only what is actively in use. We now aim to implement a minimalist coroutine system on the RV32I architecture, assuming a 32-bit system without floating-point operations. - [ ] `main.c` ```c #include <stddef.h> #include <stdint.h> #include <stdio.h> #include <stdlib.h> #include <string.h> /* The context_t structure holds the saved state of a thread, * including the return address (ra), stack pointer (sp), * callee-saved registers (s0-s11), the entry function pointer, * and a user-defined data pointer. */ typedef struct { uintptr_t ra; uintptr_t sp; uintptr_t s0, s1, s2, s3, s4, s5, s6, s7, s8, s9, s10, s11; void (*entry)(void *); void *data; } context_t[1]; extern void switch_context(context_t from, context_t to); extern void helper(void); /* Initialize the context_t structure for a thread. Sets up * the stack pointer, assigns the helper function as the return address, * and stores the thread entry function and associated data. */ static void initialize_context(context_t ctx, void *stack_base, size_t stack_size, void (*entry)(void *data), void *data) { #define RED_ZONE 128 uintptr_t stack_end = (uintptr_t) stack_base + stack_size - RED_ZONE; stack_end &= /* N01. Fill your code here */; /* Ensure the stack is 16-byte aligned */ memset(ctx, 0, sizeof *ctx); ctx->sp = stack_end; ctx->ra = (uintptr_t) helper; ctx->data = data; ctx->entry = entry; } #define ITERATIONS (1000 * 1000) #define STACKSIZE 32768 static context_t thread1, thread2, thread3; /* * Entry function for thread1. It performs a fixed number of context * switches between thread1 and thread2, and finally prints the total * number of context switches. */ static void thread1_func(void *data) { for (int i = 0; i < ITERATIONS; ++i) switch_context(thread1, thread2); printf("%d context switches\n", 3 * ITERATIONS); } /* Entry function for thread2. It alternates between thread1 and thread3 * for a series of context switches, then loops indefinitely switching * to thread3. */ static void thread2_func(void *data) { switch_context(thread2, thread1); // Switch to thread1 switch_context(thread2, thread1); // Switch to thread1 switch_context(thread2, thread3); // Switch to thread3 switch_context(thread2, thread1); // Switch to thread1 for (int i = 0;; ++i) switch_context(thread2, thread3); // Loop switching to thread3 } /* * Entry function for thread3. It alternates between thread2 and thread1, * then loops indefinitely switching to thread1. */ static void thread3_func(void *data) { switch_context(thread3, thread2); // Switch to thread2 switch_context(thread3, thread1); // Switch to thread1 for (int i = 0;; ++i) switch_context(thread3, thread1); // Loop switching to thread1 } #define ALIGN 64 // Cache line alignment /* * Allocates memory aligned to the cache line size to ensure proper alignment. */ static void *aligned_malloc(int size) { void *mem = malloc(size + ALIGN + sizeof(void *)); void **ptr = (void **) (((uintptr_t) mem + ALIGN + sizeof(void *)) & /* N02. Fill your code here */); ptr[-1] = mem; return ptr; } #define RED_ZONE 128 /* * Sets up the stack for a new thread and initializes its context. */ static void init_context(context_t ctx, void (*entry)(void *data), void *data) { void *stack = aligned_malloc(STACKSIZE + RED_ZONE + 8) - 8; initialize_context(ctx, stack, STACKSIZE, entry, data); } /* * Main function initializes thread2 and thread3, then starts thread1. */ int main() { init_context(thread2, thread2_func, (void *) 0xDEEEECAF); init_context(thread3, thread3_func, (void *) 0xF000000D); thread1_func((void *) 0xBABEBABE); return 0; } ``` - [ ] `context.S` ```c .text .align 4 .globl switch_context .type switch_context, @function /* * Saves the callee-saved registers from the current thread context * into the memory pointed to by a0, then loads the saved state * for the next thread context from the memory pointed to by a1. */ switch_context: sw ra, 0(a0) // Save return address sw sp, 4(a0) // Save stack pointer sw s0, 8(a0) // Save callee-saved registers sw s1, 12(a0) sw s2, 16(a0) sw s3, 20(a0) sw s4, 24(a0) sw s5, 28(a0) sw s6, 32(a0) sw s7, 36(a0) sw s8, 40(a0) sw s9, 44(a0) sw s10, 48(a0) sw s11, 52(a0) lw ra, 0(a1) // Load return address lw sp, 4(a1) // Load stack pointer lw s0, 8(a1) // Load callee-saved registers lw s1, 12(a1) lw s2, 16(a1) lw s3, 20(a1) lw s4, 24(a1) lw s5, 28(a1) lw s6, 32(a1) lw s7, 36(a1) lw s8, 40(a1) lw s9, 44(a1) lw s10, 48(a1) lw s11, 52(a1) ret // Return to the next context .size switch_context, .-switch_context /* * The helper function is the entry point for a new thread. * It retrieves the entry function and its argument from the stack, * then transfers control to the entry function. */ .align 4 .globl helper .type helper, @function helper: lw a0, __ N03 __ (a1) // Load the argument (data) from the stack lw t0, __ N04 __ (a1) // Load the entry function pointer jr t0 // Jump to the entry function .size helper, .-helper ``` Build the above using the command: ```shell $ $(CROSS_COMPILE)gcc -march=rv32i -mabi=ilp32 -o coroutine.elf context.S main.c ``` Complete the code to ensure it works as expected. * N01 = ? > ==`-16`== or equivalent * N02 = ? > ==`~(ALIGN - 1)`== * N03 = ? > ==`56`== * N04 = ? > ==`60`==