--- tags: AI accelerators --- # 2022q1 Lecture 2 (quiz2) ## Question `1` This problem applies SIMD to the outer product MMM scheme of today’s lecture. - Consider outer product MMM between two 3x4 and 4x5 tiles derived by partitioning large matrices. - The diagram below depicts the outer product between a column vector in the red tile and the corresponding row vector in the blue tile. - Suppose that there are wide registers for SIMD operations and sufficient memory bandwidth for load/store operations to/from these registers. Which of the following are true? ![](https://i.imgur.com/tnr0aZv.png) - [x] `(a)` In this example, there are 4 outer products to be performed. - [ ] `(b)` Using SIMD instructions operating on 5-long vectors, we can spped up each outer product computation 15x compared to non-SMID scalar instructions. :::info 考慮長度為`5`的向量做outer product的運算,則包含`5`次 multiply 跟`5`次 add,使用 SIMD 的話可以節省為`1`次 multiply 跟`1`次 add,故加速為 5x ::: - [ ] `(c)` None of the above. --- ## Question `2` We have learned systolic arrays in today's lecture. Under the baseline scheme (below), a weight-stationary systolic array can perform matrix multiplication $W \times D$ where $W$ is the weight matrix and $D$ is the data matrix. (As noted in the lecture, it suffices to consider non-skewed versions of systolic arrays, given that we can later mechanically convert them to skewed versions for synchronization purposes.) In practice, we often need to perform both $W \times D$ and $W^T \times D$ where $W^T$ is the transposed $W$, e.g., in Transformers for NLP. It turns out we can avoid storing an extra transposed version of $W$ in the systolic array by entering the rows of $D^T$ (red) from the left of the systolic array, rather than from the bottom, to meet columns of $W$ in computing $W^T \times D$, as depicted below. Now we ask which side of the systolic array the result matrix (blue) should be output from. Which of the following are true? ![](https://i.imgur.com/eSlFaAC.png) ![](https://i.imgur.com/8lBKp5D.png) - [ ] `(a)` Results are output from the right side of the systolic array ![](https://i.imgur.com/7N7dY85.png) - [x] `(b)` Results are output from the top side of the systolic array :::info The results sould be the transpose of the inputs ::: - [ ] `(c)` None of the above. --- ## Question `3` We have learned the notion of arithmetic intensity (AI) and the roofline model. Which of the following are true? ![](https://i.imgur.com/Ia1uTk6.png =300x300) X-Coordinate: Arithmetic Intensity (AI) Y-Coordinate: Computation Throughput (CT) - [ ] `(a)` We can move the flat part of the roofline higher via computation scheduling without the hardware. :::info The roofline is determined by hardware ::: - [ ] `(b)` We can move the ridge point to left by increasing data reuse via scheduling :::info Scheduling the computation can only increase AI until we reach the ridge point ::: - [x] `(c)` We can move the ridge point to left by increasing the memory bandwidth to the external memory bandwith via hardware upgrade. :::danger 提升 memory bandwith 可以提升 AI,所以會更早觸碰 roofline,故 ridge point 左移 ::: > [name=劉晉華]c選項不是很確定 --- ## Question `4` In today’s lecture, we considered three cases of block shaping for MMM, as shown above. Suppose that the three cases from left to right can make full use of increasing number of cores. Which of the following are true? ![](https://i.imgur.com/aWcJj0a.png) - [x] `(a)` This can be accomplished without increasing the bandwidth requirement with the external memory. :::info 這是 CAKE 的 feature ::: - [ ] `(b)` This can be accomplished without increasing the total number of accesses per block to the external memory. :::info IO次數: `(a)` kn+km `(b)` 2kn+2km `(c)` pkn+pkm ::: - [ ] `(c)` This can be accomplished without increasing the size of local memory. :::info 因為要提升 size of computation block,所以 size of local memory 也得提升 :::