# 2024q1 第 9 週測驗題
contributed by < `cuda-chen` >
> https://hackmd.io/@sysprog/linux2024-quiz9
## 測驗 1
## 測驗 2
XFFF = `copy_from_user`
XGGG = `copy_from_user`
XHHH = `copy_to_user`
### 解釋程式碼運作原理
### 以 CMWQ 重寫,並針對批次的矩陣乘法運算,提出有效的存取模型
- https://hackmd.io/@RinHizakura/H1PKDev6h
- https://events.static.linuxfound.org/sites/events/files/slides/Async%20execution%20with%20wqs.pdf
- https://hackmd.io/@sysprog/linux2024-integration/%2F%40sysprog%2Flinux2024-integration-c?utm_source=preview-mode&utm_medium=rec
### 在 GitHub 找出矩陣乘法相關專案,如 matmul, matmul-bench, matmul-cpu, libxsmm, Matrix_Multiply_using_Arm_Neon_and_Avx,並進行效能比較和實作分析,從而歸納出提升矩陣乘法的手法
使用 https://github.com/Cuda-Chen/SRCNN-cpp 裡面的程式碼進行最佳化。
#### loop interchange
考慮矩陣乘法的計算程式碼:
```clike
for(int i = start_row; i < end_row; i++) {
for(int j = 0; j < MAT_SIZE; j++) {
result[i][j] = 0;
for(int k = 0; k < MAT_SIZE; k++)
result[i][j] += matrix_a[i][k] * matrix_b[k][j];
}
}
```
可以看出在 `matrix_b` 是使用一個 column major 的方式在進行存取,然 C 語言是使用 row major 的方式在存取陣列中的元素,因此會有相當多的 cache miss 次數。
我們可以使用矩陣乘法中計算每個元素時的加法結合律,將矩陣乘法改寫如下:
```clike
for(int i = start_row; i < end_row; i++) {
for(int j = 0; j < MAT_SIZE; j++) {
for(int k = 0; k < MAT_SIZE; k++)
result[i][j] += matrix_a[i][k] * matrix_b[k][j];
}
}
```
這個做法將 `matrix_b` 的元素存去順序改成 row major,唯一的缺點是在開始矩陣乘法計算前,要對 `result` 進行初始化(將每個元素設定成 0)。
```clike!
static long matrix_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
{
switch (cmd)
{
case MATRIX_IOCTL_COMPUTE:
/* Set each element to zero in result */
int a, b;
for (a = 0; a < MAT_SIZE; a++) {
for (b = 0; b < MAT_SIZE; b++)
result[a][b] = 0;
}
...
}
}
```
#### tiling/blocking
我們可以給每個 thread 計算一個 sub-matrix,根據矩陣乘法的「特性」,不僅計算結果相同,而且可以讓每個 thread 直接存取其需要計算的鄰近元素,增加效能。
記得計算多少元素個數可以塞到 cache 上限。
```clike!
typedef struct start_dim {
int x;
int y;
int z;
} dim;
static int worker_thread(void *data)
{
dim d = *(dim *) data;
int start_i = d.x;
int end_i = start_i + SUBMAT_SIZE;
int start_j = d.y;
int end_j = start_j + SUBMAT_SIZE;
int start_k = d.z;
int end_k = start_k + SUBMAT_SIZE;
int i, j, k;
for (i = start_i; i < end_i; ++i) {
for (k = start_k; k < end_k; k++) {
for (j = start_j; j < end_j; j++)
result[i][j] += matrix_a[i][k] * matrix_b[k][j];
}
}
complete(&computation_done);
return 0;
}
static long matrix_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
{
switch (cmd)
{
case MATRIX_IOCTL_COMPUTE:
int i, j, k;
...
/* Create worker threads for each submatrix */
for (i = 0; i < MAT_SIZE; i += SUBMAT_SIZE) {
for (j = 0; j < MAT_SIZE; j += SUBMAT_SIZE) {
for (k = 0; k < MAT_SIZE; k += SUBMAT_SIZE) {
dim *thread_arg =
kmalloc(sizeof(struct start_dim), GFP_KERNEL);
thread_arg->x = i;
thread_arg->y = j;
thread_arg->z = k;
kthread_run(worker_thread, thread_arg, "worker_thread");
}
}
}
}
}
```
### 嘗試在 Linux 核心模組使用 SSE/AVX/NEON 等 SIMD 指令集並降低資料存取的延遲
### 研讀 LLaMA Now Goes Faster on CPUs 並歸納加速矩陣乘法的手段
sparse matrix?
- https://justine.lol/matmul/
- https://www.cs.cmu.edu/~scandal/cacm/node9.html
-
## 測驗 3