# 2024q1 第 9 週測驗題 contributed by < `cuda-chen` > > https://hackmd.io/@sysprog/linux2024-quiz9 ## 測驗 1 ## 測驗 2 XFFF = `copy_from_user` XGGG = `copy_from_user` XHHH = `copy_to_user` ### 解釋程式碼運作原理 ### 以 CMWQ 重寫,並針對批次的矩陣乘法運算,提出有效的存取模型 - https://hackmd.io/@RinHizakura/H1PKDev6h - https://events.static.linuxfound.org/sites/events/files/slides/Async%20execution%20with%20wqs.pdf - https://hackmd.io/@sysprog/linux2024-integration/%2F%40sysprog%2Flinux2024-integration-c?utm_source=preview-mode&utm_medium=rec ### 在 GitHub 找出矩陣乘法相關專案,如 matmul, matmul-bench, matmul-cpu, libxsmm, Matrix_Multiply_using_Arm_Neon_and_Avx,並進行效能比較和實作分析,從而歸納出提升矩陣乘法的手法 使用 https://github.com/Cuda-Chen/SRCNN-cpp 裡面的程式碼進行最佳化。 #### loop interchange 考慮矩陣乘法的計算程式碼: ```clike for(int i = start_row; i < end_row; i++) { for(int j = 0; j < MAT_SIZE; j++) { result[i][j] = 0; for(int k = 0; k < MAT_SIZE; k++) result[i][j] += matrix_a[i][k] * matrix_b[k][j]; } } ``` 可以看出在 `matrix_b` 是使用一個 column major 的方式在進行存取,然 C 語言是使用 row major 的方式在存取陣列中的元素,因此會有相當多的 cache miss 次數。 我們可以使用矩陣乘法中計算每個元素時的加法結合律,將矩陣乘法改寫如下: ```clike for(int i = start_row; i < end_row; i++) { for(int j = 0; j < MAT_SIZE; j++) { for(int k = 0; k < MAT_SIZE; k++) result[i][j] += matrix_a[i][k] * matrix_b[k][j]; } } ``` 這個做法將 `matrix_b` 的元素存去順序改成 row major,唯一的缺點是在開始矩陣乘法計算前,要對 `result` 進行初始化(將每個元素設定成 0)。 ```clike! static long matrix_ioctl(struct file *file, unsigned int cmd, unsigned long arg) { switch (cmd) { case MATRIX_IOCTL_COMPUTE: /* Set each element to zero in result */ int a, b; for (a = 0; a < MAT_SIZE; a++) { for (b = 0; b < MAT_SIZE; b++) result[a][b] = 0; } ... } } ``` #### tiling/blocking 我們可以給每個 thread 計算一個 sub-matrix,根據矩陣乘法的「特性」,不僅計算結果相同,而且可以讓每個 thread 直接存取其需要計算的鄰近元素,增加效能。 記得計算多少元素個數可以塞到 cache 上限。 ```clike! typedef struct start_dim { int x; int y; int z; } dim; static int worker_thread(void *data) { dim d = *(dim *) data; int start_i = d.x; int end_i = start_i + SUBMAT_SIZE; int start_j = d.y; int end_j = start_j + SUBMAT_SIZE; int start_k = d.z; int end_k = start_k + SUBMAT_SIZE; int i, j, k; for (i = start_i; i < end_i; ++i) { for (k = start_k; k < end_k; k++) { for (j = start_j; j < end_j; j++) result[i][j] += matrix_a[i][k] * matrix_b[k][j]; } } complete(&computation_done); return 0; } static long matrix_ioctl(struct file *file, unsigned int cmd, unsigned long arg) { switch (cmd) { case MATRIX_IOCTL_COMPUTE: int i, j, k; ... /* Create worker threads for each submatrix */ for (i = 0; i < MAT_SIZE; i += SUBMAT_SIZE) { for (j = 0; j < MAT_SIZE; j += SUBMAT_SIZE) { for (k = 0; k < MAT_SIZE; k += SUBMAT_SIZE) { dim *thread_arg = kmalloc(sizeof(struct start_dim), GFP_KERNEL); thread_arg->x = i; thread_arg->y = j; thread_arg->z = k; kthread_run(worker_thread, thread_arg, "worker_thread"); } } } } } ``` ### 嘗試在 Linux 核心模組使用 SSE/AVX/NEON 等 SIMD 指令集並降低資料存取的延遲 ### 研讀 LLaMA Now Goes Faster on CPUs 並歸納加速矩陣乘法的手段 sparse matrix? - https://justine.lol/matmul/ - https://www.cs.cmu.edu/~scandal/cacm/node9.html - ## 測驗 3