<style> h2.part{color:#000000;} h3.part{color:#D92424;} h4.part{color:#005BB0;} h5.part{color:#FD6F0A;} h6.part{color:#4400B0;} </style> # 2017q1 Homework5(matrix) contributed by < `paul5566` > ## 作業要求 * 詳細閱讀 [Matrix Multiplication using SIMD](https://hackmd.io/s/Hk-llHEyx),試著重現實驗 * 在 GitHub 上 fork [matrix_oo](https://github.com/sysprog21/matrix_oo) 專案,指出原本設計的優缺點,並且提出改進方案 (從物件導向封裝、效能,還有資料驗證等觀點) * 提示: 可執行 `make check`,自動編譯測試程式並驗證 * 在 [matrix_oo](https://github.com/sysprog21/matrix_oo) 的基礎上,整合 [Matrix Multiplication using SIMD](https://hackmd.io/s/Hk-llHEyx) 的成果,注意,應該建立新的檔案,如 `matrix_sse.c` 和 `matrix_avx.c` 並且設計新的效能分析工具,需要用到程式碼的 `Stopwatch`,時間刻度用 millisecond (ms) * 克服 [matrix_oo](https://github.com/sysprog21/matrix_oo) 裡頭只支援 4x4 矩陣的限制,考慮 sub-matrix 和 column-major 來提升 cache 使用率 * 驗證過程需要考慮不同大小的矩陣 * 需要考慮到矩陣乘法的有效性,不是任意矩陣都可相互執行乘法 * 需要一併考慮 data alignment 議題 * 學習 crypto API 並設計介面使演算法容易擴充和比較 * 學習的案例: [Linaro Cortex String benchmark for ARMv8](https://wiki.linaro.org/WorkingGroups/Kernel/ARMv8/cortex-strings) * 關於 benchmark 設計可參考: [ssvb/tinymembench](https://github.com/ssvb/tinymembench) 和 [hglm/test-arm-kernel-memcpy](https://github.com/hglm/test-arm-kernel-memcpy) * 允許透過亂數作為資料輸入,驗證矩陣乘法的效能,需要視覺化 (透過 gnuplot 或 R) * 留意記憶體存取,避免 leak,參考 [你所不知道的C語言:技巧篇](https://hackmd.io/s/HyIdoLnjl) 提到的 smart pointer * 指出原有的程式碼是否 [thread-safe](https://en.wikipedia.org/wiki/Thread_safety),並提出改進方案 ## 學習目標 - Matrix Multiplication using SIMD重現實驗 - 留意記憶體存取,避免 leak - 語法與概念整理 ### Matrix Multiplication using SIMD重現實驗 * [連結](https://hackmd.io/OzAMBMGNgRhhaAnANgKYDN4BYOXgDnQENUlgtwAjAZkiKNFACYg=) * 數據 | 版本 | naive | sub-matrix | sse | sse_prefetch | strassen | | --- | ----- | ---------- | --- | ------------ | ------------------- | | 時間 (us) | 27017801 | 8030999 | 2779563 | 2902576 | 8359757 | * avx和avx2無法支援的樣子 ### 留意記憶體存取,避免 leak 我們用[valgrind](http://blog.yslin.tw/2014/03/c-valgrind.html)去觀察 輸入指令 ``` $ valgrind --leak-check=full ./test-matrix ``` ##### key world 是 Heap Summary ``` ==3686== HEAP SUMMARY: ==3686== 64 bytes in 1 blocks are definitely lost in loss record 1 of 4 ==3686== at 0x4C2DB8F: malloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so) ==3686== by 0x400CDA: assign (matrix_naive.c:16) ==3686== by 0x40063B: main (test-matrix.c:13) ==3686== ==3686== 64 bytes in 1 blocks are definitely lost in loss record 2 of 4 ==3686== at 0x4C2DB8F: malloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so) ==3686== by 0x400CDA: assign (matrix_naive.c:16) ==3686== by 0x400731: main (test-matrix.c:22) ==3686== ==3686== 64 bytes in 1 blocks are definitely lost in loss record 3 of 4 ==3686== at 0x4C2DB8F: malloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so) ==3686== by 0x400C56: mul (matrix_naive.c:37) ==3686== by 0x400745: main (test-matrix.c:31) ==3686== ==3686== 64 bytes in 1 blocks are definitely lost in loss record 4 of 4 ==3686== at 0x4C2DB8F: malloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so) ==3686== by 0x400CDA: assign (matrix_naive.c:16) ==3686== by 0x400837: main (test-matrix.c:33) ``` 因此我們可以知道總共四次沒有free掉 allocate的memory ### 語法與概念整理 * "#" "macro" 這類的程式碼都是交給"pre-processor"可節省run time的時間 [reference](http://stackoverflow.com/questions/653466/reading-zend-engine-api-code-what-does-double-hash-mean) example sizeof () preprofessor * size_t * example:As an implication, size_t is a type guaranteed to hold any array index.array index is a["0"], a["1"], a["2"]; * 比如說這個array很大,如果我們是用int 這個type宣告array ,當 a["i"]在跑for(int i=0;i>OverFlow;i++ )會發生overflow這個問題我們就可以用size_t來取代i這個array index [reference](http://stackoverflow.com/questions/2550774/what-is-size-t-in-c) * malloc return type * RETURN VALUE(在查詢man後得到這段資訊) * The malloc( ) and calloc( ) functions return a pointer to the allocated memory, which is suitably aligned for any built-in type. On error, these functions return NULL. * 這邊的any built-in type 類似void!!? * 比較一下兩者差異 * typedef strcut * struct ### Trace Code ```clike= int a[] = {1, 2, 3, 4}; //element is an intterget value int *a[] = { }; // the element is address ``` ```clike= MatrixAlgo *matrix_providers[] = { //[]這個沒有限定可以存幾個的array &NaiveMatrixProvider, //stored many pointer of MatrixAlgo }; ``` #### 這個東西就是 一個指向structure的pointer 因為MartrixAlgo是一個structure ```clike= int main() { MatrixAlgo *algo = matrix_providers[0]; Matrix dst, m, n, fixed; algo->assign(&m, (Mat4x4) { .values = { { 1, 2, 3, 4, }, { 5, 6, 7, 8, }, { 1, 2, 3, 4, }, { 5, 6, 7, 8, }, }, }); } ``` 所以trace code到MatricALgo可以了解到這在幹麻 ```clike= typedef struct { void (*assign)(Matrix *thiz, Mat4x4); bool (*equal)(const Matrix *l, const Matrix *r); bool (*mul)(Matrix *dst, const也注入大量日本元素,並且用到北野武主演,但孭飛的那位,叫 Scarlett Johansson,明明就是飾演草薙素子, Matrix *l, const Matrix *r); } MatrixAlgo ``` 這個可以看到裡面包什麼東西 ```clike= MatrixAlgo *algo = matrix_providers[0]; //這個是存入matric_provide[0]這個位置裡面的資料 //這邊弄一個指標可以參考下面的 ``` 這邊可參考這個範例 ```clike= #include <stdio.h> int main(void){ int b = 10; int *a[] = {&b};// 裡面的element存了一個 指標指向int 型態的address int *c = a[0];//這個指標指向a[0]的位置 printf("c = %d\n", *c); return 0; } ``` ### 重點!! ```clike= #include "matrix.h" #include <stdlib.h> struct naive_priv { float values[4][4]; }; #define PRIV(x) ((struct naive_priv *) ((x)->priv)) //PRIV(x)的定義為 static void assign(Matrix *thiz, Mat4x4 data)//這個Mat4x4是在marco定義的 /* FIXME: don't hardcode row & col */ thiz->row = thiz->col = 4; thiz->priv = malloc(4 * 4 * sizeof(float)); for (int i = 0; i < 4; i++) for (int j = 0; j < 4; j++) PRIV(thiz)->values[i][j] = data.values[i][j]; } ``` #### 用ctags這套件去trace code發現 "priv"是[matrix.h] 裡Matrix的一個memeber ```clike= typedef struct { int row, col; void *priv; } Matrix; ``` 從此得知 priv這只是一個 void 的pointer可以轉形成任何資料型態 ```clike= #define PRIV(x) ((struct naive_priv *) (("x")->priv)) ``` 這邊作法就轉型 example 原本的malloc #### ((struct naive_priv *) 把Matrix "x"中priv的member轉形成(typecast)一個指向naive_priv的pointer 好讓data.values[i][j]資料存入"naive_priv"這個structure裡面 問一下男人會發現: malloc() return type RETURN VALUE: The malloc() and calloc() functions return a pointer to the allocated memory, which is suitably aligned for any built-in type. ```clike= { /* FIXME: don't hardcode row & col */ thiz->row = thiz->col = 4; thiz->priv = malloc(4 * 4 * sizeof(float)); for (int i = 0; i < 4; i++) for (int j = 0; j < 4; j++) PRIV(thiz)->values[i][j] = data.values[i][j]; } ``` ### Memory Pool * 改善Page Fault * 恐龍本9-1 ## Reference [stanleytazi](https://hackmd.io/s/Byc9LNyTe) [jack81306](https://hackmd.io/s/S1_Zi37ae) [xdennisx](https://hackmd.io/CYJgpgjGIIYKwFoQE4IHYEBYBGxkIA4YBmCBZOGKAgNmADMZ6wg=) [jack81306](https://hackmd.io/s/S1_Zi37ae) [Valgrind &Memoey Leak](http://blog.yslin.tw/2014/03/c-valgrind.html) https://msdn.microsoft.com/en-us/library/09dwwt6y(VS.80).aspx