2017q1 Homework5(matrix)

2017q1 Homework5(matrix) === contributed by <`vtim9907`> # 開發環境 - OS : Ubuntu 16.04.2 LTS - Kernel Version : 4.8.0-36-generic - Memory : 32 GB - CPU : Intel® Core™ i5-6600 Processor - cache : - L1i cache : 4*32 KB - L1d cache : 4*32 KB - L2 cache : 4*256 KB - L3 cache : 6144 KB - L1 cache imformation ( sudo dmidecode -t cache ) : ``` Handle 0x001E, DMI type 7, 19 bytes Cache Information Socket Designation: L1 Cache Configuration: Enabled, Not Socketed, Level 1 Operational Mode: Write Back Location: Internal Installed Size: 256 kB Maximum Size: 256 kB Supported SRAM Types: Synchronous Installed SRAM Type: Synchronous Speed: Unknown Error Correction Type: Parity System Type: Unified Associativity: 8-way Set-associative ``` # 重現原實驗 ## 修改矩陣只能 4 * 4 的限制原本作業提供的 code 只能做 4 * 4 的矩陣乘法，所以必須先把這個限制拔除，才好做之後的測試。首先，我從最開始的 naive 版本開始改。 - 修改 matrix.h 開頭 ```clike= #define M 4 #define N 4 /* predefined shortcut */ #define DECLARE_MATRIX(col, row) \ typedef struct { int values[col][row]; } Mat #if (M >= N) DECLARE_MATRIX(M, M); #else DECLARE_MATRIX(N, N); #endif ``` 矩陣大小由開頭 define 的 M、N 決定。 - 修改 matrix_naive.c 因為矩陣的大小變成動態，所以必須修改原來寫死矩陣大小的 struct ```clike= struct naive_priv { int **values; }; ``` #### `assign(Matrix *thiz, int row, int col, Mat data)` 也因為改成動態，malloc 的方式也改變，主要只是用指標的指標的概念，以完成二維矩陣的 memory 配置 ```clike= struct naive_priv *temp = malloc(sizeof(struct naive_priv)); temp->values = (int **) malloc(dst->row * sizeof(int *)); for (int i = 0; i < dst->row; ++i) temp->values[i] = (int *) malloc(dst->col * sizeof(int)); dst->priv = temp; ``` #### `mul(Matrix *dst, const Matrix *l, const Matrix *r)` 然後在相乘的過程中，必須要做 error handling，因為兩個相乘的矩陣 size 可能不合，不是 $MxN * NxM$，這會造成相乘失敗，所以在相乘之前，先確認兩矩陣的 row 和 column 合不合 ```clike= if (l->col != r->row) { printf("Wrong matrix size!"); return false; } ``` - 修改 test-matrix.c 原本兩個 source matrix 的內容是寫死的，但為了更貼近真實，我改成用 random 的方式，隨機產生矩陣內的資料 ```clike= srand(time(NULL)); for (int i = 0; i < M; ++i) for (int j = 0; j < N; ++j) data1.values[i][j] = rand() % 100; for (int i = 0; i < N; ++i) for (int j = 0; j < M; ++j) data2.values[i][j] = rand() % 100; ``` >怕矩陣的 size 很大時，有可能會產生 overflow 的狀況，所以姑且先讓矩陣內的元素限制在 0 ~ 99 之間。但是因為資料不是寫死的了，所以原來的正確答案已經不適用，不過因為資料是浮動的，目前還沒有想到一個有效的方法可以動態產生確定正確的答案，所以我只能印出來，在手動確定答案是否正確＠＠試了很多次不同 size 的矩陣，結果都正確。 > 雖然手動驗證有點愚蠢，但若確定這個 naive 版本是正確的，之後就可以以這個版本為標的，未其他版本做驗證。 ### naive 版效能測試測試 64x128 * 128x256 的矩陣乘法，總共跑 50000 次，取 95% 信賴區間，篩掉誤差過大的數據，圖形如下： ![](https://i.imgur.com/3U1lFcb.png) 基本上平均時間就落在 0.0014s 左右。 ## submatrix 版本效能測試在 [Matrix Multiplication using SIMD](https://hackmd.io/s/Hk-llHEyx#matrix-multiplication-using-simd) 裡的實驗，其中一個是把矩陣拆成多個 4 * 4 的小矩陣，讓 cache 較有機會發揮作用，但也因此多了一個限制，就是矩陣的 row 和 column 都必須是 4 的倍數。 > 有想到大概的解決方法，不過有點大費周章，再思考比較有效率的解決方法初步直接修改 naive 為 submatrix 版本( 沒用 AVX , SSE ...)： ```clike= for (int i = 0; i < dst->row; i += 4) { for (int j = 0; j < dst->col; j += 4) { for (int k = 0; k < N; k += 4) { for (int i2 = 0; i2 < 4; ++i2) { for (int j2 = 0; j2 < 4; ++j2) { for (int k2 = 0; k2 < 4; ++k2) { PRIV(dst)->values[i + i2][j + j2] += PRIV(l)->values[i + i2][k + k2] * PRIV(r)->values[k + k2][j + j2]; } } } } } } ``` 然後跑看看效能，跟前面條件一樣，都是跑 50000 次，取 95% 信賴區間： ![](https://i.imgur.com/frkEPCy.png) 結果跟原實驗不同，幾乎沒有加速多少，在想可能是因為 data alignment 的問題。 ### memalign() 尋找解決上面問題的過程中，在 [Matrix Multiplication using SIMD](https://hackmd.io/s/Hk-llHEyx#matrix-multiplication-using-simd) 看到 memalign() 這個 function，查了一下 man page 看是在幹什麼的： ``` void *memalign(size_t alignment, size_t size); The obsolete function memalign() allocates size bytes and returns a pointer to the allocated memory. The memory address will be a multiple of alignment, which must be a power of two. ``` 恩... 很好， " obsolete function memalign() " ，記取上次 " register " 的教訓，先看官方的說明果然比較好，看來這個 function 雖然還是可以用，不過被標記為淘汰了，查了 [The GNU C Library: Aligned Memory Blocks](https://www.gnu.org/software/libc/manual/html_node/Aligned-Memory-Blocks.html) ，裡面提到 : ``` The memalign function is obsolete and aligned_alloc or posix_memalign should be used instead. ``` 雖然找不太到為什麼它被淘汰的原因，不過至少知道替代 function 該用哪個，所以又查了一下 aligned_alloc 是作什麼的： ``` void *aligned_alloc(size_t alignment, size_t size); The function aligned_alloc() is the same as memalign(), except for the added restriction that size should be a multiple of alignment. ``` 另外下方還備註 ``` The glibc malloc(3) always returns 8-byte aligned memory addresses, so these functions are only needed if you require larger alignment values. ``` 所以以我現在的狀況來看，我需要對齊的長度至少該大於 8-byte，所以用 malloc 是不夠的，所以合理覺得可以試試用 aligned_alloc 來取代 malloc 配置 matrix。 - aligned_alloc 取代 malloc 來配置 matrix 記憶體空間因為 aligned_alloc 是 c11 的東西，所以改了 Makefile ( -std=gnu11 )，然後跑跑看： ![](https://i.imgur.com/4OvACt1.png) 結果還是一樣... 我再想想是哪裡不對＠＠＠＠ > 也許要更換原本用指標的指標配置二維陣列的方式，改攤平成一維的實做方式，說不定有用？ # Reference - [Matrix Multiplication using SIMD](https://hackmd.io/s/Hk-llHEyx#matrix-multiplication-using-simd)