2017q1 Homework5 (matrix)

# 2017q1 Homework5 (matrix) contributed by < `baimao8437` > ## 開發環境 ``` baimao@baimao-Aspire-V5-573G:~$ lscpu Architecture: x86_64 CPU 作業模式： 32-bit, 64-bit Byte Order: Little Endian CPU(s): 4 On-line CPU(s) list: 0-3 每核心執行緒數：2 每通訊端核心數：2 Socket(s): 1 NUMA 節點： 1 供應商識別號： GenuineIntel CPU 家族： 6 型號： 69 Model name: Intel(R) Core(TM) i5-4200U CPU @ 1.60GHz 製程： 1 CPU MHz： 1711.242 CPU max MHz: 2600.0000 CPU min MHz: 800.0000 BogoMIPS: 4589.38 虛擬： VT-x L1d 快取： 32K L1i 快取： 32K L2 快取： 256K L3 快取： 3072K NUMA node0 CPU(s): 0-3 ``` ## ~~閱讀~~欣賞程式碼 ### makefile 先拜讀[強者同學精華解析](https://hackmd.io/s/rkFhTyK3x#makefile) ### test-matrix.c ``` MatrixAlgo *matrix_providers[] = { &NaiveMatrixProvider, }; ``` 這可以將`matrix.h`中的所有不同的實作放進array裡依序完成不同的操作 ``` /* Available matrix providers */ extern MatrixAlgo NaiveMatrixProvider; ``` 但是這樣必須 extern 很多不同的實作方式我比較偏好 extern 單一 API 介面像之前在 [phonebook-concurrent](https://github.com/baimao8437/phonebook-concurrent/commit/c818f0a648aad31d9d1488d6f50e4ec6360045bc) 使用的多型技巧也能保持程式碼簡潔雖然可能要改寫現有的 makefile 但是再多次測量的設計上也會比較方便、熟悉 ## 改寫 ### Stopwatch 擴充又是熟悉的 stopwatch，這次所規定的時間單位要使用 ms 所以我想新增一個 time unit control 的功能只要 Stopwatch.create("ms") 在 read 時就會回傳單位為 ms 的時間程式碼大概長這樣只有三種單位然後 if else 寫法感覺有點醜... ```c watch_p create(char *unit) { if (!strcmp(unit, "sec")) S->time_unit = 1000000.0; else if (!strcmp(unit, "ms")) S->time_unit = 1000.0; else if (!strcmp(unit, "us")) S->time_unit = 1.0; else assert(NULL && "time unit error"); ... ``` ### Makefile 設計改寫成類似 phonebook-concurrent 的方式雖然會在 tests 資料夾中生成較多執行檔但是在設計多次測試 benchmark 會比較方便而且在`matrix.h`只要 extern 一次就可以用不同的 method .c檔去實作不同操作 ### test-matrix_% 因為上面的 makefile 修改使得不再用原本將所有方法位置存進 array 的方式執行可以減少指標的使用 e.g. 將 `algo->assign` 直接改成 `MatrixProvider.assign` ## 站在[巨人的肩膀](https://hackmd.io/s/Hk-llHEyx) 感念前人努力的貢獻我就心懷感激的當個快樂小碼農引進 submatrix、sse、sse_prefetch 經適當修改 ``` (src1 + (x + 0) * src1_w + k) => & (PRIV(l)->values[(x + 0)][k])) src2[k * src2_w + y] => PRIV(r)->values[k][y] ``` 即可運作但 avx 還不行因為現在還只支援 4x4 matrix ## 效能分析 - perf - naive ``` Performance counter stats for './tests/test-matrix_naive' (10 runs): 2,624 cache-misses # 18.275 % of all cache refs ( +- 23.58% ) 14,360 cache-references ( +- 3.02% ) 0.000689048 seconds time elapsed ( +- 16.87% ) ``` - submatrix ``` Performance counter stats for './tests/test-matrix_submatrix' (10 runs): 1,294 cache-misses # 9.315 % of all cache refs ( +- 48.59% ) 13,894 cache-references ( +- 3.40% ) 0.000510149 seconds time elapsed ( +- 6.42% ) ``` - sse ``` Performance counter stats for './tests/test-matrix_sse' (10 runs): 1,243 cache-misses # 9.022 % of all cache refs ( +- 26.07% ) 13,778 cache-references ( +- 1.32% ) 0.000488172 seconds time elapsed ( +- 6.02% ) ``` - sse_prefetch ``` Performance counter stats for './tests/test-matrix_sse_prefetch' (10 runs): 773 cache-misses # 5.917 % of all cache refs ( +- 49.66% ) 13,064 cache-references ( +- 0.54% ) 0.000407139 seconds time elapsed ( +- 1.86% ) ``` - gnuplot ![](https://i.imgur.com/YfjfQUe.png) 這裡取各執行1000次計算95信賴區間的結果 ## 學習新的包裝法這段修改是學習[老師的範例](https://github.com/jserv/arith_register) 我全程跪著寫前面的大修改把原本的 makefile & matrix provider 的包裝方式改成跟我之前比較熟悉的 phonebook-concurrent 一樣其實是我畏懼了面對這次 makefile 的奧妙我不知從何下手設計 benchmark 只能回頭做我比較熟的東西老師的範例簡直是一盞明燈讓我又見識了C語言的深奧以下是修改的主軸 ```c // in matrix.h #define REGISTER_MUL(nameX)\ MatrixAlgo MatrixAlgo_##nameX __attribute__((section("MatrixAlgo"))) = { \ .name = #nameX, .assign = assign, .equal = equal, .mul = mul, \ }; /* Available matrix providers */ extern MatrixAlgo __start_MatrixAlgo[], __stop_MatrixAlgo[]; #define MUL_IMPL_BEGIN __start_MatrixAlgo #define MUL_IMPL_END __stop_MatrixAlgo ``` 使得我們新增新的 method 時不用再去改 matrix.h 帶來很大的便利只要在各matrix_*method*.c 實作完後使用macro `REGISTER_MUL(*method name*)` 就會自動對應及分配到 MatrixAlgo 介面 array 中在測試時只要簡單的 ```c for (MatrixAlgo *p = MUL_IMPL_BEGIN; p < MUL_IMPL_END; p++) { ... } ``` 就可以使用指標 p 去完成不同的實作目前在想重複的程式碼(assign & equal)該如何處理讓每個 matrix_*method*.c 檔只對 mul 進行不同的實作