<style>
h2.part{color:#000000;}
h3.part{color:#D92424;}
h4.part{color:#005BB0;}
h5.part{color:#FD6F0A;}
h6.part{color:#4400B0;}
</style>
# 2017q1 Homework5(matrix)
contributed by < `paul5566` >
## 作業要求
* 詳細閱讀 [Matrix Multiplication using SIMD](https://hackmd.io/s/Hk-llHEyx),試著重現實驗
* 在 GitHub 上 fork [matrix_oo](https://github.com/sysprog21/matrix_oo) 專案,指出原本設計的優缺點,並且提出改進方案 (從物件導向封裝、效能,還有資料驗證等觀點)
* 提示: 可執行 `make check`,自動編譯測試程式並驗證
* 在 [matrix_oo](https://github.com/sysprog21/matrix_oo) 的基礎上,整合 [Matrix Multiplication using SIMD](https://hackmd.io/s/Hk-llHEyx) 的成果,注意,應該建立新的檔案,如 `matrix_sse.c` 和 `matrix_avx.c` 並且設計新的效能分析工具,需要用到程式碼的 `Stopwatch`,時間刻度用 millisecond (ms)
* 克服 [matrix_oo](https://github.com/sysprog21/matrix_oo) 裡頭只支援 4x4 矩陣的限制,考慮 sub-matrix 和 column-major 來提升 cache 使用率
* 驗證過程需要考慮不同大小的矩陣
* 需要考慮到矩陣乘法的有效性,不是任意矩陣都可相互執行乘法
* 需要一併考慮 data alignment 議題
* 學習 crypto API 並設計介面使演算法容易擴充和比較
* 學習的案例: [Linaro Cortex String benchmark for ARMv8](https://wiki.linaro.org/WorkingGroups/Kernel/ARMv8/cortex-strings)
* 關於 benchmark 設計可參考: [ssvb/tinymembench](https://github.com/ssvb/tinymembench) 和 [hglm/test-arm-kernel-memcpy](https://github.com/hglm/test-arm-kernel-memcpy)
* 允許透過亂數作為資料輸入,驗證矩陣乘法的效能,需要視覺化 (透過 gnuplot 或 R)
* 留意記憶體存取,避免 leak,參考 [你所不知道的C語言:技巧篇](https://hackmd.io/s/HyIdoLnjl) 提到的 smart pointer
* 指出原有的程式碼是否 [thread-safe](https://en.wikipedia.org/wiki/Thread_safety),並提出改進方案
## 學習目標
- Matrix Multiplication using SIMD重現實驗
- 留意記憶體存取,避免 leak
- 語法與概念整理
### Matrix Multiplication using SIMD重現實驗
* [連結](https://hackmd.io/OzAMBMGNgRhhaAnANgKYDN4BYOXgDnQENUlgtwAjAZkiKNFACYg=)
* 數據
| 版本 | naive | sub-matrix | sse | sse_prefetch | strassen |
| --- | ----- | ---------- | --- | ------------ | ------------------- |
| 時間 (us) | 27017801 | 8030999 | 2779563 | 2902576 | 8359757 |
* avx和avx2無法支援的樣子
### 留意記憶體存取,避免 leak
我們用[valgrind](http://blog.yslin.tw/2014/03/c-valgrind.html)去觀察
輸入指令
```
$ valgrind --leak-check=full ./test-matrix
```
##### key world 是 Heap Summary
```
==3686== HEAP SUMMARY:
==3686== 64 bytes in 1 blocks are definitely lost in loss record 1 of 4
==3686== at 0x4C2DB8F: malloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==3686== by 0x400CDA: assign (matrix_naive.c:16)
==3686== by 0x40063B: main (test-matrix.c:13)
==3686==
==3686== 64 bytes in 1 blocks are definitely lost in loss record 2 of 4
==3686== at 0x4C2DB8F: malloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==3686== by 0x400CDA: assign (matrix_naive.c:16)
==3686== by 0x400731: main (test-matrix.c:22)
==3686==
==3686== 64 bytes in 1 blocks are definitely lost in loss record 3 of 4
==3686== at 0x4C2DB8F: malloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==3686== by 0x400C56: mul (matrix_naive.c:37)
==3686== by 0x400745: main (test-matrix.c:31)
==3686==
==3686== 64 bytes in 1 blocks are definitely lost in loss record 4 of 4
==3686== at 0x4C2DB8F: malloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==3686== by 0x400CDA: assign (matrix_naive.c:16)
==3686== by 0x400837: main (test-matrix.c:33)
```
因此我們可以知道總共四次沒有free掉 allocate的memory
### 語法與概念整理
* "#" "macro" 這類的程式碼都是交給"pre-processor"可節省run time的時間
[reference](http://stackoverflow.com/questions/653466/reading-zend-engine-api-code-what-does-double-hash-mean)
example
sizeof () preprofessor
* size_t
* example:As an implication, size_t is a type guaranteed to hold any array index.array index is a["0"], a["1"], a["2"];
* 比如說這個array很大,如果我們是用int 這個type宣告array ,當 a["i"]在跑for(int i=0;i>OverFlow;i++ )會發生overflow這個問題我們就可以用size_t來取代i這個array index
[reference](http://stackoverflow.com/questions/2550774/what-is-size-t-in-c)
* malloc return type
* RETURN VALUE(在查詢man後得到這段資訊)
* The malloc( ) and calloc( ) functions return a pointer to the allocated memory, which is suitably aligned for any built-in type. On error, these functions return NULL.
* 這邊的any built-in type 類似void!!?
* 比較一下兩者差異
* typedef strcut
* struct
### Trace Code
```clike=
int a[] = {1, 2, 3, 4}; //element is an intterget value
int *a[] = { }; // the element is address
```
```clike=
MatrixAlgo *matrix_providers[] = { //[]這個沒有限定可以存幾個的array
&NaiveMatrixProvider, //stored many pointer of MatrixAlgo
};
```
#### 這個東西就是 一個指向structure的pointer 因為MartrixAlgo是一個structure
```clike=
int main()
{
MatrixAlgo *algo = matrix_providers[0];
Matrix dst, m, n, fixed;
algo->assign(&m, (Mat4x4) {
.values = {
{ 1, 2, 3, 4, },
{ 5, 6, 7, 8, },
{ 1, 2, 3, 4, },
{ 5, 6, 7, 8, },
},
});
}
```
所以trace code到MatricALgo可以了解到這在幹麻
```clike=
typedef struct {
void (*assign)(Matrix *thiz, Mat4x4);
bool (*equal)(const Matrix *l, const Matrix *r);
bool (*mul)(Matrix *dst, const也注入大量日本元素,並且用到北野武主演,但孭飛的那位,叫 Scarlett Johansson,明明就是飾演草薙素子, Matrix *l, const Matrix *r);
} MatrixAlgo
```
這個可以看到裡面包什麼東西
```clike=
MatrixAlgo *algo = matrix_providers[0];
//這個是存入matric_provide[0]這個位置裡面的資料
//這邊弄一個指標可以參考下面的
```
這邊可參考這個範例
```clike=
#include <stdio.h>
int main(void){
int b = 10;
int *a[] = {&b};// 裡面的element存了一個 指標指向int 型態的address
int *c = a[0];//這個指標指向a[0]的位置
printf("c = %d\n", *c);
return 0;
}
```
### 重點!!
```clike=
#include "matrix.h"
#include <stdlib.h>
struct naive_priv
{
float values[4][4];
};
#define PRIV(x) ((struct naive_priv *) ((x)->priv))
//PRIV(x)的定義為
static void assign(Matrix *thiz, Mat4x4 data)//這個Mat4x4是在marco定義的
/* FIXME: don't hardcode row & col */
thiz->row = thiz->col = 4;
thiz->priv = malloc(4 * 4 * sizeof(float));
for (int i = 0; i < 4; i++)
for (int j = 0; j < 4; j++)
PRIV(thiz)->values[i][j] = data.values[i][j];
}
```
#### 用ctags這套件去trace code發現 "priv"是[matrix.h] 裡Matrix的一個memeber
```clike=
typedef struct {
int row, col;
void *priv;
} Matrix;
```
從此得知 priv這只是一個 void 的pointer可以轉形成任何資料型態
```clike=
#define PRIV(x) ((struct naive_priv *) (("x")->priv))
```
這邊作法就轉型
example 原本的malloc
#### ((struct naive_priv *)
把Matrix "x"中priv的member轉形成(typecast)一個指向naive_priv的pointer 好讓data.values[i][j]資料存入"naive_priv"這個structure裡面
問一下男人會發現:
malloc() return type
RETURN VALUE:
The malloc() and calloc() functions return a pointer to the allocated memory, which is suitably aligned for any built-in type.
```clike=
{
/* FIXME: don't hardcode row & col */
thiz->row = thiz->col = 4;
thiz->priv = malloc(4 * 4 * sizeof(float));
for (int i = 0; i < 4; i++)
for (int j = 0; j < 4; j++)
PRIV(thiz)->values[i][j] = data.values[i][j];
}
```
### Memory Pool
* 改善Page Fault
* 恐龍本9-1
## Reference
[stanleytazi](https://hackmd.io/s/Byc9LNyTe)
[jack81306](https://hackmd.io/s/S1_Zi37ae)
[xdennisx](https://hackmd.io/CYJgpgjGIIYKwFoQE4IHYEBYBGxkIA4YBmCBZOGKAgNmADMZ6wg=)
[jack81306](https://hackmd.io/s/S1_Zi37ae)
[Valgrind &Memoey Leak](http://blog.yslin.tw/2014/03/c-valgrind.html)
https://msdn.microsoft.com/en-us/library/09dwwt6y(VS.80).aspx