# 2017q1 matrix
contributed by <`twzjwang`>,<`rayleigh0407`>
GitHub: [matrix_oo](https://github.com/twzjwang/matrix_oo)
Task: 比照 [B10: matrix](https://hackmd.io/s/rkrrpHm2e) 要求,完整實作多種加速矩陣乘法的演算法,伴隨相關的 SIMD 改善機制
# 矩陣乘法 Matrix multiplication
- 定義 definition
- $C = AB$ for an $n × m$ matrix $A$ and an $m × p$ matrix $B$, then $C$ is an $n × p$ matrix with entries
- $c_{ij} = \sum_{k = 1}^m {a_{ik}b_{kj}}$
- ![](https://i.imgur.com/pIL5IbO.png)
- [Matrix multiplication](https://commons.wikimedia.org/wiki/File:Matrix_multiplication_diagram.svg)
- 矩陣乘法演算法的歷史
History of Matrix Multiplication Algorithms
- ![](https://i.imgur.com/QSM7Ruh.png)
# 演算法 Algorithm
## Iterative algorithm
- time complexity : $O(n^3)$
- 方法 method :
```=
input matrix A(n*m), B(m*p)
creat matrix C(n*p)
for i = 1 : n
for j = 1 : p
sum = 0
for k = 1 : m
sum = sum + A[i][k] * B[k][j]
C[i][j] = sum
return C
```
- 問題 problem :
- 矩陣儲存的方式(由於 cache 使用對效能有相當大的影響)
the order the matrices stored
(considerable impact on performance due to cache use)
- row-major order
- column-major order
- mix of both
## Divide and conquer algorithm
- time complexity : $O(n^3)$
- 依 block partitioning 方法切割
relies on the block partitioning
- ![](https://i.imgur.com/bFmG9AD.png)
(figure from [Introduction to Parallel Computing](https://computing.llnl.gov/tutorials/parallel_comp/#distributions))
### square matrices
- 大小為 $2^n × 2^n$
the shapes are $2^n × 2^n$ for some n
- ![](https://i.imgur.com/2a35f2W.png)
- ![](https://i.imgur.com/z9fXF8Q.png)
### non-square matrices
- 將矩陣一分為二,兩部份大小須相同,或盡量相近(odd dimensions)
dividing it into two parts of equal size, or as close to equal sizes as possible in the case of odd dimensions
## Sub-cubic algorithms
- 提供運行時間比 straightforward 方法更快的演算法
Algorithms exist that provide better running times than the straightforward ones
### [Strassen Algorithm](https://en.wikipedia.org/wiki/Strassen_algorithm)
- time complexity : 約 $O(n^{2.8074})$
- $N = 2^n$, $n = log N$
- Iterative Algorithm : $2*N^3$
- Strassen Algorithm :
$Strassen\_O(n) = 7 \times Strassen\_O(n - 1) + O(n^2)$
$(7 + O(1))^n => (7 + O(1))^{logN} = N^{log(7)+O(1)} = N^{2.8074}$
- 兩矩陣相乘 $C = AB$
- 如果矩陣不是 $2^n × 2^n$, 把不足的欄和列補 0
If the matrices A, B are not of type $2^n × 2^n$ we fill the missing rows and columns with zeros.
- 切割 $A$, $B$, $C$, 使得 $A_{i,j}, B_{i,j}, C_{i,j} \in R^{2^{n-1} \times 2^{n-1}}, i,j=1,2$
We partition $A$, $B$ and $C$ into equally sized block matrices
- ![](https://i.imgur.com/1u5cxlH.png)
with ![](https://i.imgur.com/JCfGitk.png)
- 目標
- $C_{11} = A_{11}B_{11}+A_{12}B_{21}$
- $C_{12} = A_{11}B_{12}+A_{12}B_{22}$
- $C_{21} = A_{21}B_{11}+A_{22}B_{21}$
- $C_{22} = A_{21}B_{12}+A_{22}B_{22}$
- 定義新的矩陣
define new matrices
- $M_{1} = (A_{11}+A_{22})(B_{11}+B_{22})$
- $M_{2} = (A_{21}+A_{22})B_{11}$
- $M_{3} = A_{11}(B_{12}-B_{22})$
- $M_{4} = A_{22}(B_{21}-B_{11})$
- $M_{5} = (A_{11}+A_{12})B_{22}$
- $M_{6} = (A_{21}-A_{11})(B_{11}+B_{12})$
- $M_{7} = (A_{12}-A_{22})(B_{21}+B_{22})$
- 只使用 7 次相乘(每個 $M_k$ 各一次),而不是原本的 8 次
only using 7 multiplications (one for each $M_k$) instead of 8
- $C_{11} = M_{1}+M_{4}-M_{5}+M_{7}$
- $C_{12} = M_{3}+M_{5}$
- $C_{21} = M_{2}+M_{4}$
- $C_{22} = M_{1}-M_{2}+M_{3}+M_{6}$
- 分析
- 矩陣相乘:假設兩矩陣皆為 n x n , 則相乘後的矩陣, 元素數量為 n^2^, 每個元素皆相乘 n 次, 相加 n - 1 次後得值, 所需計算步驟為 $$n^2 * [ n + (n - 1) ] = 2n^3 - n^2$$ 乘法需要 $n^3$ , 加法需要 $n^3 - n^2$
- 假設 $n = 2m$, 原始所需步驟為 $16m^3 - 4m^2$, 而 Strassen algorithm , 須先計算 $M_i\ (i = 1 \sim 7)$, 子矩陣部份直接相乘, 其所需步驟為 **( m x m 矩陣相乘 ) * 7 + ( m x m 矩陣相加減 ) * 10**, 最後計算答案時所需步驟為 **( m x m 矩陣相加減 ) * 7**, 即為 $$(2m^3 - m^2) \times 7 + m^2 \times 10 + m^2 \times 8 = 14m^3 + 11m^2$$ 乘法需要 $7m^3$, 加減法需要 $7m^3 + 11m^2$
- | Naive | Strassen algo |
| :----- | :---------- |
| $16m^3 - 4m^2$ | $14m^3 + 11m^2$ |
- 當 m >= 8 時, Naive 步驟就會大於 Strassen algo, 若 m >> 1, Strassen algo 所需時間就會是 Naive 的 7/8 倍
### [Coppersmith–Winograd algorithm](https://en.wikipedia.org/wiki/Coppersmith%E2%80%93Winograd_algorithm)
- time complexity : $O(n^{2.375477})$ ~ $O(n^{2.3728639})$
- 已知漸近最快的矩陣乘法演算法。
the asymptotically fastest known matrix multiplication algorithm until 2010.
- 改進 naive $O(n^{3})$, 與Strassen $O(n^{2.807355})$
- 實際中很少應用,因為常數因素的負擔很大,使他變得不切實際
rarely used in practice, because the large constant factors in their running times make them impractical
- 可能再改進,但至少$O(n^{2})$
It is possible to improve the exponent further; however, the exponent must be at least $O(n^{2})$
- 讀取 $n \times n$ 需 $O(n^{2})$,至少讀一次
- In 2010, Andrew Stothers gave an improvement to the algorithm, $O(n^{{2.374}})$
- In 2011, Virginia Williams combined a mathematical short-cut from Stothers' paper with her own insights and automated optimization on computers, improving the bound to $O(n^{2.372864})$
- In 2014, François Le Gall simplified the methods of Williams and obtained an improved bound of $O(n^{2.3728639})$
- Coppersmith-Winograd 演算法經常作為其他演算法中的 building block 以證明理論時間範圍
The Coppersmith–Winograd algorithm is frequently used as a building block in other algorithms to prove theoretical time bounds.
- 然而,與Strassen算法不同,實際上他不被使用,因為它只有在矩陣非常巨大時才有優勢,使得它們不能被現代硬體處理。
However, unlike the Strassen algorithm, it is not used in practice because it only provides an advantage for matrices so large that they cannot be processed by modern hardware.
### [Freivalds' algorithm](https://en.wikipedia.org/wiki/Freivalds%27_algorithm)
- time complexity : $O(n^2)$
- 驗證矩陣乘法(正確性)
Verify matrix multiplication
> 想驗證 $C=AB$是否成立,直觀的想法為先算出$AB=D$,再判斷$C=D$
> 時間複雜度為$O(n^3)$~$O(n^{2.3728639})$
> Freivalds' algorithm 可加速至 $O(n^2)$
- 比任何已知解法還快的簡單隨機演算法
a simple randomized algorithm faster than any known deterministic solution
- 方法 method :
- intput $n \times n$ matrices $A, B, C$
- generate an $n \times 1$ random 0/1 vector $\vec{r}$.
- compute $\vec{P} = A \times (B\vec{r}) - C\vec{r}$
- output "Yes" if $\vec{P} = (0,0,...,0)^T$; "No," otherwise.
- Error
- If $A \times B = C$, then the algorithm always returns "Yes". If $A \times B \neq C$, then the probability that the algorithm returns "Yes" is less than or equal to one half. This is called one-sided error.
- 運行時間 Running Time :
- $ABr = A(Br)$, so we have 3 instances of an $n \times n$ matrix times an n-vector
> $B\times r$:
> $B$($n\times n$) r($n\times 1$)
> =>$O(n\times n \times1)$
>
> $A\times (Br)$:
> $A$($n\times n$) (Br)($n\times 1$)
> =>$O(n\times n \times1)$
- These are $O(n^2)$ time operations if done straightforwardly
- Total running time $O(n^2)$
- 錯誤分析 Error analysis
- $p$ 為 error 的機率
Let $p$ equal the probability of error
- Case $A \times B = C$ , $p = 0$
- ${\begin{aligned}\vec{P} &= A \times (B\vec{r}) - C\vec{r} \\&= (A \times B)\vec{r} - C\vec{r} \\&= (A \times B - C)\vec{r} \\&= \vec{0}\end{aligned}}$
- 與 $\vec{r}$ 無關,因為 $A \times B - C = 0$ , 所以$Pr[\vec{P}\neq0]=0$
This is regardless of the value of $\vec{r}$, since it uses only that $A \times B - C = 0$ Hence the probability for error in this case is: $Pr[\vec{P}\neq0]=0$
- Case $A \times B \neq C$ , $p ≤ 1/2$
- let $\vec{P} = D \times \vec{r} = (p1,p2,...,pn)^T$, where $D = A \times B - C = (d_{ij})$
- 因為 $A \times B \neq C$, $D$ 的元素不全為 0
Since $A \times B \neq C$, we have that some element of $D$ is nonzero.
- 假設 $d_{ij} \neq 0$
Suppose that the element $d_{ij} \neq 0$.
- 根據[定義](https://hackmd.io/s/BkTL8nWRl#矩陣乘法-matrix-multiplication)得到
By the definition of matrix multiplication, we have
- $p_i = \sum_{k = 1}^n d_{ik}r_k = d_{i1}r_1 + \cdots + d_{ij}r_j + \cdots + d_{in}r_n = d_{ij}r_j + y.$
- 對於常數 $y$ ,使用貝氏定理,分割常數 $y$
For some constant $y$. Using [Bayes' Theorem](https://en.wikipedia.org/wiki/Bayes%27_theorem), we can partition over $y$:
- $\Pr[p_i = 0] = \Pr[p_i = 0 | y = 0]\cdot \Pr[y = 0]\, +\, \Pr[p_i = 0 | y \neq 0] \cdot \Pr[y \neq 0] (1)$
- We use that:
- $\Pr[p_i = 0 | y = 0] = \Pr[r_j = 0] = \frac{1}{2}$
$\Pr[p_i = 0 | y \neq 0] = \Pr[r_j = 1 \land d_{ij}=-y] \leq \Pr[r_j = 1] = \frac{1}{2}$
- 插入 equation (1) 得到
Plugging these in the equation (1), we get:
- ${\begin{aligned}\Pr[p_{i}=0]&\leq {\frac {1}{2}}\cdot \Pr[y=0]+{\frac {1}{2}}\cdot \Pr[y\neq 0]\\&={\frac {1}{2}}\cdot \Pr[y=0]+{\frac {1}{2}}\cdot (1-\Pr[y=0])\\&={\frac {1}{2}}\end{aligned}}$
- Therefore,
${\Pr[{\vec {P}}=0]=\Pr[p_{1}=0\land \dots \land p_{i}=0\land \dots \land p_{n}=0]\leq \Pr[p_{i}=0]\leq {\frac {1}{2}}.}$
This completes the proof.
> e.g.:
> case $C = A \times B$ :
> $\mathbf{A} = \begin{bmatrix}0 & 1\\2 & 3\end{bmatrix}$ , $B = \begin{bmatrix}1 & 0\\0 & 0\end{bmatrix}$ , $C = \begin{bmatrix}0 & 0\\2 & 0\end{bmatrix}$
> 隨機產生 $\vec{r}$ 可能為 $\vec{r} = \begin{bmatrix}0 \\ 0\end{bmatrix}$ , $\vec{r} = \begin{bmatrix}0 \\ 1\end{bmatrix}$, $\vec{r} = \begin{bmatrix}1 \\ 0\end{bmatrix}$, $\vec{r} = \begin{bmatrix}1 \\ 1\end{bmatrix}$
> 運算 $\vec{P} = A \times (B\vec{r}) - C\vec{r}$ 結果皆為 $\vec{P} = \begin{bmatrix}0 \\ 0\end{bmatrix}$
> => 錯誤機率為 0
>
>
> case $C \neq A \times B$ :
> $\mathbf{A} = \begin{bmatrix}0 & 1\\2 & 3\end{bmatrix}$ , $B = \begin{bmatrix}1 & 0\\0 & 0\end{bmatrix}$ , $C = \begin{bmatrix}0 & 0\\0 & 0\end{bmatrix}$
> 隨機產生 $\vec{r}$ 可能為 $\vec{r} = \begin{bmatrix}0 \\ 0\end{bmatrix}$ , $\vec{r} = \begin{bmatrix}0 \\ 1\end{bmatrix}$, $\vec{r} = \begin{bmatrix}1 \\ 0\end{bmatrix}$, $\vec{r} = \begin{bmatrix}1 \\ 1\end{bmatrix}$
> 運算 $\vec{P} = A \times (B\vec{r}) - C\vec{r}$
> 其中 $\vec{r} = \begin{bmatrix}0 \\ 0\end{bmatrix}$ , $\vec{r} = \begin{bmatrix}0 \\ 1\end{bmatrix}$ 算出 $\vec{P} = \begin{bmatrix}0 \\ 0\end{bmatrix}$
> $\vec{r} = \begin{bmatrix}1 \\ 0\end{bmatrix}$, $\vec{r} = \begin{bmatrix}1 \\ 1\end{bmatrix}$ 算出 $\vec{P} \neq \begin{bmatrix}0 \\ 0\end{bmatrix}$
> => 錯誤機率為 0.5
> (有一半機會當 $C \neq A \times B$ 卻算出$\vec{P} = \begin{bmatrix}0 \\ 0\end{bmatrix}$)
- 後果 Ramifications
- 簡單的演算法分析顯示,這個演算法的運行時間為 $O(n^2)$,完勝傳統演算法的$O(n^3)$。
Simple algorithmic analysis shows that the running time of this algorithm is $O(n^2)$, beating the classical deterministic algorithm's bound of $O(n^3)$.
- 同時,錯誤分析顯示執行這個演算法 $k$ 次可以達到 error bound 小於(指數級的) ${\frac {1}{2^{k}}}$。
The error analysis also shows that if we run our algorithm $k$ times, we can achieve an error bound of less than ${\frac {1}{2^{k}}}$, an exponentially small quantity.
- 使用隨機演算法可加速相對非常慢的確定性演算法。
utilization of randomized algorithms can speed up a very slow deterministic algorithm
- Freivalds' algorithm 常常出現在 probabilistic algorithms 的介紹中,因為它的簡單性,以及它如何說明了 probabilistic algorithms 處理某些問題的優越性。
Freivalds' algorithm frequently arises in introductions to probabilistic algorithms due to its simplicity and how it illustrates the superiority of probabilistic algorithms in practice for some problems.
## Parallel and distributed algorithms
https://en.wikipedia.org/wiki/Matrix_multiplication_algorithm#Parallel_and_distributed_algorithms
# 實驗
- 分別實作上述演算法,用 SIMD 改善,並用 Freivalds' algorithm 驗證
## 0. Freivalds' algorithm
- 驗證方法
```C=
algo = matrix_providers[0];
if (!algo->mul(&L_Br, &B, &r)) {
printf("B x r error!\n");
continue;
}
if (!algo->mul(&L_ABr, &A, &L_Br)) {
printf("A x Br error!\n");
continue;
}
if (!algo->mul(&R_Cr, &C, &r)) {
printf("C x r error!\n");
continue;
}
if (!algo->equal(&L_ABr, &R_Cr)) {
printf("ABr and Cr are not equal!\n");
break;
}
```
- 改善
- error bound 小於 ${\frac {1}{2^{k}}}$
- 原本 k 為 1 , error bound < 1/2^1^ (< 50%)
- 將 k 提高至 10 , error bound < 1/2^10^ (< 0.1%)
```C=
int flag = 1;
algo = matrix_providers[0];
for (int i = 0; i < 10 && flag; i++) {
if (!algo->mul(&L_Br, &B, &r)) {
printf("B x r error!\n");
flag = 0;
continue;
}
if (!algo->mul(&L_ABr, &A, &L_Br)) {
printf("A x Br error!\n");
flag = 0;
continue;
}
if (!algo->mul(&R_Cr, &C, &r)) {
printf("C x r error!\n");
flag = 0;
continue;
}
if (!algo->equal(&L_ABr, &R_Cr)) {
printf("ABr and Cr are not equal!\n");
flag = 0;
break;
}
}
```
## 1. Naive
- 定義 : $c_{ij} = \sum_{k = 1}^m {a_{ik}b_{kj}}$
- 使用 3 層迴圈實做
```C=
for (int i = 0; i < l->row; i++)
for (int j = 0; j < r->col; j++) {
PRIV(dst)->values[i][j] = 0;
for (int k = 0; k < l->col; k++)
PRIV(dst)->values[i][j] += PRIV(l)->values[i][k] *
PRIV(r)->values[k][j];
}
```
- 實驗結果
```shell
algo : Naive
matrix : 8 x 8
exe time : 0.000001 ms
matrix : 16 x 16
exe time : 0.000004 ms
matrix : 32 x 32
exe time : 0.000029 ms
matrix : 64 x 64
exe time : 0.000237 ms
matrix : 128 x 128
exe time : 0.001771 ms
matrix : 256 x 256
exe time : 0.014509 ms
matrix : 512 x 512
exe time : 0.144029 ms
matrix : 1024 x 1024
exe time : 1.456696 ms
```
### SSE
- 類似 naive , 一次做 4 個 int (128 bits)
- ![](https://i.imgur.com/0XJ6usb.gif)
- [SSE version](https://hackmd.io/s/Hk-llHEyx#sse-version)
- 指令
- `__m128i _mm_setzero_si128()`
- 將 128 bits 設為0
- `__m128i _mm_load_si128(__m128i const*p)`
- 從 p (必須是 16-byte aligned) 連續讀取 128 bits 記憶體
- `__m128i _mm_set_epi32(int i3, int i2, int i1, int i0)`
- 將 128 bits 內容設為 4 個 int (每個佔 32 bits)
- `__m128i _mm_mullo_epi32( __m128i a, __m128i b)`
- a , b 中 4 個 32-bit 整數相乘, 保留前半部份(前32 bits)
- `__m128i _mm_unpacklo_epi32(__m128i a, __m128i b)`
- 取 a 跟 b 前兩個(0-63) 32-bit int
- e.g.
- a : i0 i1 i2 i3
b : j0 j1 j2 j3
=>i0 i1 j0 j1
- `__m128i _mm_unpackhi_epi32(__m128i a, __m128i b)`
- 取 a 跟 b 後兩個(64-127) 32-bit int
- `__m128i _mm_unpacklo_epi64(__m128i a, __m128i b)`
- 取 a 跟 b 第一個(0-63) 64-bit int
- `__m128i _mm_unpackhi_epi64(__m128i a, __m128i b)`
- 取 a 跟 b 第二個(64-127) 64-bit int
- `__m128i _mm_add_epi32(__m128i a, __m128i b)`
- 將 a 跟 b 中 4 個 32-bit int 分別相加,保留前半部份(前32 bits)
- `void _mm_store_si128(__m128i *p, __m128i b)`
- 從 p (必須是 16-byte aligned) 連續寫入 b (128 bits)
- 實驗結果
```shell
algo : SSE
matrix : 8 x 8
exe time : 0.000000 ms
matrix : 16 x 16
exe time : 0.000002 ms
matrix : 32 x 32
exe time : 0.000009 ms
matrix : 64 x 64
exe time : 0.000061 ms
matrix : 128 x 128
exe time : 0.000471 ms
matrix : 256 x 256
exe time : 0.003963 ms
matrix : 512 x 512
exe time : 0.035724 ms
matrix : 1024 x 1024
exe time : 0.327177 ms
```
### AVX
- 類似 naive , 一次做 8 個 int (256 bits)
- 指令
- `__m256i _mm256_setzero_si256(void)`
- 將 int 所有 bits 設 0 並回傳
- `__m256i _mm256_load_si256(__m256i const *a)`
- 從 a (必須是 32-byte aligned) 連續讀取 256 bits 記憶體
- `__m256i _mm256_set1_epi32(int)`
- 將一個 32-bit int 存入 256-bit vector
- `__m256i _mm256_mullo_epi32(__m256i s1, __m256i s2)`
- s1 , s2 中 8 個 32-bit 整數相乘, 保留前半部份(前32 bits)
- `_mm256_add_epi32`
- 將 a 跟 b 中 8 個 32-bit int 分別相加,保留前半部份(前32 bits)
- `void _mm256_store_si256(__m128i *a, __m128i b)`
- 從 a (必須是 32-byte aligned) 連續寫入 b (128 bits)
- 實驗結果
```shell
algo : AVX
matrix : 8 x 8
exe time : 0.000001 ms
matrix : 16 x 16
exe time : 0.000002 ms
matrix : 32 x 32
exe time : 0.000006 ms
matrix : 64 x 64
exe time : 0.000026 ms
matrix : 128 x 128
exe time : 0.000175 ms
matrix : 256 x 256
exe time : 0.001366 ms
matrix : 512 x 512
exe time : 0.011233 ms
matrix : 1024 x 1024
exe time : 0.103360 ms
```
#### AVX loop unrolling
- 將以下迴圈展開
```C=
for (int i = 0; i < 8; i++) {
// broadcast each elements from source 1
ymm8 = _mm256_set1_epi32(PRIV(l)->values[(x + i)][(k + 0)]);
ymm9 = _mm256_set1_epi32(PRIV(l)->values[(x + i)][(k + 1)]);
ymm10 = _mm256_set1_epi32(PRIV(l)->values[(x + i)][(k + 2)]);
ymm11 = _mm256_set1_epi32(PRIV(l)->values[(x + i)][(k + 3)]);
ymm12 = _mm256_set1_epi32(PRIV(l)->values[(x + i)][(k + 4)]);
ymm13 = _mm256_set1_epi32(PRIV(l)->values[(x + i)][(k + 5)]);
ymm14 = _mm256_set1_epi32(PRIV(l)->values[(x + i)][(k + 6)]);
ymm15 = _mm256_set1_epi32(PRIV(l)->values[(x + i)][(k + 7)]);
// multiply
ymm8 = _mm256_mullo_epi32(ymm8, ymm0); // row 1, 2
ymm9 = _mm256_mullo_epi32(ymm9, ymm1);
ymm8 = _mm256_add_epi32(ymm8, ymm9);
ymm10 = _mm256_mullo_epi32(ymm10, ymm2); // row 3, 4
ymm11 = _mm256_mullo_epi32(ymm11, ymm3);
ymm10 = _mm256_add_epi32(ymm10, ymm11);
ymm12 = _mm256_mullo_epi32(ymm12, ymm4); // row 5, 6
ymm13 = _mm256_mullo_epi32(ymm13, ymm5);
ymm12 = _mm256_add_epi32(ymm12, ymm13);
ymm14 = _mm256_mullo_epi32(ymm14, ymm6); // row 7, 8
ymm15 = _mm256_mullo_epi32(ymm15, ymm7);
ymm14 = _mm256_add_epi32(ymm14, ymm15);
ymm8 = _mm256_add_epi32(ymm8, ymm10); // sum
ymm12 = _mm256_add_epi32(ymm12, ymm14);
ymm8 = _mm256_add_epi32(ymm8, ymm12);
switch(i) {
case 0:
ymm16 = _mm256_add_epi32(ymm16, ymm8);
break;
case 1:
ymm17 = _mm256_add_epi32(ymm17, ymm8);
break;
case 2:
ymm18 = _mm256_add_epi32(ymm18, ymm8);
break;
case 3:
ymm19 = _mm256_add_epi32(ymm19, ymm8);
break;
case 4:
ymm20 = _mm256_add_epi32(ymm20, ymm8);
break;
case 5:
ymm21 = _mm256_add_epi32(ymm21, ymm8);
break;
case 6:
ymm22 = _mm256_add_epi32(ymm22, ymm8);
break;
case 7:
ymm23 = _mm256_add_epi32(ymm23, ymm8);
break;
}
}
```
- 實驗結果
- before
```shell
algo : AVX
matrix : 8 x 8
exe time : 0.000001 ms
matrix : 16 x 16
exe time : 0.000002 ms
matrix : 32 x 32
exe time : 0.000006 ms
matrix : 64 x 64
exe time : 0.000026 ms
matrix : 128 x 128
exe time : 0.000177 ms
matrix : 256 x 256
exe time : 0.001359 ms
matrix : 512 x 512
exe time : 0.011220 ms
matrix : 1024 x 1024
exe time : 0.101038 ms
```
- after
``` shell
algo : AVX
matrix : 8 x 8
exe time : 0.000001 ms
matrix : 16 x 16
exe time : 0.000002 ms
matrix : 32 x 32
exe time : 0.000006 ms
matrix : 64 x 64
exe time : 0.000026 ms
matrix : 128 x 128
exe time : 0.000165 ms
matrix : 256 x 256
exe time : 0.001294 ms
matrix : 512 x 512
exe time : 0.010437 ms
matrix : 1024 x 1024
exe time : 0.096296 ms
```
- 比較
- $Speedup = T_{before} / T_{after}$
- 矩陣大小超過 $128 \times 128$ 會加速約 5-7%
n|8|16|32|64|128|256|512|1024
-|-|-|-|-|-|-|-|-
before(ms)|0.000001|0.000002|0.000006|0.000026|0.000177|0.001359|0.011220|0.101038
after(ms)|0.000001|0.000002|0.000006|0.000026|0.000165|0.001294|0.010437|0.096296
speedup|1|1|1|1|1.07|1.05|1.07|1.05
### Naive 比較
n|8|16|32|64|128|256|512|1024
-|-|-|-|-|-|-|-|-
naive(ms)|0.000001|0.000004|0.000029|0.000237|0.001771|0.014509|0.144029|1.456696
sse(ms)|0.000000|0.000002|0.000009|0.000061|0.000471|0.003963|0.035724|0.327177
avx(ms)|0.000001|0.000002|0.000005|0.000025|0.000168|0.001288|0.010347|0.095896
> 更新 AVX loop unrolling
## 2. Strassen Algorithm
- 新增 `seperate()`, `combine()`, `add_sub_part()`, `mul_part()` 等函式以便實作.
- 若矩陣行(或列)為奇數, 則多補一行(或列), 並初始化為零.
- 實驗結果
```shell
algo : Naive
matrix : 8 x 8
exe time : 0.000001 ms
matrix : 16 x 16
exe time : 0.000004 ms
matrix : 32 x 32
exe time : 0.000032 ms
matrix : 64 x 64
exe time : 0.000220 ms
matrix : 128 x 128
exe time : 0.001543 ms
matrix : 256 x 256
exe time : 0.012644 ms
matrix : 512 x 512
exe time : 0.137741 ms
matrix : 1024 x 1024
exe time : 1.216822 ms
algo : Strassen
matrix : 8 x 8
exe time : 0.000004 ms
matrix : 16 x 16
exe time : 0.000010 ms
matrix : 32 x 32
exe time : 0.000038 ms
matrix : 64 x 64
exe time : 0.000244 ms
matrix : 128 x 128
exe time : 0.001432 ms
matrix : 256 x 256
exe time : 0.009937 ms
matrix : 512 x 512
exe time : 0.083023 ms
matrix : 1024 x 1024
exe time : 0.978982 ms
```
n|8|16|32|64|128|256|512|1024
-|-|-|-|-|-|-|-|-
naive(ms)|0.000001|0.000004|0.000032|0.000220|0.001543|0.012644|0.137741|1.216822
Strassen(ms)|0.000004|0.000010|0.000038|0.0000244|0.001432|0.009937|0.083023|0.978982
大約在 N > 100 後 , strassen algorithm 會快過 naive 版本
:::warning
原本的 Strassen algorithm 是遞迴呼叫的, 也就是子矩陣相乘時呼叫自身, 才能使得時間複雜度為 O(n^2.81^), 但是這次在實作時執行時間會暴增, 所以只先執行一次矩陣分割, 目前推測問題在於分割及結合子矩陣花費的時間過長
:::
- [在 Strassen algorithm 中嵌入 SSE]()
由於上面那個缺陷, 使得子矩陣還是使用原始乘法相乘, 所以在相乘的部份, 就可以使用 SSE 指令集, 以 SIMD 的機制作矩陣相乘, 以下為實驗結果:
```shell
algo : SSE
matrix : 8 x 8
exe time : 0.000000 ms
matrix : 16 x 16
exe time : 0.000001 ms
matrix : 32 x 32
exe time : 0.000008 ms
matrix : 64 x 64
exe time : 0.000056 ms
matrix : 128 x 128
exe time : 0.000430 ms
matrix : 256 x 256
exe time : 0.003593 ms
matrix : 512 x 512
exe time : 0.032536 ms
matrix : 1024 x 1024
exe time : 0.297687 ms
algo : Strassen SSE
matrix : 8 x 8
exe time : 0.000003 ms
matrix : 16 x 16
exe time : 0.000008 ms
matrix : 32 x 32
exe time : 0.000023 ms
matrix : 64 x 64
exe time : 0.000098 ms
matrix : 128 x 128
exe time : 0.000529 ms
matrix : 256 x 256
exe time : 0.003462 ms
matrix : 512 x 512
exe time : 0.026929 ms
matrix : 1024 x 1024
exe time : 0.240095 ms
```
n|8|16|32|64|128|256|512|1024
-|-|-|-|-|-|-|-|-
Strassen(ms)|0.000004|0.000010|0.000038|0.0000244|0.001432|0.009937|0.083023|0.978982
SSE(ms)|0.000000|0.000001|0.000008|0.000056|0.000430|0.003593|0.032536|0.297687|
Strassen SSE(ms)|0.000003|0.000008|0.000023|0.00098|0.000529|0.003462|0.026929|0.240095
在矩陣大於 256 x 256 時, Strassen SSE 會略快於 SSE
- [設立邊界]()
由於小矩陣在使用 Strassen algorithm 計算時增加執行時間, 於是我嘗試設置邊界, 在切個的子矩陣小於一定值時, 就執行原始版本的矩陣乘法, 實驗結果如下
邊界 \ n|8|16|32|64|128|256|512|1024
-|-|-|-|-|-|-|-|-
8|0.000005|0.000011|0.000089|0.000669|0.004888|0.033680|0.236442|1.664202
16|0.000005|0.000011|0.000040|0.000326|0.002450|0.017589|0.123404|0.874777
32|0.000005|0.000011|0.000040|0.000247|0.001883|0.013629|0.095878|0.681866
64|0.000004|0.000011|0.000040|0.00245|0.001461|0.010796|0.075746|0.539311
128|0.000004|0.000011|0.000040|0.00247|0.001471|0.010397|0.072306|0.510256
256|0.000005|0.000011|0.000040|0.00251|0.001485|0.010220|0.082975|0.589007
# 比較 comparison
![](https://i.imgur.com/FSkFHjy.png)
> 更新 AVX loop unrolling
- 矩陣愈大,效能
- avx > strassen_sse > sse > strassen > naive
# reference
- [MathJax basic tutorial and quick reference](https://math.meta.stackexchange.com/questions/5020/mathjax-basic-tutorial-and-quick-reference)
- [Matrix Multiplication using SIMD](https://hackmd.io/s/Hk-llHEyx)
- [Matrix multiplication algorithm](https://en.wikipedia.org/wiki/Matrix_multiplication_algorithm)
- [Matrix Algorithms](https://www.slideshare.net/AndresMendezVazquez/23-matrix-algorithms)
- [The Coppersmith-Winograd Matrix Multiplication Algorithm](http://bioinfo.ict.ac.cn/~dbu/AlgorithmCourses/Lectures/MAndersonSBarman2009.pdf)
- [On the Coppersmith–Winograd method](http://www.cs.toronto.edu/~yuvalf/Limitations.pdf)
- [Matrix Multiplication check](https://www.cs.cmu.edu/~avrim/Randalgs11/lectures/lect0209.pdf)
- [Another Randomized Algorithm](https://courses.cs.washington.edu/courses/cse312/14wi/freivalds.pdf)
- [Multiplying matrices in O(n2.373) time](https://people.csail.mit.edu/virgi/matrixmult-f.pdf)
- [What is Tensor?](https://www.youtube.com/watch?v=f5liqUk0ZTw)
- [Strassen algorithm](https://ccjou.wordpress.com/2013/06/04/%E5%88%86%E6%B2%BB%E7%9F%A9%E9%99%A3%E4%B9%98%E6%B3%95%E2%94%80%E2%94%80strassen-%E6%BC%94%E7%AE%97%E6%B3%95/)