siahuat0727
    • Create new note
    • Create a note from template
      • Sharing URL Link copied
      • /edit
      • View mode
        • Edit mode
        • View mode
        • Book mode
        • Slide mode
        Edit mode View mode Book mode Slide mode
      • Customize slides
      • Note Permission
      • Read
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Write
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Engagement control Commenting, Suggest edit, Emoji Reply
      • Invitee
      • No invitee
    • Publish Note

      Publish Note

      Everyone on the web can find and read all notes of this public team.
      Once published, notes can be searched and viewed by anyone online.
      See published notes
      Please check the box to agree to the Community Guidelines.
    • Commenting
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
      • Everyone
    • Suggest edit
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
    • Emoji Reply
    • Enable
    • Versions and GitHub Sync
    • Note settings
    • Engagement control
    • Transfer ownership
    • Delete this note
    • Save as template
    • Insert from template
    • Import from
      • Dropbox
      • Google Drive
      • Gist
      • Clipboard
    • Export to
      • Dropbox
      • Google Drive
      • Gist
    • Download
      • Markdown
      • HTML
      • Raw HTML
Menu Note settings Sharing URL Create Help
Create Create new note Create a note from template
Menu
Options
Versions and GitHub Sync Engagement control Transfer ownership Delete this note
Import from
Dropbox Google Drive Gist Clipboard
Export to
Dropbox Google Drive Gist
Download
Markdown HTML Raw HTML
Back
Sharing URL Link copied
/edit
View mode
  • Edit mode
  • View mode
  • Book mode
  • Slide mode
Edit mode View mode Book mode Slide mode
Customize slides
Note Permission
Read
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Write
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Engagement control Commenting, Suggest edit, Emoji Reply
Invitee
No invitee
Publish Note

Publish Note

Everyone on the web can find and read all notes of this public team.
Once published, notes can be searched and viewed by anyone online.
See published notes
Please check the box to agree to the Community Guidelines.
Engagement control
Commenting
Permission
Disabled Forbidden Owners Signed-in users Everyone
Enable
Permission
  • Forbidden
  • Owners
  • Signed-in users
  • Everyone
Suggest edit
Permission
Disabled Forbidden Owners Signed-in users Everyone
Enable
Permission
  • Forbidden
  • Owners
  • Signed-in users
Emoji Reply
Enable
Import from Dropbox Google Drive Gist Clipboard
   owned this note    owned this note      
Published Linked with GitHub
1
Subscribed
  • Any changes
    Be notified of any changes
  • Mention me
    Be notified of mention me
  • Unsubscribe
Subscribe
# 2018q3 矩陣乘法 Matrix multiplication Contributed by < [DyslexiaS](https://github.com/DyslexiaS), [siahuat0727](https://github.com/siahuat0727), [pjchiou](https://github.com/pjchiou) > [GitHub](https://github.com/siahuat0727/matrix-multiplication) Task : 重現 [Matrix Multiplication using SIMD](https://hackmd.io/s/Hk-llHEyx) 實驗,並依循 CS:APP 第 6 章指引,分析個別實作的效能 --- ## outline - 演算法 - SSE 指令集與 AVX 指令集的對齊問題。 ## 演算法 - [x] naive - [x] cache-friendly - [x] sub-matrix - [x] strassen - [x] SIMD-SSE - [x] SIMD-AVX --- ### Naive algorithm 最簡單的矩陣相乘演算法 - time complexity : $O(n^3)$ - 演算法 ```click=1 for (int i = 0; i < a.row; ++i) for (int j = 0; j < b.col; ++j) { int sum = 0; for (int k = 0; k < a.col; ++k) sum += a.values[i][k] * b.values[k][j]; dst->values[i][j] += sum; } ``` --- ### cache-friendly 在 naive algorithm 中, `b.values[k][j]` 一直在取用距離為 `b.col` 的變數,這樣對 cache 的使用不利。要利用 cache 的特性來加速運算,把握兩個重點 1. Temporal locality : 一旦取用過的變數,重複取用。 2. Spatial locality : 取用記憶體中相鄰的變數。 - time complexity : $O(n^3)$ - 演算法 ```clike=1 for (int i = 0; i < a.row; ++i) for (int k = 0; k < a.col; ++k) { int r = a.values[i][k]; for (int j = 0; j < b.col; ++j) dst->values[i][j] += r * b.values[k][j]; } ``` 如此一來,重複取用 `a[i][k] `且每個矩陣都是以步長為 1 取用變數。 --- ### Sub-matrix Divied and conquer 的概念,先把矩陣拆成較小的矩陣,各自運算後再加起來。 ![](https://i.imgur.com/VlBLSXb.png) 在這個示意圖中,`A`、`B`兩個矩陣都是正方形的,其 sub-matrix 行列皆為原本的一半。這與 naive 版本的功能不符,所以我們將其稍做調整 1. `A`、`B` 矩陣不必為正方形,符合矩陣相乘的限制 `A.col==B.row` 即可。 2. sub-matrix 行列大小不必為 N/2 ,但必須為正方形。 3. sub-matrix 大小亦不必可被 C.row 或 C.col 整除。 - time complexity : $O(n^3)$ - 演算法 - reference : [divide and conquer](https://www.geeksforgeeks.org/strassens-matrix-multiplication/) ```clike=1 //Calculate sub-matrix at (c_row,c_col) and the size is stride. void matmul_stride(const Matrix l, const Matrix r, const Matrix dst, int c_row, int c_col, int stride) { for (int k = 0; k < l.col; k += stride) for (int i = 0; i < stride && i + c_row < l.row; i++) for (int j = 0; j < stride && j + c_col < r.col; j++) for (int m = k; m < k + stride && m < l.col; m++) dst.values[i + c_row][j + c_col] += l.values[i + c_row][m] * r.values[m][j + c_col]; } void sub_matmul(const Matrix l, const Matrix r, const Matrix dst, void *ctx) { CHECK(l, r, dst); assert(ctx != NULL); int stride = ((SubMatrixInfo*)ctx)->stride; for (int i = 0; i < l.row; i += stride) for (int j = 0; j < r.col; j += stride) matmul_stride(l, r, dst, i, j, stride); } ``` --- ### Strassen 兩個矩陣相乘所要用到的乘法及加法次數: 因為 $n \times n$ 矩陣 $C$ 有 $n^2$ 個元素,計算每一元素需要 n 個乘法和 (n-1) 個加法,故知 $n\times n$ 階矩陣乘積共使用了 $n^3$ 個乘法和 $n^3-n^2$ 個加法。 這個迷思直到更快速的 divide-and-conquer 矩陣乘法 strassen 被提出才打破。 - 考慮 $2\times 2$ 階分塊矩陣乘法 ![](https://i.imgur.com/vN6s5iN.png) 其中每一分塊皆為方陣。傳統的矩陣乘法運算方式,$C_{ij}=A_{i1}B_{1j}+A_{i2}B_{2j}$,總共使用8個分塊乘法和4個分塊加法。 Strassen 演算法使用7個分塊乘法和18個分塊加法,如下: \begin{aligned} P_1&=(A_{11}+A_{22})(B_{11}+B_{22})\\ P_2&=(A_{21}+A_{22})B_{11}\\ P_3&=A_{11}(B_{12}-B_{22})\\ P_4&=A_{22}(B_{21}-B_{11})\\ P_5&=(A_{11}+A_{12})B_{22}\\ P_6&=(A_{21}-A_{11})(B_{11}+B_{12})\\ P_7&=(A_{12}-A_{22})(B_{21}+B_{22})\\ C_{11}&=P_1+P_4-P_5+P_7\\ C_{12}&=P_3+P_5\\ C_{21}&=P_2+P_4\\ C_{22}&=P_1+P_3-P_2+P_6. \end{aligned} - 直接代入化簡即可驗證上述公式的正確性 假設 $n=2m$,上面所有分塊皆為 $m\times m$ 。 - 傳統的矩陣乘法 : - 使用了 $8m^3$ 個乘法和 $8m^3-4m^2$ 個加法。 - Strassen : - Strassen 演算法僅執行至分塊等級,也就是說,$m\times m$ 分塊乘法仍採用傳統方法計算 - 則7次分塊乘法共使用 $7m^3$ 個乘法和 $7(m^3-m^2)$ 個加法,再加上 18 次分塊加法使用的 $18m^2$ 個加法,總計是 $7m^3$ 個乘法和 $7m^3+11m^2$ 個加法 - 若 $m\gg 1$,不論乘法或加法,Strassen 演算法僅耗費傳統方法 $7/8$ 的計算量。 Reference : [strassen](https://ccjou.wordpress.com/2013/06/04/分治矩陣乘法──strassen-演算法/) --- ### SIMD-SSE - SIMD 為 ==Single Instruction Multiple Data== 的縮寫,以 Intel 的處理器為例,其發展的 MMX SIMD 指令集,能以一道指令處理多個資料,其目的就是為了提高多媒體資料的處理能力。其後,Intel 又發展出 SSE SIMD 指令集,為 MMX 的擴充指令集。 - AMD 的處理器亦有其 SIMD 指令集,但我用的是 Intel 的處理器,所以我以 Intel 的 `SSE SIMD` 指令集做為例子。 - SSE 有 8 個 128 bit 的暫存器,每個暫存器可以用來存 - 4 個 32 bit 的 float - 2 個 64 bit 的 double - 4 個 32 bit 的 integer - 8 個 16 bit 的 short - 16 個 8 bit 的 char 利用 SIMD 指令集做矩陣乘法時,其流程與 sub-matrix 無異,但有一個地方要注意的是,矩陣乘法的本質就是向量的內積,因為我們的暫存器大小是 8 個 128 bit ,剛好可以用來==存 8 個有 4 個 integer 的向量==,所以 sub-matrix 的大小就是 $4*4$ ,這樣可以充分利用暫存器。 我們分別用以下 4 個 vector 來存 A、B 兩個矩陣中的 sub-matrix 。 ![](https://i.imgur.com/EQTRW6x.png) 我們要做的事情就是從 [SIMD 指令集](https://software.intel.com/sites/landingpage/IntrinsicsGuide/)中組合出可以達到需求的指令。以下說明如何得到最上面的 row ,其他以此類推。 ==使用前注意每個 function 必須 include 哪個 header file== 1. 讀取矩陣值,存入暫存器。宣告型別為 `__128i` 的變數來代表這個 128 bit 長可存四個 32 bit integer 的變數。用以下兩個 function 來讀取矩陣值。 - __m128i _mm_load_si128 (__m128i const* mem_addr) - __m128i _mm_set_epi32 (int e3, int e2, int e1, int e0) 因為 A 矩陣中的向量值是存在連續的記憶體內,用前者; B 矩陣用後者。 2. 將 V1 分別對 V5~V8 做內積,可得四個內積後的向量 `V15`, `V16`, `V17`, `V18` - __m128i _mm_mullo_epi32 (__m128i a, __m128i b)。 ![](https://i.imgur.com/PTJG78o.gif) 3. 我們要將這四個向量 V15, V16, V17, V18 組成的矩陣做轉置,但這四個向量是分別獨立的四個 __m128i 變數,只是在我們腦海裡是一個矩陣,我們要利用 SIMD 的四個指令達到等價效果。 - __m128i _mm_unpackhi_epi32 (__m128i a, __m128i b) ``` 將兩個 m128i 變數的前兩筆 32 bits 資料交錯擺放。 r[0] = a[0] ; r[1] = b[0]; r[2] = a[1] ; r[3] = b[1]; ``` - __m128i _mm_unpackhi_epi64 (__m128i a, __m128i b) ``` 將兩個 m128i 變數的前兩筆 64 bits 資料交錯擺放。 r[0] = a[0]; r[1] = a[1]; r[2] = b[0]; r[3] = b[1]; ``` - __m128i _mm_unpacklo_epi32 (__m128i a, __m128i b) ``` 將兩個 m128i 變數的後兩筆 32 bits 資料交錯擺放。 r[0] = a[2] ; r[1] = b[2]; r[2] = a[3] ; r[3] = b[3]; ``` - __m128i _mm_unpacklo_epi64 (__m128i a, __m128i b) ``` 將兩個 m128i 變數的後兩筆 64 bits 資料交錯擺放。 r[0] = a[2]; r[1] = a[3]; r[2] = b[2]; r[3] = b[3]; ``` 先將這四個向量分為兩個一組後,依序做以下四個函式,可以得到另外四個向量。 - S1 = _mm_unpacklo_epi32(V15,V16) - S2 = _mm_unpackhi_epi32(V15,V16) - S3 = _mm_unpacklo_epi32(V17,V18) - S4 = _mm_unpackhi_epi32(V17,V18) ![](https://i.imgur.com/rcXrkuc.gif) 接著再做四個函式,會得到等價於轉置的結果 - D1 = _mm_unpacklo_epi64(S1,S3) - D2 = _mm_unpackhi_epi64(S1,S3) - D3 = _mm_unpacklo_epi64(S2,S4) - D4 = _mm_unpackhi_epi64(S2,S4) ![](https://i.imgur.com/uV7c4MH.gif) 4. 將這四個向量相加後,回到 (2.) 改以 V2 與 V5~V8 做內積,以此類推,最終就能得到矩陣乘積的結果。 此處完整的程式碼如下: ```clike=1 void SIMD_matmul(const Matrix l, const Matrix r, const Matrix dst, void *ctx) { for (int i = 0; i < l.row; i += 4) for (int j = 0; j < r.col; j += 4) SIMD_matmul4(l, r, dst, i, j); } void SIMD_matmul4(const Matrix l, const Matrix r, const Matrix dst, int c_row, int c_col) { __m128i I[4], R[4], D[4], T[4], S[4], Sum[4]; for (int i = 0; i < 4; i++) Sum[i] = _mm_setzero_si128(); for (int k = 0; k < l.col; k += 4) { // Read matrix A for (int i = 0; i < 4; i++) I[i] = _mm_load_si128((__m128i *) (&l.values[i + c_row][k])); // Read matrix B for (int i = 0; i < 4; i++) R[i] = _mm_set_epi32( r.values[k][i + c_col], r.values[k + 1][i + c_col], r.values[k + 2][i + c_col], r.values[k + 3][i + c_col]); for (int i = 0; i < 4; i++) { // Inner product of vector from matrix A and B for (int j = 0; j < 4; j++) T[j] = _mm_mullo_epi32(I[i], R[j]); for (int j = 0; j < 2; j++) { S[j] = _mm_unpacklo_epi32(T[j * 2], T[j * 2 + 1]); S[j + 2] = _mm_unpackhi_epi32(T[j * 2], T[j * 2 + 1]); } for (int j = 0; j < 2; j++) { D[j] = _mm_unpacklo_epi64(S[2 * j], S[2 * j + 1]); D[j + 2] = _mm_unpackhi_epi64(S[2 * j], S[2 * j + 1]); } for (int j = 0; j < 4; j++) Sum[i] = _mm_add_epi32(Sum[i], D[j]); } } for (int i = 0; i < 4; i++) _mm_store_si128((__m128i *) (&dst.values[c_row + i][c_col]), Sum[i]); } ``` Reference : - [Intel Instrinsics Guide](https://software.intel.com/sites/landingpage/IntrinsicsGuide/) - [Wiki-SSE](https://zh.wikipedia.org/wiki/SSE) ### SIMD-AVX - AVX(Advanced Vector Extensions) 指令集為 SSE 指令集的擴充。與 SSE 不同的是其有 16 個 256 bit 的暫存器。 - 既然暫存器的大小是 256 bit,我們可以一次計算 $8*8$ 大小的矩陣。 - 這裡用的方法與 cache-friendly 一樣,==與 sub-matrix 不同==。可以參考 [AVX matrix-multiplication, or something like it](http://www.mindfruit.co.uk/2012/02/avx-matrix-multiplication-or-something.html) 一文的做法。 - 所有的指令可以參考 [Intel Instrinsics Guide](https://software.intel.com/sites/landingpage/IntrinsicsGuide/),而我們會用到的指令有以下幾個。(記得先 `#include <immintrin.h>`) - __m256i _mm256_setzero_si256 (void) 將 256 bit 全部設為 0。 - __m256i _mm256_loadu_si256 (__m256i const * mem_addr) 從 mem_addr 讀取 256 bit 的整數資料,不用考慮對齊。 - __m256i _mm256_set1_epi32 (int a) 將 256 bit 內存的 8 個 integer 都設成 `a` 。 - __m256i _mm256_mullo_epi32 (__m256i a, __m256i b) 向量 `a` 與 `b` 內積。 - __m256i _mm256_add_epi32 (__m256i a, __m256i b) 向量 `a` 與 `b` 相加。 - void _mm256_storeu_si256 (__m256i * mem_addr, __m256i a) 將 `a` 的內容存入 MEM[mem_addr] ,不用考慮對齊。 假設我們要計算 $[C]=[A]*[B]$ ,以下說明如何計算 $[C]$ 最上方 row 的結果,類推即可得完整的矩陣$[C]$。 1. 我們將 $[A]$ 、 $[B]$ 矩陣切成 $8*8$ 的 sub-matrix ,然候按照下圖同顏色做乘積。 ![](https://i.imgur.com/4KNSROR.gif) 2. 假設 $[A]_{xy}$ 為 $[A]$ 矩陣中 row x, col y 的元素;$[B]_x$ 為 $[B]$ 矩陣中第 x 個 row 的向量。乘積後我們會有 8 個向量 $V_{1x}=[A]_{1x}*[B]_{x}$ , $x\in\{1,2,3,4,5,6,7,8\}$。 ==$[B]_x=([B]_{x1},[B]_{x2},[B]_{x3},[B]_{x4},[B]_{x5},[B]_{x6},[B]_{x7},[B]_{x8})$== ![](https://i.imgur.com/55QabAD.gif) 3. 由上圖可看出,加總 8 個向量即可得 $[C]$ 矩陣第一個 row 的結果,剩下的 row 依序將 $A_{12}, A_{13}...,A_{18}$ 代入即可得。 完整程式碼如下: ```clike=1 void SIMD_AVX_matmul8(const Matrix l, const Matrix r, const Matrix dst, int c_row, int c_col) { __m256i I[8], R[8], S[8], Sum[8]; for (int i = 0; i < 8; i++) Sum[i] = _mm256_setzero_si256(); for (int k = 0; k < l.col; k += 8) { for (int i = 0; i < 8; i++) R[i] = _mm256_loadu_si256((__m256i *) (&r.values[k + i][c_col])); for (int i = 0; i < 8; i++) { for (int j = 0; j < 8; j++) { I[j] = _mm256_set1_epi32(l.values[c_row + i][k + j]); S[j] = _mm256_mullo_epi32(R[j], I[j]); } for (int j = 0; j < 8; j++) Sum[i] = _mm256_add_epi32(Sum[i], S[j]); } } for (int i = 0; i < 8; i++) _mm256_storeu_si256((__m256i *) (&dst.values[c_row + i][c_col]), Sum[i]); } void SIMD_AVX_matmul(const Matrix l, const Matrix r, const Matrix dst, void *ctx) { CHECK(l, r, dst); for (int i = 0; i < l.row; i += 8) for (int j = 0; j < r.col; j += 8) SIMD_AVX_matmul8(l, r, dst, i, j); } ``` ## SSE 指令集與 AVX 指令集的對齊問題 從 [Intel Instrinsics Guide](https://software.intel.com/sites/landingpage/IntrinsicsGuide/) 內對於這兩種指令集讀取連續記憶體的 function 說明中分別寫著 - __m128i _mm_load_si128 (__m128i const* mem_addr) :::info Load 128-bits of integer data from memory into dst. mem_addr must be aligned on a ==16-byte== boundary or a general-protection exception may be generated. ::: - __m256i _mm256_load_si256 (__m256i const * mem_addr) :::info Load 256-bits of integer data from memory into dst. mem_addr must be aligned on a ==32-byte== boundary or a general-protection exception may be generated. ::: 兩者對於對齊的要求不同,在我的電腦(x86_64, 64 bit)下,每一次成功的 malloc 都會得到一個對齊 16-byte 的記憶體位置,但不見得會對齊 32-byte 。 從我的電腦連續 malloc 10000 次,每次都向系統要求 1~10 個 `sizeof(int)`,實驗的結果如下圖,每一次都可以得到對齊 16-byte 的記憶體位置,但只有大約一半次數會對齊 32-byte 。 ![](https://i.imgur.com/iuT0crh.png) 實驗的程式碼如下 ```clike=1 #include <stdint.h> #include <stdio.h> #include <stdlib.h> #define VSIZE 10000 struct AlignTester { int *ptr; int iSize; }; typedef struct AlignTester Tester; int main() { int iAlign16 = 0, iAlign32 = 0; Tester box[VSIZE]; for (int i = 0; i < VSIZE; i++) { box[i].iSize = (rand() % 10)+1; box[i].ptr = (int *) malloc(sizeof(int) * box[i].iSize); if (((intptr_t) box[i].ptr) % 16 == 0) iAlign16++; if (((intptr_t) box[i].ptr) % 32 == 0) iAlign32++; } printf("1\t%.2f\t2\t%.2f\t0.5\n", (double) iAlign16 / VSIZE * 100.0, (double) iAlign32 / VSIZE * 100.0); for (int i = 0; i < VSIZE; i++) free(box[i].ptr); return(0); } ``` 對於這種情況,SSE 指令集不會有問題,但 AVX 指令集會有一半的機率遇到 segmentation fault,解決這個問題可以有兩種做法。 - 改用不需要對齊的 function ,如用上一節使用的 `_mm256_loadu_mi256`。 - 把 `malloc` 換成 `posix_memalign` 函式,可以使取得的空間對齊 32-byte 。但是如果矩陣的大小並不是 8 的倍數,也就是會有 sub-matrix 不是完整的 $8*8$ ,還是會出現 segmentation fault 。 ## 當使用 AVX 指令集時,矩陣大小並非 8 的倍數 有兩種情況 - 右方接近邊界,無法填滿 256 個 bit 。 - 下方接近邊界,無法產生 8 個向量加總。 後者較為單純,邏輯不變,只要修改迴圈上限即可;前者需將右方不足的部分填 0 ,將其視為 $8*8$ 的矩陣計算,為了避免讀取超出矩陣範圍造成 segmentation fault ,額外使用以下幾個指令做處理。 - __m256i _mm256_maskload_epi32 (int const* mem_addr, __m256i mask) - __m256i _mm256_setr_epi32 (int e7, int e6, int e5, int e4, int e3, int e2, int e1, int e0) - void _mm256_maskstore_epi32 (int* mem_addr, __m256i mask, __m256i a) 可接受矩陣大小非 8 的倍數版本如下: ```clike=1 void SIMD_AVX_matmul8(const Matrix l, const Matrix r, const Matrix dst, int c_row, int c_col) { __m256i I[8], R[8], S[8], Sum[8]; __m256i mask; for (int i = 0; i < 8; i++) Sum[i] = _mm256_setzero_si256(); for (int k = 0; k < l.col; k += 8) { for (int i = 0; i < 8 && k + i < r.row; i++) { if (c_col + 8 <= r.col) R[i] = _mm256_loadu_si256((__m256i *) (&r.values[k + i][c_col])); else { int iMask[8]; for (int j = 0; j < 8; j++) iMask[j] = c_col + j < r.col ? 1 << 31 : 0; mask = _mm256_setr_epi32(iMask[0], iMask[1], iMask[2], iMask[3], iMask[4], iMask[5], iMask[6], iMask[7]); R[i] = _mm256_maskload_epi32((int *) (&r.values[k + i][c_col]), mask); } } for (int i = 0; i < 8 && c_row + i < l.row; i++) { for (int j = 0; j < 8 && k + j < l.col; j++) { I[j] = _mm256_set1_epi32(l.values[c_row + i][k + j]); S[j] = _mm256_mullo_epi32(R[j], I[j]); } for (int j = 0; j < 8 && k + j < l.col; j++) Sum[i] = _mm256_add_epi32(Sum[i], S[j]); } } for (int i = 0; i < 8 && c_row + i < l.row; i++) { if (c_col + 8 <= r.col) _mm256_storeu_si256((__m256i *) (&dst.values[c_row + i][c_col]), Sum[i]); else _mm256_maskstore_epi32(&dst.values[c_row + i][c_col], mask, Sum[i]); } } void SIMD_AVX_matmul(const Matrix l, const Matrix r, const Matrix dst, void *ctx) { CHECK(l, r, dst); for (int i = 0; i < l.row; i += 8) for (int j = 0; j < r.col; j += 8) SIMD_AVX_matmul8(l, r, dst, i, j); } ``` --- ## 實作成果 實驗環境 ``` $ lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 8 On-line CPU(s) list: 0-7 Thread(s) per core: 2 Core(s) per socket: 4 Socket(s): 1 NUMA node(s): 1 Vendor ID: GenuineIntel CPU family: 6 Model: 158 Model name: Intel(R) Core(TM) i7-7700HQ CPU @ 2.80GHz Stepping: 9 CPU MHz: 826.723 CPU max MHz: 3800.0000 CPU min MHz: 800.0000 BogoMIPS: 5616.00 Virtualization: VT-x L1d cache: 32K L1i cache: 32K L2 cache: 256K L3 cache: 6144K NUMA node0 CPU(s): 0-7 ``` 以下下實驗矩陣大小均爲 1024 1. make ```bash= $ make gcc --std=gnu99 -O2 -DNDEBUG -msse4.1 -Wall -c -o main.o main.c gcc --std=gnu99 -O2 -DNDEBUG -msse4.1 -Wall -mavx -march=native -c matmul.c gcc --std=gnu99 -O2 -DNDEBUG -msse4.1 -Wall -c -o matrix.o matrix.c gcc --std=gnu99 -O2 -DNDEBUG -msse4.1 -Wall -c -o strassen.o strassen.c gcc --std=gnu99 -O2 -DNDEBUG -msse4.1 -Wall -o main main.o matmul.o matrix.o strassen.o ``` 2. `./main` - 由使用者自行選擇要使用的矩陣乘法 ```bash= $ ./main ------------------------------ Print matrix? 0: No 1: Yes Please input value from 0 to 1: 0 ------------------------------ Square matrix? 0: No 1: Yes Please input value from 0 to 1: 1 ------------------------------ Matrix size? Please input value from 1 to 4096: 1024 Matrix multiplication: (1024 x 1024) x (1024 x 1024) ------------------------------ Choose a matrix multiplication method 0: quit 1: naive 2: cache friendly naive 3: submatrix 4: simd 5: simd_avx 6: strassen Please input value from 0 to 6: 3 ------------------------------ Stride? Please input value from 2 to 4096: 4 Submatrix(stride=4) correct! CPU time: 3.106043 seconds ``` 3. make test - 提供使用者隨意輸入矩陣大小 - 由程式亂數創出該大小的矩陣 - 得知每個相乘方法的時間長度 ```bash= $ make test ./main --test < test_script/test.txt Matrix multiplication: (1024 x 1024) x (1024 x 1024) Naive correct! CPU time: 2.213579 seconds ----------------------------------------------- Cache_friendly_naive correct! CPU time: 0.595068 seconds ----------------------------------------------- Submatrix(stride=4) correct! CPU time: 1.246918 seconds ----------------------------------------------- Submatrix(stride=6) correct! CPU time: 1.169062 seconds ----------------------------------------------- SIMD correct! CPU time: 0.654296 seconds ----------------------------------------------- SIMD_AVX correct! CPU time: 0.238938 seconds ----------------------------------------------- Strassen(threshold=128)+Submatrix(stride=6) correct! CPU time: 0.543757 seconds ----------------------------------------------- Strassen(threshold=128)+SIMD_AVX correct! CPU time: 0.195370 seconds ----------------------------------------------- ``` 4. make perf 用 fork() 的方式,達到測試每個方法的 cache miss - child 去做 perf 的指令 - 做完之後 parent 就 kill 掉他 **Code:** - 利用 makefile 在編譯時加上 -DPERF ```clike= #if defined(PERF) int pid = getpid(); int cpid = fork(); if (cpid == 0) { // child process . Run perf stat char buf[128]; sprintf(buf, "perf stat -e cache-misses,cache-references," "instructions,cycles,branches,branch-misses -p %d", pid); execl("/bin/sh", "sh", "-c", buf, NULL); } setpgid(cpid, 0); #endif clock_t tic = clock(); matmuls[mm](m, n, o, &matmul_ctx); clock_t toc = clock(); #if defined(PERF) kill(-cpid, SIGINT); wait(&(int){0}); #endif ``` **Output:** ```bash= Matrix multiplication: (1024 x 1024) x (1024 x 1024) Performance counter stats for process id '4064': 61,190,796 cache-misses:u # 3.626 % of all cache refs 1,687,460,171 cache-references:u 8,428,134,785 instructions:u # 0.83 insn per cycle 10,161,492,593 cycles:u 1,055,317,279 branches:u 1,028,966 branch-misses:u # 0.10% of all branches 2.735073074 seconds time elapsed Naive correct! CPU time: 2.797990 seconds ############################################### Performance counter stats for process id '4064': 5,209,865 cache-misses:u # 4.085 % of all cache refs 127,541,989 cache-references:u 7,209,871,205 instructions:u # 3.39 insn per cycle 2,128,038,207 cycles:u 1,032,419,939 branches:u 1,005,070 branch-misses:u # 0.10% of all branches 0.576083717 seconds time elapsed Cache-friendly-naive correct! CPU time: 0.604449 seconds ############################################### Performance counter stats for process id '4064': 18,270,261 cache-misses:u # 4.643 % of all cache refs 393,530,544 cache-references:u 12,839,811,376 instructions:u # 2.56 insn per cycle 5,018,055,509 cycles:u 2,010,377,946 branches:u 77,591 branch-misses:u # 0.00% of all branches 1.354679373 seconds time elapsed Submatrix(stride=4) correct! CPU time: 1.382032 seconds ############################################### Performance counter stats for process id '4064': 25,508,482 cache-misses:u # 8.113 % of all cache refs 314,407,369 cache-references:u 11,108,525,213 instructions:u # 2.36 insn per cycle 4,704,450,795 cycles:u 1,647,827,769 branches:u 4,975,617 branch-misses:u # 0.30% of all branches 1.287659037 seconds time elapsed Submatrix(stride=6) correct! CPU time: 1.314957 seconds ############################################### Performance counter stats for process id '4064': 15,755,922 cache-misses:u # 5.458 % of all cache refs 288,694,157 cache-references:u 4,489,265,404 instructions:u # 1.55 insn per cycle 2,902,179,862 cycles:u 522,381,422 branches:u 81,643 branch-misses:u # 0.02% of all branches 0.795661427 seconds time elapsed SIMD correct! CPU time: 0.828697 seconds ############################################### Performance counter stats for process id '4064': 6,045,090 cache-misses:u # 9.784 % of all cache refs 61,783,562 cache-references:u 1,931,992,621 instructions:u # 1.90 insn per cycle 1,017,711,196 cycles:u 279,494,672 branches:u 1,976,689 branch-misses:u # 0.71% of all branches 0.281439317 seconds time elapsed SIMD-AVX correct! CPU time: 0.312442 seconds ############################################### ``` ## 實驗結果視覺化 ### 不同矩陣乘法的運算時間 ![](https://i.imgur.com/KVKjkUs.png) ### Strassen 演算法加速 Strassen 演算法在減少矩陣乘法計算量的同時會帶來一些 overhead,以下以結合 cache-friendly 方法爲例,實驗 Strassen 演算法中 threshold(不再進行切割的矩陣大小) 的選擇對運算時間的影響: ![](https://i.imgur.com/Y7PFQcJ.png) Strassen 結合不同演算法帶來的加速效果: ![](https://i.imgur.com/TAg6wBl.png) 可以看到 naive 和 cache-friendly 的方法在結合 Strassen 之後時間相差不大,表示在 64*64 的矩陣乘法中,使用 naive 和 cache-friendly 的方法在運算時間上差異不大。 ### Submatrix 切不同大小對運算時間的影響 ![](https://i.imgur.com/P4W82jU.png) ![](https://i.imgur.com/i9UJG1O.png) ## Matrix Multiplication using SIMD 討論區 > 覺得可以直接先看上一組的影片,講得很清楚。 剛好之前有做過類似的([matrix-multiplication-parallel](https://github.com/siahuat0727/matrix-multiplication-parallel),有一些部分還寫不好),所以比較有概念。 另外,因爲 Strassen 只是 divide-and-conquer 後使得計算量減少,裡面還是會進行矩陣乘法,而這裡的乘法應該要可以自由選擇用 naive 或 simd 之類的實作。但上一組的寫法不能方便的做到這些,所以目前把乘法宣告成: `typedef Matrix (*MatrixMulFunc)(const Matrix, const Matrix, void*)` 其中的 void* 保留給需要的 function (如 Strassen 中用來指定用哪種乘法進行運算以及分割的最小大小,或是將來決定是否平行運算之類的)。 對此大概寫了一點框架 [matrix-multiplication](https://github.com/siahuat0727/matrix-multiplication/tree/a866e5808e7b9a19fff01d62308150765825610f),不過不確定設計得好不好,有問題可以討論看看~ 想說這部分要先達到共識之後才比較方便? > [name=siahuat0727] --- ```c=1 MatrixMulFuncEle *matmul_listadd(MatrixMulFuncEle *list, MatrixMulFunc func, char *name, void *ctx, size_t extra_size) { MatrixMulFuncEle *node = malloc(sizeof(MatrixMulFuncEle) + extra_size); return_val_if_fail(node != NULL, 0); node->matmul = func; node->name = name; memcpy(node->ctx, ctx, extra_size); return node; } ``` > 這裡 extra_size 打算用來做什麼? > ctx 指的是? > [name=pjchiou] > ctx 指 context。 > 先說上一組的 sse 和 sse_prefetch,沒錯的話兩者差別不大,一大段的程式碼是複製貼上的,因爲他們是以一個 function 爲單位,不好修改。 > 所以想說給每種方法保留調整的空間,像是 sse 是否進行 prefetch,那這時需要傳一個 bool 給它。又或是 strassen 要用哪種矩陣乘法還有最小切割單位是多少之類的,這時就可以傳一個 struct 去控制,這些額外的空間就需要 extra_size 去存。 > 因爲最後要串起來,API 要一致。 [name=siahuat0727] >> ok 明白了 >> [name=pjchiou] --- ```c=1 Matrix matrix_free(const Matrix m) { free(m.values[0]); free(m.values); return (Matrix){0}; } ``` > 這邊返回 `Matrix{0}` 的用意是? > [name=pjchiou] > 這裡是參考 [你所不知道的C語言:技巧篇](https://hackmd.io/s/HyIdoLnjl#%E5%BE%9E%E7%9F%A9%E9%99%A3%E6%93%8D%E4%BD%9C%E8%AB%87%E8%B5%B7) 提到的 [prefer-to-return-a-value-rather-than-modifying-pointers](https://github.com/mcinglis/c-style#prefer-to-return-a-value-rather-than-modifying-pointers)。 >:::info >**Prefer to return a value rather than modifying pointers** >This encourages immutability, cultivates pure functions, and makes things simpler and easier to understand. It also improves safety by eliminating the possibility of a NULL argument. >```clike >// Bad: unnecessary mutation (probably), and unsafe >void drink_mix( Drink * const drink, Ingredient const ingr ) { > assert( drink != NULL ); > color_blend( &( drink->color ), ingr.color ); > drink->alcohol += ingr.alcohol; >} > >// Good: immutability rocks, pure and safe functions everywhere >Drink drink_mix( Drink const drink, Ingredient const ingr ) { > return ( Drink ){ > .color = color_blend( drink.color, ingr.color ), > .alcohol = drink.alcohol + ingr.alcohol > }; >} >``` >This isn't always the best way to go, but it's something you should always consider. >::: >因爲應該要把 matrix 的 row 和 column 也設成 0,所以用的時候會是 `m = matrix_free(m)` 而不是 `matrix_free(&m)` >其實我也是第一次這麼寫,想說試試看? >不過我這裡還是少了 `assert(m.values!=NULL)` 或類似的處理。 [name=siahuat0727] >> 感覺這個範例跟我們 `matrix_free` 的使用時機不大一樣,但我一時說不上哪不同。 >> [name=pjchiou] >> 嗯嗯我也糾結了很久,都可以討論看看。 >> [name=siahuat0727] >> 在想不一樣的地方是不是因爲我們的 struct 需要 allocate 空間? >> 覺得比較麻煩的是用 `Matrix matmul(const Matrix, const Matrix, void*)` 這種寫法每次乘出來的 Matrix 都需要被 free,這有點討厭。 >> 如果用 `void matmul(const Matrix, const Matrix, Matrix *dst, void*)` 的話雖然要去檢查 dst 空間是不是已經配置過了之類的,但就可以一直重複使用同一個 dst,最後不用時再 free,不知道會不會比較好。 >> [name=siahuat0727] >>> 我認為都可以,隨意選一種就好,我們把重點擺在運算的部份,反正分配記憶體的時間不算在我們要量測的範圍內。我們把其他方法也直接寫在 `naive.[hc]` 內好嗎? >>> [name=pjchiou] >>>> 也是,計時是夾整個 function,第一種方法會把分配的時間算進去,那我們就用第二種然後 function 裡假設 dst 已經被分配了而且大小是正確的吧。合在一起感覺大家一起寫的時候有點麻煩,我先去改成第二種的,如果覺得還是合在一起比較好的話我也都 ok。 >>>> [name=siahuat0727] ---

Import from clipboard

Advanced permission required

Your current role can only read. Ask the system administrator to acquire write and comment permission.

This team is disabled

Sorry, this team is disabled. You can't edit this note.

This note is locked

Sorry, only owner can edit this note.

Reach the limit

Sorry, you've reached the max length this note can be.
Please reduce the content or divide it to more notes, thank you!

Import from Gist

Import from Snippet

or

Export to Snippet

Are you sure?

Do you really want to delete this note?
All users will lose their connection.

Create a note from template

Create a note from template

Oops...
This template is not available.
Upgrade
All
  • All
  • Team
No template found.

Create custom template

Upgrade

Delete template

Do you really want to delete this template?
Turn this template into a regular note and keep its content, versions, and comments.

This page need refresh

You have an incompatible client version.
Refresh to update.
New version available!
See releases notes here
Refresh to enjoy new features.
Your user state has changed.
Refresh to load new user state.

Sign in

Forgot password

or

By clicking below, you agree to our terms of service.

Sign in via Facebook Sign in via Twitter Sign in via GitHub Sign in via Dropbox Sign in with Wallet
Wallet ( )
Connect another wallet

New to HackMD? Sign up

Help

  • English
  • 中文
  • Français
  • Deutsch
  • 日本語
  • Español
  • Català
  • Ελληνικά
  • Português
  • italiano
  • Türkçe
  • Русский
  • Nederlands
  • hrvatski jezik
  • język polski
  • Українська
  • हिन्दी
  • svenska
  • Esperanto
  • dansk

Documents

Help & Tutorial

How to use Book mode

How to use Slide mode

API Docs

Edit in VSCode

Install browser extension

Get in Touch

Feedback

Discord

Send us email

Resources

Releases

Pricing

Blog

Policy

Terms

Privacy

Cheatsheet

Syntax Example Reference
# Header Header 基本排版
- Unordered List
  • Unordered List
1. Ordered List
  1. Ordered List
- [ ] Todo List
  • Todo List
> Blockquote
Blockquote
**Bold font** Bold font
*Italics font* Italics font
~~Strikethrough~~ Strikethrough
19^th^ 19th
H~2~O H2O
++Inserted text++ Inserted text
==Marked text== Marked text
[link text](https:// "title") Link
![image alt](https:// "title") Image
`Code` Code 在筆記中貼入程式碼
```javascript
var i = 0;
```
var i = 0;
:smile: :smile: Emoji list
{%youtube youtube_id %} Externals
$L^aT_eX$ LaTeX
:::info
This is a alert area.
:::

This is a alert area.

Versions and GitHub Sync
Upgrade to Prime Plan

  • Edit version name
  • Delete

revision author avatar     named on  

More Less

No updates to save
Compare
    Choose a version
    No search result
    Version not found
Sign in to link this note to GitHub
Learn more
This note is not linked with GitHub
 

Feedback

Submission failed, please try again

Thanks for your support.

On a scale of 0-10, how likely is it that you would recommend HackMD to your friends, family or business associates?

Please give us some advice and help us improve HackMD.

 

Thanks for your feedback

Remove version name

Do you want to remove this version name and description?

Transfer ownership

Transfer to
    Warning: is a public team. If you transfer note to this team, everyone on the web can find and read this note.

      Link with GitHub

      Please authorize HackMD on GitHub
      • Please sign in to GitHub and install the HackMD app on your GitHub repo.
      • HackMD links with GitHub through a GitHub App. You can choose which repo to install our App.
      Learn more  Sign in to GitHub

      Push the note to GitHub Push to GitHub Pull a file from GitHub

        Authorize again
       

      Choose which file to push to

      Select repo
      Refresh Authorize more repos
      Select branch
      Select file
      Select branch
      Choose version(s) to push
      • Save a new version and push
      • Choose from existing versions
      Include title and tags
      Available push count

      Upgrade

      Pull from GitHub

       
      File from GitHub
      File from HackMD

      GitHub Link Settings

      File linked

      Linked by
      File path
      Last synced branch
      Available push count

      Upgrade

      Danger Zone

      Unlink
      You will no longer receive notification when GitHub file changes after unlink.

      Syncing

      Push failed

      Push successfully