2017q1 matrix

contributed by <twzjwang>,<rayleigh0407>
GitHub: matrix_oo
Task: 比照 B10: matrix 要求,完整實作多種加速矩陣乘法的演算法,伴隨相關的 SIMD 改善機制

矩陣乘法 Matrix multiplication

  • 定義 definition
    • C=AB
      for an
      n×m
      matrix
      A
      and an
      m×p
      matrix
      B
      , then
      C
      is an
      n×p
      matrix with entries
    • cij=k=1maikbkj
    • Image Not Showing Possible Reasons
      • The image file may be corrupted
      • The server hosting the image is unavailable
      • The image path is incorrect
      • The image format is not supported
      Learn More →
  • 矩陣乘法演算法的歷史
    History of Matrix Multiplication Algorithms
    • Image Not Showing Possible Reasons
      • The image file may be corrupted
      • The server hosting the image is unavailable
      • The image path is incorrect
      • The image format is not supported
      Learn More →

演算法 Algorithm

Iterative algorithm

  • time complexity :
    O(n3)
  • 方法 method :
    ​​​​input matrix A(n*m), B(m*p) ​​​​creat matrix C(n*p) ​​​​for i = 1 : n ​​​​ for j = 1 : p ​​​​ sum = 0 ​​​​ for k = 1 : m ​​​​ sum = sum + A[i][k] * B[k][j] ​​​​ C[i][j] = sum ​​​​return C
  • 問題 problem :
    • 矩陣儲存的方式(由於 cache 使用對效能有相當大的影響)
      the order the matrices stored
      (considerable impact on performance due to cache use)
      • row-major order
      • column-major order
      • mix of both

Divide and conquer algorithm

  • time complexity :
    O(n3)
  • 依 block partitioning 方法切割
    relies on the block partitioning

square matrices

  • 大小為
    2n×2n

    the shapes are
    2n×2n
    for some n
  • Image Not Showing Possible Reasons
    • The image file may be corrupted
    • The server hosting the image is unavailable
    • The image path is incorrect
    • The image format is not supported
    Learn More →
  • Image Not Showing Possible Reasons
    • The image file may be corrupted
    • The server hosting the image is unavailable
    • The image path is incorrect
    • The image format is not supported
    Learn More →

non-square matrices

  • 將矩陣一分為二,兩部份大小須相同,或盡量相近(odd dimensions)
    dividing it into two parts of equal size, or as close to equal sizes as possible in the case of odd dimensions

Sub-cubic algorithms

  • 提供運行時間比 straightforward 方法更快的演算法
    Algorithms exist that provide better running times than the straightforward ones

Strassen Algorithm

  • time complexity : 約
    O(n2.8074)
    • N=2n
      ,
      n=logN
    • Iterative Algorithm :
      2N3
    • Strassen Algorithm :
      Strassen_O(n)=7×Strassen_O(n1)+O(n2)

      (7+O(1))n=>(7+O(1))logN=Nlog(7)+O(1)=N2.8074
  • 兩矩陣相乘
    C=AB
  • 如果矩陣不是
    2n×2n
    , 把不足的欄和列補 0
    If the matrices A, B are not of type
    2n×2n
    we fill the missing rows and columns with zeros.
  • 切割
    A
    ,
    B
    ,
    C
    , 使得
    Ai,j,Bi,j,Ci,jR2n1×2n1,i,j=1,2

    We partition
    A
    ,
    B
    and
    C
    into equally sized block matrices
    • Image Not Showing Possible Reasons
      • The image file may be corrupted
      • The server hosting the image is unavailable
      • The image path is incorrect
      • The image format is not supported
      Learn More →

      with
      Image Not Showing Possible Reasons
      • The image file may be corrupted
      • The server hosting the image is unavailable
      • The image path is incorrect
      • The image format is not supported
      Learn More →
  • 目標
    • C11=A11B11+A12B21
    • C12=A11B12+A12B22
    • C21=A21B11+A22B21
    • C22=A21B12+A22B22
  • 定義新的矩陣
    define new matrices
    • M1=(A11+A22)(B11+B22)
    • M2=(A21+A22)B11
    • M3=A11(B12B22)
    • M4=A22(B21B11)
    • M5=(A11+A12)B22
    • M6=(A21A11)(B11+B12)
    • M7=(A12A22)(B21+B22)
  • 只使用 7 次相乘(每個
    Mk
    各一次),而不是原本的 8 次
    only using 7 multiplications (one for each
    Mk
    ) instead of 8
    • C11=M1+M4M5+M7
    • C12=M3+M5
    • C21=M2+M4
    • C22=M1M2+M3+M6
  • 分析
    • 矩陣相乘:假設兩矩陣皆為 n x n , 則相乘後的矩陣, 元素數量為 n2, 每個元素皆相乘 n 次, 相加 n - 1 次後得值, 所需計算步驟為
      n2[n+(n1)]=2n3n2
      乘法需要
      n3
      , 加法需要
      n3n2
    • 假設
      n=2m
      , 原始所需步驟為
      16m34m2
      , 而 Strassen algorithm , 須先計算
      Mi (i=17)
      , 子矩陣部份直接相乘, 其所需步驟為 ( m x m 矩陣相乘 ) * 7 + ( m x m 矩陣相加減 ) * 10, 最後計算答案時所需步驟為 ( m x m 矩陣相加減 ) * 7, 即為
      (2m3m2)×7+m2×10+m2×8=14m3+11m2
      乘法需要
      7m3
      , 加減法需要
      7m3+11m2
    • Naive Strassen algo
      16m34m2
      14m3+11m2
    • 當 m >= 8 時, Naive 步驟就會大於 Strassen algo, 若 m >> 1, Strassen algo 所需時間就會是 Naive 的 7/8 倍

Coppersmith–Winograd algorithm

  • time complexity :
    O(n2.375477)
    ~
    O(n2.3728639)
  • 已知漸近最快的矩陣乘法演算法。
    the asymptotically fastest known matrix multiplication algorithm until 2010.
  • 改進 naive
    O(n3)
    , 與Strassen
    O(n2.807355)
  • 實際中很少應用,因為常數因素的負擔很大,使他變得不切實際
    rarely used in practice, because the large constant factors in their running times make them impractical
  • 可能再改進,但至少
    O(n2)

    It is possible to improve the exponent further; however, the exponent must be at least
    O(n2)
    • 讀取
      n×n
      O(n2)
      ,至少讀一次
    • In 2010, Andrew Stothers gave an improvement to the algorithm,
      O(n2.374)
    • In 2011, Virginia Williams combined a mathematical short-cut from Stothers' paper with her own insights and automated optimization on computers, improving the bound to
      O(n2.372864)
    • In 2014, François Le Gall simplified the methods of Williams and obtained an improved bound of
      O(n2.3728639)
  • Coppersmith-Winograd 演算法經常作為其他演算法中的 building block 以證明理論時間範圍
    The Coppersmith–Winograd algorithm is frequently used as a building block in other algorithms to prove theoretical time bounds.
  • 然而,與Strassen算法不同,實際上他不被使用,因為它只有在矩陣非常巨大時才有優勢,使得它們不能被現代硬體處理。
    However, unlike the Strassen algorithm, it is not used in practice because it only provides an advantage for matrices so large that they cannot be processed by modern hardware.

Freivalds' algorithm

  • time complexity :
    O(n2)
  • 驗證矩陣乘法(正確性)
    Verify matrix multiplication

想驗證

C=AB是否成立,直觀的想法為先算出
AB=D
,再判斷
C=D

時間複雜度為
O(n3)
~
O(n2.3728639)

Freivalds' algorithm 可加速至
O(n2)

  • 比任何已知解法還快的簡單隨機演算法
    a simple randomized algorithm faster than any known deterministic solution
  • 方法 method :
    • intput
      n×n
      matrices
      A,B,C
    • generate an
      n×1
      random 0/1 vector
      r
      .
    • compute
      P=A×(Br)Cr
    • output "Yes" if
      P=(0,0,...,0)T
      ; "No," otherwise.
    • Error
      • If
        A×B=C
        , then the algorithm always returns "Yes". If
        A×BC
        , then the probability that the algorithm returns "Yes" is less than or equal to one half. This is called one-sided error.
  • 運行時間 Running Time :
    • ABr=A(Br)
      , so we have 3 instances of an
      n×n
      matrix times an n-vector

    B×r:
    B
    (
    n×n
    ) r(
    n×1
    )
    =>
    O(n×n×1)

    A×(Br):
    A
    (
    n×n
    ) (Br)(
    n×1
    )
    =>
    O(n×n×1)

    • These are
      O(n2)
      time operations if done straightforwardly
    • Total running time
      O(n2)
  • 錯誤分析 Error analysis
    • p
      為 error 的機率
      Let
      p
      equal the probability of error
    • Case
      A×B=C
      ,
      p=0
      • P=A×(Br)Cr=(A×B)rCr=(A×BC)r=0
      • r
        無關,因為
        A×BC=0
        , 所以
        Pr[P0]=0

        This is regardless of the value of
        r
        , since it uses only that
        A×BC=0
        Hence the probability for error in this case is:
        Pr[P0]=0
    • Case
      A×BC
      ,
      p1/2
      • let
        P=D×r=(p1,p2,...,pn)T
        , where
        D=A×BC=(dij)
      • 因為
        A×BC
        ,
        D
        的元素不全為 0
        Since
        A×BC
        , we have that some element of
        D
        is nonzero.
      • 假設
        dij0

        Suppose that the element
        dij0
        .
      • 根據定義得到
        By the definition of matrix multiplication, we have
        • pi=k=1ndikrk=di1r1++dijrj++dinrn=dijrj+y.
      • 對於常數
        y
        ,使用貝氏定理,分割常數
        y

        For some constant
        y
        . Using Bayes' Theorem, we can partition over
        y
        :
        • Pr[pi=0]=Pr[pi=0|y=0]Pr[y=0]+Pr[pi=0|y0]Pr[y0](1)
      • We use that:
        • Pr[pi=0|y=0]=Pr[rj=0]=12

          Pr[pi=0|y0]=Pr[rj=1dij=y]Pr[rj=1]=12
      • 插入 equation (1) 得到
        Plugging these in the equation (1), we get:
        • Pr[pi=0]12Pr[y=0]+12Pr[y0]=12Pr[y=0]+12(1Pr[y=0])=12
      • Therefore,
        Pr[P=0]=Pr[p1=0pi=0pn=0]Pr[pi=0]12.

        This completes the proof.

e.g.:
case

C=A×B :
A=[0123]
,
B=[1000]
,
C=[0020]

隨機產生
r
可能為
r=[00]
,
r=[01]
,
r=[10]
,
r=[11]

運算
P=A×(Br)Cr
結果皆為
P=[00]

=> 錯誤機率為 0

case

CA×B :
A=[0123]
,
B=[1000]
,
C=[0000]

隨機產生
r
可能為
r=[00]
,
r=[01]
,
r=[10]
,
r=[11]

運算
P=A×(Br)Cr

其中
r=[00]
,
r=[01]
算出
P=[00]

r=[10]
,
r=[11]
算出
P[00]

=> 錯誤機率為 0.5
(有一半機會當
CA×B
卻算出
P[00]
)

  • 後果 Ramifications
    • 簡單的演算法分析顯示,這個演算法的運行時間為
      O(n2)
      ,完勝傳統演算法的
      O(n3)

      Simple algorithmic analysis shows that the running time of this algorithm is
      O(n2)
      , beating the classical deterministic algorithm's bound of
      O(n3)
      .
    • 同時,錯誤分析顯示執行這個演算法
      k
      次可以達到 error bound 小於(指數級的)
      12k

      The error analysis also shows that if we run our algorithm
      k
      times, we can achieve an error bound of less than
      12k
      , an exponentially small quantity.
    • 使用隨機演算法可加速相對非常慢的確定性演算法。
      utilization of randomized algorithms can speed up a very slow deterministic algorithm
    • Freivalds' algorithm 常常出現在 probabilistic algorithms 的介紹中,因為它的簡單性,以及它如何說明了 probabilistic algorithms 處理某些問題的優越性。
      Freivalds' algorithm frequently arises in introductions to probabilistic algorithms due to its simplicity and how it illustrates the superiority of probabilistic algorithms in practice for some problems.

Parallel and distributed algorithms

https://en.wikipedia.org/wiki/Matrix_multiplication_algorithm#Parallel_and_distributed_algorithms

實驗

  • 分別實作上述演算法,用 SIMD 改善,並用 Freivalds' algorithm 驗證

0. Freivalds' algorithm

  • 驗證方法

    ​​​​algo = matrix_providers[0]; ​​​​if (!algo->mul(&L_Br, &B, &r)) { ​​​​ printf("B x r error!\n"); ​​​​ continue; ​​​​} ​​​​if (!algo->mul(&L_ABr, &A, &L_Br)) { ​​​​ printf("A x Br error!\n"); ​​​​ continue; ​​​​} ​​​​if (!algo->mul(&R_Cr, &C, &r)) { ​​​​ printf("C x r error!\n"); ​​​​ continue; ​​​​} ​​​​if (!algo->equal(&L_ABr, &R_Cr)) { ​​​​ printf("ABr and Cr are not equal!\n"); ​​​​ break; ​​​​}
  • 改善

    • error bound 小於
      12k
    • 原本 k 為 1 , error bound < 1/21 (< 50%)
    • 將 k 提高至 10 , error bound < 1/210 (< 0.1%)
    ​​​​int flag = 1; ​​​​algo = matrix_providers[0]; ​​​​for (int i = 0; i < 10 && flag; i++) { ​​​​ if (!algo->mul(&L_Br, &B, &r)) { ​​​​ printf("B x r error!\n"); ​​​​ flag = 0; ​​​​ continue; ​​​​ } ​​​​ if (!algo->mul(&L_ABr, &A, &L_Br)) { ​​​​ printf("A x Br error!\n"); ​​​​ flag = 0; ​​​​ continue; ​​​​ } ​​​​ if (!algo->mul(&R_Cr, &C, &r)) { ​​​​ printf("C x r error!\n"); ​​​​ flag = 0; ​​​​ continue; ​​​​ } ​​​​ if (!algo->equal(&L_ABr, &R_Cr)) { ​​​​ printf("ABr and Cr are not equal!\n"); ​​​​ flag = 0; ​​​​ break; ​​​​ } ​​​​}

1. Naive

  • 定義 :
    cij=k=1maikbkj
  • 使用 3 層迴圈實做
for (int i = 0; i < l->row; i++) for (int j = 0; j < r->col; j++) { PRIV(dst)->values[i][j] = 0; for (int k = 0; k < l->col; k++) PRIV(dst)->values[i][j] += PRIV(l)->values[i][k] * PRIV(r)->values[k][j]; }
  • 實驗結果
    ​​​​algo     : Naive
    
    ​​​​matrix   : 8 x 8
    ​​​​exe time : 0.000001 ms
    
    ​​​​matrix   : 16 x 16
    ​​​​exe time : 0.000004 ms
    
    ​​​​matrix   : 32 x 32
    ​​​​exe time : 0.000029 ms
    
    ​​​​matrix   : 64 x 64
    ​​​​exe time : 0.000237 ms
    
    ​​​​matrix   : 128 x 128
    ​​​​exe time : 0.001771 ms
    
    ​​​​matrix   : 256 x 256
    ​​​​exe time : 0.014509 ms
    
    ​​​​matrix   : 512 x 512
    ​​​​exe time : 0.144029 ms
    
    ​​​​matrix   : 1024 x 1024
    ​​​​exe time : 1.456696 ms
    

SSE

  • 類似 naive , 一次做 4 個 int (128 bits)

  • 指令

    • __m128i _mm_setzero_si128()
      • 將 128 bits 設為0
    • __m128i _mm_load_si128(__m128i const*p)
      • 從 p (必須是 16-byte aligned) 連續讀取 128 bits 記憶體
    • __m128i _mm_set_epi32(int i3, int i2, int i1, int i0)
      • 將 128 bits 內容設為 4 個 int (每個佔 32 bits)
    • __m128i _mm_mullo_epi32( __m128i a, __m128i b)
      • a , b 中 4 個 32-bit 整數相乘, 保留前半部份(前32 bits)
    • __m128i _mm_unpacklo_epi32(__m128i a, __m128i b)
      • 取 a 跟 b 前兩個(0-63) 32-bit int
      • e.g.
        • a : i0 i1 i2 i3
          b : j0 j1 j2 j3
          =>i0 i1 j0 j1
    • __m128i _mm_unpackhi_epi32(__m128i a, __m128i b)
      • 取 a 跟 b 後兩個(64-127) 32-bit int
    • __m128i _mm_unpacklo_epi64(__m128i a, __m128i b)
      • 取 a 跟 b 第一個(0-63) 64-bit int
    • __m128i _mm_unpackhi_epi64(__m128i a, __m128i b)
      • 取 a 跟 b 第二個(64-127) 64-bit int
    • __m128i _mm_add_epi32(__m128i a, __m128i b)
      • 將 a 跟 b 中 4 個 32-bit int 分別相加,保留前半部份(前32 bits)
    • void _mm_store_si128(__m128i *p, __m128i b)
      • 從 p (必須是 16-byte aligned) 連續寫入 b (128 bits)
  • 實驗結果

    ​​​​algo     : SSE
    
    ​​​​matrix   : 8 x 8
    ​​​​exe time : 0.000000 ms
    
    ​​​​matrix   : 16 x 16
    ​​​​exe time : 0.000002 ms
    
    ​​​​matrix   : 32 x 32
    ​​​​exe time : 0.000009 ms
    
    ​​​​matrix   : 64 x 64
    ​​​​exe time : 0.000061 ms
    
    ​​​​matrix   : 128 x 128
    ​​​​exe time : 0.000471 ms
    
    ​​​​matrix   : 256 x 256
    ​​​​exe time : 0.003963 ms
    
    ​​​​matrix   : 512 x 512
    ​​​​exe time : 0.035724 ms
    
    ​​​​matrix   : 1024 x 1024
    ​​​​exe time : 0.327177 ms
    

AVX

  • 類似 naive , 一次做 8 個 int (256 bits)

  • 指令

    • __m256i _mm256_setzero_si256(void)
      • 將 int 所有 bits 設 0 並回傳
    • __m256i _mm256_load_si256(__m256i const *a)
      • 從 a (必須是 32-byte aligned) 連續讀取 256 bits 記憶體
    • __m256i _mm256_set1_epi32(int)
      • 將一個 32-bit int 存入 256-bit vector
    • __m256i _mm256_mullo_epi32(__m256i s1, __m256i s2)
      • s1 , s2 中 8 個 32-bit 整數相乘, 保留前半部份(前32 bits)
    • _mm256_add_epi32
      • 將 a 跟 b 中 8 個 32-bit int 分別相加,保留前半部份(前32 bits)
    • void _mm256_store_si256(__m128i *a, __m128i b)
      • 從 a (必須是 32-byte aligned) 連續寫入 b (128 bits)
  • 實驗結果

    ​​​​algo     : AVX
    
    ​​​​matrix   : 8 x 8
    ​​​​exe time : 0.000001 ms
    
    ​​​​matrix   : 16 x 16
    ​​​​exe time : 0.000002 ms
    
    ​​​​matrix   : 32 x 32
    ​​​​exe time : 0.000006 ms
    
    ​​​​matrix   : 64 x 64
    ​​​​exe time : 0.000026 ms
    
    ​​​​matrix   : 128 x 128
    ​​​​exe time : 0.000175 ms
    
    ​​​​matrix   : 256 x 256
    ​​​​exe time : 0.001366 ms
    
    ​​​​matrix   : 512 x 512
    ​​​​exe time : 0.011233 ms
    
    ​​​​matrix   : 1024 x 1024
    ​​​​exe time : 0.103360 ms
    

AVX loop unrolling

  • 將以下迴圈展開

    ​​​​for (int i = 0; i < 8; i++) { ​​​​ // broadcast each elements from source 1 ​​​​ ymm8 = _mm256_set1_epi32(PRIV(l)->values[(x + i)][(k + 0)]); ​​​​ ymm9 = _mm256_set1_epi32(PRIV(l)->values[(x + i)][(k + 1)]); ​​​​ ymm10 = _mm256_set1_epi32(PRIV(l)->values[(x + i)][(k + 2)]); ​​​​ ymm11 = _mm256_set1_epi32(PRIV(l)->values[(x + i)][(k + 3)]); ​​​​ ymm12 = _mm256_set1_epi32(PRIV(l)->values[(x + i)][(k + 4)]); ​​​​ ymm13 = _mm256_set1_epi32(PRIV(l)->values[(x + i)][(k + 5)]); ​​​​ ymm14 = _mm256_set1_epi32(PRIV(l)->values[(x + i)][(k + 6)]); ​​​​ ymm15 = _mm256_set1_epi32(PRIV(l)->values[(x + i)][(k + 7)]); ​​​​ // multiply ​​​​ ymm8 = _mm256_mullo_epi32(ymm8, ymm0); // row 1, 2 ​​​​ ymm9 = _mm256_mullo_epi32(ymm9, ymm1); ​​​​ ymm8 = _mm256_add_epi32(ymm8, ymm9); ​​​​ ymm10 = _mm256_mullo_epi32(ymm10, ymm2); // row 3, 4 ​​​​ ymm11 = _mm256_mullo_epi32(ymm11, ymm3); ​​​​ ymm10 = _mm256_add_epi32(ymm10, ymm11); ​​​​ ymm12 = _mm256_mullo_epi32(ymm12, ymm4); // row 5, 6 ​​​​ ymm13 = _mm256_mullo_epi32(ymm13, ymm5); ​​​​ ymm12 = _mm256_add_epi32(ymm12, ymm13); ​​​​ ymm14 = _mm256_mullo_epi32(ymm14, ymm6); // row 7, 8 ​​​​ ymm15 = _mm256_mullo_epi32(ymm15, ymm7); ​​​​ ymm14 = _mm256_add_epi32(ymm14, ymm15); ​​​​ ymm8 = _mm256_add_epi32(ymm8, ymm10); // sum ​​​​ ymm12 = _mm256_add_epi32(ymm12, ymm14); ​​​​ ymm8 = _mm256_add_epi32(ymm8, ymm12); ​​​​ switch(i) { ​​​​ case 0: ​​​​ ymm16 = _mm256_add_epi32(ymm16, ymm8); ​​​​ break; ​​​​ case 1: ​​​​ ymm17 = _mm256_add_epi32(ymm17, ymm8); ​​​​ break; ​​​​ case 2: ​​​​ ymm18 = _mm256_add_epi32(ymm18, ymm8); ​​​​ break; ​​​​ case 3: ​​​​ ymm19 = _mm256_add_epi32(ymm19, ymm8); ​​​​ break; ​​​​ case 4: ​​​​ ymm20 = _mm256_add_epi32(ymm20, ymm8); ​​​​ break; ​​​​ case 5: ​​​​ ymm21 = _mm256_add_epi32(ymm21, ymm8); ​​​​ break; ​​​​ case 6: ​​​​ ymm22 = _mm256_add_epi32(ymm22, ymm8); ​​​​ break; ​​​​ case 7: ​​​​ ymm23 = _mm256_add_epi32(ymm23, ymm8); ​​​​ break; ​​​​ } ​​​​}
  • 實驗結果

    • before
      ​​​​​​​​algo     : AVX
      
      ​​​​​​​​matrix   : 8 x 8
      ​​​​​​​​exe time : 0.000001 ms
      
      ​​​​​​​​matrix   : 16 x 16
      ​​​​​​​​exe time : 0.000002 ms
      
      ​​​​​​​​matrix   : 32 x 32
      ​​​​​​​​exe time : 0.000006 ms
      
      ​​​​​​​​matrix   : 64 x 64
      ​​​​​​​​exe time : 0.000026 ms
      
      ​​​​​​​​matrix   : 128 x 128
      ​​​​​​​​exe time : 0.000177 ms
      
      ​​​​​​​​matrix   : 256 x 256
      ​​​​​​​​exe time : 0.001359 ms
      
      ​​​​​​​​matrix   : 512 x 512
      ​​​​​​​​exe time : 0.011220 ms
      
      ​​​​​​​​matrix   : 1024 x 1024
      ​​​​​​​​exe time : 0.101038 ms
      
    • after
      ​​​​​​​​algo     : AVX
      
      ​​​​​​​​matrix   : 8 x 8
      ​​​​​​​​exe time : 0.000001 ms
      
      ​​​​​​​​matrix   : 16 x 16
      ​​​​​​​​exe time : 0.000002 ms
      
      ​​​​​​​​matrix   : 32 x 32
      ​​​​​​​​exe time : 0.000006 ms
      
      ​​​​​​​​matrix   : 64 x 64
      ​​​​​​​​exe time : 0.000026 ms
      
      ​​​​​​​​matrix   : 128 x 128
      ​​​​​​​​exe time : 0.000165 ms
      
      ​​​​​​​​matrix   : 256 x 256
      ​​​​​​​​exe time : 0.001294 ms
      
      ​​​​​​​​matrix   : 512 x 512
      ​​​​​​​​exe time : 0.010437 ms
      
      ​​​​​​​​matrix   : 1024 x 1024
      ​​​​​​​​exe time : 0.096296 ms
      
    • 比較
      • Speedup=Tbefore/Tafter
      • 矩陣大小超過
        128×128
        會加速約 5-7%
    n 8 16 32 64 128 256 512 1024
    before(ms) 0.000001 0.000002 0.000006 0.000026 0.000177 0.001359 0.011220 0.101038
    after(ms) 0.000001 0.000002 0.000006 0.000026 0.000165 0.001294 0.010437 0.096296
    speedup 1 1 1 1 1.07 1.05 1.07 1.05

Naive 比較

n 8 16 32 64 128 256 512 1024
naive(ms) 0.000001 0.000004 0.000029 0.000237 0.001771 0.014509 0.144029 1.456696
sse(ms) 0.000000 0.000002 0.000009 0.000061 0.000471 0.003963 0.035724 0.327177
avx(ms) 0.000001 0.000002 0.000005 0.000025 0.000168 0.001288 0.010347 0.095896

更新 AVX loop unrolling

2. Strassen Algorithm

  • 新增 seperate(), combine(), add_sub_part(), mul_part() 等函式以便實作.
  • 若矩陣行(或列)為奇數, 則多補一行(或列), 並初始化為零.
  • 實驗結果
    ​​​​algo     : Naive
    
    ​​​​matrix   : 8 x 8
    ​​​​exe time : 0.000001 ms
    
    ​​​​matrix   : 16 x 16
    ​​​​exe time : 0.000004 ms
    
    ​​​​matrix   : 32 x 32
    ​​​​exe time : 0.000032 ms
    
    ​​​​matrix   : 64 x 64
    ​​​​exe time : 0.000220 ms
    
    ​​​​matrix   : 128 x 128
    ​​​​exe time : 0.001543 ms
    
    ​​​​matrix   : 256 x 256
    ​​​​exe time : 0.012644 ms
    
    ​​​​matrix   : 512 x 512
    ​​​​exe time : 0.137741 ms
    
    ​​​​matrix   : 1024 x 1024
    ​​​​exe time : 1.216822 ms
    
    ​​​​
    ​​​​
    ​​​​algo     : Strassen
    
    ​​​​matrix   : 8 x 8
    ​​​​exe time : 0.000004 ms
    
    ​​​​matrix   : 16 x 16
    ​​​​exe time : 0.000010 ms
    
    ​​​​matrix   : 32 x 32
    ​​​​exe time : 0.000038 ms
    
    ​​​​matrix   : 64 x 64
    ​​​​exe time : 0.000244 ms
    
    ​​​​matrix   : 128 x 128
    ​​​​exe time : 0.001432 ms
    
    ​​​​matrix   : 256 x 256
    ​​​​exe time : 0.009937 ms
    
    ​​​​matrix   : 512 x 512
    ​​​​exe time : 0.083023 ms
    
    ​​​​matrix   : 1024 x 1024
    ​​​​exe time : 0.978982 ms
    
n 8 16 32 64 128 256 512 1024
naive(ms) 0.000001 0.000004 0.000032 0.000220 0.001543 0.012644 0.137741 1.216822
Strassen(ms) 0.000004 0.000010 0.000038 0.0000244 0.001432 0.009937 0.083023 0.978982
​​​​大約在 N > 100 後 , strassen algorithm 會快過 naive 版本

原本的 Strassen algorithm 是遞迴呼叫的, 也就是子矩陣相乘時呼叫自身, 才能使得時間複雜度為 O(n2.81), 但是這次在實作時執行時間會暴增, 所以只先執行一次矩陣分割, 目前推測問題在於分割及結合子矩陣花費的時間過長

  • 在 Strassen algorithm 中嵌入 SSE
    由於上面那個缺陷, 使得子矩陣還是使用原始乘法相乘, 所以在相乘的部份, 就可以使用 SSE 指令集, 以 SIMD 的機制作矩陣相乘, 以下為實驗結果:
    ​​​​algo     : SSE
    
    ​​​​matrix   : 8 x 8
    ​​​​exe time : 0.000000 ms
    
    ​​​​matrix   : 16 x 16
    ​​​​exe time : 0.000001 ms
    
    ​​​​matrix   : 32 x 32
    ​​​​exe time : 0.000008 ms
    
    ​​​​matrix   : 64 x 64
    ​​​​exe time : 0.000056 ms
    
    ​​​​matrix   : 128 x 128
    ​​​​exe time : 0.000430 ms
    
    ​​​​matrix   : 256 x 256
    ​​​​exe time : 0.003593 ms
    
    ​​​​matrix   : 512 x 512
    ​​​​exe time : 0.032536 ms
    
    ​​​​matrix   : 1024 x 1024
    ​​​​exe time : 0.297687 ms
    
    
    ​​​​
    ​​​​algo     : Strassen SSE
    
    ​​​​matrix   : 8 x 8
    ​​​​exe time : 0.000003 ms
    
    ​​​​matrix   : 16 x 16
    ​​​​exe time : 0.000008 ms
    
    ​​​​matrix   : 32 x 32
    ​​​​exe time : 0.000023 ms
    
    ​​​​matrix   : 64 x 64
    ​​​​exe time : 0.000098 ms
    
    ​​​​matrix   : 128 x 128
    ​​​​exe time : 0.000529 ms
    
    ​​​​matrix   : 256 x 256
    ​​​​exe time : 0.003462 ms
    
    ​​​​matrix   : 512 x 512
    ​​​​exe time : 0.026929 ms
    
    ​​​​matrix   : 1024 x 1024
    ​​​​exe time : 0.240095 ms
    
n 8 16 32 64 128 256 512 1024
Strassen(ms) 0.000004 0.000010 0.000038 0.0000244 0.001432 0.009937 0.083023 0.978982
SSE(ms) 0.000000 0.000001 0.000008 0.000056 0.000430 0.003593 0.032536 0.297687
Strassen SSE(ms) 0.000003 0.000008 0.000023 0.00098 0.000529 0.003462 0.026929 0.240095
在矩陣大於 256 x 256 時, Strassen SSE 會略快於 SSE
  • 設立邊界
    由於小矩陣在使用 Strassen algorithm 計算時增加執行時間, 於是我嘗試設置邊界, 在切個的子矩陣小於一定值時, 就執行原始版本的矩陣乘法, 實驗結果如下
邊界 \ n 8 16 32 64 128 256 512 1024
8 0.000005 0.000011 0.000089 0.000669 0.004888 0.033680 0.236442 1.664202
16 0.000005 0.000011 0.000040 0.000326 0.002450 0.017589 0.123404 0.874777
32 0.000005 0.000011 0.000040 0.000247 0.001883 0.013629 0.095878 0.681866
64 0.000004 0.000011 0.000040 0.00245 0.001461 0.010796 0.075746 0.539311
128 0.000004 0.000011 0.000040 0.00247 0.001471 0.010397 0.072306 0.510256
256 0.000005 0.000011 0.000040 0.00251 0.001485 0.010220 0.082975 0.589007

比較 comparison

更新 AVX loop unrolling

  • 矩陣愈大,效能
    • avx > strassen_sse > sse > strassen > naive

reference