# Optimizing matrix multiplication ###### tags: `aweimeow`, `sysprog`, `Team3` --- ## Basic Matrix Multiple Introduce ![Multiply Matrix](https://www.mathsisfun.com/algebra/images/matrix-multiply-a.gif) Let start from a short sweet multiply code: ``` for i = 1 to m for j = 1 to m for k = 1 to m C(i, j) = C(i, j) + A(i, k) * B(k, j) ``` ## Optimization Techniques ### L1 cache blocking optimizations Partition Big Matrix into uniform matrix, muitiply matrix in L1 cache, and then self-interface misses are minimized. ##### Calculate Optimal block size > $2 * (blockSize)^2 * wordSize = L1cachesize$ * blockSize: uniform matrix size > 文章中是以2的倍數,因為 L1 cache 一行有 4 words 的大小,選其他倍率效能會很差。 > 觀察從16到64之間 block sizes 效能設定。 * wordSize: Datatype in matrix (ex. double: 8 bytes) ### Compile flag Optimizations gcc provide various compiling options, and `-O4`, `-funroll-loops`, `-ffast-math`, but `-ffast-math` make opt to IEEE floating point standard will cause incorrect result. ### Software Optimizations * Software unrolling and register reuse * Unroll innermost loop, and load a common value in outer loop to register * avoid repeatedly fetched from L1 or L2 caches. * Avoiding pointer arithmetic in arrays * In array: `c[10]` is better than `*(c + 10)` * c[10] can be execute in one LOAD instructions * Avoiding inequality comparsions * BRNZ(Branch if not ZERO) extremely fast * replace inequality based for-loop to equality based for-loop * Interspacing add-multiply operations * add, multiply performed in different processing units ### Copy Optimizations 在矩陣乘法當中,乘法是以行(column)乘以列(row),且在此假設矩陣以 row-major 的方式儲存,那麼在 Matrix 的大小是 $2^N$ 的時候,column 中每一個元素的地址是被用 $2^N$ 去切開的。而也因為 L1 Cache 使用 direct mapped cache,這會導致大量的 Cache Misses。 「remedy」則是一個解決此問題的技術,在乘法運算開始之前,將第一個矩陣存成 Row-major 的格式,而第二個矩陣存成 Column-major 的格式,這種作法就達成了空間局域性(Space Locality) ### L2 Cache blocking optimizations 把 L2 Cache 也加進來算,這樣可以存放更多 Matrix 的資料 ## 閱讀資料 [Optimizing matrix multiplication](http://networks.cs.ucdavis.edu/~amitabha/optimizingMatrixMultiplication.pdf)