# CS267 HW1: Optimization of blocked matrix multiply
#### Group 07: Kiran Eiden, Luis Rangel DaCosta, Mark (Ziyao) Zhang
### Intro
Matrix multiplication is one of the most basic and fundamental computations that are performed in scientific computing. Due to its importance, it has been highly studied and there are thus many strategies for improving the performance of a basic matrix multiply routine. In this report, we discuss an implementation of a blocked matrix multiplication kernel and how various optimization strategies---including parameter tuning, data reorganization, memory alignment, and vector intrinsics---affect its performance on the Knight's Landing nodes available on NERSC's Cori supercomputing cluster.
### Optimization strategies
#### 1. Block size tuning
In an blocked matrix multiply algorithm, one of the first natural parameters to explore and optimize is the block size itself. The ideal block size, in an idealized memory model, would be such that the relevant blocks of $C$, $A$, and $B$ all fit together into the lowest level of cache. For a cache size of $M$, the ideal block dimension $b$ would then be $\sqrt{\frac{M}{3}}$ ; for the Knight's Landing architecture with an L1 cache size of 32 kilobytes and matrices stored in double precision, $M=4096$ and thus the optimal $b$ is about 36 elements. With this in mind, we tested the basic implementation of the blocked algorithm over a range of block sizes to observe how performance evolved over ranges of the block size. The performance results can be seen below in Fig. 1. We note that the performance changes very little on an absolute scale, hovering around 1% of peak performance, over the range of block sizes from 8 through 36 which are all well below the ideal size. While we would have expected the performance to be best at a block size of 32 or 36, we attribute the lack of improvement due to the relatively naive implementation of the blocked algorithm, which might not make best use the block size with respect to memory movement anyway. For the rest of the analysis, we utilize a block size of 8, since it is one of the "leading" configurations and, anticipating the future optimization strategies like zero padding, cache alignment, writing an intrinsic kernel, that keeping a block-size of 8 would allow for easy testing and comparison of performance.

***Figure 1**: Performance of naive blocked-matrix multiply over a range of block sizes*
#### 2. Zero-padding and Input Transposition
For a given block size and an arbitrarirly sized matrix inputs, it is clear that many input matrices will not be well divided into an integer number of blocks. This then requires proper handling of boundaries, which can cause both logical function overhead and irregular memory access patterns. To alleviate this mismatch in size without compromising the accuracy of the final answer, we can zero-pad the arrays until their sizes are integer multiples of the block size. This is effectively accomplished by allocating new arrays for $A$, $B$, and $C$ of the desired size, copying the relevant data from $A$ and $B$ into the their new padded versions, performing a simpler blocked multiplication without boundary handling, and copying the final results back into the original $C$ array. Given that the arrays only need to be iterated over once to set and copy the data each, and many times in the actual multiplication, the cost of copying the arrays is negligible compared the savings incurred by the regular splitting of the arrays. We observed an average performance of 2.34% of peak for the zero-padded arrays.
Since the blocked multiplication kernel iterates over elements in rows of $A$ and our data are stored column-wise, to improve the spatial locality of $A$ such that we do not have to jump around in memory locations by intervals of the block size, we can instead work with a transposed copy of $A$ so that we instead are effectively iterating over columns of $A$. With the data copy implemented for the zero-padding, it was relatively simple to implement the transposition. With the addition of transposition, the observe an increased average performance of 3.02% of peak. The behavior across array sizes for a subset of test arrays is shown below for both the zero-padding and the zero-padding with array tranposition. We note that, with solely zero-padding the arrays, we have strong characterisic drops in performance in arrays are not multiples of the block size. This is in part due to the overhead incurred by padding the arrays, but as can be seen by the improvement in transposition, due to the particular irregularities in array access patterns in reading over $A$ row-wise versus column-wise.

***Figure 2**: Performance of matrix multipliy after zero-padding input arrays and tranposing $A$*
#### 3. Memory alignment
Memory alignment was one of the first idea we tried. In theory, aligning memory should require fewer clocks for CPU to fetch a given piece of data than if our data was unaligned. Furthermore, because the cache line for knl processors is 64 bytes (via `getconf LEVEL1_DCACHE_LINESIZE`), setting the block size to 16 (128 bytes) means each block would contain exactly 32 cache lines worth of data without needing to fetching extra. Combined with zero-padding, we had an increase in performance from the baseline to an average performance 3.75% of peak. Since the block size is small, at larger array sizes, we are underutilizing the cache and thus suffer a performance loss.

***Figure 3**: Performance of matrix multiply with cache-aligned memory blocks and a block-size of 8*
#### 4. Utilizing FMA intrinsics
The next step in improving our performance involved utiziling the AVX512 intrisics available on the Knight's Landing architecture in order to perform vectorized SIMD instructions. Instead of operating on a single value of $A$, $B$, and $C$ at a time, with the AVX512 instructions, we can instead operate on 8 of each values at once. This would offer, in an idealized model, a speedup of 8x. The AVX512 set implements the fused multiply-add standard FMA4, where the vector operations are implemented as $d[0:7] = a[0:7]*b[0:7]+c[0:7]$ for doubles (using a 0-indexed notation). For a block size of 8, then, optimal use of the intrisics would imply at most 64 FMA instructions since 256 FMA operations are computed overall. One simple way to organize the instructions is to iterate over columns of $C$, which would utilize a column of $A$ and a scalar of $B$ copied over a vector. We include a sample snippet below for a single update (out of a total of eight) to a column $j$ of $C$---since the memory blocks have been aligned, we can utilize the aligned versions of the intrinsic calls.
```c
_m512d Cj, Aj, Bij;
Aj = _mm512_load_pd((__m512d *) A+j*BLOCK_SIZE);
Cj = _mm512_load_pd((__m512d *) C+j*BLOCK_SIZE);
Bij = _mm512_set1_pd(B+i+j*BLOCK_SIZE);
Cj = _mm512_fmadd_pd(Aj, Bij, Cj);
//repeat for the other seven values of Bij
_mm512_store_pd(C+j*BLOCK_SIZE, Cj)
```
For a block size of 8, we implement a version of this kernel where all of the columns of $A$ are preloaded and the columns $C_j$ are looped over. In this iteration of development, we have un-transposed $A$ so that we can access the data columnwise.
```c
// set intrinsic variables
__m512d Av0, Av1, Av2, Av3, Av4, Av5, Av6, Av7;
__m512d Bv0, Bv1, Bv2, Bv3, Bv4, Bv5, Bv6, Bv7;
__m512d Cv;
// load data onto registers
Av0 = _mm512_load_pd((__m512d *) &Apad[0*ldap]);
... load Av1 through Av7
// perform 64 FMA operations for blocksize of 8
for(int i = 0; i < BLOCK_SIZE; ++i){
Cv = _mm512_load_pd((__m512d *) &Cpad[i*ldap]);
Bv0 = _mm512_set1_pd(Bpad[i*ldap+0]);
... set Bv1 through Bv7
Cv = _mm512_fmadd_pd(Av0, Bv0, Cv);
Cv = _mm512_fmadd_pd(Av1, Bv1, Cv);
... compute additions from Av2, Bv2, through, Av7, Bv7 similarly
// store C back into memory
_mm512_store_pd(&Cpad[i*ldap], Cv);
}
```
This achieves an average performance of 10.85% of peak on a the selected subset of test matrix sizes; a reordering of $B$ to improve the access pattern further increases the average performance to 18.72% of peak performance. The repacking/reordering code looked something like:
```c
double *Breo_p = &Breo[0];
for(int i = 0; i < ldap; i += KERNEL_SIZE)
{
for(int j = 0; j < ldap; ++j)
{
double *Bpad_p = &Bpad[i + j*ldap];
for(int k = 0; k < KERNEL_SIZE; ++k)
{
*(Breo_p + k) = *(Bpad_p + k);
}
Breo_p += KERNEL_SIZE;
}
}
```
where we copy entries from the zero-padded $B$ matrix `Bpad` into the reordered version `Breo` (`ldap` is the padded version of `lda`). This packs the columns of each block of $B$ in sequential order, allowing the kernel to iterate through them contiguously. With the optimal loop permutation, performance increased to 19.67% on average.

***Figure 4**: Performance of matrix multiply using AVX512 intrinsic kernels with a block-size of 8*
It is not clear to us why we saw a performance dip around $n = 400$. This reordering of $B$ is similar to the block-major ordering discussed later, albeit less complete (that ordering also repacked $A$) -- these were done in parallel, and we opted to pursue the latter further. This version has somewhat less overhead, which led to higher performance for some test cases.
#### 5. Auto-Vectorization
Our initial finding with regards to auto-vectorization was that adding the compiler flag `-ftree-loop-vectorize` had no impact on performance. We then realized that we (Kiran) had misinterpreted the `README.md` file, which suggested turning on optimizations in Debug mode using the `-O2` flag (but the release mode build was actually compiled with `-O3`).
Examining the code with the [Compiler Explorer project](https://gcc.godbolt.org/z/v2fTDJ), it became clear that the code was successfully vectorizing the innermost loop, and a restructuring of the kernel was needed to take full advantage of vector operations. It was not immediately clear how to do this.
In parallel we looked into AVX-512 instrinsics, and were able to retructure the code to utilize these effectively. These are discussed in the subsequent section.
#### 6. Block-major ordering
After implemented the AVX512 intrinsics and in conjuction with investigating the reordering of $B$, we implemented block-major ordering of the data. Block-major ordering was motivated via reading the excellent paper by Lam et al. [1]. The idea is to copy the memory block-by-block so that it's always read sequentially as opposed to jumping in strides when the expensive part of the algorithm executes. The first benefit is that we will be able to use almost all of the data being read into the cache, so aligning memory to cachelines is no longer as crucial (even if we don't use all of the cacheline in this iteration, we will use it in the next one). However, since we already have alignment implemented, the seceond benefit is more important: block-major ordering eliminates self-interference within a block (assuming a block fits in the cache). Self-interference is defined as the eviction of data triggered by using the same variable's data. Because knl caches are likely not fully associative, if we march through a matrix in constant but larger than 1 strides, then the chance of filling a set and forcing previous entries to be evicted despite the other parts of the cache being empty can be significant. If we can read the data sequentially (stride = 1), however, eviction only happens when the cache is acutally full regardless of its associativity. Therefore, block-major ordering should give us a significant performance boost. Indeed, our performance reached an average performance of 18.23% over the full set of matrix sizes after block-major ordering was implemented; this performance is achieved with a full block size of 128x128 elements while keeping the FMA kernel size at 8x8 elements. We referenced the implementation in [2], though unlike [2], we did all of our reordering upfront for code clarity.

***Figure 5**: Performance of blocked-matrix multiply with block-major ordering*
However, we observed a significant performance dip for matrices of sizes about $n=512$. We theorize that this is due to 4k aliasing. 4K aliasing is a phenomenon where a load is stalled by a previous store to an address that has the same last 12 bits, despite the addresses themselves being different. [3] This is due a limitation in Intel processors where the cache coherency protocol only compares the last 12 bits of a load/store to decide whether to stall. To confirm our suspicion, we fired up Intel VTune to profile a run where we only launched matrices of size $n=511$ and $n=512$. Unlike haswell, knights landing processors don't have the event `LD_BLOCKS_PARTIAL.ADDRESS_ALIAS` to indicate 4k aliasing. Instead, 4k aliasing events are bundled together with other store forwarding blocks under `LOAD_BLOCK.OVERLAP_STORE` [4]. Indeed, we observe a significant portion of loads blocked by store forwarding events in the results.

Looking closely at the code, we located the location of the 4k aliasing inside of the 8x8 kernel:
```c
static void do_block_8x8_block_order(int ldap, double* A, double* B, double* C){
...
for(int i = 0; i < 8; ++i){
Cv = _mm512_load_pd((__m512d *) &C[i*ldap]);
...
_mm512_store_pd(&C[i*ldap], Cv);
}
...
}
```
Here, a store in the previous iteration of the loop was immediately followed by a load in the next iteration. Because ldap is the matrix dimension, each iteration jumps in `512 * sizeof(double) = 4096` strides, exactly 4kB apart!
To resolve this issue, we added extra logic to pad the matrix with an extra kernel worth of zeros on both dimensions, if it was a multiple of 4kB. Afterwards, the dip at $n=511, 512, 1023$ and $1024$ disappeared, as shown below in the plot of performance over the full set of input sizes. We achieve an average performance of 18.15% of peak. We theorize that this is lower due to the extra slight overhead to manage the 4kB aliasing, and because only a small set of cases are improved.

***Figure 6**: Performance of matrix multiply with block-major ordering and checks for 4k aliasing*
#### 7. Adding another level of blocking
In the previous section, we mentioned that we used a block size of 128. Because our previous block became a $8 \times 8$ SIMD kernel, changing its size is now non-trivial. Inspired by [2], we use that kernel as a minimum unit of computation and add another level of blocking so that we can make our blocks larger than 8x8. As a result, we can reuse data much more efficiently and take full advantage of the L2 cache.
#### 8. Loop permutations
With the goal of improving temporal locality and minimizing cache misses, we explored different permutations of the loops in our base `square_dgemm` routine. From the outermost to the innermost loop, the original ordering was $i \rightarrow j \rightarrow k$, where $i$ is the index into rows of $A$, $j$ the index into columns of $B$, and $k$ the index into entries in the row/column pair. We suspected that there might be other orderings of the loops that would make better use of the on-chip cache, however.
The first alternate ordering we explored was $k \rightarrow j \rightarrow i$. In theory, swapping $i$ and $k$ would allow us to load each row of $A$ and column of $B$ just once for each iteration of the outermost loop, and having $i$ as the innermost loop would take advantage of the column-major ordering of the matrices. This only gave a significant performance boost for large block sizes, however (with the maximum boost being a factor $4\times$ for block size $b = \infty$, i.e., the naive algorithm).
Since the boost depended strongly on the setup of our kernel, we shelved this optimization approach while working on other improvements. Returning to it after introducing FMA instrinsics, we found that the ordering $j \rightarrow k \rightarrow i$ was ideal, increasing performance to $14.4\%$ of peak (from the $11\%$ achieved by $i \rightarrow j \rightarrow k$ ordering). With the addition of block-major formatting of the matrices, we were able to increase that to nearly $20\%$ of maximum. This loop ordering may be optimal because it takes advantage of the FMA kernel layout, which iterates over columns of $B$ and $C$ and loads each entry of $B$ separately. Having the loop over $j$ as the outermost loop (followed by $k$) potentially lets us reuse cached columns of $B$ more easily with each kernel call.
#### 9. Loop unrolling
We also experimented with unrolling loops in the kernel, without much success. We initially hoped that tagging our `do_blocked` function with `__attribute__((optimize("unroll-loops")))` might produce some performance benefit. To our dismay, we found that it gave at best a performance increase of a few percent, comparable to the random fluctutations in performance.
Assuming a block size of $8$, we tried directly tagging the innermost loop of the kernel with `#pragma GCC unroll n` (for $n = 2, 4, 8$), without success. Perhaps unrolling would have been more effective for a different block size.
Our final implementation of the kernel with FMA intrinsics had only one loop (over the columns of $B$/$C$)---all other loops were unrolled. Manually unrolling this final loop did not provide a performance boost, so that loop was retained for conciseness.
#### 10. Summary of Final Configuration
For our final implementaton, we tuned our block-major ordering to utilize a full-block size of 72x72 elements, an intrinsic kernel of size 8, and keeping the zero-padding, 4k aliasing prevention strategy, and loop permutations previously described. This achieves an average performance of 19.22% of peak over the full range of test matrix sizes; the performance is plotted below in Figure 7. We settled with this final configuration as it has a high peak performance, of the configurations we tested, while keeping relatively smooth behavior over the full range of tested matrices.

***Figure 7**: Performance of final optimized configuration of blocked matrix multiply*
### Member Contributions
* Kiran Eiden worked on utilizing compiler optimizations (unrolling, auto-vectorization), loop permutations, and repacking/reordering of B for better locality
* Luis Rangel DaCosta worked on block size tuning, zero-padding, array transposition, and implementing FMA intrinics
* Mark (Ziyao) Zhang worked on memory alignment, zero-padding, block-major ordering, 4k aliasing mediation, and unit-test infrastructure.
### References
[1] https://suif.stanford.edu/papers/lam-asplos91.pdf
[2] https://github.com/flame/how-to-optimize-gemm
[3] https://richardstartin.github.io/posts/4k-aliasing
[4] https://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-optimization-manual.pdf