# Cache simulation for various matrix operations > 黃灝 In this project, our input are 32*32 interger matrix ## 1. Matrix multiplication ```cpp // matrix multiplication for (int i = 0; i < n; i++) { for (int j = 0; j < n; j++) { C[i][j] = 0; for (int k = 0; k < n; k++) { C[i][j] += A[i][k] * B[k][j]; } } } ``` ### A. Different Cache Parameters #### N-Way Set-Associative Cache Fixed Cache Parameters: * Cache level : 1 * Block size : 4 bytes * Number of Blocks: 16 * Block Replacement Policy: LRU ##### Direct Mapped(1-Way-associated) ![image](https://hackmd.io/_uploads/SkjCVznD1x.png) This picture is the cache with direct Mapped(1-Way) and 16 blocks,each block size is 4 bytes (Ripes). ![image](https://hackmd.io/_uploads/S1P9eFCPyl.png) ##### 2-way associated ![image](https://hackmd.io/_uploads/BkfGjMhwyl.png) This picture is the cache with 2-way associated and 16 blocks,each block size is 4 bytes. ![image](https://hackmd.io/_uploads/B1PJbY0vkl.png) ##### 4-way associated ![image](https://hackmd.io/_uploads/H11aiG3Dyl.png) This picture is the cache with 4-way associated and 16 blocks,each block size is 4 bytes. ![image](https://hackmd.io/_uploads/SJAbbK0Pyg.png) We can observe that increasing the number of ways in a cache, while keeping the number of blocks constant, can improve the hit rate. This is because a higher associativity reduces conflict misses, which occur when multiple memory blocks map to the same cache set. By transitioning from direct-mapped (1-way associative) to N-way associative caches, each set can store more memory blocks, making the cache more adaptable to diverse access patterns. #### Different Block Sizes Fixed Cache Parameters: * Cache level : 1 * Placement Policy : Direct Mapped * Number of Blocks: 16 * Block Replacement Policy: LRU | Block Sizes | Result | | ----------- |:---------------------------------------------------:| | 4 bytes | ![image](https://hackmd.io/_uploads/S1kflFAwke.png)| | 8 bytes | ![image](https://hackmd.io/_uploads/H1zNgYAwyl.png)| | 16 bytes | ![image](https://hackmd.io/_uploads/Bk2UgtCw1l.png)| Increasing the block size improves the cache hit rate by leveraging spatial locality. Larger block sizes allow more adjacent data to be loaded into the cache with each memory access. ### B. Loop Ordering We know that different loop orders can lead to variations in cache performance, primarily due to their impact on data's spatial and temporal locality. Fixed Cache Parameters: * Cache level : 1 * Placement Policy : 2-way associated * Number of Blocks: 32 * Block size : 16 bytes * Block Replacement Policy: LRU | loop | Result | | -------- | -------- | | ijk |![image](https://hackmd.io/_uploads/rJvs1KAPJl.png)| | ikj |![image](https://hackmd.io/_uploads/ByRUkYRwkg.png)| | kij |![image](https://hackmd.io/_uploads/ryknRuRwJl.png)| | kji |![image](https://hackmd.io/_uploads/Hk5dAuAwkg.png)| | jik |![image](https://hackmd.io/_uploads/Skn7yYAwke.png)| | jki |![image](https://hackmd.io/_uploads/BJKgkKAPyg.png)| We can observe that **ikj** and **kij** have higher hit rates because they make more efficient use of cache spatial and temporal locality. In these loop orders, the access to **B[k][j]** is row-wise, aligning well with Row-Major memory storage and reducing cache misses. Additionally, the value of **A[i][k]** is stored in a temporary variable. ## 2. Matrix transpose ```cpp= // Matrix transpose for (int i = 0; i < size; i++) { for (int j = 0; j < size; j++) { dst[j][i] = src[i][j]; } } ``` Fixed Cache Parameters: * Cache level : 1 * Placement Policy : 2-way associated * Number of Blocks: 32 * Block size : 16 bytes * Block Replacement Policy: LRU ![image](https://hackmd.io/_uploads/HJOdw1x_1x.png) Transpose with blocking ![image](https://hackmd.io/_uploads/BkRzFkluyg.png) ## 3. Sparse Matrix Multiplication Sparse Matrix Multiplication (SpMM) is a specialized computation technique designed for multiplying sparse matrices, which contain a significant number of zero elements. Instead of performing operations on all elements, SpMM focuses only on the non-zero values, saving both memory and computational resources. ### Matrix Multiplication ![image](https://hackmd.io/_uploads/S1yg6_Cw1x.png) ### Sparse Matrix Multiplication ![image](https://hackmd.io/_uploads/H1DbCOCDJe.png) Sparse matrix multiplication improves efficiency by skipping zero elements , reducing computational complexity and memory access. It focusing resources on non-zero elements, which significantly boosts performance and effectiveness. ## 4. General Matrix Multiplication GEMM (General Matrix Multiplication) is an efficient matrix multiplication operation that improves cache performance by partitioning large matrices into smaller blocks. It uses vectorization and multi-threading techniques to speed up operations. General Matrix Multiplication is defined as the operation $C=αAB+βC$ with $A$ and $B$ as matrix inputs, $α$ and $β$ as scalar inputs, and $C$ as a pre-existing matrix which is overwritten by the output. $A$ plain matrix product $AB$ is a GEMM with $α$ equal to one and $β$ equal to zero. naive ![image](https://hackmd.io/_uploads/SyF_4kgOJg.png) with blocking ![image](https://hackmd.io/_uploads/ByuVByld1x.png) Blocking improves cache performance by enhancing temporal and spatial locality. It divides matrices into smaller tiles, allowing frequently accessed data to stay in the cache longer, minimizing redundant memory accesses ## Reference [github-tugbaguler/Spatial-Locality](https://github.com/tugbaguler/Spatial-Locality) [github-iVishalr/GEMM](https://https://github.com/iVishalr/GEMM) [github-manishravula/matmul_locality](https://github.com/manishravula/matmul_locality) [github-mortbopet/Ripes](https://github.com/mortbopet/Ripes/blob/master/docs/cache_sim.md) [Quiz7 of Computer Architecture (2024 Fall)](https://hackmd.io/@sysprog/arch2024-quiz7) [Matrix Multiplication Background User's Guide](https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html)