# Cache simulation for various matrix operations
> 黃灝
In this project, our input are 32*32 interger matrix
## 1. Matrix multiplication
```cpp
// matrix multiplication
for (int i = 0; i < n; i++) {
for (int j = 0; j < n; j++) {
C[i][j] = 0;
for (int k = 0; k < n; k++) {
C[i][j] += A[i][k] * B[k][j];
}
}
}
```
### A. Different Cache Parameters
#### N-Way Set-Associative Cache
Fixed Cache Parameters:
* Cache level : 1
* Block size : 4 bytes
* Number of Blocks: 16
* Block Replacement Policy: LRU
##### Direct Mapped(1-Way-associated)

This picture is the cache with direct Mapped(1-Way) and 16 blocks,each block size is 4 bytes (Ripes).

##### 2-way associated

This picture is the cache with 2-way associated and 16 blocks,each block size is 4 bytes.

##### 4-way associated

This picture is the cache with 4-way associated and 16 blocks,each block size is 4 bytes.

We can observe that increasing the number of ways in a cache, while keeping the number of blocks constant, can improve the hit rate. This is because a higher associativity reduces conflict misses, which occur when multiple memory blocks map to the same cache set. By transitioning from direct-mapped (1-way associative) to N-way associative caches, each set can store more memory blocks, making the cache more adaptable to diverse access patterns.
#### Different Block Sizes
Fixed Cache Parameters:
* Cache level : 1
* Placement Policy : Direct Mapped
* Number of Blocks: 16
* Block Replacement Policy: LRU
| Block Sizes | Result |
| ----------- |:---------------------------------------------------:|
| 4 bytes | |
| 8 bytes | |
| 16 bytes | |
Increasing the block size improves the cache hit rate by leveraging spatial locality. Larger block sizes allow more adjacent data to be loaded into the cache with each memory access.
### B. Loop Ordering
We know that different loop orders can lead to variations in cache performance, primarily due to their impact on data's spatial and temporal locality.
Fixed Cache Parameters:
* Cache level : 1
* Placement Policy : 2-way associated
* Number of Blocks: 32
* Block size : 16 bytes
* Block Replacement Policy: LRU
| loop | Result |
| -------- | -------- |
| ijk ||
| ikj ||
| kij ||
| kji ||
| jik ||
| jki ||
We can observe that **ikj** and **kij** have higher hit rates because they make more efficient use of cache spatial and temporal locality. In these loop orders, the access to **B[k][j]** is row-wise, aligning well with Row-Major memory storage and reducing cache misses. Additionally, the value of **A[i][k]** is stored in a temporary variable.
## 2. Matrix transpose
```cpp=
// Matrix transpose
for (int i = 0; i < size; i++) {
for (int j = 0; j < size; j++) {
dst[j][i] = src[i][j];
}
}
```
Fixed Cache Parameters:
* Cache level : 1
* Placement Policy : 2-way associated
* Number of Blocks: 32
* Block size : 16 bytes
* Block Replacement Policy: LRU

Transpose with blocking

## 3. Sparse Matrix Multiplication
Sparse Matrix Multiplication (SpMM) is a specialized computation technique designed for multiplying sparse matrices, which contain a significant number of zero elements. Instead of performing operations on all elements, SpMM focuses only on the non-zero values, saving both memory and computational resources.
### Matrix Multiplication

### Sparse Matrix Multiplication

Sparse matrix multiplication improves efficiency by skipping zero elements , reducing computational complexity and memory access. It focusing resources on non-zero elements, which significantly boosts performance and effectiveness.
## 4. General Matrix Multiplication
GEMM (General Matrix Multiplication) is an efficient matrix multiplication operation that improves cache performance by partitioning large matrices into smaller blocks. It uses vectorization and multi-threading techniques to speed up operations.
General Matrix Multiplication is defined as the operation $C=αAB+βC$ with $A$ and $B$ as matrix inputs, $α$ and $β$ as scalar inputs, and $C$ as a pre-existing matrix which is overwritten by the output. $A$ plain matrix product $AB$ is a GEMM with $α$ equal to one and $β$ equal to zero.
naive

with blocking

Blocking improves cache performance by enhancing temporal and spatial locality. It divides matrices into smaller tiles, allowing frequently accessed data to stay in the cache longer, minimizing redundant memory accesses
## Reference
[github-tugbaguler/Spatial-Locality](https://github.com/tugbaguler/Spatial-Locality)
[github-iVishalr/GEMM](https://https://github.com/iVishalr/GEMM)
[github-manishravula/matmul_locality](https://github.com/manishravula/matmul_locality)
[github-mortbopet/Ripes](https://github.com/mortbopet/Ripes/blob/master/docs/cache_sim.md)
[Quiz7 of Computer Architecture (2024 Fall)](https://hackmd.io/@sysprog/arch2024-quiz7)
[Matrix Multiplication Background User's Guide](https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html)