Cache simulation for various matrix operations

黃灝

In this project, our input are 32*32 interger matrix

1. Matrix multiplication

// matrix multiplication
for (int i = 0; i < n; i++) {
    for (int j = 0; j < n; j++) {
        C[i][j] = 0;
        for (int k = 0; k < n; k++) {
            C[i][j] += A[i][k] * B[k][j];
        }
    }
}

A. Different Cache Parameters

N-Way Set-Associative Cache

Fixed Cache Parameters:

Cache level : 1
Block size : 4 bytes
Number of Blocks: 16
Block Replacement Policy: LRU

Direct Mapped(1-Way-associated)

Image Not Showing Possible Reasons

The image was uploaded to a note which you don't have access to
The note which the image was originally uploaded to has been deleted

Learn More →

This picture is the cache with direct Mapped(1-Way) and 16 blocks,each block size is 4 bytes (Ripes).

Image Not Showing Possible Reasons

The image was uploaded to a note which you don't have access to
The note which the image was originally uploaded to has been deleted

Learn More →

2-way associated

Image Not Showing Possible Reasons

The image was uploaded to a note which you don't have access to
The note which the image was originally uploaded to has been deleted

Learn More →

This picture is the cache with 2-way associated and 16 blocks,each block size is 4 bytes.

Image Not Showing Possible Reasons

The image was uploaded to a note which you don't have access to
The note which the image was originally uploaded to has been deleted

Learn More →

4-way associated

Image Not Showing Possible Reasons

The image was uploaded to a note which you don't have access to
The note which the image was originally uploaded to has been deleted

Learn More →

This picture is the cache with 4-way associated and 16 blocks,each block size is 4 bytes.

We can observe that increasing the number of ways in a cache, while keeping the number of blocks constant, can improve the hit rate. This is because a higher associativity reduces conflict misses, which occur when multiple memory blocks map to the same cache set. By transitioning from direct-mapped (1-way associative) to N-way associative caches, each set can store more memory blocks, making the cache more adaptable to diverse access patterns.

Different Block Sizes

Fixed Cache Parameters:

Cache level : 1
Placement Policy : Direct Mapped
Number of Blocks: 16
Block Replacement Policy: LRU

Block Sizes	Result
4 bytes	Image Not Showing Possible Reasons The image was uploaded to a note which you don't have access to The note which the image was originally uploaded to has been deleted Learn More →
8 bytes	Image Not Showing Possible Reasons The image was uploaded to a note which you don't have access to The note which the image was originally uploaded to has been deleted Learn More →
16 bytes	Image Not Showing Possible Reasons The image was uploaded to a note which you don't have access to The note which the image was originally uploaded to has been deleted Learn More →

Increasing the block size improves the cache hit rate by leveraging spatial locality. Larger block sizes allow more adjacent data to be loaded into the cache with each memory access.

B. Loop Ordering

We know that different loop orders can lead to variations in cache performance, primarily due to their impact on data's spatial and temporal locality.

Fixed Cache Parameters:

Cache level : 1
Placement Policy : 2-way associated
Number of Blocks: 32
Block size : 16 bytes
Block Replacement Policy: LRU

loop	Result
ijk	Image Not Showing Possible Reasons The image was uploaded to a note which you don't have access to The note which the image was originally uploaded to has been deleted Learn More →
ikj	Image Not Showing Possible Reasons The image was uploaded to a note which you don't have access to The note which the image was originally uploaded to has been deleted Learn More →
kij	Image Not Showing Possible Reasons The image was uploaded to a note which you don't have access to The note which the image was originally uploaded to has been deleted Learn More →
kji	Image Not Showing Possible Reasons The image was uploaded to a note which you don't have access to The note which the image was originally uploaded to has been deleted Learn More →
jik	Image Not Showing Possible Reasons The image was uploaded to a note which you don't have access to The note which the image was originally uploaded to has been deleted Learn More →
jki	Image Not Showing Possible Reasons The image was uploaded to a note which you don't have access to The note which the image was originally uploaded to has been deleted Learn More →

We can observe that ikj and kij have higher hit rates because they make more efficient use of cache spatial and temporal locality. In these loop orders, the access to B[k][j] is row-wise, aligning well with Row-Major memory storage and reducing cache misses. Additionally, the value of A[i][k] is stored in a temporary variable.

2. Matrix transpose






// Matrix transpose    
for (int i = 0; i < size; i++) {
        for (int j = 0; j < size; j++) {
            dst[j][i] = src[i][j]; 
        }
}

Fixed Cache Parameters:

Cache level : 1
Placement Policy : 2-way associated
Number of Blocks: 32
Block size : 16 bytes
Block Replacement Policy: LRU

Image Not Showing Possible Reasons

The image was uploaded to a note which you don't have access to
The note which the image was originally uploaded to has been deleted

Learn More →

Transpose with blocking

Image Not Showing Possible Reasons

The image was uploaded to a note which you don't have access to
The note which the image was originally uploaded to has been deleted

Learn More →

3. Sparse Matrix Multiplication

Sparse Matrix Multiplication (SpMM) is a specialized computation technique designed for multiplying sparse matrices, which contain a significant number of zero elements. Instead of performing operations on all elements, SpMM focuses only on the non-zero values, saving both memory and computational resources.

Matrix Multiplication

Image Not Showing Possible Reasons

The image was uploaded to a note which you don't have access to
The note which the image was originally uploaded to has been deleted

Learn More →

Sparse Matrix Multiplication

Image Not Showing Possible Reasons

The image was uploaded to a note which you don't have access to
The note which the image was originally uploaded to has been deleted

Learn More →

Sparse matrix multiplication improves efficiency by skipping zero elements , reducing computational complexity and memory access. It focusing resources on non-zero elements, which significantly boosts performance and effectiveness.

4. General Matrix Multiplication

GEMM (General Matrix Multiplication) is an efficient matrix multiplication operation that improves cache performance by partitioning large matrices into smaller blocks. It uses vectorization and multi-threading techniques to speed up operations.

General Matrix Multiplication is defined as the operation

C = α A B + β C

with

A

and

B

as matrix inputs,

α

and

β

as scalar inputs, and

C

as a pre-existing matrix which is overwritten by the output.

A

plain matrix product

A B

is a GEMM with

α

equal to one and

β

equal to zero.

naive

Image Not Showing Possible Reasons

The image was uploaded to a note which you don't have access to
The note which the image was originally uploaded to has been deleted

Learn More →

with blocking

Image Not Showing Possible Reasons

The image was uploaded to a note which you don't have access to
The note which the image was originally uploaded to has been deleted

Learn More →

Blocking improves cache performance by enhancing temporal and spatial locality. It divides matrices into smaller tiles, allowing frequently accessed data to stay in the cache longer, minimizing redundant memory accesses

Reference

github-tugbaguler/Spatial-Locality
github-iVishalr/GEMM
github-manishravula/matmul_locality
github-mortbopet/Ripes
Quiz7 of Computer Architecture (2024 Fall)
Matrix Multiplication Background User's Guide