Assignment4: Cache

tags: `riscv`

Cache.s



















































.data
array:	.word	2048		# max array size specified in BYTES (DO NOT CHANGE)

.text
##################################################################################################
# You MAY change the code below this section
main:	li	a0, 256		# array size in BYTES (power of 2 < array size)
	li	a1, 2		# step size  (power of 2 > 0)
	li	a2, 1		# rep count  (int > 0)
	li	a3, 1		# 0 - option 0, 1 - option 1
# You MAY change the code above this section
##################################################################################################

	jal	accessWords	# lw/sw
	#jal	accessBytes	# lb/sb
	
	li	a7,10		# exit // Here I modify the register from a0 to a7 for Ripes.
	ecall

# SUMMARY OF REGISTER USE:
#  a0 = array size in bytes
#  a1 = step size
#  a2 = number of times to repeat
#  a3 = 0 (W) / 1 (RW)
#  s0 = moving array ptr
#  s1 = array limit (ptr)

accessWords:
	la	s0, array		# ptr to array
	add	s1, s0, a0		# hardcode array limit (ptr)
	slli	t1, a1, 2		# multiply stepsize by 4 because WORDS
wordLoop:
	beq	a3, zero,  wordZero

	lw	t0, 0(s0)		# array[index/4]++
	addi	t0, t0, 1
	sw	t0, 0(s0)
	j	wordCheck

wordZero:
	sw	zero,  0(s0)		# array[index/4] = 0

wordCheck:
	add	s0, s0, t1		# increment ptr
	blt	s0, s1, wordLoop	# inner loop done?

	addi	a2, a2, -1
	bgtz	a2, accessWords	# outer loop done?
	jr	ra
    
... // We focus on word access.

Simply description about cache.s

# PSEUDOCODE:
#    int array[];  //Assume sizeof(int) == 4
#    for (k = 0; k < repcount; k++) {		// repeat repcount times
#      /* Step through the selected array segment with the given step size. */
#      for (index = 0; index < arraysize; index += stepsize) {
#        if(option==0)
#          array[index] = 0;			// Option 0: One cache access - write
#        else
#          array[index] = array[index] + 1;	// Option 1: Two cache accesses - read AND write
#      }
#    }

# SUMMARY OF REGISTER USE:
#  a0 = array size in bytes
#  a1 = step size
#  a2 = number of times to repeat
#  a3 = 0 (W) / 1 (RW)
#  s0 = moving array ptr
#  s1 = array limit (ptr)

accessWords
```
la	s0, array		# ptr to array
add	s1, s0, a0	# hardcode array limit (ptr)
slli	t1, a1, 2		# multiply stepsize by 4 because WORDS
```
s0 is saved the start address of the array, and s1 is saved the end address of the array.
t1 is saved that how many byte we would need to shift (
$4 \times a 1$ ).
wordLoop
```
beq	a3, zero,  wordZero
lw	t0, 0(s0)     # array[index/4]++
addi	t0, t0, 1
sw	t0, 0(s0)
j	wordCheck
```
Check a3 whether is equal to 0, if does than jump to wordZero.
Otherwise, load the data from s0 (it is pointer to the start position of array) to t0, and then save back after adding 1. (array[index] = array[index] + 1)
Jump to wordCheck.
wordCheck
```
add	s0, s0, t1		       # increment ptr
blt	s0, s1, wordLoop    	# inner loop done?
addi	a2, a2, -1
bgtz	a2, accessWords	# outer loop done?
jr	ra
```
Move the pointer of array (s0) to next position by add s0, s0, t1.
If s0 < s1 (blt) then jump to wordLoop and keep executing, it means that there is still in the inner loop, not yet access all the array data.
a3 is saved repeat times for this program, so reduce the a3 value by 1 and check the value whether is equal to 0.
wordZero
```
sw	zero,  0(s0)		# array[index/4] = 0
```
When option is 0, set the array[index] to 0.

Observation in Ripes

The program parameters setup
```
#  a0 = 8, array size in bytes
#  a1 = 1, step size --> 1 * 4 Byte = 4 Bytes
#  a2 = 1, number of times to repeat
#  a3 = 1, 0 (W) / 1 (RW)
```
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →

Observation for Memory

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Here we can see the array pointer to address 0x10000000, and the end pointer to address 0x100000008. The total array size is 8 Bytes (2 Words).
Due to the step size is 1, it would access the next 1 word data.

Schematic diagram

Array:M	Data	Address
End	-	0x10000008
-	0x0	0x10000007
-	0x0	0x10000006
-	0x0	0x10000005
M[1]	0x1	0x10000004
-	0x0	0x10000003
-	0x0	0x10000002
-	0x8	0x10000001
M[0]	0x1	0x10000000

Why the data at 0x100000001 is always 0x8 ? And why the address 0x1000001c always be accessed ?

Another case
```
#  a0 = 12, array size in bytes
#  a1 = 2, step size --> 2 * 4 Byte = 8 Bytes
#  a2 = 3, number of times to repeat
#  a3 = 1, 0 (W) / 1 (RW)
```
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
The step size is 2 word, so it would access the address 0x10000000, 0x10000008 and 0x10000010, but 0x10000010 is out of the limit address 0x1000000c, so the address 0x10000010 would not write data.
And the program would repeat 3 times, so 0x10000000 and 0x10000008 are saved the value 3.

Review - Hit and Miss Policies

Hit
- Read hits: Take the correctly data into cpu from cache.
- Write hits:
  1. Write-Through Policy: Always write data to cache and to memory (through cache), forces cache and memory to always be consistent, slow, and include a Write Buffer that updates memory in parallel with processor.
  2. Write-Back Policy: Write data only to cache, then update memory when block is removed, allows cache and memory to be inconsistent, and Dirty bit per cache row that is set if block was written to (is dirty) then needs to be written back.
  3. Write-Around Policy: In every situation, data is written to main memory only, if we have the block we're writing in the cache, the valid bit of the block is changed to invalid.
    Essentially, there is no such thing as a write hit in a write-around cache, a write "hit" does the same thing as a write miss.
Miss
- Read misses: Stall execution, fetch block from memory, put in cache, send requested data to processor, resume.
- Write misses:
  1. Always have to update block from memory.
  2. Carry the updated block into cache or not ?
    - Write Allocate: When we bring the block into the cache after a write miss, and always pair with write-back (accessing same address many times)
    - No Write Allocate: Only change main memory after a write miss, and typically pair with write-through (infrequent/random writes)
Replacement policies
- LRU (Least recently used): When we decide to evict a cache block to make space, we select the block that has been used farthest back in time of all the other blocks.
- Random: When we decide to evict a cache block to make space, we randomly select one of the blocks in the cache to evict.
- MRU (Most recently used): When we decide to evict a cache block to make space, we select the block that has been used most recently of all the other available blocks.

The Venus cache simulator currently simulates a write-through, write-allocate cache.
So we need to choose the same options in Ripes for our experiment.

Exercise 1 - A Couple of Memory Access Scenarios

Task:
Simulate the following scenarios and record the final cache hit rates with the program in cache.s.
Try to reason out what the hit rate will be BEFORE running the code.

THE FOLLOWING are good questions to ask yourself as you do these exercises (not checkoff questions):

How big is your cache block?
How many consecutive accesses (taking into account the step size) fit within a single block?
How much data fits in the WHOLE cache?
How far apart in memory are blocks that map to the same set (and could create conflicts)?
What is your cache’s associativity?
Where in the cache does a particular block map to?
When considering why a specific access is a miss or hit: Have you accessed this piece of data before? If so, is it still in the cache or not?

Scenario 1

Program Parameters: (set these by initializing the a registers in the code)

Array Size (a0): 128 (bytes)
Step Size (a1): 8
Rep Count (a2): 4
Option (a3): 0

Cache Parameters: (set these in the Cache tab)

Cache Levels: 1
Block Size: 8
Number of Blocks: 4
Enable?: Should be green
Placement Policy: Direct Mapped
Associativity: 1
Block Replacement Policy: LRU
Cache Size (Bytes):
$32 B y t e s = 8 B y t e s \times 4 o f b l o c k s$

Here we modify the program parameters for cache.s in Ripes.

Confirm that the Data bits exactly equal to 32 Bytes (256 bits).

The result after executing the simulator.

Compare with Venus

Tasks

What combination of parameters is producing the hit rate you observe ?

Consider the sequence of memory address accesses, starting with a cold cache:

M[0] = 0x100000000, M[8] = 0x10000020, M[16] = 0x10000040, M[24] = 0x10000060
TIO Breakdown:
- O =
  $l o g_{2} (8) = 3 b i t s$
- Cache size / block size =
  $32 / 8 = 4$ , so
  $I = l o g_{2} (4) = 2 b i t s$
- $A = l o g_{2} (2^{32}) b i t s$ , so
  $T = (A = 32) - 2 - 3 = 27 b i t s$
e.g.
0x100000060 = 0b0001_0000_0000_0000_0000_0000_0110_0000
break to
(Tag) 000_1000_0000_0000_0000_0000_0011 | (Index) 00 | (Offset) 000 = 0x00800003
Because of the step size is … , so the index after TIO-BreakDown is always 0 for M[0], M[8], M[16], M[24], it means the collision always happen for each access.

e.g.
Access M[0]: miss

Index	V	Tag	Block 0	Block 1
0	1	0x00800000	M[0]	M[1]
1	0	-	-	-
2	0	-	-	-
3	0	-	-	-
Access M[8]: miss
Index	V	Tag	Block 0	Block 1
–––	-	–––––	––––	––––
0	1	0x00800001	M[8]	M[9]
1	0	-	-	-
2	0	-	-	-
3	0	-	-	-
Access M[16]: miss
Index	V	Tag	Block 0	Block 1
–––	-	–––––	––––	––––
0	1	0x00800002	M[16]	M[17]
1	0	-	-	-
2	0	-	-	-
3	0	-	-	-
Access M[24]: miss
Index	V	Tag	Block 0	Block 1
–––	-	–––––	––––	––––
0	1	0x00800003	M[24]	M[25]
1	0	-	-	-
2	0	-	-	-
3	0	-	-	-

Improve the hit-rate for direct mapping
The program would repeat 4 times, so staying the array value in cache as possible is a good method. Such as
1. Increase the number of block, make a slot contains more the value, that would make more hit.
  
  $2^{4} B l o c k s$ , with 0.875 hit rate.
  
  $2^{5} B l o c k s$ , with 0.9375 hit rate.
2. Because of the DM is one-to-one mapping, so incresing the lines (more index) to reduce the collsion happen for different data can also imporve the hit rate.
  
  $2^{4} L i n e s$ , with hit rate 0.75.

What is our hit rate if we increase Rep Count arbitrarily? Why?
Ans.
The above combination of parameters would not induce the hit. No matter how we increase the rep count arbitrarily, it will not improve the hit rate. Instead, it will increase the times of miss, because of the cache miss always happen.
But, if there has a chance to induce the hit, it will carry more hit occurring when we increase the rep count. e.g.

Compare repeat 4 with repeat 16, the hit rate will be increased from 0.75 to 0.9375.
How could we modify one program parameter to increase our hit rate?
Ans.
We can change the step size to 1 to increase the hit rate, because it will carry two array value (even and odd index) into the cache at the same time when meeting a cache miss. That is why the cache hit always occur when we access the value of the odd index.
Therefore, we will get the hit rate with 0.5 after finishing the program.

Miss when cpu access the even index of array.

Hit when cpu access the odd index of array.

Scenario 2

Program Parameters: (set these by initializing the a registers in the code)

Array Size (a0): 256 (bytes)
Step Size (a1): 2
Rep Count (a2): 1
Option (a3): 1

Cache Parameters: (set these in the Cache tab)

Cache Levels: 1
Block Size: 16
Number of Blocks: 16
Enable?: Should be green
Placement Policy: N-Way Set Associative
Associativity: 4
Block Replacement Policy: LRU
Cache Size (Bytes):
$256 B y t e s = 16 B y t e s \times 16 o f b l o c k s$

Hit rate is 0.75.

The result in Ripes. Hit rate is also 0.75.

Tasks

How many memory accesses are there per iteration of the inner loop ? (not the one involving repcount). It’s not 1.

Access cache memory: 2 times (read and write)
Access main memory: 2 times for
$4 n$ index (read and write) and 1 time for
$4 n + 2$ index (just wirte back to main memory due to the write-through).

Consider the sequence of memory address accesses, starting with a cold cache:

M[0] = 0x100000000, M[2] = 0x10000008, M[4] = 0x10000010, M[6] = 0x10000018, M[8] = 0x10000020, M[10], …, M[60], M[62]
TIO Breakdown:
- O =
  $l o g_{2} (16) = 4 b i t s$ (green + black)
- $I = l o g_{2} (C a c h e S i z e / B l o c k S i z e / N - w a y) = l o g_{2} (256 B / 16 B / 4) = 2 b i t s$
- $T = (A = 32) - 2 - 2 = 28 b i t s$

e.g. Note: The option is 1, so it will access the cache two times ! (One is for reading data, the other one is for writing data.)
Access M[0]: read miss, and cpu will access main memory.

Set	Slot	V	Tag	Block 0	Block 1	Block 2	Block 3
0	0	1	0x00400000	M[0]	M[1]	M[2]	M[3]
0	1	0	-	-	-	-	-
0	2	0	-	-	-	-	-
0	3	0	-	-	-	-	-
Access M[0]: wirte hit, and cpu will access main memory for the write-through.
Access M[2]: read hit, and cpu will not access main memory.
Access M[2]: write hit, and cpu need to access main memory for the write-through.

Access M[4]: read miss

Set	Slot	V	Tag	Block 0	Block 1	Block 2	Block 3
1	0	1	0x00400000	M[4]	M[5]	M[6]	M[7]
1	1	0	-	-	-	-	-
1	2	0	-	-	-	-	-
1	3	0	-	-	-	-	-
Access M[4]: write miss
Access M[6]: read hit
Access M[6]: write hit

…

Access M[16]: read miss

Set	Slot	V	Tag	Block 0	Block 1	Block 2	Block 3
0	0	1	0x00400000	M[0]	M[1]	M[2]	M[3]
0	1	1	0x00400001	M[16]	M[17]	M[18]	M[19]
0	2	0	-	-	-	-	-
0	3	0	-	-	-	-	-
Access M[16]: write hit
Access M[18]: read hit
Access M[18]: write hit

…

Access M[60]: read miss
Access M[60]: write hit
Access M[62]: read hit
Access M[62]: write hit

What is the repeating hit/miss pattern ? WHY?
Because of the block size is 16 Bytes (4 Words), it will contain 4 array value in a slot (line).
Therefore, when cpu meet a cache miss, it carry the data that will be accessed at the next iteration due to 2 words of the step size.
And the option is 1 so cpu will access the cache memory by 2 times per index, that would take a cache read miss and a cache write hit for the index
$4 n$ , as well as take a cache read hit and a cache write hit for the index
$4 n + 2$ .

e.g.
Access M[0]: read miss, and cpu will access the main memory.

Hits: 0, Misses: 1, Writebacks: 0

Access M[0]: wirte hit, and cpu will access the main memory.

Hits: 1, Misses: 1, Writebacks: 1

Access M[2]: read hit, and cpu will do not access the main memory.

Hits: 2, Misses: 1, Writebacks: 1

Access M[2]: write hit, and cpu need to access the main memory.

Hits: 3, Misses: 1, Writebacks: 2

Access M[4]: read miss

… and it repeats every 4 accesses.
This should follow very straightforwardly from the above question: Explain the hit rate in terms of the hit/miss pattern.
Ans.
There are 4 times of access at a slot, they contain 1 cache miss caused by reading the index
$4 n$ data and 3 cache hit.
So
$h i t r a t e = 3 / 4 = 0.75$
Keeping everything else the same, what happens to our hit rate as Rep Count goes to infinity ? Why ? Try it out by changing the appropriate program parameter and letting the code run !

Hit rate will approach to 1.

rep count = 10240

Scenario 3

L1, L2 Cache assess.

Exercise 2 - Loop Ordering and Matrix Multiplication

If you recall, matrices are 2-dimensional data structures wherein each data element is accessed via two indices. To multiply two matrices, we can simply use 3 nested loops, assuming that matrices A, B, and C are all n-by-n and stored in one-dimensional column-major arrays:

for (int i = 0; i < n; i++)
    for (int j = 0; j < n; j++)
        for (int k = 0; k < n; k++)
            C[i+j*n] += A[i+k*n] * B[k+j*n];

In the above code, note that the loops are ordered i, j, k. If we examine the innermost loop (the one that increments k), we see that it…

moves through B with stride 1
moves through A with stride n
moves through C with stride 0

REMEMBER:
To compute the matrix multiplication correctly, the loop order doesn’t matter.

BUT, the order in which we choose to access the elements of the matrices can have a large impact on performance. Caches perform better (more cache hits, fewer cache misses) when memory accesses take advantage of spatial and temporal locality, utilizing blocks already contained within our cache. Optimizing a program’s memory access patterns is essential to obtaining good performance from the memory hierarchy.

Observation for matrixMultiply.c

matrixMultiply.c












































































































#include <stdio.h>
#include <stdlib.h>
#include <sys/time.h>
#include <time.h>

/* To save you time, we are including all 6 variants of the loop ordering
   as separate functions and then calling them using function pointers.
   The reason for having separate functions that are nearly identical is
   to avoid counting any extraneous processing towards the computation
   time.  This includes I/O accesses (printf) and conditionals (if/switch).
   I/O accesses are slow and conditional/branching statements could
   unfairly bias results (lower cases in switches must run through more
   case statements on each iteration).
*/
void multMat1( int n, float *A, float *B, float *C ) {
    int i,j,k;
    /* This is ijk loop order. */
    for( i = 0; i < n; i++ )
        for( j = 0; j < n; j++ )
            for( k = 0; k < n; k++ )
                C[i+j*n] += A[i+k*n]*B[k+j*n];
}

void multMat2( int n, float *A, float *B, float *C ) {
    int i,j,k;
    /* This is ikj loop order. */
    for( i = 0; i < n; i++ )
        for( k = 0; k < n; k++ )
            for( j = 0; j < n; j++ )
                C[i+j*n] += A[i+k*n]*B[k+j*n];
}

void multMat3( int n, float *A, float *B, float *C ) {
    int i,j,k;
    /* This is jik loop order. */
    for( j = 0; j < n; j++ )
        for( i = 0; i < n; i++ )
            for( k = 0; k < n; k++ )
                C[i+j*n] += A[i+k*n]*B[k+j*n];
}

void multMat4( int n, float *A, float *B, float *C ) {
    int i,j,k;
    /* This is jki loop order. */
    for( j = 0; j < n; j++ )
        for( k = 0; k < n; k++ )
            for( i = 0; i < n; i++ )
                C[i+j*n] += A[i+k*n]*B[k+j*n];
}

void multMat5( int n, float *A, float *B, float *C ) {
    int i,j,k;
    /* This is kij loop order. */
    for( k = 0; k < n; k++ )
        for( i = 0; i < n; i++ )
            for( j = 0; j < n; j++ )
                C[i+j*n] += A[i+k*n]*B[k+j*n];
}

void multMat6( int n, float *A, float *B, float *C ) {
    int i,j,k;
    /* This is kji loop order. */
    for( k = 0; k < n; k++ )
        for( j = 0; j < n; j++ )
            for( i = 0; i < n; i++ )
                C[i+j*n] += A[i+k*n]*B[k+j*n];
}

/* uses timing features from sys/time.h that you haven't seen before */
int main( int argc, char **argv ) {
    int nmax = 1000,i;

    void (*orderings[])(int,float *,float *,float *) =
        {&multMat1,&multMat2,&multMat3,&multMat4,&multMat5,&multMat6};
    char *names[] = {"ijk","ikj","jik","jki","kij","kji"};

    float *A = (float *)malloc( nmax*nmax * sizeof(float));
    float *B = (float *)malloc( nmax*nmax * sizeof(float));
    float *C = (float *)malloc( nmax*nmax * sizeof(float));

    struct timeval start, end;

    /* fill matrices with random numbers */
    for( i = 0; i < nmax*nmax; i++ ) A[i] = drand48()*2-1;
    for( i = 0; i < nmax*nmax; i++ ) B[i] = drand48()*2-1;
    for( i = 0; i < nmax*nmax; i++ ) C[i] = drand48()*2-1;

    for( i = 0; i < 6; i++) {
        /* multiply matrices and measure the time */
        gettimeofday( &start, NULL );
        (*orderings[i])( nmax, A, B, C );
        gettimeofday( &end, NULL );

        /* convert time to Gflop/s */
        double seconds = (end.tv_sec - start.tv_sec) +
            1.0e-6 * (end.tv_usec - start.tv_usec);
        double Gflops = 2e-9*nmax*nmax*nmax/seconds;
        printf( "%s:\tn = %d, %.3f Gflop/s\n", names[i], nmax, Gflops );
    }

    free( A );
    free( B );
    free( C );

    printf("\n\n");

    return 0;
}

$make ex2 

gcc -o matrixMultiply -ggdb -Wall -pedantic -std=gnu99 -O3 matrixMultiply.c
./matrixMultiply
ijk:	n = 1000, 2.191 Gflop/s
ikj:	n = 1000, 0.317 Gflop/s
jik:	n = 1000, 2.097 Gflop/s
jki:	n = 1000, 13.640 Gflop/s
kij:	n = 1000, 0.317 Gflop/s
kji:	n = 1000, 13.615 Gflop/s

This will run some matrix multiplications according to the six different implementations in the file, and it will tell you the speed at which each implementation executed the operation. The unit “Gflops/s” reads, “Giga-floating-point-operations per second.”
THE BIGGER THE NUMBER THE FASTER IT IS RUNNING!

Modification for multMat1.c

multMat1.c


























#include <stdio.h>
#include <stdlib.h>

/* uses timing features from sys/time.h that you haven't seen before */
int main() {
    int n = 2, id;
    int A[n*n];
    int B[n*n];
    int C[n*n];

    /* fill matrices with random numbers */
    for( id = 0; id < n*n; id++ ) A[id] = 3;
    for( id = 0; id < n*n; id++ ) B[id] = 2;
    for( id = 0; id < n*n; id++ ) C[id] = 0;

    /* multMat1, ijk */
    int i,j,k;
    /* This is ijk loop order. */
    for( i = 0; i < n; i++ )
        for( j = 0; j < n; j++ )
            for( k = 0; k < n; k++ ) {
                C[i+j*n] += A[i+k*n]*B[k+j*n];
            }

    return 0;
}

n = 2, i.e. the array size with 2 by 2.

Change to static call so that we can observe the data state in stack memory for Ripes.

Change the data type from float to int.

Use the constant as the initial value.

Translate to RISC-V assembly code by Complier Explorer (choose RISC-V rv32gc clang(trunk))

Modify the assembly code generated by Complier Explorer

Assembly Code that can execute in Ripes.


































































































































































































main:                                   # @main
    addi    sp, sp, -64
    sw      ra, 60(sp)                      # 4-byte Folded Spill
    sw      s0, 56(sp)                      # 4-byte Folded Spill
    addi    s0, sp, 64
    mv      a0, zero
    sw      a0, -12(s0)
    addi    a1, zero, 2
    sw      a1, -16(s0)
    lw      a1, -16(s0)
    mul     a1, a1, a1
    add     a2, zero, sp
    sw      a2, -24(s0)
    slli    a2, a1, 2
    addi    a2, a2, 15
    andi    a3, a2, -16
    add     a2, zero, sp
    sub     a2, a2, a3
    sw      a2, -60(s0)                     # 4-byte Folded Spill
    add     sp, zero, a2
    sw      a1, -28(s0)
    lw      a1, -16(s0)
    mul     a1, a1, a1
    slli    a2, a1, 2
    addi    a2, a2, 15
    andi    a3, a2, -16
    add     a2, zero, sp
    sub     a2, a2, a3
    sw      a2, -56(s0)                     # 4-byte Folded Spill
    add     sp, zero, a2
    sw      a1, -32(s0)
    lw      a1, -16(s0)
    mul     a1, a1, a1
    slli    a2, a1, 2
    addi    a2, a2, 15
    andi    a3, a2, -16
    add     a2, zero, sp
    sub     a2, a2, a3
    sw      a2, -52(s0)                     # 4-byte Folded Spill
    add     sp, zero, a2
    sw      a1, -36(s0)
    sw      a0, -20(s0)
    j       .LBB0_1
.LBB0_1:                                # =>This Inner Loop Header: Depth=1
        lw      a0, -20(s0)
        lw      a1, -16(s0)
        mul     a1, a1, a1
        bge     a0, a1, .LBB0_4
        j       .LBB0_2
.LBB0_2:                                #   in Loop: Header=BB0_1 Depth=1
        lw      a0, -60(s0)                     # 4-byte Folded Reload
        lw      a1, -20(s0)
        slli    a1, a1, 2
        add     a1, a1, a0
        addi    a0, zero, 3
        sw      a0, 0(a1)
        j       .LBB0_3
.LBB0_3:                                #   in Loop: Header=BB0_1 Depth=1
        lw      a0, -20(s0)
        addi    a0, a0, 1
        sw      a0, -20(s0)
        j       .LBB0_1
.LBB0_4:
        mv      a0, zero
        sw      a0, -20(s0)
        j       .LBB0_5
.LBB0_5:                                # =>This Inner Loop Header: Depth=1
        lw      a0, -20(s0)
        lw      a1, -16(s0)
        mul     a1, a1, a1
        bge     a0, a1, .LBB0_8
        j       .LBB0_6
.LBB0_6:                                #   in Loop: Header=BB0_5 Depth=1
        lw      a0, -56(s0)                     # 4-byte Folded Reload
        lw      a1, -20(s0)
        slli    a1, a1, 2
        add     a1, a1, a0
        addi    a0, zero, 2
        sw      a0, 0(a1)
        j       .LBB0_7
.LBB0_7:                                #   in Loop: Header=BB0_5 Depth=1
        lw      a0, -20(s0)
        addi    a0, a0, 1
        sw      a0, -20(s0)
        j       .LBB0_5
.LBB0_8:
        mv      a0, zero
        sw      a0, -20(s0)
        j       .LBB0_9
.LBB0_9:                                # =>This Inner Loop Header: Depth=1
        lw      a0, -20(s0)
        lw      a1, -16(s0)
        mul     a1, a1, a1
        bge     a0, a1, .LBB0_12
        j       .LBB0_10
.LBB0_10:                               #   in Loop: Header=BB0_9 Depth=1
        lw      a0, -52(s0)                     # 4-byte Folded Reload
        lw      a1, -20(s0)
        slli    a1, a1, 2
        add     a1, a1, a0
        mv      a0, zero
        sw      a0, 0(a1)
        j       .LBB0_11
.LBB0_11:                               #   in Loop: Header=BB0_9 Depth=1
        lw      a0, -20(s0)
        addi    a0, a0, 1
        sw      a0, -20(s0)
        j       .LBB0_9
.LBB0_12:
        mv      a0, zero
        sw      a0, -40(s0)
        j       .LBB0_13
.LBB0_13:                               # =>This Loop Header: Depth=1
        lw      a0, -40(s0)
        lw      a1, -16(s0)
        bge     a0, a1, .LBB0_24
        j       .LBB0_14
.LBB0_14:                               #   in Loop: Header=BB0_13 Depth=1
        mv      a0, zero
        sw      a0, -44(s0)
        j       .LBB0_15
.LBB0_15:                               #   Parent Loop BB0_13 Depth=1
        lw      a0, -44(s0)
        lw      a1, -16(s0)
        bge     a0, a1, .LBB0_22
        j       .LBB0_16
.LBB0_16:                               #   in Loop: Header=BB0_15 Depth=2
        mv      a0, zero
        sw      a0, -48(s0)
        j       .LBB0_17
.LBB0_17:                               #   Parent Loop BB0_13 Depth=1
        lw      a0, -48(s0)
        lw      a1, -16(s0)
        bge     a0, a1, .LBB0_20
        j       .LBB0_18
.LBB0_18:                               #   in Loop: Header=BB0_17 Depth=3
        lw      a0, -52(s0)                     # 4-byte Folded Reload
        lw      a4, -56(s0)                     # 4-byte Folded Reload
        lw      a2, -60(s0)                     # 4-byte Folded Reload
        lw      a1, -40(s0)
        lw      a5, -48(s0)
        lw      a6, -16(s0)
        mul     a3, a5, a6
        add     a3, a3, a1
        slli    a3, a3, 2
        add     a2, a2, a3
        lw      a2, 0(a2)
        lw      a3, -44(s0)
        mul     a3, a3, a6
        add     a5, a5, a3
        slli    a5, a5, 2
        add     a4, a4, a5
        lw      a4, 0(a4)
        mul     a2, a2, a4
        add     a1, a1, a3
        slli    a1, a1, 2
        add     a1, a1, a0
        lw      a0, 0(a1)
        add     a0, a0, a2
        sw      a0, 0(a1)
        j       .LBB0_19
.LBB0_19:                               #   in Loop: Header=BB0_17 Depth=3
        lw      a0, -48(s0)
        addi    a0, a0, 1
        sw      a0, -48(s0)
        j       .LBB0_17
.LBB0_20:                               #   in Loop: Header=BB0_15 Depth=2
        j       .LBB0_21
.LBB0_21:                               #   in Loop: Header=BB0_15 Depth=2
        lw      a0, -44(s0)
        addi    a0, a0, 1
        sw      a0, -44(s0)
        j       .LBB0_15
.LBB0_22:                               #   in Loop: Header=BB0_13 Depth=1
        j       .LBB0_23
.LBB0_23:                               #   in Loop: Header=BB0_13 Depth=1
        lw      a0, -40(s0)
        addi    a0, a0, 1
        sw      a0, -40(s0)
        j       .LBB0_13
.LBB0_24:
        mv      a0, zero
        sw      a0, -12(s0)
        lw      a0, -24(s0)
        add     sp, zero, a0
        lw      a0, -12(s0)
        addi    sp, s0, -64
        lw      s0, 56(sp)                      # 4-byte Folded Reload
        lw      ra, 60(sp)                      # 4-byte Folded Reload
        addi    sp, sp, 64
        j		.EXIT
.EXIT:
        li a7, 10
        ecall

We can do some modification.

...
.LBB0_24:
    mv      a0, zero
    sw      a0, -12(s0)
    lw      a0, -24(s0)
    add     sp, zero, a0
    lw      a0, -12(s0)
    addi    sp, s0, -64
    lw      s0, 56(sp)                      # 4-byte Folded Reload
    lw      ra, 60(sp)                      # 4-byte Folded Reload
    addi    sp, sp, 64
    // ret                  <--- # Remove
    j		.EXIT          <---| # Add
.EXIT:                              |
        li a7, 10                  |  
        ecall                        |

Check the result
Setup (sp=0x7fffffff0)

The Result in C.

/* printf("i=%d, j=%d, C[i+j*n]=%d \n", i, j, C[i+j*n]); */
i=0, j=0, C[i+j*n]=6 
i=0, j=0, C[i+j*n]=12 
i=0, j=1, C[i+j*n]=6 
i=0, j=1, C[i+j*n]=12 
i=1, j=0, C[i+j*n]=6 
i=1, j=0, C[i+j*n]=12 
i=1, j=1, C[i+j*n]=6 
i=1, j=1, C[i+j*n]=12

Memory in Ripes.

The access order (small case, n = 2)
Result in Ripes (n = 50)

Hit rate is 0.9174

Task

Record your results in answers.txt

ijk:	n = 1000, 2.191 Gflop/s
ikj:	n = 1000, 0.317 Gflop/s
jik:	n = 1000, 2.097 Gflop/s
**jki:	n = 1000, 13.640 Gflop/s**
kij:	n = 1000, 0.317 Gflop/s
kji:	n = 1000, 13.615 Gflop/s

Which ordering(s) perform best for these 1000-by-1000 matrices ? Why ?
Ans 1. jki (multMat4)
Ans 2.
The cpu will often access the adjacent block in cache due to the stride 1.

n = 2,

n = 50, Hit rate = 0.966
Which ordering(s) perform the worst? Why?
Ans 1. ikj (multMat2) and kij (multMat5)
Ans 2.
Cpu access the cache block often suffer a cache miss due to the stride n when the cache block not enough big to include the array data.

ikj (multMat2)

n = 50, Hit rate = 0.8692

kij (multMat5)

n = 50, Hit rate = 0.8702
How does the way we stride through the matrices with respect to the innermost loop affect performance?
Ans.
The innermost loop will induce jn stride, thus, there are a weakly spatial locality when j is too big.

Exercise 3 - Cache Blocking and Matrix Transposition

Matrix Transposition
Sometimes, we wish to swap the rows and columns of a matrix. This operation is called a “transposition” and an efficient implementation can be quite helpful while performing more complicated linear algebra operations. The transpose of matrix A is often denoted as AT.

Cache Blocking
In the above code for matrix multiplication, note that we are striding across the entire A and B matrices to compute a single value of C. As such, we are constantly accessing new values from memory and obtain very little reuse of cached data! We can improve the amount of data reuse in the caches by implementing a technique called cache blocking. More formally, cache blocking is a technique that attempts to reduce the cache miss rate by further improving the temporal and/or spatial locality of memory accesses. In the case of matrix transposition we consider performing the transposition one block at a time.

Things to note: In the above image, we transpose each submatrix Aij of matrix A into its final location in the output matrix, one submatrix at a time. It is important to note that transposing each individual subsection is equivalent to tranposing the entire matrix.

Since we operate on and finish transposing each submatrix one at a time, we consolidate our memory accesses to that smaller chunk of memory when transposing that particular submatrix, which increases the degree of temporal (and spatial) locality that we exhibit, which makes our cache performance better, which makes our program run faster.

This (if implemented correctly) will result in a substantial improvement in performance. For this lab, you will implement a cache blocking scheme for matrix transposition and analyze its performance.

Transpose_blocking











/* Implement cache blocking below. You should NOT assume that n is a
 * multiple of the block size. */
void transpose_blocking(int n, int blocksize, int *dst, int *src) {
    // YOUR CODE HER
    for (int i = 0; i < n; i += blocksize)
        for (int j = 0; j < n; j += blocksize)
            // transpose the block beginning at [i,j]
            for (int k = i; k < i + blocksize && k < n; ++k)
                for (int l = j; l < j + blocksize && l < n; ++l)
                    dst[k + l*n] = src[l + k*n];
}

Part 1 - Changing Array Sizes

Task

Fix the blocksize to be 20, and run your code with n equal to 100, 1000, 2000, 5000, and 10000.

Record your results in answers.txt

n = 100, 
Testing naive transpose: 0.004 milliseconds
Testing transpose with blocking: 0.004 milliseconds

n = 1000, 
Testing naive transpose: 0.716 milliseconds
Testing transpose with blocking: 0.562 milliseconds

n = 2000,
Testing naive transpose: 12.236 milliseconds
Testing transpose with blocking: 3.432 milliseconds

n = 5000, 
Testing naive transpose: 113.673 milliseconds
Testing transpose with blocking: 23.722 milliseconds

n = 10000, 
Testing naive transpose: 600.908 milliseconds
Testing transpose with blocking: 102.394 milliseconds

At what point does cache blocked version of transpose become faster than the non-cache blocked version?
Ans.
n = 1000
Why does cache blocking require the matrix to be a certain size before it outperforms the non-cache blocked code?
Ans.
To make sure the spatial locality.

Part 2 - Changing Blocksize

Task

Fix n to be 10000, and run your code with blocksize equal to 50, 100, 500, 1000, 5000.

Record your results in answers.txt

blocksize = 50,
Testing naive transpose: 574.05 milliseconds
Testing transpose with blocking: 118.785 milliseconds

blocksize = 100,
Testing naive transpose: 562.55 milliseconds
Testing transpose with blocking: 111.177 milliseconds

blocksize = 500,
Testing naive transpose: 595.29 milliseconds
Testing transpose with blocking: 137.427 milliseconds

blocksize = 1000,
Testing naive transpose: 591.831 milliseconds
Testing transpose with blocking: 163.924 milliseconds

blocksize = 5000,
Testing naive transpose: 583.844 milliseconds
Testing transpose with blocking: 580.742 milliseconds

How does performance change as blocksize increases? Why is this the case?
Ans.
The performance will decrease when increasing blocksize.
Becuase the cache will need to change the block frequently for a big size n, it means the situation with a weakly spatial locality.
And the worst case could probaly make the performance degerate to the naive transpose.

Notice that in neither of the last two exercises did we actually know the cache parameters of our machine. We just made the code exhibit a higher degree of locality, and this magically made things go faster! This tells us that caches, regardless of their specific parameters, will always benefit from operating on code which exhibits a high degree of locality.

Assignment4: Cache

tags: riscv

Cache.s

Simply description about cache.s

Observation in Ripes

Review - Hit and Miss Policies

Exercise 1 - A Couple of Memory Access Scenarios

Scenario 1

Tasks

Scenario 2

Tasks

Scenario 3

Exercise 2 - Loop Ordering and Matrix Multiplication

Observation for matrixMultiply.c

Task

Exercise 3 - Cache Blocking and Matrix Transposition

Part 1 - Changing Array Sizes

Task

Part 2 - Changing Blocksize

Task

tags: `riscv`