Assignment4: Cache

tags: riscv

Cache.s

.data array: .word 2048 # max array size specified in BYTES (DO NOT CHANGE) .text ################################################################################################## # You MAY change the code below this section main: li a0, 256 # array size in BYTES (power of 2 < array size) li a1, 2 # step size (power of 2 > 0) li a2, 1 # rep count (int > 0) li a3, 1 # 0 - option 0, 1 - option 1 # You MAY change the code above this section ################################################################################################## jal accessWords # lw/sw #jal accessBytes # lb/sb li a7,10 # exit // Here I modify the register from a0 to a7 for Ripes. ecall # SUMMARY OF REGISTER USE: # a0 = array size in bytes # a1 = step size # a2 = number of times to repeat # a3 = 0 (W) / 1 (RW) # s0 = moving array ptr # s1 = array limit (ptr) accessWords: la s0, array # ptr to array add s1, s0, a0 # hardcode array limit (ptr) slli t1, a1, 2 # multiply stepsize by 4 because WORDS wordLoop: beq a3, zero, wordZero lw t0, 0(s0) # array[index/4]++ addi t0, t0, 1 sw t0, 0(s0) j wordCheck wordZero: sw zero, 0(s0) # array[index/4] = 0 wordCheck: add s0, s0, t1 # increment ptr blt s0, s1, wordLoop # inner loop done? addi a2, a2, -1 bgtz a2, accessWords # outer loop done? jr ra ... // We focus on word access.

Simply description about cache.s

# PSEUDOCODE:
#    int array[];  //Assume sizeof(int) == 4
#    for (k = 0; k < repcount; k++) {		// repeat repcount times
#      /* Step through the selected array segment with the given step size. */
#      for (index = 0; index < arraysize; index += stepsize) {
#        if(option==0)
#          array[index] = 0;			// Option 0: One cache access - write
#        else
#          array[index] = array[index] + 1;	// Option 1: Two cache accesses - read AND write
#      }
#    }

# SUMMARY OF REGISTER USE:
#  a0 = array size in bytes
#  a1 = step size
#  a2 = number of times to repeat
#  a3 = 0 (W) / 1 (RW)
#  s0 = moving array ptr
#  s1 = array limit (ptr)
  1. accessWords
    ​la	s0, array		# ptr to array
    ​add	s1, s0, a0	# hardcode array limit (ptr)
    ​slli	t1, a1, 2		# multiply stepsize by 4 because WORDS
    
    s0 is saved the start address of the array, and s1 is saved the end address of the array.
    t1 is saved that how many byte we would need to shift (
    4×a1
    ).
  2. wordLoop
    ​​​​beq	a3, zero,  wordZero
    ​lw	t0, 0(s0)     # array[index/4]++
    ​addi	t0, t0, 1
    ​sw	t0, 0(s0)
    ​j	wordCheck
    
    Check a3 whether is equal to 0, if does than jump to wordZero.
    Otherwise, load the data from s0 (it is pointer to the start position of array) to t0, and then save back after adding 1. (array[index] = array[index] + 1)
    Jump to wordCheck.
  3. wordCheck
    ​​​​add	s0, s0, t1		       # increment ptr
    ​blt	s0, s1, wordLoop    	# inner loop done?
    ​addi	a2, a2, -1
    ​bgtz	a2, accessWords	# outer loop done?
    ​jr	ra
    
    Move the pointer of array (s0) to next position by add s0, s0, t1.
    If s0 < s1 (blt) then jump to wordLoop and keep executing, it means that there is still in the inner loop, not yet access all the array data.
    a3 is saved repeat times for this program, so reduce the a3 value by 1 and check the value whether is equal to 0.
  4. wordZero
    ​sw	zero,  0(s0)		# array[index/4] = 0
    
    When option is 0, set the array[index] to 0.

Observation in Ripes

  • The program parameters setup

    ​​​​#  a0 = 8, array size in bytes
    ​​​​#  a1 = 1, step size --> 1 * 4 Byte = 4 Bytes
    ​​​​#  a2 = 1, number of times to repeat
    ​​​​#  a3 = 1, 0 (W) / 1 (RW)
    

    Image Not Showing Possible Reasons
    • The image file may be corrupted
    • The server hosting the image is unavailable
    • The image path is incorrect
    • The image format is not supported
    Learn More →

  • Observation for Memory

    Image Not Showing Possible Reasons
    • The image file may be corrupted
    • The server hosting the image is unavailable
    • The image path is incorrect
    • The image format is not supported
    Learn More →

    1. Here we can see the array pointer to address 0x10000000, and the end pointer to address 0x100000008. The total array size is 8 Bytes (2 Words).
    2. Due to the step size is 1, it would access the next 1 word data.
    3. Schematic diagram
      Array:M Data Address
      End - 0x10000008
      - 0x0 0x10000007
      - 0x0 0x10000006
      - 0x0 0x10000005
      M[1] 0x1 0x10000004
      - 0x0 0x10000003
      - 0x0 0x10000002
      - 0x8 0x10000001
      M[0] 0x1 0x10000000

      Why the data at 0x100000001 is always 0x8 ? And why the address 0x1000001c always be accessed ?

  • Another case

    ​​​​#  a0 = 12, array size in bytes
    ​​​​#  a1 = 2, step size --> 2 * 4 Byte = 8 Bytes
    ​​​​#  a2 = 3, number of times to repeat
    ​​​​#  a3 = 1, 0 (W) / 1 (RW)
    

    Image Not Showing Possible Reasons
    • The image file may be corrupted
    • The server hosting the image is unavailable
    • The image path is incorrect
    • The image format is not supported
    Learn More →

    The step size is 2 word, so it would access the address 0x10000000, 0x10000008 and 0x10000010, but 0x10000010 is out of the limit address 0x1000000c, so the address 0x10000010 would not write data.
    And the program would repeat 3 times, so 0x10000000 and 0x10000008 are saved the value 3.


Review - Hit and Miss Policies

  • Hit

    • Read hits: Take the correctly data into cpu from cache.
    • Write hits:
      1. Write-Through Policy: Always write data to cache and to memory (through cache), forces cache and memory to always be consistent, slow, and include a Write Buffer that updates memory in parallel with processor.
      2. Write-Back Policy: Write data only to cache, then update memory when block is removed, allows cache and memory to be inconsistent, and Dirty bit per cache row that is set if block was written to (is dirty) then needs to be written back.
      3. Write-Around Policy: In every situation, data is written to main memory only, if we have the block we're writing in the cache, the valid bit of the block is changed to invalid.
        Essentially, there is no such thing as a write hit in a write-around cache, a write "hit" does the same thing as a write miss.
  • Miss

    • Read misses: Stall execution, fetch block from memory, put in cache, send requested data to processor, resume.
    • Write misses:
      1. Always have to update block from memory.
      2. Carry the updated block into cache or not ?
        • Write Allocate: When we bring the block into the cache after a write miss, and always pair with write-back (accessing same address many times)
        • No Write Allocate: Only change main memory after a write miss, and typically pair with write-through (infrequent/random writes)
  • Replacement policies

    • LRU (Least recently used): When we decide to evict a cache block to make space, we select the block that has been used farthest back in time of all the other blocks.
    • Random: When we decide to evict a cache block to make space, we randomly select one of the blocks in the cache to evict.
    • MRU (Most recently used): When we decide to evict a cache block to make space, we select the block that has been used most recently of all the other available blocks.

The Venus cache simulator currently simulates a write-through, write-allocate cache.
So we need to choose the same options in Ripes for our experiment.


Exercise 1 - A Couple of Memory Access Scenarios

Task:
Simulate the following scenarios and record the final cache hit rates with the program in cache.s.
Try to reason out what the hit rate will be BEFORE running the code.

THE FOLLOWING are good questions to ask yourself as you do these exercises (not checkoff questions):

  • How big is your cache block?
  • How many consecutive accesses (taking into account the step size) fit within a single block?
  • How much data fits in the WHOLE cache?
  • How far apart in memory are blocks that map to the same set (and could create conflicts)?
  • What is your cache’s associativity?
  • Where in the cache does a particular block map to?
  • When considering why a specific access is a miss or hit: Have you accessed this piece of data before? If so, is it still in the cache or not?

Scenario 1

Program Parameters: (set these by initializing the a registers in the code)

  • Array Size (a0): 128 (bytes)
  • Step Size (a1): 8
  • Rep Count (a2): 4
  • Option (a3): 0

Cache Parameters: (set these in the Cache tab)

  • Cache Levels: 1
  • Block Size: 8
  • Number of Blocks: 4
  • Enable?: Should be green
  • Placement Policy: Direct Mapped
  • Associativity: 1
  • Block Replacement Policy: LRU
  • Cache Size (Bytes):
    32 Bytes=8 Bytes×4 of blocks

     

Here we modify the program parameters for cache.s in Ripes.

Confirm that the Data bits exactly equal to 32 Bytes (256 bits).

The result after executing the simulator.

Compare with Venus

Tasks

  1. What combination of parameters is producing the hit rate you observe ?
  • Consider the sequence of memory address accesses, starting with a cold cache:
    • M[0] = 0x100000000, M[8] = 0x10000020, M[16] = 0x10000040, M[24] = 0x10000060
    • TIO Breakdown:
      • O =
        log2(8)=3 bits
      • Cache size / block size =
        32/8=4
        , so
        I=log2(4)=2 bits
      • A=log2(232)bits
        , so
        T=(A=32)23=27 bits
    • e.g.
      0x100000060 = 0b0001_0000_0000_0000_0000_0000_0110_0000
      break to
      (Tag) 000_1000_0000_0000_0000_0000_0011 | (Index) 00 | (Offset) 000 = 0x00800003
    • Because of the step size is , so the index after TIO-BreakDown is always 0 for M[0], M[8], M[16], M[24], it means the collision always happen for each access.
    • e.g.
      Access M[0]: miss
      Index V Tag Block 0 Block 1
      0 1 0x00800000 M[0] M[1]
      1 0 - - -
      2 0 - - -
      3 0 - - -
      Access M[8]: miss
      Index V Tag Block 0 Block 1
      -
      0 1 0x00800001 M[8] M[9]
      1 0 - - -
      2 0 - - -
      3 0 - - -
      Access M[16]: miss
      Index V Tag Block 0 Block 1
      -
      0 1 0x00800002 M[16] M[17]
      1 0 - - -
      2 0 - - -
      3 0 - - -
      Access M[24]: miss
      Index V Tag Block 0 Block 1
      -
      0 1 0x00800003 M[24] M[25]
      1 0 - - -
      2 0 - - -
      3 0 - - -
  • Improve the hit-rate for direct mapping
    The program would repeat 4 times, so staying the array value in cache as possible is a good method. Such as
    1. Increase the number of block, make a slot contains more the value, that would make more hit.

      24 Blocks, with 0.875 hit rate.

      25 Blocks, with 0.9375 hit rate.

    2. Because of the DM is one-to-one mapping, so incresing the lines (more index) to reduce the collsion happen for different data can also imporve the hit rate.

      24 Lines, with hit rate 0.75.

  1. What is our hit rate if we increase Rep Count arbitrarily? Why?
    Ans.
    The above combination of parameters would not induce the hit. No matter how we increase the rep count arbitrarily, it will not improve the hit rate. Instead, it will increase the times of miss, because of the cache miss always happen.
    But, if there has a chance to induce the hit, it will carry more hit occurring when we increase the rep count. e.g.

    Compare repeat 4 with repeat 16, the hit rate will be increased from 0.75 to 0.9375.

  2. How could we modify one program parameter to increase our hit rate?
    Ans.
    We can change the step size to 1 to increase the hit rate, because it will carry two array value (even and odd index) into the cache at the same time when meeting a cache miss. That is why the cache hit always occur when we access the value of the odd index.
    Therefore, we will get the hit rate with 0.5 after finishing the program.

    Miss when cpu access the even index of array.

    Hit when cpu access the odd index of array.

Scenario 2

Program Parameters: (set these by initializing the a registers in the code)

  • Array Size (a0): 256 (bytes)
  • Step Size (a1): 2
  • Rep Count (a2): 1
  • Option (a3): 1

Cache Parameters: (set these in the Cache tab)

  • Cache Levels: 1
  • Block Size: 16
  • Number of Blocks: 16
  • Enable?: Should be green
  • Placement Policy: N-Way Set Associative
  • Associativity: 4
  • Block Replacement Policy: LRU
  • Cache Size (Bytes):
    256 Bytes=16 Bytes×16 of blocks

Hit rate is 0.75.

The result in Ripes. Hit rate is also 0.75.

Tasks

  1. How many memory accesses are there per iteration of the inner loop ? (not the one involving repcount). It’s not 1.
  • Access cache memory: 2 times (read and write)
  • Access main memory: 2 times for
    4n
    index (read and write) and 1 time for
    4n+2
    index (just wirte back to main memory due to the write-through).
  • Consider the sequence of memory address accesses, starting with a cold cache:
    • M[0] = 0x100000000, M[2] = 0x10000008, M[4] = 0x10000010, M[6] = 0x10000018, M[8] = 0x10000020, M[10], , M[60], M[62]

    • TIO Breakdown:

      • O =
        log2(16)=4 bits
        (green + black)
      • I=log2(Cache Size/Block Size/Nway)=log2(256B/16B/4)=2 bits
      • T=(A=32)22=28 bits
    • e.g. Note: The option is 1, so it will access the cache two times ! (One is for reading data, the other one is for writing data.)
      Access M[0]: read miss, and cpu will access main memory.

      Set Slot V Tag Block 0 Block 1 Block 2 Block 3
      0 0 1 0x00400000 M[0] M[1] M[2] M[3]
      0 1 0 - - - - -
      0 2 0 - - - - -
      0 3 0 - - - - -
      Access M[0]: wirte hit, and cpu will access main memory for the write-through.
      Access M[2]: read hit, and cpu will not access main memory.
      Access M[2]: write hit, and cpu need to access main memory for the write-through.

      Access M[4]: read miss

      Set Slot V Tag Block 0 Block 1 Block 2 Block 3
      1 0 1 0x00400000 M[4] M[5] M[6] M[7]
      1 1 0 - - - - -
      1 2 0 - - - - -
      1 3 0 - - - - -
      Access M[4]: write miss
      Access M[6]: read hit
      Access M[6]: write hit

      Access M[16]: read miss

      Set Slot V Tag Block 0 Block 1 Block 2 Block 3
      0 0 1 0x00400000 M[0] M[1] M[2] M[3]
      0 1 1 0x00400001 M[16] M[17] M[18] M[19]
      0 2 0 - - - - -
      0 3 0 - - - - -
      Access M[16]: write hit
      Access M[18]: read hit
      Access M[18]: write hit

      Access M[60]: read miss
      Access M[60]: write hit
      Access M[62]: read hit
      Access M[62]: write hit

  1. What is the repeating hit/miss pattern ? WHY?
    Because of the block size is 16 Bytes (4 Words), it will contain 4 array value in a slot (line).
    Therefore, when cpu meet a cache miss, it carry the data that will be accessed at the next iteration due to 2 words of the step size.
    And the option is 1 so cpu will access the cache memory by 2 times per index, that would take a cache read miss and a cache write hit for the index

    4n, as well as take a cache read hit and a cache write hit for the index
    4n+2
    .

    e.g.
    Access M[0]: read miss, and cpu will access the main memory.

    Hits: 0, Misses: 1, Writebacks: 0

    Access M[0]: wirte hit, and cpu will access the main memory.

    Hits: 1, Misses: 1, Writebacks: 1

    Access M[2]: read hit, and cpu will do not access the main memory.

    Hits: 2, Misses: 1, Writebacks: 1

    Access M[2]: write hit, and cpu need to access the main memory.

    Hits: 3, Misses: 1, Writebacks: 2

    Access M[4]: read miss

    and it repeats every 4 accesses.

  2. This should follow very straightforwardly from the above question: Explain the hit rate in terms of the hit/miss pattern.
    Ans.
    There are 4 times of access at a slot, they contain 1 cache miss caused by reading the index

    4n data and 3 cache hit.
    So
    hit rate=3/4=0.75

  3. Keeping everything else the same, what happens to our hit rate as Rep Count goes to infinity ? Why ? Try it out by changing the appropriate program parameter and letting the code run !

    Hit rate will approach to 1.

    rep count = 10240

Scenario 3

L1, L2 Cache assess.


Exercise 2 - Loop Ordering and Matrix Multiplication

If you recall, matrices are 2-dimensional data structures wherein each data element is accessed via two indices. To multiply two matrices, we can simply use 3 nested loops, assuming that matrices A, B, and C are all n-by-n and stored in one-dimensional column-major arrays:

for (int i = 0; i < n; i++)
    for (int j = 0; j < n; j++)
        for (int k = 0; k < n; k++)
            C[i+j*n] += A[i+k*n] * B[k+j*n];

In the above code, note that the loops are ordered i, j, k. If we examine the innermost loop (the one that increments k), we see that it…

  • moves through B with stride 1
  • moves through A with stride n
  • moves through C with stride 0

REMEMBER:
To compute the matrix multiplication correctly, the loop order doesn’t matter.

BUT, the order in which we choose to access the elements of the matrices can have a large impact on performance. Caches perform better (more cache hits, fewer cache misses) when memory accesses take advantage of spatial and temporal locality, utilizing blocks already contained within our cache. Optimizing a program’s memory access patterns is essential to obtaining good performance from the memory hierarchy.

Observation for matrixMultiply.c

matrixMultiply.c
#include <stdio.h> #include <stdlib.h> #include <sys/time.h> #include <time.h> /* To save you time, we are including all 6 variants of the loop ordering as separate functions and then calling them using function pointers. The reason for having separate functions that are nearly identical is to avoid counting any extraneous processing towards the computation time. This includes I/O accesses (printf) and conditionals (if/switch). I/O accesses are slow and conditional/branching statements could unfairly bias results (lower cases in switches must run through more case statements on each iteration). */ void multMat1( int n, float *A, float *B, float *C ) { int i,j,k; /* This is ijk loop order. */ for( i = 0; i < n; i++ ) for( j = 0; j < n; j++ ) for( k = 0; k < n; k++ ) C[i+j*n] += A[i+k*n]*B[k+j*n]; } void multMat2( int n, float *A, float *B, float *C ) { int i,j,k; /* This is ikj loop order. */ for( i = 0; i < n; i++ ) for( k = 0; k < n; k++ ) for( j = 0; j < n; j++ ) C[i+j*n] += A[i+k*n]*B[k+j*n]; } void multMat3( int n, float *A, float *B, float *C ) { int i,j,k; /* This is jik loop order. */ for( j = 0; j < n; j++ ) for( i = 0; i < n; i++ ) for( k = 0; k < n; k++ ) C[i+j*n] += A[i+k*n]*B[k+j*n]; } void multMat4( int n, float *A, float *B, float *C ) { int i,j,k; /* This is jki loop order. */ for( j = 0; j < n; j++ ) for( k = 0; k < n; k++ ) for( i = 0; i < n; i++ ) C[i+j*n] += A[i+k*n]*B[k+j*n]; } void multMat5( int n, float *A, float *B, float *C ) { int i,j,k; /* This is kij loop order. */ for( k = 0; k < n; k++ ) for( i = 0; i < n; i++ ) for( j = 0; j < n; j++ ) C[i+j*n] += A[i+k*n]*B[k+j*n]; } void multMat6( int n, float *A, float *B, float *C ) { int i,j,k; /* This is kji loop order. */ for( k = 0; k < n; k++ ) for( j = 0; j < n; j++ ) for( i = 0; i < n; i++ ) C[i+j*n] += A[i+k*n]*B[k+j*n]; } /* uses timing features from sys/time.h that you haven't seen before */ int main( int argc, char **argv ) { int nmax = 1000,i; void (*orderings[])(int,float *,float *,float *) = {&multMat1,&multMat2,&multMat3,&multMat4,&multMat5,&multMat6}; char *names[] = {"ijk","ikj","jik","jki","kij","kji"}; float *A = (float *)malloc( nmax*nmax * sizeof(float)); float *B = (float *)malloc( nmax*nmax * sizeof(float)); float *C = (float *)malloc( nmax*nmax * sizeof(float)); struct timeval start, end; /* fill matrices with random numbers */ for( i = 0; i < nmax*nmax; i++ ) A[i] = drand48()*2-1; for( i = 0; i < nmax*nmax; i++ ) B[i] = drand48()*2-1; for( i = 0; i < nmax*nmax; i++ ) C[i] = drand48()*2-1; for( i = 0; i < 6; i++) { /* multiply matrices and measure the time */ gettimeofday( &start, NULL ); (*orderings[i])( nmax, A, B, C ); gettimeofday( &end, NULL ); /* convert time to Gflop/s */ double seconds = (end.tv_sec - start.tv_sec) + 1.0e-6 * (end.tv_usec - start.tv_usec); double Gflops = 2e-9*nmax*nmax*nmax/seconds; printf( "%s:\tn = %d, %.3f Gflop/s\n", names[i], nmax, Gflops ); } free( A ); free( B ); free( C ); printf("\n\n"); return 0; }
$make ex2 

gcc -o matrixMultiply -ggdb -Wall -pedantic -std=gnu99 -O3 matrixMultiply.c
./matrixMultiply
ijk:	n = 1000, 2.191 Gflop/s
ikj:	n = 1000, 0.317 Gflop/s
jik:	n = 1000, 2.097 Gflop/s
jki:	n = 1000, 13.640 Gflop/s
kij:	n = 1000, 0.317 Gflop/s
kji:	n = 1000, 13.615 Gflop/s

This will run some matrix multiplications according to the six different implementations in the file, and it will tell you the speed at which each implementation executed the operation. The unit “Gflops/s” reads, “Giga-floating-point-operations per second.”
THE BIGGER THE NUMBER THE FASTER IT IS RUNNING!

Modification for multMat1.c

  1. multMat1.c
#include <stdio.h> #include <stdlib.h> /* uses timing features from sys/time.h that you haven't seen before */ int main() { int n = 2, id; int A[n*n]; int B[n*n]; int C[n*n]; /* fill matrices with random numbers */ for( id = 0; id < n*n; id++ ) A[id] = 3; for( id = 0; id < n*n; id++ ) B[id] = 2; for( id = 0; id < n*n; id++ ) C[id] = 0; /* multMat1, ijk */ int i,j,k; /* This is ijk loop order. */ for( i = 0; i < n; i++ ) for( j = 0; j < n; j++ ) for( k = 0; k < n; k++ ) { C[i+j*n] += A[i+k*n]*B[k+j*n]; } return 0; }
  1. n = 2, i.e. the array size with 2 by 2.
  2. Change to static call so that we can observe the data state in stack memory for Ripes.
  3. Change the data type from float to int.
  4. Use the constant as the initial value.
  1. Translate to RISC-V assembly code by Complier Explorer (choose RISC-V rv32gc clang(trunk))

  1. Modify the assembly code generated by Complier Explorer

    Assembly Code that can execute in Ripes.
    ​​​​main: # @main ​​​​ addi sp, sp, -64 ​​​​ sw ra, 60(sp) # 4-byte Folded Spill ​​​​ sw s0, 56(sp) # 4-byte Folded Spill ​​​​ addi s0, sp, 64 ​​​​ mv a0, zero ​​​​ sw a0, -12(s0) ​​​​ addi a1, zero, 2 ​​​​ sw a1, -16(s0) ​​​​ lw a1, -16(s0) ​​​​ mul a1, a1, a1 ​​​​ add a2, zero, sp ​​​​ sw a2, -24(s0) ​​​​ slli a2, a1, 2 ​​​​ addi a2, a2, 15 ​​​​ andi a3, a2, -16 ​​​​ add a2, zero, sp ​​​​ sub a2, a2, a3 ​​​​ sw a2, -60(s0) # 4-byte Folded Spill ​​​​ add sp, zero, a2 ​​​​ sw a1, -28(s0) ​​​​ lw a1, -16(s0) ​​​​ mul a1, a1, a1 ​​​​ slli a2, a1, 2 ​​​​ addi a2, a2, 15 ​​​​ andi a3, a2, -16 ​​​​ add a2, zero, sp ​​​​ sub a2, a2, a3 ​​​​ sw a2, -56(s0) # 4-byte Folded Spill ​​​​ add sp, zero, a2 ​​​​ sw a1, -32(s0) ​​​​ lw a1, -16(s0) ​​​​ mul a1, a1, a1 ​​​​ slli a2, a1, 2 ​​​​ addi a2, a2, 15 ​​​​ andi a3, a2, -16 ​​​​ add a2, zero, sp ​​​​ sub a2, a2, a3 ​​​​ sw a2, -52(s0) # 4-byte Folded Spill ​​​​ add sp, zero, a2 ​​​​ sw a1, -36(s0) ​​​​ sw a0, -20(s0) ​​​​ j .LBB0_1 ​​​​.LBB0_1: # =>This Inner Loop Header: Depth=1 ​​​​ lw a0, -20(s0) ​​​​ lw a1, -16(s0) ​​​​ mul a1, a1, a1 ​​​​ bge a0, a1, .LBB0_4 ​​​​ j .LBB0_2 ​​​​.LBB0_2: # in Loop: Header=BB0_1 Depth=1 ​​​​ lw a0, -60(s0) # 4-byte Folded Reload ​​​​ lw a1, -20(s0) ​​​​ slli a1, a1, 2 ​​​​ add a1, a1, a0 ​​​​ addi a0, zero, 3 ​​​​ sw a0, 0(a1) ​​​​ j .LBB0_3 ​​​​.LBB0_3: # in Loop: Header=BB0_1 Depth=1 ​​​​ lw a0, -20(s0) ​​​​ addi a0, a0, 1 ​​​​ sw a0, -20(s0) ​​​​ j .LBB0_1 ​​​​.LBB0_4: ​​​​ mv a0, zero ​​​​ sw a0, -20(s0) ​​​​ j .LBB0_5 ​​​​.LBB0_5: # =>This Inner Loop Header: Depth=1 ​​​​ lw a0, -20(s0) ​​​​ lw a1, -16(s0) ​​​​ mul a1, a1, a1 ​​​​ bge a0, a1, .LBB0_8 ​​​​ j .LBB0_6 ​​​​.LBB0_6: # in Loop: Header=BB0_5 Depth=1 ​​​​ lw a0, -56(s0) # 4-byte Folded Reload ​​​​ lw a1, -20(s0) ​​​​ slli a1, a1, 2 ​​​​ add a1, a1, a0 ​​​​ addi a0, zero, 2 ​​​​ sw a0, 0(a1) ​​​​ j .LBB0_7 ​​​​.LBB0_7: # in Loop: Header=BB0_5 Depth=1 ​​​​ lw a0, -20(s0) ​​​​ addi a0, a0, 1 ​​​​ sw a0, -20(s0) ​​​​ j .LBB0_5 ​​​​.LBB0_8: ​​​​ mv a0, zero ​​​​ sw a0, -20(s0) ​​​​ j .LBB0_9 ​​​​.LBB0_9: # =>This Inner Loop Header: Depth=1 ​​​​ lw a0, -20(s0) ​​​​ lw a1, -16(s0) ​​​​ mul a1, a1, a1 ​​​​ bge a0, a1, .LBB0_12 ​​​​ j .LBB0_10 ​​​​.LBB0_10: # in Loop: Header=BB0_9 Depth=1 ​​​​ lw a0, -52(s0) # 4-byte Folded Reload ​​​​ lw a1, -20(s0) ​​​​ slli a1, a1, 2 ​​​​ add a1, a1, a0 ​​​​ mv a0, zero ​​​​ sw a0, 0(a1) ​​​​ j .LBB0_11 ​​​​.LBB0_11: # in Loop: Header=BB0_9 Depth=1 ​​​​ lw a0, -20(s0) ​​​​ addi a0, a0, 1 ​​​​ sw a0, -20(s0) ​​​​ j .LBB0_9 ​​​​.LBB0_12: ​​​​ mv a0, zero ​​​​ sw a0, -40(s0) ​​​​ j .LBB0_13 ​​​​.LBB0_13: # =>This Loop Header: Depth=1 ​​​​ lw a0, -40(s0) ​​​​ lw a1, -16(s0) ​​​​ bge a0, a1, .LBB0_24 ​​​​ j .LBB0_14 ​​​​.LBB0_14: # in Loop: Header=BB0_13 Depth=1 ​​​​ mv a0, zero ​​​​ sw a0, -44(s0) ​​​​ j .LBB0_15 ​​​​.LBB0_15: # Parent Loop BB0_13 Depth=1 ​​​​ lw a0, -44(s0) ​​​​ lw a1, -16(s0) ​​​​ bge a0, a1, .LBB0_22 ​​​​ j .LBB0_16 ​​​​.LBB0_16: # in Loop: Header=BB0_15 Depth=2 ​​​​ mv a0, zero ​​​​ sw a0, -48(s0) ​​​​ j .LBB0_17 ​​​​.LBB0_17: # Parent Loop BB0_13 Depth=1 ​​​​ lw a0, -48(s0) ​​​​ lw a1, -16(s0) ​​​​ bge a0, a1, .LBB0_20 ​​​​ j .LBB0_18 ​​​​.LBB0_18: # in Loop: Header=BB0_17 Depth=3 ​​​​ lw a0, -52(s0) # 4-byte Folded Reload ​​​​ lw a4, -56(s0) # 4-byte Folded Reload ​​​​ lw a2, -60(s0) # 4-byte Folded Reload ​​​​ lw a1, -40(s0) ​​​​ lw a5, -48(s0) ​​​​ lw a6, -16(s0) ​​​​ mul a3, a5, a6 ​​​​ add a3, a3, a1 ​​​​ slli a3, a3, 2 ​​​​ add a2, a2, a3 ​​​​ lw a2, 0(a2) ​​​​ lw a3, -44(s0) ​​​​ mul a3, a3, a6 ​​​​ add a5, a5, a3 ​​​​ slli a5, a5, 2 ​​​​ add a4, a4, a5 ​​​​ lw a4, 0(a4) ​​​​ mul a2, a2, a4 ​​​​ add a1, a1, a3 ​​​​ slli a1, a1, 2 ​​​​ add a1, a1, a0 ​​​​ lw a0, 0(a1) ​​​​ add a0, a0, a2 ​​​​ sw a0, 0(a1) ​​​​ j .LBB0_19 ​​​​.LBB0_19: # in Loop: Header=BB0_17 Depth=3 ​​​​ lw a0, -48(s0) ​​​​ addi a0, a0, 1 ​​​​ sw a0, -48(s0) ​​​​ j .LBB0_17 ​​​​.LBB0_20: # in Loop: Header=BB0_15 Depth=2 ​​​​ j .LBB0_21 ​​​​.LBB0_21: # in Loop: Header=BB0_15 Depth=2 ​​​​ lw a0, -44(s0) ​​​​ addi a0, a0, 1 ​​​​ sw a0, -44(s0) ​​​​ j .LBB0_15 ​​​​.LBB0_22: # in Loop: Header=BB0_13 Depth=1 ​​​​ j .LBB0_23 ​​​​.LBB0_23: # in Loop: Header=BB0_13 Depth=1 ​​​​ lw a0, -40(s0) ​​​​ addi a0, a0, 1 ​​​​ sw a0, -40(s0) ​​​​ j .LBB0_13 ​​​​.LBB0_24: ​​​​ mv a0, zero ​​​​ sw a0, -12(s0) ​​​​ lw a0, -24(s0) ​​​​ add sp, zero, a0 ​​​​ lw a0, -12(s0) ​​​​ addi sp, s0, -64 ​​​​ lw s0, 56(sp) # 4-byte Folded Reload ​​​​ lw ra, 60(sp) # 4-byte Folded Reload ​​​​ addi sp, sp, 64 ​​​​ j .EXIT ​​​​.EXIT: ​​​​ li a7, 10 ​​​​ ecall

    We can do some modification.

    ​​​​...
    ​​​​.LBB0_24:
    ​​​​    mv      a0, zero
    ​​​​    sw      a0, -12(s0)
    ​​​​    lw      a0, -24(s0)
    ​​​​    add     sp, zero, a0
    ​​​​    lw      a0, -12(s0)
    ​​​​    addi    sp, s0, -64
    ​​​​    lw      s0, 56(sp)                      # 4-byte Folded Reload
    ​​​​    lw      ra, 60(sp)                      # 4-byte Folded Reload
    ​​​​    addi    sp, sp, 64
    ​​​​    // ret                  <--- # Remove
    ​​​​    j		.EXIT          <---| # Add
    ​​​​.EXIT:                              |
    ​​​​        li a7, 10                  |  
    ​​​​        ecall                        |
    
  2. Check the result
    Setup (sp=0x7fffffff0)

    The Result in C.

    ​​​​/* printf("i=%d, j=%d, C[i+j*n]=%d \n", i, j, C[i+j*n]); */
    ​​​​i=0, j=0, C[i+j*n]=6 
    ​​​​i=0, j=0, C[i+j*n]=12 
    ​​​​i=0, j=1, C[i+j*n]=6 
    ​​​​i=0, j=1, C[i+j*n]=12 
    ​​​​i=1, j=0, C[i+j*n]=6 
    ​​​​i=1, j=0, C[i+j*n]=12 
    ​​​​i=1, j=1, C[i+j*n]=6 
    ​​​​i=1, j=1, C[i+j*n]=12 
    

    Memory in Ripes.

  3. The access order (small case, n = 2)

  4. Result in Ripes (n = 50)

    Hit rate is 0.9174

Task

  1. Record your results in answers.txt

    ​​​​ijk:	n = 1000, 2.191 Gflop/s
    ​​​​ikj:	n = 1000, 0.317 Gflop/s
    ​​​​jik:	n = 1000, 2.097 Gflop/s
    ​​​​**jki:	n = 1000, 13.640 Gflop/s**
    ​​​​kij:	n = 1000, 0.317 Gflop/s
    ​​​​kji:	n = 1000, 13.615 Gflop/s
    
  2. Which ordering(s) perform best for these 1000-by-1000 matrices ? Why ?
    Ans 1. jki (multMat4)
    Ans 2.
    The cpu will often access the adjacent block in cache due to the stride 1.

    n = 2,

    n = 50, Hit rate = 0.966

  3. Which ordering(s) perform the worst? Why?
    Ans 1. ikj (multMat2) and kij (multMat5)
    Ans 2.
    Cpu access the cache block often suffer a cache miss due to the stride n when the cache block not enough big to include the array data.

    ikj (multMat2)

    n = 50, Hit rate = 0.8692

    kij (multMat5)

    n = 50, Hit rate = 0.8702

  4. How does the way we stride through the matrices with respect to the innermost loop affect performance?
    Ans.
    The innermost loop will induce jn stride, thus, there are a weakly spatial locality when j is too big.


Exercise 3 - Cache Blocking and Matrix Transposition

Matrix Transposition
Sometimes, we wish to swap the rows and columns of a matrix. This operation is called a “transposition” and an efficient implementation can be quite helpful while performing more complicated linear algebra operations. The transpose of matrix A is often denoted as AT.

Cache Blocking
In the above code for matrix multiplication, note that we are striding across the entire A and B matrices to compute a single value of C. As such, we are constantly accessing new values from memory and obtain very little reuse of cached data! We can improve the amount of data reuse in the caches by implementing a technique called cache blocking. More formally, cache blocking is a technique that attempts to reduce the cache miss rate by further improving the temporal and/or spatial locality of memory accesses. In the case of matrix transposition we consider performing the transposition one block at a time.

Things to note: In the above image, we transpose each submatrix Aij of matrix A into its final location in the output matrix, one submatrix at a time. It is important to note that transposing each individual subsection is equivalent to tranposing the entire matrix.

Since we operate on and finish transposing each submatrix one at a time, we consolidate our memory accesses to that smaller chunk of memory when transposing that particular submatrix, which increases the degree of temporal (and spatial) locality that we exhibit, which makes our cache performance better, which makes our program run faster.

This (if implemented correctly) will result in a substantial improvement in performance. For this lab, you will implement a cache blocking scheme for matrix transposition and analyze its performance.

Transpose_blocking

/* Implement cache blocking below. You should NOT assume that n is a * multiple of the block size. */ void transpose_blocking(int n, int blocksize, int *dst, int *src) { // YOUR CODE HER for (int i = 0; i < n; i += blocksize) for (int j = 0; j < n; j += blocksize) // transpose the block beginning at [i,j] for (int k = i; k < i + blocksize && k < n; ++k) for (int l = j; l < j + blocksize && l < n; ++l) dst[k + l*n] = src[l + k*n]; }

Part 1 - Changing Array Sizes

Task

Fix the blocksize to be 20, and run your code with n equal to 100, 1000, 2000, 5000, and 10000.

  • Record your results in answers.txt
    ​​​​n = 100, 
    ​​​​Testing naive transpose: 0.004 milliseconds
    ​​​​Testing transpose with blocking: 0.004 milliseconds
    ​​​​
    ​​​​n = 1000, 
    ​​​​Testing naive transpose: 0.716 milliseconds
    ​​​​Testing transpose with blocking: 0.562 milliseconds
    
    ​​​​n = 2000,
    ​​​​Testing naive transpose: 12.236 milliseconds
    ​​​​Testing transpose with blocking: 3.432 milliseconds
    ​​​​
    ​​​​n = 5000, 
    ​​​​Testing naive transpose: 113.673 milliseconds
    ​​​​Testing transpose with blocking: 23.722 milliseconds
    
    ​​​​n = 10000, 
    ​​​​Testing naive transpose: 600.908 milliseconds
    ​​​​Testing transpose with blocking: 102.394 milliseconds
    
  • At what point does cache blocked version of transpose become faster than the non-cache blocked version?
    Ans.
    n = 1000
  • Why does cache blocking require the matrix to be a certain size before it outperforms the non-cache blocked code?
    Ans.
    To make sure the spatial locality.

Part 2 - Changing Blocksize

Task

Fix n to be 10000, and run your code with blocksize equal to 50, 100, 500, 1000, 5000.

  • Record your results in answers.txt
    ​​​​blocksize = 50,
    ​​​​Testing naive transpose: 574.05 milliseconds
    ​​​​Testing transpose with blocking: 118.785 milliseconds
    
    ​​​​blocksize = 100,
    ​​​​Testing naive transpose: 562.55 milliseconds
    ​​​​Testing transpose with blocking: 111.177 milliseconds
    
    ​​​​blocksize = 500,
    ​​​​Testing naive transpose: 595.29 milliseconds
    ​​​​Testing transpose with blocking: 137.427 milliseconds
    
    ​​​​blocksize = 1000,
    ​​​​Testing naive transpose: 591.831 milliseconds
    ​​​​Testing transpose with blocking: 163.924 milliseconds
    
    ​​​​blocksize = 5000,
    ​​​​Testing naive transpose: 583.844 milliseconds
    ​​​​Testing transpose with blocking: 580.742 milliseconds
    
  • How does performance change as blocksize increases? Why is this the case?
    Ans.
    The performance will decrease when increasing blocksize.
    Becuase the cache will need to change the block frequently for a big size n, it means the situation with a weakly spatial locality.
    And the worst case could probaly make the performance degerate to the naive transpose.

Notice that in neither of the last two exercises did we actually know the cache parameters of our machine. We just made the code exhibit a higher degree of locality, and this magically made things go faster! This tells us that caches, regardless of their specific parameters, will always benefit from operating on code which exhibits a high degree of locality.