羅紹豪
    • Create new note
    • Create a note from template
      • Sharing URL Link copied
      • /edit
      • View mode
        • Edit mode
        • View mode
        • Book mode
        • Slide mode
        Edit mode View mode Book mode Slide mode
      • Customize slides
      • Note Permission
      • Read
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Write
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Engagement control Commenting, Suggest edit, Emoji Reply
    • Invite by email
      Invitee

      This note has no invitees

    • Publish Note

      Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

      Your note will be visible on your profile and discoverable by anyone.
      Your note is now live.
      This note is visible on your profile and discoverable online.
      Everyone on the web can find and read all notes of this public team.
      See published notes
      Unpublish note
      Please check the box to agree to the Community Guidelines.
      View profile
    • Commenting
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
      • Everyone
    • Suggest edit
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
    • Emoji Reply
    • Enable
    • Versions and GitHub Sync
    • Note settings
    • Note Insights
    • Engagement control
    • Transfer ownership
    • Delete this note
    • Save as template
    • Insert from template
    • Import from
      • Dropbox
      • Google Drive
      • Gist
      • Clipboard
    • Export to
      • Dropbox
      • Google Drive
      • Gist
    • Download
      • Markdown
      • HTML
      • Raw HTML
Menu Note settings Versions and GitHub Sync Note Insights Sharing URL Create Help
Create Create new note Create a note from template
Menu
Options
Engagement control Transfer ownership Delete this note
Import from
Dropbox Google Drive Gist Clipboard
Export to
Dropbox Google Drive Gist
Download
Markdown HTML Raw HTML
Back
Sharing URL Link copied
/edit
View mode
  • Edit mode
  • View mode
  • Book mode
  • Slide mode
Edit mode View mode Book mode Slide mode
Customize slides
Note Permission
Read
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Write
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Engagement control Commenting, Suggest edit, Emoji Reply
  • Invite by email
    Invitee

    This note has no invitees

  • Publish Note

    Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

    Your note will be visible on your profile and discoverable by anyone.
    Your note is now live.
    This note is visible on your profile and discoverable online.
    Everyone on the web can find and read all notes of this public team.
    See published notes
    Unpublish note
    Please check the box to agree to the Community Guidelines.
    View profile
    Engagement control
    Commenting
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    • Everyone
    Suggest edit
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    Emoji Reply
    Enable
    Import from Dropbox Google Drive Gist Clipboard
       owned this note    owned this note      
    Published Linked with GitHub
    Subscribed
    • Any changes
      Be notified of any changes
    • Mention me
      Be notified of mention me
    • Unsubscribe
    Subscribe
    # Rework CacheLab In my term project I'm going to rework on [Assignment4:Cache](https://hackmd.io/eBIh9LtJQguyBpSP7gYl3w?view), I'm gonna use the method provide in [CS 61C Lab7](https://inst.eecs.berkeley.edu/~cs61c/su20/labs/lab07/), and also reference three of the homeworks written by [魏晉成](https://hackmd.io/T_MxFzGNRMqS5b-7oGaz9w), [江松穎](https://hackmd.io/@Uduru0522/rJ-NUxC3v), [呂紹樺](https://hackmd.io/@horseradish1208/HkzkUEAiD) ## Exercise 1 - A Couple of Memory Access Scenarios We use the program `cache.s` in [source code](https://github.com/61c-teach/su20-lab-starter/tree/master/lab07) to observe how will the cache do in these three kinds of situation :::spoiler **Pseudocode:** ``` c= int array[]; //Assume sizeof(int) == 4 for (k = 0; k < repcount; k++) { // repeat repcount times /* Step through the selected array segment with the given step size. */ for (index = 0; index < arraysize; index += stepsize) { if(option==0) array[index] = 0; // Option 0: One cache access - write else array[index] = array[index] + 1; // Option 1: Two cache accesses - read AND write } } ``` ::: ### Scenario 1: Test environment : [Ripes](https://github.com/mortbopet/Ripes) :::spoiler **Program Parameters** * Array Size (a0): 128 (bytes) * Step Size (a1): 8 * Rep Count (a2): 4 * Option (a3): 0 ::: Use the given program parameters to modify the assembly code in `cache.s` ``` main: li a0, 128 # array size in BYTES (power of 2 < array size) li a1, 8 # step size (power of 2 > 0) li a2, 4 # rep count (int > 0) li a3, 0 # 0 - option 0, 1 - option 1 ``` also we modify the code in main ``` li a7,10 # exit ecall ``` in the origin code we load in `a0`,though Ripes requires: * `a0` for specifying system call * `a7` for system call parameters :::spoiler **Cache Parameters**: * Cache Levels: 1 * Block Size: 8 * Number of Blocks: 4 * Enable?: Should be green * Placement Policy: Direct Mapped * Associativity: 1 * Block Replacement Policy: LRU ::: set the require cache parameters in Ripes(the require parameters are for [Venus](https://venus.cs61c.org/),so there will be a little different) ![](https://i.imgur.com/4JhGBjE.png) The cache will be like ![](https://i.imgur.com/bFrUKvM.png) In Ripes, size of a block will always be 4 bytes, so we need to set the number of blocks to be 2 tohave a 8 bytes block.than we see those 2 blocks as 1. ![](https://i.imgur.com/AtCmdGW.png) In the picture above we can see the hit rate is 0 and there are 16 misses, we can also get the result by calculating ourselves, 128(bytes)/8(step size)/4(bytes/word) = 4, and we do it 4 times equal to 16, but why miss, its because we use direct mapped and the step size is 8 so all blocks who wants to come in cache its Index will be 0, cause to the result that we are going to replace the block in Index 0, but there are still idle cache blocks. ![](https://i.imgur.com/P1NRRUW.png) If we change placement policy to 4-way set associate the hit rate will increase to 0.75 ![](https://i.imgur.com/Bj8R7GL.png) ### Tasks **Q1.** What combination of parameters is producing the hit rate you observe? (Hint: Your answer should be of the form: “Because [parameter A] in bytes is exactly equal to [parameter B] in bytes.”) **A1.** Because cache size in bytes is exactly equal to step size in bytes and we use direct mapped, so the hit rate would be 0 in this case. **Q2.** What is our hit rate if we increase Rep Count arbitrarily? Why? **A2.** The hit rate will not have changed when we increase Rep Count, because we can see Rep Count like an iteration, in fact the only change of the result will bethe amount of misses **Q3.** How could we modify one program parameter to increase our hit rate? (Hint: Your answer should be of the form: “Change [parameter] to [value].” Note: we don’t care if we access the same array elements. Just give us a program parameter modification that would increase the hit rate. **A3.** We can change step size to 1 , in this case the cache will fill with blocks, then we get the miss rate increase to 0.5 ![](https://i.imgur.com/s1uLAe6.png) ### Scenario 2: Test environment : [Ripes](https://github.com/mortbopet/Ripes) :::spoiler **Program Parameters** * Array Size (a0): 256 (bytes) * Step Size (a1): 2 * Rep Count (a2): 1 * Option (a3): 1 ::: Same as scenario 1, use the given program parameters to modify the assembly code in `cache.s` ``` main: li a0, 256 # array size in BYTES (power of 2 < array size) li a1, 2 # step size (power of 2 > 0) li a2, 1 # rep count (int > 0) li a3, 1 # 0 - option 0, 1 - option 1 ``` :::spoiler **Cache Parameters**: * Cache Levels: 1 * Block Size: 16 * Number of Blocks: 16 * Enable?: Should be green * Placement Policy: N-Way Set Associative * Associativity: 4 * Block Replacement Policy: LRU ::: set the require cache parameters in Ripes(the require parameters are for [Venus](https://venus.cs61c.org/),so there will be a little different) ![](https://i.imgur.com/4IyP7Mg.png) ### Tasks **Q1.** How many memory accesses are there per iteration of the inner loop? **A1.** Because we set option to 1 in program parameter, it will do `arr[index] = arr[index] + 1` instead of `array[index] = 0`. It cause to do an additional write to the memory, so there will be 2 memory access per iteration. **Q2.** What is the repeating hit/miss pattern? WHY? (Hint: it repeats every 4 accesses). **A2.** Miss-Hit-Hit-Hit The first access will be a compulsory miss, then the same block will be access again to write. Because the cache line width is 4 words, the next read and write will also be HIT. **Q3.** This should follow very straightforwardly from the above question: Explain the hit rate in terms of the hit/miss pattern. **A3.** The hit rate is 0.75 because there are 3 hits in 4 accesses. **Q4.** Keeping everything else the same, what happens to our hit rate as Rep Count goes to infinity? Why? Try it out by changing the appropriate program parameter and letting the code run! **A4.** When the Rep Count goes to infinity, the hit rate will also very close to 1, it is because there is no replacement in this case, the number of misses will always stay at 16(compulsory miss), the number of hits will keep increasing. set Rep Count to 100 ![](https://i.imgur.com/8MyvVO2.png) **Q5.** Suppose we have a program that iterates through a very large array (AKA way bigger than the size of the cache) times. During each Rep, we map a different function to the elements of our array (e.g. if Rep Count = 1024, we map 1024 different functions onto each of the array elements, one per Rep). (For reference, in this scenario, we just had one function (incrementation) and one Rep). QUESTION: Given the program described above, how can we restructure its array accesses to achieve a hit rate like that achieved in this scenario? Assume that each array element is to be modified independently of the others AKA it doesn’t matter if Rep is applied to element before Rep is applied to element , etc. **A5.** **Method 1:**(provided by 江松穎) Since the array size is that huge, we cannot let every accessed address cached. Ergo, we can try to every “kind” of array access happens together. That is, for element arr[i], instead of incrementing i everytime we finish task k, we apply all functions then goto the next array member. In this way, for every array member, we undergo only 1 miss. Following is a sample c-code: ``` c= int array[SIZE]; void (*func_ptr)(int*); for(int i = 0; i < SIZE; i += stepsize){ for(int k = 0; k < rep_count; ++k){ next_func(func_ptr); // Into some new function that modify/uses only arr[i] (*func_ptr)(&arr[i]); } } ``` **Method 2:**(provided by 魏晉成) We should try to access first 64KByte of the array at a time and apply all of the R/W hit to that first 64KByte so we can be completely done with it before moving on, thereby keeping that 64KByte hot in the cache and not having to circle back to it later on! ### Scenario 3: Test environment : [Venus](https://venus.cs61c.org/) :::spoiler **Program Parameters** * Array Size (a0): 128 (bytes) * Step Size (a1): 1 * Rep Count (a2): 1 * Option (a3): 0 ::: Same as scenario 1, use the given program parameters to modify the assembly code in `cache.s` ``` main: li a0, 128 # array size in BYTES (power of 2 < array size) li a1, 1 # step size (power of 2 > 0) li a2, 1 # rep count (int > 0) li a3, 0 # 0 - option 0, 1 - option 1 ``` :::spoiler **Cache Parameters**: * Cache Levels: 2 * L1 Cache Parameters: * Block Size: 8 * Number of Blocks: 8 * Placement Policy: Direct Mapped * Associativity: 1 * Block Replacement Policy: LRU * L2 Cache Parameters: * Block Size: 8 * Number of Blocks: 16 * Placement Policy: Direct Mapped * Associativity: 1 * Block Replacement Policy: LRU ::: The different part is Ripes does not support multilevel cache so we use Venus. The setting of the L1, L2 caches is in below L1: ![](https://i.imgur.com/g7wMpDQ.png) L2: ![](https://i.imgur.com/8Q4xHlG.png) After running the program, the status of the cache become L1: ![](https://i.imgur.com/63N9EeZ.png) L2: ![](https://i.imgur.com/0oPu6HA.png) In the result we can see there are 32 times of memory access in L1, 16 hits and 16 misses, when L1 misses it will go to L2, but in this case L2 still miss so it eventually goes to memory ### Tasks **Q1.** What is the hit rate of our L1 cache? Our L2 cache? Overall? **A1.** The hit rate of L1 cache is 0.5, L2 is 0, we can use `hits/total access` to get the overall hit rate, it will be 16/(32+16)=1/3 (There are two answers in this question provide by two of the student, i think this one is the correct answer) **Q2.** How many accesses do we have to the L1 cache total? How many of them are misses? **A2.** According to the result show in Venus, the total access of L1 cache will be 32 and the misses will be 16 **Q3.** How many accesses do we have to the L2 cache total? How does this relate to the L1 cache (think about what the L1 cache has to do in order to make us access the L2 cache)? pattern. **A3.** According to the result show in Venus, the total access of L2 cache will be 16, and it is because we can't find the block we want in L1(L1 cache miss) then we will go find it in L2. **Q4.** What program parameter would allow us to increase our L2 hit rate, but keep our L1 hit rate the same? Why? **A4.** If we increase the Rep Count, it is any easy way to increase the L2 hit rate, in first case the whole miss is because compulsory miss, when we increase Rep Count we reuse those blocks in L2. **Q5.** What happens to our hit rates for L1 and L2 as we slowly increase the number of blocks in L1? What about L1 block size? **A5.** Increase number of blocks of L1 cache will not affect the hit rate of L1, L2 cache, but when we also add Rep Count, the hit rate of L1 will increase but the misses doesn't increase so L2 cache hit rate will be still the same. **Increse number of blocks:** ![](https://i.imgur.com/rPSakug.png) ![](https://i.imgur.com/7FpQ70m.png) **Add Rep Count:** ![](https://i.imgur.com/E247R7c.png) ![](https://i.imgur.com/SoBbusw.png) Increase L1 block size to 16 will really change the hit rate from M-H-M-H to M-H-H-H, so it definately will increase hit rate of L1 cache, but L2 cahce still facing the same problem, compulsory miss. **Increase block size:** ![](https://i.imgur.com/B05FHYE.png) ![](https://i.imgur.com/XzGKrDX.png) ## Exercise 2 - Loop Ordering and Matrix Multiplication this exercise is going to let us compare the performance between different loop order of matrix multiplication :::spoiler **matrixMultiply.c** ``` c= #include <stdio.h> #include <stdlib.h> #include <sys/time.h> #include <time.h> /* To save you time, we are including all 6 variants of the loop ordering as separate functions and then calling them using function pointers. The reason for having separate functions that are nearly identical is to avoid counting any extraneous processing towards the computation time. This includes I/O accesses (printf) and conditionals (if/switch). I/O accesses are slow and conditional/branching statements could unfairly bias results (lower cases in switches must run through more case statements on each iteration). */ void multMat1( int n, float *A, float *B, float *C ) { int i,j,k; /* This is ijk loop order. */ for( i = 0; i < n; i++ ) for( j = 0; j < n; j++ ) for( k = 0; k < n; k++ ) C[i+j*n] += A[i+k*n]*B[k+j*n]; } void multMat2( int n, float *A, float *B, float *C ) { int i,j,k; /* This is ikj loop order. */ for( i = 0; i < n; i++ ) for( k = 0; k < n; k++ ) for( j = 0; j < n; j++ ) C[i+j*n] += A[i+k*n]*B[k+j*n]; } void multMat3( int n, float *A, float *B, float *C ) { int i,j,k; /* This is jik loop order. */ for( j = 0; j < n; j++ ) for( i = 0; i < n; i++ ) for( k = 0; k < n; k++ ) C[i+j*n] += A[i+k*n]*B[k+j*n]; } void multMat4( int n, float *A, float *B, float *C ) { int i,j,k; /* This is jki loop order. */ for( j = 0; j < n; j++ ) for( k = 0; k < n; k++ ) for( i = 0; i < n; i++ ) C[i+j*n] += A[i+k*n]*B[k+j*n]; } void multMat5( int n, float *A, float *B, float *C ) { int i,j,k; /* This is kij loop order. */ for( k = 0; k < n; k++ ) for( i = 0; i < n; i++ ) for( j = 0; j < n; j++ ) C[i+j*n] += A[i+k*n]*B[k+j*n]; } void multMat6( int n, float *A, float *B, float *C ) { int i,j,k; /* This is kji loop order. */ for( k = 0; k < n; k++ ) for( j = 0; j < n; j++ ) for( i = 0; i < n; i++ ) C[i+j*n] += A[i+k*n]*B[k+j*n]; } /* uses timing features from sys/time.h that you haven't seen before */ int main( int argc, char **argv ) { int nmax = 1000,i; void (*orderings[])(int,float *,float *,float *) = {&multMat1,&multMat2,&multMat3,&multMat4,&multMat5,&multMat6}; char *names[] = {"ijk","ikj","jik","jki","kij","kji"}; float *A = (float *)malloc( nmax*nmax * sizeof(float)); float *B = (float *)malloc( nmax*nmax * sizeof(float)); float *C = (float *)malloc( nmax*nmax * sizeof(float)); struct timeval start, end; /* fill matrices with random numbers */ for( i = 0; i < nmax*nmax; i++ ) A[i] = drand48()*2-1; for( i = 0; i < nmax*nmax; i++ ) B[i] = drand48()*2-1; for( i = 0; i < nmax*nmax; i++ ) C[i] = drand48()*2-1; for( i = 0; i < 6; i++) { /* multiply matrices and measure the time */ gettimeofday( &start, NULL ); (*orderings[i])( nmax, A, B, C ); gettimeofday( &end, NULL ); /* convert time to Gflop/s */ double seconds = (end.tv_sec - start.tv_sec) + 1.0e-6 * (end.tv_usec - start.tv_usec); double Gflops = 2e-9*nmax*nmax*nmax/seconds; printf( "%s:\tn = %d, %.3f Gflop/s\n", names[i], nmax, Gflops ); } free( A ); free( B ); free( C ); printf("\n\n"); return 0; } ``` ::: there are six ways in the code, `ijk`, `ikj`, `jik`, `jki`, `kij`, `kji`, and the corresponding column and row relaionshup will be like (this figure provided by [林柏維](https://hackmd.io/lYbsD9D2Q2i-otiPCkB6SA?view#Exercise-2---Loop-Ordering-and-Matrix-Multiplication)) ![](https://i.imgur.com/RKFpd0B.png) In memory, there is onli 1 dimensional can be used, so the 2 dimensional matrix will become (this figure provided by [呂紹樺](https://hackmd.io/@horseradish1208/HkzkUEAiD)) ![](https://i.imgur.com/oeQsGTP.png) use following command to get the result of each six ways ``` make ex2 gcc -o matrixMultiply -ggdb -Wall -pedantic -std=gnu99 -O3 matrixMultiply.c ``` test for three times, the results are ``` ijk: n = 1000, 1.746 Gflop/s ikj: n = 1000, 0.154 Gflop/s jik: n = 1000, 1.724 Gflop/s jki: n = 1000, 10.591 Gflop/s kij: n = 1000, 0.153 Gflop/s kji: n = 1000, 9.020 Gflop/s ijk: n = 1000, 1.840 Gflop/s ikj: n = 1000, 0.151 Gflop/s jik: n = 1000, 1.765 Gflop/s jki: n = 1000, 11.146 Gflop/s kij: n = 1000, 0.152 Gflop/s kji: n = 1000, 9.207 Gflop/s ijk: n = 1000, 1.841 Gflop/s ikj: n = 1000, 0.149 Gflop/s jik: n = 1000, 1.746 Gflop/s jki: n = 1000, 10.151 Gflop/s kij: n = 1000, 0.151 Gflop/s kji: n = 1000, 9.475 Gflop/s ``` ### Tasks **Q1.** Which ordering(s) perform best for these 1000-by-1000 matrices? Why? **A1.** Order `jki` and `kji` have better performance than the others, as we know the setting of the exercise is column-major, and creating C array is always be`C[i+j*n] += A[i+k*n]*B[k+j*n]`, we observe that if `i` is in the inner loop, array A and C will get less cache miss because it only adds one instead of n in per iteration. **Q2.** Which ordering(s) perform the worst? Why? **A2.** One the other hand, if `j` is in the inner loop, array B and C will add n un per iteration, it will cause much more cache block replacement. So the worst case will be either `ikj` or `kij` **Q3.** How does the way we stride through the matrices with respect to the innermost loop affect performance? **A3.** When we are trying to get higher performance, the first thing we have to do is decrease the miss rate of cache, so the index of innermost loop is the key, we should find the column index for two of the matrix to be the inner one, then find the other column index to be the second loop index, in this way you can find the less miss rate combination. I think the figure really make its point (provided by [呂紹樺](https://hackmd.io/@horseradish1208/HkzkUEAiD)) ![](https://i.imgur.com/LBMEoUY.png) ## Exercise 3 - Cache Blocking and Matrix Transposition ### Exercise Discription In this exercise we need to design a transfer with blocking code and write it into the [transpose.c](https://github.com/61c-teach/su20-lab-starter/blob/master/lab07/transpose.c) provided by [Lab7](https://github.com/61c-teach/su20-lab-starter/tree/master/lab07) In this part I'm gonna compare the code and the performance to figure out what's the difference between three of the codes I reference to. :::spoiler Code designed by [江松穎](https://hackmd.io/@Uduru0522/rJ-NUxC3v): ``` c= void transpose_blocking(int n, int blocksize, int *dst, int *src) { int block_count = n / blocksize; for(int i = 0; i < block_count; ++i){ for(int j = 0; j < block_count; ++j){ // i := row block count // j := column block count for(int a = 0; a < blocksize; ++a){ for(int b = 0; b < blocksize; ++b){ if(i * blocksize + a < n && j * blocksize + b < n){ dst[(j * blocksize + b) + (i * blocksize + a) * n] = src[(j * blocksize + b) * n + (i * blocksize+ a)]; #ifdef EBUG printf("Moving src[%d] to ", (j * blocksize+ b) * n + (i * blocksize + a)); printf("to dst[%d]", (j * blocksize + b) + (i * blocksize + a) * n); putchar('\n'); #endif } } } } } } ``` ::: :::spoiler Performance: ./transpose 100 20 Testing naive transpose: 0.004 milliseconds Testing transpose with blocking: 0.006 milliseconds ./transpose 1000 20 Testing naive transpose: 1.154 milliseconds Testing transpose with blocking: 1.267 milliseconds ./transpose 5000 20 Testing naive transpose: 296.269 milliseconds Testing transpose with blocking: 47.287 milliseconds ./transpose 10000 20 Testing naive transpose: 1434.96 milliseconds Testing transpose with blocking: 250.862 milliseconds ::: :::spoiler Code designed by [魏晉成](https://hackmd.io/T_MxFzGNRMqS5b-7oGaz9w): ``` c= void transpose_blocking(int n, int blocksize, int *dst, int *src) { for(int row = 0; row < n / blocksize; row ++) { for(int col = 0; col < n / blocksize; col ++) { int dst_base_offset = row * blocksize + col * blocksize * n, src_base_offset = col * blocksize + row * blocksize * n; for(int inner_row = 0; inner_row < blocksize; inner_row ++) { for(int inner_col = 0; inner_col < blocksize; inner_col ++) { *(dst + dst_base_offset + inner_col * n + inner_row) = *(src + src_base_offset + inner_row * n + inner_col); } } } } } ``` ::: :::spoiler Performance: ./transpose 100 20 Testing naive transpose: 0.005 milliseconds Testing transpose with blocking: 0.006 milliseconds ./transpose 1000 20 Testing naive transpose: 1.55 milliseconds Testing transpose with blocking: 0.97 milliseconds ./transpose 5000 20 Testing naive transpose: 314.841 milliseconds Testing transpose with blocking: 40.12 milliseconds ./transpose 10000 20 Testing naive transpose: 1471.05 milliseconds Testing transpose with blocking: 175.337 milliseconds ::: :::spoiler Code designed by [呂紹樺](https://hackmd.io/@horseradish1208/HkzkUEAiD): ``` c= // Function transpose_naive() transposes a (blocksize x blocksize) block in a (n x n) martrix. // The start indices in each block must be specified void transpose(int IndexX, int IndexY, int blocksize, int n, int* dst, int* src) { for (int y = IndexY; y < (IndexY + blocksize) ; y++) { for (int x = IndexX; x < (IndexX + blocksize) ; x++) { if (y < n && x < n) { dst[y + x * n] = src[x + y * n]; }// Prevent any operation at index out of bound } } } // Function transpose_blocking() iterate through the matrix by a block. // It transposes each block matrix. void transpose_blocking(int n, int blocksize, int* dst, int* src) { for (int blockX = 0; blockX < n; blockX+=blocksize) { for (int blockY = 0; blockY < n; blockY += blocksize) { transpose(blockX, blockY , blocksize, n, dst, src); } } } ``` ::: :::spoiler Performance: ./transpose 100 20 Testing naive transpose: 0.005 milliseconds Testing transpose with blocking: 0.012 milliseconds ./transpose 1000 20 Testing naive transpose: 1.105 milliseconds Testing transpose with blocking: 1.713 milliseconds ./transpose 5000 20 Testing naive transpose: 302.005 milliseconds Testing transpose with blocking: 66.458 milliseconds ./transpose 10000 20 Testing naive transpose: 1409.76 milliseconds Testing transpose with blocking: 427.206 milliseconds ::: They all use the same logic to write the code, but there are differences between there performances. In the first code, it has the second good performance, compare with the best performance(second code), it done the same thing more times in the inner loop, but the calculation can be just done once one the second loop, so I think it is the point why second code has better performance. In the third code, its performance is the worst, but I think it has an advantage is it has a clearest code compare with others, first it create a new function to do the transpose action, then it call the function in `transpose_blocking`, make the code avoid a fourth loop, it's more friendly to the programmer, and also more convenient if we may use it later. ### Part 1 - Changing Array Sizes #### Tasks **Q1.** At what point does cache blocked version of transpose become faster than the non-cache blocked version? **A1.** By the test result, if we take the best performance code, it will be faster before `array size=1000`, the others will be between `array size=1000~5000` **Q2.** Why does cache blocking require the matrix to be a certain size before it outperforms the non-cache blocked code? **A2.** When the size is small ,`Naive transpose` also take adventage of locality just like`Transpose with blocking`, but when the size become larger, the same block access will be farer, and leads to miss rate increase. ### Part 2 - Changing Blocksize Fix n to be 10000, and run your code with blocksize equal to 50, 100, 500, 1000, 5000 and record results. :::spoiler Performance([江松穎](https://hackmd.io/@Uduru0522/rJ-NUxC3v)): ./transpose 10000 50 Testing naive transpose: 1420.89 milliseconds Testing transpose with blocking: 171.208 milliseconds ./transpose 10000 100 Testing naive transpose: 1436.73 milliseconds Testing transpose with blocking: 146.55 milliseconds ./transpose 10000 500 Testing naive transpose: 1427.1 milliseconds Testing transpose with blocking: 150.841 milliseconds ./transpose 10000 1000 Testing naive transpose: 1448.18 milliseconds Testing transpose with blocking: 269.06 milliseconds ./transpose 10000 5000 Testing naive transpose: 1431.75 milliseconds Testing transpose with blocking: 1306.34 milliseconds ::: :::spoiler Performance([魏晉成](https://hackmd.io/T_MxFzGNRMqS5b-7oGaz9w)): ./transpose 10000 50 Testing naive transpose: 1415.84 milliseconds Testing transpose with blocking: 154.443 milliseconds ./transpose 10000 100 Testing naive transpose: 1434.98 milliseconds Testing transpose with blocking: 142.22 milliseconds ./transpose 10000 500 Testing naive transpose: 1424.07 milliseconds Testing transpose with blocking: 184.678 milliseconds ./transpose 10000 1000 Testing naive transpose: 1428.87 milliseconds Testing transpose with blocking: 238.894 milliseconds ./transpose 10000 5000 Testing naive transpose: 1452.11 milliseconds Testing transpose with blocking: 1179.47 milliseconds ::: :::spoiler Performance([呂紹樺](https://hackmd.io/@horseradish1208/HkzkUEAiD)): ./transpose 10000 50 Testing naive transpose: 1452.17 milliseconds Testing transpose with blocking: 340.687 milliseconds ./transpose 10000 100 Testing naive transpose: 1523.58 milliseconds Testing transpose with blocking: 263.248 milliseconds ./transpose 10000 500 Testing naive transpose: 1461.54 milliseconds Testing transpose with blocking: 192.446 milliseconds ./transpose 10000 1000 Testing naive transpose: 1408.23 milliseconds Testing transpose with blocking: 254.349 milliseconds ./transpose 10000 5000 Testing naive transpose: 1407.35 milliseconds Testing transpose with blocking: 1122.16 milliseconds ::: ### Tasks **Q1.** How does performance change as increases? Why is this the case? **A1.** In these three cases we can observe that the performance has no obviously changed when `n` is very large, because if the block is become too big, the distance between two memory access will be very far, then the performance will decrease. I think this picture makes the point( provided by [江松穎](https://hackmd.io/@Uduru0522/rJ-NUxC3v)) ![](https://i.imgur.com/jWQnPSp.png) ## Reference [江松穎-Homework4](https://hackmd.io/@Uduru0522/rJ-NUxC3v) [魏晉成-Homework4](https://hackmd.io/T_MxFzGNRMqS5b-7oGaz9w?view) [呂紹樺-Homework4](https://hackmd.io/@horseradish1208/HkzkUEAiD) [CS 61C Lab 7](https://inst.eecs.berkeley.edu/~cs61c/su20/labs/lab07/) [Ripes](https://github.com/mortbopet/Ripes)

    Import from clipboard

    Paste your markdown or webpage here...

    Advanced permission required

    Your current role can only read. Ask the system administrator to acquire write and comment permission.

    This team is disabled

    Sorry, this team is disabled. You can't edit this note.

    This note is locked

    Sorry, only owner can edit this note.

    Reach the limit

    Sorry, you've reached the max length this note can be.
    Please reduce the content or divide it to more notes, thank you!

    Import from Gist

    Import from Snippet

    or

    Export to Snippet

    Are you sure?

    Do you really want to delete this note?
    All users will lose their connection.

    Create a note from template

    Create a note from template

    Oops...
    This template has been removed or transferred.
    Upgrade
    All
    • All
    • Team
    No template.

    Create a template

    Upgrade

    Delete template

    Do you really want to delete this template?
    Turn this template into a regular note and keep its content, versions, and comments.

    This page need refresh

    You have an incompatible client version.
    Refresh to update.
    New version available!
    See releases notes here
    Refresh to enjoy new features.
    Your user state has changed.
    Refresh to load new user state.

    Sign in

    Forgot password

    or

    By clicking below, you agree to our terms of service.

    Sign in via Facebook Sign in via Twitter Sign in via GitHub Sign in via Dropbox Sign in with Wallet
    Wallet ( )
    Connect another wallet

    New to HackMD? Sign up

    Help

    • English
    • 中文
    • Français
    • Deutsch
    • 日本語
    • Español
    • Català
    • Ελληνικά
    • Português
    • italiano
    • Türkçe
    • Русский
    • Nederlands
    • hrvatski jezik
    • język polski
    • Українська
    • हिन्दी
    • svenska
    • Esperanto
    • dansk

    Documents

    Help & Tutorial

    How to use Book mode

    Slide Example

    API Docs

    Edit in VSCode

    Install browser extension

    Contacts

    Feedback

    Discord

    Send us email

    Resources

    Releases

    Pricing

    Blog

    Policy

    Terms

    Privacy

    Cheatsheet

    Syntax Example Reference
    # Header Header 基本排版
    - Unordered List
    • Unordered List
    1. Ordered List
    1. Ordered List
    - [ ] Todo List
    • Todo List
    > Blockquote
    Blockquote
    **Bold font** Bold font
    *Italics font* Italics font
    ~~Strikethrough~~ Strikethrough
    19^th^ 19th
    H~2~O H2O
    ++Inserted text++ Inserted text
    ==Marked text== Marked text
    [link text](https:// "title") Link
    ![image alt](https:// "title") Image
    `Code` Code 在筆記中貼入程式碼
    ```javascript
    var i = 0;
    ```
    var i = 0;
    :smile: :smile: Emoji list
    {%youtube youtube_id %} Externals
    $L^aT_eX$ LaTeX
    :::info
    This is a alert area.
    :::

    This is a alert area.

    Versions and GitHub Sync
    Get Full History Access

    • Edit version name
    • Delete

    revision author avatar     named on  

    More Less

    Note content is identical to the latest version.
    Compare
      Choose a version
      No search result
      Version not found
    Sign in to link this note to GitHub
    Learn more
    This note is not linked with GitHub
     

    Feedback

    Submission failed, please try again

    Thanks for your support.

    On a scale of 0-10, how likely is it that you would recommend HackMD to your friends, family or business associates?

    Please give us some advice and help us improve HackMD.

     

    Thanks for your feedback

    Remove version name

    Do you want to remove this version name and description?

    Transfer ownership

    Transfer to
      Warning: is a public team. If you transfer note to this team, everyone on the web can find and read this note.

        Link with GitHub

        Please authorize HackMD on GitHub
        • Please sign in to GitHub and install the HackMD app on your GitHub repo.
        • HackMD links with GitHub through a GitHub App. You can choose which repo to install our App.
        Learn more  Sign in to GitHub

        Push the note to GitHub Push to GitHub Pull a file from GitHub

          Authorize again
         

        Choose which file to push to

        Select repo
        Refresh Authorize more repos
        Select branch
        Select file
        Select branch
        Choose version(s) to push
        • Save a new version and push
        • Choose from existing versions
        Include title and tags
        Available push count

        Pull from GitHub

         
        File from GitHub
        File from HackMD

        GitHub Link Settings

        File linked

        Linked by
        File path
        Last synced branch
        Available push count

        Danger Zone

        Unlink
        You will no longer receive notification when GitHub file changes after unlink.

        Syncing

        Push failed

        Push successfully