Assignment4: Cache

tags: `Computer Architecture`

Exercise lab7

Assignment4: Cache

Source code of this assignment

Exercise 1 - A Couple of Memory Access Scenarios

PSEUDOCODE:










int array[];  //Assume sizeof(int) == 4
for (k = 0; k < repcount; k++) {	// repeat repcount times
/* Step through the selected array segment with the given step size. */
  for (index = 0; index < arraysize; index += stepsize) {
    if(option==0)
      array[index] = 0;	        // Option 0: One cache access - write
    else
      array[index] = array[index] + 1;// Option 1: Two cache accesses - read AND write
   }
}

source: cache.s

Scenario 1

Program Parameter: the following register parameters are set in cache.s

parameter	register	value
Array Size	a₀	128(byte)
Step Size	a₁	8
Rep Count	a₂	4
Option	a₃	0

Cache Parameters: the following cache parameters are set in Ripes simulator

parameter	value or strategy
Cache Level	1
Block Size	8
Number of blocks	4
Enable	Should be green
Placement Policy	Direct Mapped
Associativity	1
Block Replacement Policy	LRU

Q1: What combination of parameters is producing the hit rate you observe?

A1:

the result is shown below:
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
explanation: tracing the code we can observe that outer loop iterate 4 times, inner loop iterate 4 times, totally 16 times, and it travel inner loop with step size 8.At the situation at first round, the increament of inner index is 0 -> 8 -> 16 -> 24, it will all meet compulsory miss when it first access the the array(that's only occur at the first iteration of outer loop).
the category of miss can check here

Q2: What is our hit rate if we increase Rep Count arbitrarily? Why?

A2: the hit rate will higher than previous one, because if it increase step size, the times it has to traverse inner loop will decrease, then the times of compulsory miss rate will decrease (note that the miss catogory above are all compulsory cache miss)

Q3: How could we modify one program parameter to increase our hit rate?

A3: there are two options we can modified to acheive incresing hit rate as following

Step Size: if it decreases the step size, the hit rate will increase,
- here we decrease step size to 2, the result is figure below
  Image Not Showing Possible Reasons
  The image file may be corrupted
  The server hosting the image is unavailable
  The image path is incorrect
  The image format is not supported
  Learn More →
Rep Count: increase the Rep Count will increase hit rate too, as we mentioned in Q1, the miss are all occured at first iteration, if we increase the total iteration times, the hit ratio will get higher by more hit times.(miss times is fixed) Below is the result that we double Rep Count to 8.
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →

Scenario 2

Program Parameters:

Array Size	Step Size	Rep Count	Option
256(byte)	2	1	1

Cache Parameters:

Parameter	value or strategy
Cache Level	1
Number of blocks	16
Block Size	16
Enable	Should be green
Placement Policy	N-Way Set Associative
Associativity	4
Block Replacement Policy	LRU

Q1: How many memory accesses are there per iteration of the inner loop?

A1: in per iteration of inner loop, that is `wordLoop`, it will execute one `lw` and one `sw`, so it will access memory two times in each iteration.

Q2: What is the repeating hit/miss pattern? WHY?

A2: when it encounter a compulsory miss, it will load 16 entries into cache line, then following will has 15 consequent hit because the stride is 1 in this scenario.

Q3: This should follow very straightforwardly from the above question: Explain the hit rate in terms of the hit/miss pattern.

A3: with the explanation of previous question, the hit rate will be
$\frac{15}{16}$ , that is 0.9375

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Q4: Keeping everything else the same, what happens to our hit rate as Rep Count goes to infinity? Why?

A4: If the Rep Count goes to infinity, the hit rate will getting close to 1, because it will load all data at first iteration, then the following round will never encounter miss.

Scenario 3

Ripes only support one-level cache now, this problem should do with Venus (TODO)

Exercise 2 - Loop Ordering and Matrix Multiplication

source: matrixMultiply.c

multiplication of two matrix




for (int i = 0; i < n; i++)
    for (int j = 0; j < n; j++)
        for (int k = 0; k < n; k++)
            C[i+j*n] += A[i+k*n] * B[k+j*n];

To compute the matrix multiplication correctly, the loop order doesn’t matter.

although the loop order doesn't matter, the efficient with different situation is large variat, one critical point how to accelerate matrix multiplication problem is to utilize the spatial and temporal locality of cache

Task1: Record results

result of running matrixMultiply.c
difference between these matrix multiplication is only the order of i, j, k loop

make ex2
gcc -o matrixMultiply -ggdb -Wall -pedantic -std=gnu99 -O3 matrixMultiply.c
./matrixMultiply
ijk:    n = 1000, 0.858 Gflop/s
ikj:    n = 1000, 0.155 Gflop/s
jik:    n = 1000, 0.849 Gflop/s
jki:    n = 1000, 1.451 Gflop/s
kij:    n = 1000, 0.478 Gflop/s
kji:    n = 1000, 1.448 Gflop/s

note that Gfloap is following:

Giga-floating-point-operations per second.” THE BIGGER THE NUMBER THE FASTER IT IS RUNNING!

we can record the value on every index in each iteration with different ordering
here we experiment in 3*3 matrix multiplication and record their accessing stride
- A_idx: i+k*n
- B_idx: k+j*n
- C_idx: i+j*n

ijk

iteration	i	j	k	A_idx	B_idx	C_idx
1	0	0	0	0	0	0
2	0	0	1	3	1	0
3	0	0	2	6	2	0
4	0	1	0	0	3	3
5	0	1	1	3	4	3
6	0	1	2	6	5	3
7	0	2	0	0	6	6
8	0	2	1	3	7	6
9	0	2	2	6	8	6
10	1	0	0	1	0	1
11	1	0	1	4	1	1
12	1	0	2	7	2	1
13	1	1	0	1	3	4
14	1	1	1	4	4	4
15	1	1	2	7	5	4
16	1	2	0	1	6	7
17	1	2	1	4	7	7
18	1	2	2	7	8	7
19	2	0	0	2	0	2
20	2	0	1	5	1	2
21	2	0	2	8	2	2
22	2	1	0	2	3	5
23	2	1	1	5	4	5
24	2	1	2	8	5	5
25	2	2	0	2	6	8
26	2	2	1	5	7	8
27	2	2	2	8	8	8

	A	B	C
stride	3	1	3

ikj

iteration	i	j	k	A_idx	B_idx	C_idx
1	0	0	0	0	0	0
2	0	1	0	3	0	3
3	0	2	0	6	0	6
4	0	0	1	0	3	1
5	0	1	1	3	3	4
6	0	2	1	6	3	7
7	0	0	2	0	6	2
8	0	1	2	3	6	5
9	0	2	2	6	6	8
10	1	0	0	1	1	0
11	1	1	0	4	1	3
12	1	2	0	7	1	6
13	1	0	1	1	4	1
14	1	1	1	4	4	4
15	1	2	1	7	4	7
16	1	0	2	1	7	2
17	1	1	2	4	7	5
18	1	2	2	7	7	8
19	2	0	0	2	2	0
20	2	1	0	5	2	3
21	2	2	0	8	2	6
22	2	0	1	2	5	1
23	2	1	1	5	5	4
24	2	2	1	8	5	7
25	2	0	2	2	8	2
26	2	1	2	5	8	5
27	2	2	2	8	8	8

	A	B	C
stride	3	3	3

jik

iteration	i	j	k	A_idx	B_idx	C_idx
1	0	0	0	0	0	0
2	0	0	1	3	1	0
3	0	0	2	6	2	0
4	1	0	0	1	0	1
5	1	0	1	4	1	1
6	1	0	2	7	2	1
7	2	0	0	2	0	2
8	2	0	1	5	1	2
9	2	0	2	8	2	2
10	0	1	0	0	3	3
11	0	1	1	3	4	3
12	0	1	2	6	5	3
13	1	1	0	1	3	4
14	1	1	1	4	4	4
15	1	1	2	7	5	4
16	2	1	0	2	3	5
17	2	1	1	5	4	5
18	2	1	2	8	5	5
19	0	2	0	0	6	6
20	0	2	1	3	7	6
21	0	2	2	6	8	6
22	1	2	0	1	6	7
23	1	2	1	4	7	7
24	1	2	2	7	8	7
25	2	2	0	2	6	8
26	2	2	1	5	7	8
27	2	2	2	8	8	8

	A	B	C
stride	3	1	1

jki

iteration	i	j	k	A_idx	B_idx	C_idx
1	0	0	0	0	0	0
2	1	0	0	1	0	1
3	2	0	0	2	0	2
4	0	0	1	3	1	0
5	1	0	1	4	1	1
6	2	0	1	5	1	2
7	0	0	2	6	2	0
8	1	0	2	7	2	1
9	2	0	2	8	2	2
10	0	1	0	0	3	3
11	1	1	0	1	3	4
12	2	1	0	2	3	5
13	0	1	1	3	4	3
14	1	1	1	4	4	4
15	2	1	1	5	4	5
16	0	1	2	6	5	3
17	1	1	2	7	5	4
18	2	1	2	8	5	5
19	0	2	0	0	6	6
20	1	2	0	1	6	7
21	2	2	0	2	6	8
22	0	2	1	3	7	6
23	1	2	1	4	7	7
24	2	2	1	5	7	8
25	0	2	2	6	8	6
26	1	2	2	7	8	7
27	2	2	2	8	8	8

	A	B	C
stride	1	1	1

kij

iteration	i	j	k	A_idx	B_idx	C_idx
1	0	0	0	0	0	0
2	0	1	0	0	3	3
3	0	2	0	0	6	6
4	1	0	0	1	0	1
5	1	1	0	1	3	4
6	1	2	0	1	6	7
7	2	0	0	2	0	2
8	2	1	0	2	3	5
9	2	2	0	2	6	8
10	0	0	1	3	1	0
11	0	1	1	3	4	3
12	0	2	1	3	7	6
13	1	0	1	4	1	1
14	1	1	1	4	4	4
15	1	2	1	4	7	7
16	2	0	1	5	1	2
17	2	1	1	5	4	5
18	2	2	1	5	7	8
19	0	0	2	6	2	0
20	0	1	2	6	5	3
21	0	2	2	6	8	6
22	1	0	2	7	2	1
23	1	1	2	7	5	4
24	1	2	2	7	8	7
25	2	0	2	8	2	2
26	2	1	2	8	5	5
27	2	2	2	8	8	8

	A	B	C
stride	1	3	3

kji

iteration	i	j	k	A_idx	B_idx	C_idx
1	0	0	0	0	0	0
2	1	0	0	1	0	1
3	2	0	0	2	0	2
4	0	1	0	0	3	3
5	1	1	0	1	3	4
6	2	1	0	2	3	5
7	0	2	0	0	6	6
8	1	2	0	1	6	7
9	2	2	0	2	6	8
10	0	0	1	3	1	0
11	1	0	1	4	1	1
12	2	0	1	5	1	2
13	0	1	1	3	4	3
14	1	1	1	4	4	4
15	2	1	1	5	4	5
16	0	2	1	3	7	6
17	1	2	1	4	7	7
18	2	2	1	5	7	8
19	0	0	2	6	2	0
20	1	0	2	7	2	1
21	2	0	2	8	2	2
22	0	1	2	6	5	3
23	1	1	2	7	5	4
24	2	1	2	8	5	5
25	0	2	2	6	8	6
26	1	2	2	7	8	7
27	2	2	2	8	8	8

	A	B	C
stride	1	3	1

we know that if the stride is less, we can utilize more feature of spacial locality.
the influence of different strides have reflect in their runnung time.

Task2: Which ordering(s) perform best for these 1000-by-1000 matrices? Why?

we can observe that the ordering of jki runs the best with 1.451 Gflop/s
as mentioned above, this situation most utilize the advantage of locality.

Task3: Which ordering(s) perform the worst? Why?

obviously the ordering of ikj perform the worst with 0.478 Gflop/s
as mentioned above, this situation worst utilize the advantage of locality, cpu have to reload each data in every iteration (here we always assume that cache isn't large enough to save whole matrix data).

Task4: How does the way we stride through the matrices with respect to the innermost loop affect performance?

in experiment, we can see that when the stride of inner loopmost is 1, which always faster.

Exercise 3 - Cache Blocking and Matrix Transposition

source: transpose.c

C code implementation
































/* Implement cache blocking below. You should NOT assume that n is a
 * multiple of the block size. */
void transpose_blocking(int n, int blocksize, int *dst, int *src) {
    // YOUR CODE HERE
    int blocknum = n / blocksize;
    for (int block_i = 0; block_i < blocknum; block_i++){
        for (int block_j = 0; block_j < blocknum; block_j++){
            int itr_i = block_i * blocksize;
            int itr_j = block_j * blocksize;
            for (int inner_i = 0; inner_i < blocksize; inner_i++) {
                for (int inner_j = 0; inner_j < blocksize; inner_j++){
                    int x = itr_i + inner_i;
                    int y = itr_j + inner_j;
                    dst[y + x * n] = src[x + y * n];
                }
            }
        }
    }
    // transpose residual elements
    int nn = blocknum * blocksize;
    int remain = n - nn;
    for (int x = 0; x < n; x++){
        for (int y = nn; y < n; y++){
            dst[y + x * n] = src[x + y * n];
        }
    }
    for (int x = nn; x < n; x++){
        for (int y = 0; y < n; y++){
            dst[y + x * n] = src[x + y * n];
        }
    }
}

Part 1: Changing Array Sizes

blocksize fixed to 20, and modified n
time unit is milliseconds(ms)

n	20	60	70	80	100	1000	2000	5000	10000
naive	0.001	0.007	0.01	0.014	0.021	2.731	13.802	106.16	428.3
blocking	0.001	0.007	0.009	0.011	0.017	1.896	15.806	92.287	398.101

Q1: At what point does cache blocked version of transpose become faster than the non-cache blocked version?
A1: from n > 70, the advantage of blocked version of transpose has appear clearly, and is getting obviously when n getting larger.
Q2: Why does cache blocking require the matrix to be a certain size before it outperforms the non-cache blocked code?
A2: if the matrix size is too small, the advantage of spatial/temporal localty cannot be brought out since the data can all be loaded to memory in one time access.

Part 2: Changing Blocksize

n is fixed to 10000, and modified blocksize
time unit is milliseconds(ms)

blocksize	10	20	50	100	500	1000	5000
naive	427.988	428.112	428.154	428.393	427.967	428.344	428.618
blocking	478.875	398.02	296.951	284.136	278.36	336.922	428.019

Q: How does performance change as blocksize increases? Why is this the case?
A: As the blocksize is getting larger, the advantage of temporary locality is lose, and the behavior is getting similar with naive matrix transpose due to the cache size that is less than block size.

iteration	i	j	k	A_idx	B_idx	C_idx
1	0	0	0	0	0	0
2	0	0	1	3	1	0
3	0	0	2	6	2	0
4	0	1	0	0	3	3
5	0	1	1	3	4	3
6	0	1	2	6	5	3
7	0	2	0	0	6	6
8	0	2	1	3	7	6
9	0	2	2	6	8	6
10	1	0	0	1	0	1
11	1	0	1	4	1	1
12	1	0	2	7	2	1
13	1	1	0	1	3	4
14	1	1	1	4	4	4
15	1	1	2	7	5	4
16	1	2	0	1	6	7
17	1	2	1	4	7	7
18	1	2	2	7	8	7
19	2	0	0	2	0	2
20	2	0	1	5	1	2
21	2	0	2	8	2	2
22	2	1	0	2	3	5
23	2	1	1	5	4	5
24	2	1	2	8	5	5
25	2	2	0	2	6	8
26	2	2	1	5	7	8
27	2	2	2	8	8	8

iteration	i	j	k	A_idx	B_idx	C_idx
1	0	0	0	0	0	0
2	0	1	0	3	0	3
3	0	2	0	6	0	6
4	0	0	1	0	3	1
5	0	1	1	3	3	4
6	0	2	1	6	3	7
7	0	0	2	0	6	2
8	0	1	2	3	6	5
9	0	2	2	6	6	8
10	1	0	0	1	1	0
11	1	1	0	4	1	3
12	1	2	0	7	1	6
13	1	0	1	1	4	1
14	1	1	1	4	4	4
15	1	2	1	7	4	7
16	1	0	2	1	7	2
17	1	1	2	4	7	5
18	1	2	2	7	7	8
19	2	0	0	2	2	0
20	2	1	0	5	2	3
21	2	2	0	8	2	6
22	2	0	1	2	5	1
23	2	1	1	5	5	4
24	2	2	1	8	5	7
25	2	0	2	2	8	2
26	2	1	2	5	8	5
27	2	2	2	8	8	8

iteration	i	j	k	A_idx	B_idx	C_idx
1	0	0	0	0	0	0
2	0	0	1	3	1	0
3	0	0	2	6	2	0
4	1	0	0	1	0	1
5	1	0	1	4	1	1
6	1	0	2	7	2	1
7	2	0	0	2	0	2
8	2	0	1	5	1	2
9	2	0	2	8	2	2
10	0	1	0	0	3	3
11	0	1	1	3	4	3
12	0	1	2	6	5	3
13	1	1	0	1	3	4
14	1	1	1	4	4	4
15	1	1	2	7	5	4
16	2	1	0	2	3	5
17	2	1	1	5	4	5
18	2	1	2	8	5	5
19	0	2	0	0	6	6
20	0	2	1	3	7	6
21	0	2	2	6	8	6
22	1	2	0	1	6	7
23	1	2	1	4	7	7
24	1	2	2	7	8	7
25	2	2	0	2	6	8
26	2	2	1	5	7	8
27	2	2	2	8	8	8

iteration	i	j	k	A_idx	B_idx	C_idx
1	0	0	0	0	0	0
2	1	0	0	1	0	1
3	2	0	0	2	0	2
4	0	0	1	3	1	0
5	1	0	1	4	1	1
6	2	0	1	5	1	2
7	0	0	2	6	2	0
8	1	0	2	7	2	1
9	2	0	2	8	2	2
10	0	1	0	0	3	3
11	1	1	0	1	3	4
12	2	1	0	2	3	5
13	0	1	1	3	4	3
14	1	1	1	4	4	4
15	2	1	1	5	4	5
16	0	1	2	6	5	3
17	1	1	2	7	5	4
18	2	1	2	8	5	5
19	0	2	0	0	6	6
20	1	2	0	1	6	7
21	2	2	0	2	6	8
22	0	2	1	3	7	6
23	1	2	1	4	7	7
24	2	2	1	5	7	8
25	0	2	2	6	8	6
26	1	2	2	7	8	7
27	2	2	2	8	8	8

iteration	i	j	k	A_idx	B_idx	C_idx
1	0	0	0	0	0	0
2	0	1	0	0	3	3
3	0	2	0	0	6	6
4	1	0	0	1	0	1
5	1	1	0	1	3	4
6	1	2	0	1	6	7
7	2	0	0	2	0	2
8	2	1	0	2	3	5
9	2	2	0	2	6	8
10	0	0	1	3	1	0
11	0	1	1	3	4	3
12	0	2	1	3	7	6
13	1	0	1	4	1	1
14	1	1	1	4	4	4
15	1	2	1	4	7	7
16	2	0	1	5	1	2
17	2	1	1	5	4	5
18	2	2	1	5	7	8
19	0	0	2	6	2	0
20	0	1	2	6	5	3
21	0	2	2	6	8	6
22	1	0	2	7	2	1
23	1	1	2	7	5	4
24	1	2	2	7	8	7
25	2	0	2	8	2	2
26	2	1	2	8	5	5
27	2	2	2	8	8	8

iteration	i	j	k	A_idx	B_idx	C_idx
1	0	0	0	0	0	0
2	1	0	0	1	0	1
3	2	0	0	2	0	2
4	0	1	0	0	3	3
5	1	1	0	1	3	4
6	2	1	0	2	3	5
7	0	2	0	0	6	6
8	1	2	0	1	6	7
9	2	2	0	2	6	8
10	0	0	1	3	1	0
11	1	0	1	4	1	1
12	2	0	1	5	1	2
13	0	1	1	3	4	3
14	1	1	1	4	4	4
15	2	1	1	5	4	5
16	0	2	1	3	7	6
17	1	2	1	4	7	7
18	2	2	1	5	7	8
19	0	0	2	6	2	0
20	1	0	2	7	2	1
21	2	0	2	8	2	2
22	0	1	2	6	5	3
23	1	1	2	7	5	4
24	2	1	2	8	5	5
25	0	2	2	6	8	6
26	1	2	2	7	8	7
27	2	2	2	8	8	8

iteration	i	j	k	A_idx	B_idx	C_idx
1	0	0	0	0	0	0
2	0	0	1	3	1	0
3	0	0	2	6	2	0
4	0	1	0	0	3	3
5	0	1	1	3	4	3
6	0	1	2	6	5	3
7	0	2	0	0	6	6
8	0	2	1	3	7	6
9	0	2	2	6	8	6
10	1	0	0	1	0	1
11	1	0	1	4	1	1
12	1	0	2	7	2	1
13	1	1	0	1	3	4
14	1	1	1	4	4	4
15	1	1	2	7	5	4
16	1	2	0	1	6	7
17	1	2	1	4	7	7
18	1	2	2	7	8	7
19	2	0	0	2	0	2
20	2	0	1	5	1	2
21	2	0	2	8	2	2
22	2	1	0	2	3	5
23	2	1	1	5	4	5
24	2	1	2	8	5	5
25	2	2	0	2	6	8
26	2	2	1	5	7	8
27	2	2	2	8	8	8

iteration	i	j	k	A_idx	B_idx	C_idx
1	0	0	0	0	0	0
2	0	1	0	3	0	3
3	0	2	0	6	0	6
4	0	0	1	0	3	1
5	0	1	1	3	3	4
6	0	2	1	6	3	7
7	0	0	2	0	6	2
8	0	1	2	3	6	5
9	0	2	2	6	6	8
10	1	0	0	1	1	0
11	1	1	0	4	1	3
12	1	2	0	7	1	6
13	1	0	1	1	4	1
14	1	1	1	4	4	4
15	1	2	1	7	4	7
16	1	0	2	1	7	2
17	1	1	2	4	7	5
18	1	2	2	7	7	8
19	2	0	0	2	2	0
20	2	1	0	5	2	3
21	2	2	0	8	2	6
22	2	0	1	2	5	1
23	2	1	1	5	5	4
24	2	2	1	8	5	7
25	2	0	2	2	8	2
26	2	1	2	5	8	5
27	2	2	2	8	8	8

iteration	i	j	k	A_idx	B_idx	C_idx
1	0	0	0	0	0	0
2	0	0	1	3	1	0
3	0	0	2	6	2	0
4	1	0	0	1	0	1
5	1	0	1	4	1	1
6	1	0	2	7	2	1
7	2	0	0	2	0	2
8	2	0	1	5	1	2
9	2	0	2	8	2	2
10	0	1	0	0	3	3
11	0	1	1	3	4	3
12	0	1	2	6	5	3
13	1	1	0	1	3	4
14	1	1	1	4	4	4
15	1	1	2	7	5	4
16	2	1	0	2	3	5
17	2	1	1	5	4	5
18	2	1	2	8	5	5
19	0	2	0	0	6	6
20	0	2	1	3	7	6
21	0	2	2	6	8	6
22	1	2	0	1	6	7
23	1	2	1	4	7	7
24	1	2	2	7	8	7
25	2	2	0	2	6	8
26	2	2	1	5	7	8
27	2	2	2	8	8	8

iteration	i	j	k	A_idx	B_idx	C_idx
1	0	0	0	0	0	0
2	1	0	0	1	0	1
3	2	0	0	2	0	2
4	0	0	1	3	1	0
5	1	0	1	4	1	1
6	2	0	1	5	1	2
7	0	0	2	6	2	0
8	1	0	2	7	2	1
9	2	0	2	8	2	2
10	0	1	0	0	3	3
11	1	1	0	1	3	4
12	2	1	0	2	3	5
13	0	1	1	3	4	3
14	1	1	1	4	4	4
15	2	1	1	5	4	5
16	0	1	2	6	5	3
17	1	1	2	7	5	4
18	2	1	2	8	5	5
19	0	2	0	0	6	6
20	1	2	0	1	6	7
21	2	2	0	2	6	8
22	0	2	1	3	7	6
23	1	2	1	4	7	7
24	2	2	1	5	7	8
25	0	2	2	6	8	6
26	1	2	2	7	8	7
27	2	2	2	8	8	8

iteration	i	j	k	A_idx	B_idx	C_idx
1	0	0	0	0	0	0
2	0	1	0	0	3	3
3	0	2	0	0	6	6
4	1	0	0	1	0	1
5	1	1	0	1	3	4
6	1	2	0	1	6	7
7	2	0	0	2	0	2
8	2	1	0	2	3	5
9	2	2	0	2	6	8
10	0	0	1	3	1	0
11	0	1	1	3	4	3
12	0	2	1	3	7	6
13	1	0	1	4	1	1
14	1	1	1	4	4	4
15	1	2	1	4	7	7
16	2	0	1	5	1	2
17	2	1	1	5	4	5
18	2	2	1	5	7	8
19	0	0	2	6	2	0
20	0	1	2	6	5	3
21	0	2	2	6	8	6
22	1	0	2	7	2	1
23	1	1	2	7	5	4
24	1	2	2	7	8	7
25	2	0	2	8	2	2
26	2	1	2	8	5	5
27	2	2	2	8	8	8

iteration	i	j	k	A_idx	B_idx	C_idx
1	0	0	0	0	0	0
2	1	0	0	1	0	1
3	2	0	0	2	0	2
4	0	1	0	0	3	3
5	1	1	0	1	3	4
6	2	1	0	2	3	5
7	0	2	0	0	6	6
8	1	2	0	1	6	7
9	2	2	0	2	6	8
10	0	0	1	3	1	0
11	1	0	1	4	1	1
12	2	0	1	5	1	2
13	0	1	1	3	4	3
14	1	1	1	4	4	4
15	2	1	1	5	4	5
16	0	2	1	3	7	6
17	1	2	1	4	7	7
18	2	2	1	5	7	8
19	0	0	2	6	2	0
20	1	0	2	7	2	1
21	2	0	2	8	2	2
22	0	1	2	6	5	3
23	1	1	2	7	5	4
24	2	1	2	8	5	5
25	0	2	2	6	8	6
26	1	2	2	7	8	7
27	2	2	2	8	8	8