# Matrix Profile
## Target Architecture
Intel(R) Xeon(R) Platinum 8360Y CPU @ 2.40GHz
L1d Cache: 3,4 MiB
L1i Cache: 2,3 MiB
L2 Cache: 90 MiB
L3 Cache: 108 MiB
Number of AVX-512 FMA units: 2
All below measurments are taken with the following configuration of inputs.
Time series length: 262159
Sublen: 10
### Baseline
The baseline measurments were taken using the icpx compiler with the following arguments:
icpx
```
-O3 -mtune=icelake-server
```
Time taken:
```
45.41 seconds
```
Perf:
```
4.24 Giga Floating Point Operations Per Second
0.25 Cycles Per Instruction (per Logical Processor)
3.94 Instructions Per Cycle (per Logical Processor)
1.24 Floating Point Operations Per Cycle
```
### SIMD - User-Defined Reduction in OpenMP
The ```#pragma omp simd``` vectoriziation is pretty straightforward, because of the row dependencies a reduction is needed to avoid race conditions. Because we want to preserve the the index, we implemented a custom reduction on a struct data type containing mp and mpi.
The comparison for the row max is done on that local variable, which is then reduced. The relevant code snippet is shown below.
```cpp
max_mp localMp{ -1,-1 };
#pragma omp declare reduction(argMax : struct max_mp : omp_out = omp_in.mp > omp_out.mp ? omp_in : omp_out)
// main computation loops
// TODO loop tiling and parallelization
for (int i = 0; i < mlen - sublen; i++) {
// TODO loop tiling and simd parallelization
localMp.mp = mp[i];
localMp.mpi = mpi[i];
#pragma omp simd reduction(argMax:localMp)
for (int j = i + sublen; j < mlen; j++) {
// streaming dot product
if (i != 0)
QT[j - i - sublen] += df[i] * dg[j] + df[j] * dg[i];
double cr = QT[j - i - sublen] * norm[i] * norm[j];
// updating the nearest neighbors information
if (cr > localMp.mp) {
localMp.mp = cr;
localMp.mpi = j;
}
if (cr > mp[j]) {
mp[j] = cr;
mpi[j] = i;
}
}
mp[i] = localMp.mp;
mpi[i] = localMp.mpi;
}
```
#### Performance evaluation
The OMP SIMD measurments were taken using the gcc compiler (ICPX/ICPC did not work well with custom reductions) with the following arguments:
g++
```
-fopenmp -O3
```
```
44.76 seconds
```
```
4.49 Giga Floating Point Operations Per Second
0.32 Cycles Per Instruction (per Logical Processor)
3.17 Instructions Per Cycle (per Logical Processor)
1.35 Floating Point Operations Per Cycle
```
### AVX-512 Intrinsics
#### Steps
The kernel contains the following computations in the inner loop that can be vectorized
- Row initialization
The first row is calculated using the SIMD version of the serial code.
```
__m512d QT_v = _mm512_add_pd(QT_v,
_mm512_mul_pd(_mm512_sub_pd(ts_j, mu_start),
_mm512_sub_pd(ts_i_j, mu_i)));
```
- Corelation Calculation
Correlation is calculated using the SIMD version of the serial code.
```
...
QT_v = _mm512_add_pd(QT_v, _mm512_add_pd(_mm512_mul_pd(df_i_v, dg_j_v),
_mm512_mul_pd(df_j_v, dg_i_v)));
__m512d cr_v = _mm512_mul_pd(QT_v, _mm512_mul_pd(norm_i_v, norm_j_v));
```
- Reduction across J
We need to reduce max(cr_v) along j and do a manual reduction on max_cr_v at the end of the j SIMD loop
```
result = _mm512_cmp_pd_mask(cr_v, max_cr_v, _CMP_GT_OS);
max_cr_v = _mm512_mask_blend_pd(result, max_cr_v, cr_v);
max_index = _mm512_mask_blend_epi64(result, max_index, indices_j_v);
```
- Reduction within J
We also need to compare cr_v with mp[j] within the j loop and reduce accordingly. The resulting operation is pretty straightforward.
```
result = _mm512_cmp_pd_mask(cr_v, mp_j_v, _CMP_GT_OS);
_mm512_mask_store_pd(&mp[j], result, cr_v);
_mm512_mask_store_epi64(&mpi[j], result, index_i_v);
```
#### Compiler
We decided to use icpx for the intrinsics version with the following parameters:
```
-O3 -xCORE-AVX512 -mtune=icelake-server
```
#### Best Runtime
```
49,71 seconds.
```
The Intrinsic version is slower than the 256 bit auto-vectorized version of baseline. We speculate that this could be due to the heavy nature of 512 fma instructions being run on a single thread.
The results obtained from ```baseline``` and ```omp simd``` versions are very close to the Intrinsics version
This could be due to the fact that the number of AVX512 FMA units is 2 in the Icelake CPU. Which bottlenecks the number of AVX512 instructions that can be run in parallel in the CPU
#### Performance Measurements
- Perf report breakdown (vfma* vmulpd instructions contribute to a high percentage of the runtime)
```
13,69 │ vfmadd213pd (%rdx,%rcx,8),%zmm6,%zmm7
19,08 │ vfmadd231pd (%r14,%rcx,8),%zmm5,%zmm7
0,00 │ vpbroadcastq %rsi,%zmm8
19,46 │ vmulpd (%r10,%rcx,8),%zmm3,%zmm9
0,01 │ vmovupd %zmm7,(%rdx,%rcx,8)
4,69 │ vmulpd %zmm7,%zmm9,%zmm7
11,64 │ vcmpltpd %zmm7,%zmm1,%k1
1,18 │ vpaddq %zmm0,%zmm8,%zmm2{%k1}
8,12 │ vcmpgtpd (%r12,%rcx,8),%zmm7,%k1
```
The following stats are collected from Intel core/uncore events using pmu_tools (https://github.com/andikleen/pmu-tools). This provides us access to more counters than the ones available in perf. pmu-tools provides us with a Top Level Breakdown of performance counters.
Here are the following relevant metrics we obtained from the results
- GFLOPs
```
5.14 Giga Floating Point Operations Per Second
```
The Giga flops is a derived metric from the number of Floating point operations divided by the time taken.
We have a higher Giga flops when compared to the baseline due to the larger 512bit instruction.
- Memory Counters
```
Backend_Bound.Memory_Bound.L3_Bound % Stalls 61.4
Info.Bottleneck Memory_Latency Scaled_Slots 39.36
```
61.4 % of backend bound instructions were stalled for memory, this means that we are memory intensive.
The total pipeline cost of Memory Latency related bottlenecks (external memory and off-core caches) is not very high. This is a derived metric from different core/uncore DRAM_Bound, L1_Bound PMM_Bound, Store_Bound, Hit_Latency events. (described in https://github.com/andikleen/pmu-tools/blob/master/clx_server_ratios.py)
- Miscellaneous
```
1.88 Cycles Per Instruction (per Logical Processor)
1.56 Floating Point Operations Per Cycle
```
We are spending more time per instruction with than baseline and omp simd due to the 512 bit SIMD instructions but we are doing more Floating point operations per cycle.
#### Memory Bandwidth
We calculate the Memory bandwidth using the intel uncore event ```iMC.MEM_BW_TOTAL``` This is derived from (CAS_COUNT.RD * 64) + (CAS_COUNT.WR * 64) CAS stands for _Column Address Select_. This metric is socket bound and thus, contains readings for the different channels in that socket.
The pmu tool allows us to monitor the counter over a period of time.
```python pmu-tools/ucevent/ucevent.py iMC.MEM_BW_TOTAL --scale GB```
In the intrinsics version, since we are using only one thread and processing blocks of data, the Memory bandwidth peaks at **0.19 GB/s** during the beginning of the run and reduces steadily as more data is available in the caches and less needs to be queried from the Main memory.
### Multi-threading - OpenMP
Before we explain our parallelization, we should explain the dependencies of the matrix profile code or algorithm. The most important part is that if the rolling dot product is used, then the next row is dependant on the computational results of the previous row. What this means is that the QT array is is updated according to the sliding window principle.
For example, In the first row QT[0] computes the QT part of the correlation for index i=0 and j=3 (sublen=3) and for the second row, that initial data is used to compute QT for i=1 and j=4 (in QT[0]). This means the diagonal elements going from the top left to the bottom right, are dependant on their predecessor.
Another important point is synchronization of the max comparison and the writing of the values into the mp and mpi arrays. This will be discussed in more detail in the tiling section.
#### Tiling
To properly distribute the triangular workload onto our 72 hardware threads in ice2, we need to tile it. A simple automatic distribution of the index ranges would work with a rectangular workload, but not in a triangle. Like the assignment document suggests, you can tile the triangle into smaller triangles, some of the same structure and some inverted ones. This is shown in the figure below.
<figure>
<img src="tiling_triangles.png" alt="figure" style="width:400px">
<figcaption>Figure: The triangular tiling</figcaption>
</figure>
What is also possible is to connect the inverted triangle, to create parallelograms, this makes the programming logic simpler, makes better use of the whole range of the QT array, removes overhead and might increase cache locality. The figure below shows the tiling.
<figure>
<img src="tiling_para.png" alt="figure" style="width:400px">
<figcaption>Figure: The parallelogram tiling</figcaption>
</figure>
Now to the number of tiles and the distribution to threads. Each of the threads should ideally have the same amount of work, a near perfect distribution can be achieved with this kind of tiling. The right row still consists of triangles, which are nearly exactly half the amount of work to compute as the parallelograms (one diagonal less). Meaning, one thread should either take two triangles or one parallelogram.
The formula for the number of workload units is related to the gaussian sum.
It is:```1+...+n +(n+1)/2, (n is an odd integer)```
As an example, for 8 threads a even distribution is possible, the first row has 3 parallelograms, the second 2, the third 1. That's 6 parallelograms, at the side there are 4 triangles meaning the workload of 2 parallelograms. That's 8 work units overall for n=3, n corresponds to the number of parallelograms in the first row. (Or tiles-1)
N should be odd, because otherwise there are half work units, for example for n=2, there are 3 parallelograms and 3 triangles, equaling to 4.5 work units, which can not be equally distributed.
For our case of 72 threads we have a perfect distribution for n=11, ```1+...+11+(11+1)/2 = 72```. This gives each thread an close to equal workload, with 66 parallelograms taken by 66 threads and 12 triangles taken by 6 threads.
We implemented the tiling by having the triangle tiles case as an if case in the same code, the fragment below shows the structure.
```cpp
for (int i = 0; i < mlen - sublen; i += tileSize) {
for (int j = i + sublen; j < mlen; j += tileSize) {
for (int k = 0; k < tileSize; k++) {
QT_tile[k] = 0;
}
for (int k = j; k < j + tileSize; k++) {
for (int l = 0; l < sublen; l++) {
//if (k - sublen >= 0)
QT_tile[k - j] += (ts[i + l] - mu[i])*(ts[k + l] -mu[k]);
}
}
int jj_border = i + tileSize;
for (int ii = i; ii < i + tileSize; ii++) {
jj_border = j + ii - i + tileSize;
if (jj_border > mlen)
jj_border = mlen;
for (int jj = j + ii - i; jj < jj_border; jj++) {
// streaming dot product
if (ii != i)
QT_tile[jj - j - (ii - i)] += df[ii] * dg[jj] + df[jj] * dg[ii];
double cr = QT_tile[jj - j - (ii - i)] * norm[ii] * norm[jj];
// updating the nearest neighbors information
if (cr > mp_local[ii]) {
mp_local[ii] = cr;
mpi_local[ii] = jj;
}
if (cr > mp_local[jj]) {
mp_local[jj] = cr;
mpi_local[jj] = ii;
}
}
}
```
If mlen-sublen is not divisible by n+1, then the tiling will leave a remainder, we also handle this. The figure below shows an example with mlen-sublen = 14, which is does leaves the remainder 2, with 3 tiles in the first row. (n=2). We handle first the upper right triangle, the green tile, with the same logic as usual (triangle bounds, not parallelogram) and the red columns are handled with the parallelogram logic, but treated as mirrored. In this case the red columns are handled as two rows with QT being the size of the buffer.
There are minor differences in mp, when handling the remainder because the order of operations is different (rounding errors, we checked closely), but mpi is always the same. Without any remainder we get the exact same results as the sequential code. If there is a remainder it has a negligible impact on performance, that's why it's not parallelized, but could be.
<figure>
<img src="remainder.png" alt="figure" style="width:400px">
<figcaption>Figure: The parallelogram tiling</figcaption>
</figure>
#### Parallelization
To parallelize our implementation we need to watch out for the dependencies between the tiles, the figure below shows which ranges of mp/mpi have collisions.
<figure>
<img src="collisions.png" alt="figure" style="width:100%">
<figcaption>Figure: The parallelogram tiling</figcaption>
</figure>
MP1-3 show local mp buffers, the ranges that are touched by the tiles 1-3 are colored by their respective color. The i range is the same for all 3 tiles, the are horizontal neighbors. Due to the symmetry of the triangle the j comparisons touch different ranges of mp. As we can see there would be collision if they worked on the same mp/mpi. That's why each tile has its own mp/mpi, in our case of 72 threads, it's 72 tiles. At first we had an array of structs and handled it in a custom reduction, but to make it more compatible with our intrinsics code we switched to a critical section at the end of the computations, serving as a handmade reduction. This did not decrease the performance, as it's only done once between 72 threads at the very end.
```cpp
#pragma omp parallel num_threads(numThreads)
{
double* QT_tile = (double*)malloc(tileSize * sizeof(double));
double* mp_local = (double*)malloc(mlen_tiled * sizeof(double));
int * mpi_local = (int*)malloc(mlen_tiled * sizeof(int));
for (int k = 0; k < tileSize; k++) {
mp_local[k] = -1;
mpi_local[k] = -1;
}
#pragma omp for collapse(2) schedule(dynamic)
for (int i = 0; i < mlen - sublen; i += tileSize) {
for (int j = i + sublen; j < mlen; j += tileSize) {
```
The code fragment above shows the main part of the parallelization, we have local mp, mpi and QT for each thread and the tiles are distributed with one tile for each thread, or two tiles for some of the threads (this is handled by schedule(dynamic)). We also tried two parallel sections, where the 66 parallelograms get 66 threads and the 12 triangles get 6 threads, but that code was slower. (Possible with omp_set_nested(1)).
Each thread has its own local buffers for mp/mp/qt, so no numa considerations needed (first touch). And we tried thread local df, dg and norm buffers, but that slowed down execution.
Pinning is done with export OMP_PROC_BIND=close and export OMP_PLACES=cores, giving us a lot better performance for smaller lengths, opposed to no bindings.
### Parallel Performance
The parallel performance was not effected by the inner loop vectorization, auto vectorization, omp simd and intrinsics performed nearly the same, as in the sequential case.
ICPX
```
-O3 -xCORE-AVX512 -mtune=icelake-server -qopenmp
```
```
Main Kernel took 0.752386 seconds
0.96 seconds
```
#### Performance Measurements
- Perf report breakdown. (The AVX512 FMA and CMP instructions consumed most amount of time)
```
11,63 │ vmovupd (%rbx,%rcx,8),%zmm7
10,27 │ vfmadd213pd (%r8,%rcx,8),%zmm5,%zmm7
0,01 │ mov 0x1f8(%rsp),%rbx
20,42 │ vfmadd231pd (%rbx,%rcx,8),%zmm4,%zmm7
0,01 │ lea (%r11,%rcx,1),%ebx
│ movslq %ebx,%rbx
7,52 │ vmulpd (%rdx,%rcx,8),%zmm2,%zmm8
0,01 │ vmovupd %zmm7,(%r12,%rbx,8)
7,12 │ vmulpd %zmm7,%zmm8,%zmm7
20,59 │ vcmpltpd %zmm7,%zmm0,%k1 (Compiler switches the '>' to a '<' comparison)
2,10 │ vmaxpd %zmm0,%zmm7,%zmm0
```
The following stats are collected from Intel core/uncore events using pmu_tools (https://github.com/andikleen/pmu-tools). This provides us access to more counters than the ones available in perf. pmu-tools provides us with a Top Level Breakdown of performance counters.
The Top-Down Characterization is a hierarchical organization of event-based metrics that identifies the dominant performance bottlenecks in an application.
We also used the manufacturer's guide on Performance counters from http://www.intel.com/content/dam/www/public/us/en/documents/design-guides/xeon-e5-2600-uncore-guide.pdf
Here are the following interesting metrics we obtained from the results
- GFLOPs
```
81.54 Giga Floating Point Operations Per Second
```
We are able to achieve 81 GFLOPs using the Tiled, Parallel and intrinsics version for 262159 length time series
- Backend Bound Instructions
```
BE/Mem Backend_Bound.Memory_Bound % Slots 32.1 [ 1.2%]
BE/Core Backend_Bound.Core_Bound % Slots 35.3 [ 1.2%]
```
The Memory Bound and Core Bound Instructions are equally split due to the tiling across threads.
- Miscellaneous
```
1.06 Cycles Per Instruction (per Logical Processor)
0.94 Instructions Per Cycle (per Logical Processor)
2.38 Floating Point Operations Per Cycle
```
We accomplished 2.38 FLOPs per cycle and this is greater than our Intrinsics and OMP SIMD version.
#### Memory Bandwidth
We calculate the Memory bandwidth using the intel uncore event ```iMC.MEM_BW_TOTAL``` This is derived from (CAS_COUNT.RD * 64) + (CAS_COUNT.WR * 64) CAS stands for _Column Address Select_. This metric is socket bound and thus, contains readings for the different channels in that socket.
The pmu tool allows us to monitor the counter over a period of time.
```python pmu-tools/ucevent/ucevent.py iMC.MEM_BW_TOTAL --scale GB```
In the intrinsics version, since we are using only one thread and processing blocks of data, the Memory bandwidth peaks at **0.19 GB/s** during the beginning of the run and reduces steadily as more data is available in the caches and less needs to be queried from the Main memory.
python ../../../pmu-tools/ucevent/ucevent.py iMC.MEM_BW_TOTAL --scale GB
#### Different Thread Configurations.
Number of elements in time series: 262159
Sub length: 10
Number of Threads: 8
```
9.06 seconds
```
Number of Threads: 32
```
2.04 seconds
```
Number of Threads: 72
```
0.96 seconds
```
The Speedup achieved by our implementation is linear w.r.t. the number of threads.
### Results for different data sizes
The timings are for the whole program.
- 262159
81.54 GFLOPs
0.96 seconds
- 524299:
139.41 GFLOPs
6.82 Seconds
- 1048591:
123.50 GFLOPs
32.309 Seconds
### Memory Bandwidth Calculation:
TODO: