Matrix Profile

# Matrix Profile ## Target Architecture Intel(R) Xeon(R) Platinum 8360Y CPU @ 2.40GHz L1d Cache: 3,4 MiB L1i Cache: 2,3 MiB L2 Cache: 90 MiB L3 Cache: 108 MiB Number of AVX-512 FMA units: 2 All below measurments are taken with the following configuration of inputs. Time series length: 262159 Sublen: 10 ### Baseline The baseline measurments were taken using the icpx compiler with the following arguments: icpx ``` -O3 -mtune=icelake-server ``` Time taken: ``` 45.41 seconds ``` Perf: ``` 4.24 Giga Floating Point Operations Per Second 0.25 Cycles Per Instruction (per Logical Processor) 3.94 Instructions Per Cycle (per Logical Processor) 1.24 Floating Point Operations Per Cycle ``` ### SIMD - User-Defined Reduction in OpenMP The ```#pragma omp simd``` vectoriziation is pretty straightforward, because of the row dependencies a reduction is needed to avoid race conditions. Because we want to preserve the the index, we implemented a custom reduction on a struct data type containing mp and mpi. The comparison for the row max is done on that local variable, which is then reduced. The relevant code snippet is shown below. ```cpp max_mp localMp{ -1,-1 }; #pragma omp declare reduction(argMax : struct max_mp : omp_out = omp_in.mp > omp_out.mp ? omp_in : omp_out) // main computation loops // TODO loop tiling and parallelization for (int i = 0; i < mlen - sublen; i++) { // TODO loop tiling and simd parallelization localMp.mp = mp[i]; localMp.mpi = mpi[i]; #pragma omp simd reduction(argMax:localMp) for (int j = i + sublen; j < mlen; j++) { // streaming dot product if (i != 0) QT[j - i - sublen] += df[i] * dg[j] + df[j] * dg[i]; double cr = QT[j - i - sublen] * norm[i] * norm[j]; // updating the nearest neighbors information if (cr > localMp.mp) { localMp.mp = cr; localMp.mpi = j; } if (cr > mp[j]) { mp[j] = cr; mpi[j] = i; } } mp[i] = localMp.mp; mpi[i] = localMp.mpi; } ``` #### Performance evaluation The OMP SIMD measurments were taken using the gcc compiler (ICPX/ICPC did not work well with custom reductions) with the following arguments: g++ ``` -fopenmp -O3 ``` ``` 44.76 seconds ``` ``` 4.49 Giga Floating Point Operations Per Second 0.32 Cycles Per Instruction (per Logical Processor) 3.17 Instructions Per Cycle (per Logical Processor) 1.35 Floating Point Operations Per Cycle ``` ### AVX-512 Intrinsics #### Steps The kernel contains the following computations in the inner loop that can be vectorized - Row initialization The first row is calculated using the SIMD version of the serial code. ``` __m512d QT_v = _mm512_add_pd(QT_v, _mm512_mul_pd(_mm512_sub_pd(ts_j, mu_start), _mm512_sub_pd(ts_i_j, mu_i))); ``` - Corelation Calculation Correlation is calculated using the SIMD version of the serial code. ``` ... QT_v = _mm512_add_pd(QT_v, _mm512_add_pd(_mm512_mul_pd(df_i_v, dg_j_v), _mm512_mul_pd(df_j_v, dg_i_v))); __m512d cr_v = _mm512_mul_pd(QT_v, _mm512_mul_pd(norm_i_v, norm_j_v)); ``` - Reduction across J We need to reduce max(cr_v) along j and do a manual reduction on max_cr_v at the end of the j SIMD loop ``` result = _mm512_cmp_pd_mask(cr_v, max_cr_v, _CMP_GT_OS); max_cr_v = _mm512_mask_blend_pd(result, max_cr_v, cr_v); max_index = _mm512_mask_blend_epi64(result, max_index, indices_j_v); ``` - Reduction within J We also need to compare cr_v with mp[j] within the j loop and reduce accordingly. The resulting operation is pretty straightforward. ``` result = _mm512_cmp_pd_mask(cr_v, mp_j_v, _CMP_GT_OS); _mm512_mask_store_pd(&mp[j], result, cr_v); _mm512_mask_store_epi64(&mpi[j], result, index_i_v); ``` #### Compiler We decided to use icpx for the intrinsics version with the following parameters: ``` -O3 -xCORE-AVX512 -mtune=icelake-server ``` #### Best Runtime ``` 49,71 seconds. ``` The Intrinsic version is slower than the 256 bit auto-vectorized version of baseline. We speculate that this could be due to the heavy nature of 512 fma instructions being run on a single thread. The results obtained from ```baseline``` and ```omp simd``` versions are very close to the Intrinsics version This could be due to the fact that the number of AVX512 FMA units is 2 in the Icelake CPU. Which bottlenecks the number of AVX512 instructions that can be run in parallel in the CPU #### Performance Measurements - Perf report breakdown (vfma* vmulpd instructions contribute to a high percentage of the runtime) ``` 13,69 │ vfmadd213pd (%rdx,%rcx,8),%zmm6,%zmm7 19,08 │ vfmadd231pd (%r14,%rcx,8),%zmm5,%zmm7 0,00 │ vpbroadcastq %rsi,%zmm8 19,46 │ vmulpd (%r10,%rcx,8),%zmm3,%zmm9 0,01 │ vmovupd %zmm7,(%rdx,%rcx,8) 4,69 │ vmulpd %zmm7,%zmm9,%zmm7 11,64 │ vcmpltpd %zmm7,%zmm1,%k1 1,18 │ vpaddq %zmm0,%zmm8,%zmm2{%k1} 8,12 │ vcmpgtpd (%r12,%rcx,8),%zmm7,%k1 ``` The following stats are collected from Intel core/uncore events using pmu_tools (https://github.com/andikleen/pmu-tools). This provides us access to more counters than the ones available in perf. pmu-tools provides us with a Top Level Breakdown of performance counters. Here are the following relevant metrics we obtained from the results - GFLOPs ``` 5.14 Giga Floating Point Operations Per Second ``` The Giga flops is a derived metric from the number of Floating point operations divided by the time taken. We have a higher Giga flops when compared to the baseline due to the larger 512bit instruction. - Memory Counters ``` Backend_Bound.Memory_Bound.L3_Bound % Stalls 61.4 Info.Bottleneck Memory_Latency Scaled_Slots 39.36 ``` 61.4 % of backend bound instructions were stalled for memory, this means that we are memory intensive. The total pipeline cost of Memory Latency related bottlenecks (external memory and off-core caches) is not very high. This is a derived metric from different core/uncore DRAM_Bound, L1_Bound PMM_Bound, Store_Bound, Hit_Latency events. (described in https://github.com/andikleen/pmu-tools/blob/master/clx_server_ratios.py) - Miscellaneous ``` 1.88 Cycles Per Instruction (per Logical Processor) 1.56 Floating Point Operations Per Cycle ``` We are spending more time per instruction with than baseline and omp simd due to the 512 bit SIMD instructions but we are doing more Floating point operations per cycle. #### Memory Bandwidth We calculate the Memory bandwidth using the intel uncore event ```iMC.MEM_BW_TOTAL``` This is derived from (CAS_COUNT.RD * 64) + (CAS_COUNT.WR * 64) CAS stands for _Column Address Select_. This metric is socket bound and thus, contains readings for the different channels in that socket. The pmu tool allows us to monitor the counter over a period of time. ```python pmu-tools/ucevent/ucevent.py iMC.MEM_BW_TOTAL --scale GB``` In the intrinsics version, since we are using only one thread and processing blocks of data, the Memory bandwidth peaks at **0.19 GB/s** during the beginning of the run and reduces steadily as more data is available in the caches and less needs to be queried from the Main memory. ### Multi-threading - OpenMP Before we explain our parallelization, we should explain the dependencies of the matrix profile code or algorithm. The most important part is that if the rolling dot product is used, then the next row is dependant on the computational results of the previous row. What this means is that the QT array is is updated according to the sliding window principle. For example, In the first row QT[0] computes the QT part of the correlation for index i=0 and j=3 (sublen=3) and for the second row, that initial data is used to compute QT for i=1 and j=4 (in QT[0]). This means the diagonal elements going from the top left to the bottom right, are dependant on their predecessor. Another important point is synchronization of the max comparison and the writing of the values into the mp and mpi arrays. This will be discussed in more detail in the tiling section. #### Tiling To properly distribute the triangular workload onto our 72 hardware threads in ice2, we need to tile it. A simple automatic distribution of the index ranges would work with a rectangular workload, but not in a triangle. Like the assignment document suggests, you can tile the triangle into smaller triangles, some of the same structure and some inverted ones. This is shown in the figure below. <figure> <img src="tiling_triangles.png" alt="figure" style="width:400px"> <figcaption>Figure: The triangular tiling</figcaption> </figure> What is also possible is to connect the inverted triangle, to create parallelograms, this makes the programming logic simpler, makes better use of the whole range of the QT array, removes overhead and might increase cache locality. The figure below shows the tiling. <figure> <img src="tiling_para.png" alt="figure" style="width:400px"> <figcaption>Figure: The parallelogram tiling</figcaption> </figure> Now to the number of tiles and the distribution to threads. Each of the threads should ideally have the same amount of work, a near perfect distribution can be achieved with this kind of tiling. The right row still consists of triangles, which are nearly exactly half the amount of work to compute as the parallelograms (one diagonal less). Meaning, one thread should either take two triangles or one parallelogram. The formula for the number of workload units is related to the gaussian sum. It is:```1+...+n +(n+1)/2, (n is an odd integer)``` As an example, for 8 threads a even distribution is possible, the first row has 3 parallelograms, the second 2, the third 1. That's 6 parallelograms, at the side there are 4 triangles meaning the workload of 2 parallelograms. That's 8 work units overall for n=3, n corresponds to the number of parallelograms in the first row. (Or tiles-1) N should be odd, because otherwise there are half work units, for example for n=2, there are 3 parallelograms and 3 triangles, equaling to 4.5 work units, which can not be equally distributed. For our case of 72 threads we have a perfect distribution for n=11, ```1+...+11+(11+1)/2 = 72```. This gives each thread an close to equal workload, with 66 parallelograms taken by 66 threads and 12 triangles taken by 6 threads. We implemented the tiling by having the triangle tiles case as an if case in the same code, the fragment below shows the structure. ```cpp for (int i = 0; i < mlen - sublen; i += tileSize) { for (int j = i + sublen; j < mlen; j += tileSize) { for (int k = 0; k < tileSize; k++) { QT_tile[k] = 0; } for (int k = j; k < j + tileSize; k++) { for (int l = 0; l < sublen; l++) { //if (k - sublen >= 0) QT_tile[k - j] += (ts[i + l] - mu[i])*(ts[k + l] -mu[k]); } } int jj_border = i + tileSize; for (int ii = i; ii < i + tileSize; ii++) { jj_border = j + ii - i + tileSize; if (jj_border > mlen) jj_border = mlen; for (int jj = j + ii - i; jj < jj_border; jj++) { // streaming dot product if (ii != i) QT_tile[jj - j - (ii - i)] += df[ii] * dg[jj] + df[jj] * dg[ii]; double cr = QT_tile[jj - j - (ii - i)] * norm[ii] * norm[jj]; // updating the nearest neighbors information if (cr > mp_local[ii]) { mp_local[ii] = cr; mpi_local[ii] = jj; } if (cr > mp_local[jj]) { mp_local[jj] = cr; mpi_local[jj] = ii; } } } ``` If mlen-sublen is not divisible by n+1, then the tiling will leave a remainder, we also handle this. The figure below shows an example with mlen-sublen = 14, which is does leaves the remainder 2, with 3 tiles in the first row. (n=2). We handle first the upper right triangle, the green tile, with the same logic as usual (triangle bounds, not parallelogram) and the red columns are handled with the parallelogram logic, but treated as mirrored. In this case the red columns are handled as two rows with QT being the size of the buffer. There are minor differences in mp, when handling the remainder because the order of operations is different (rounding errors, we checked closely), but mpi is always the same. Without any remainder we get the exact same results as the sequential code. If there is a remainder it has a negligible impact on performance, that's why it's not parallelized, but could be. <figure> <img src="remainder.png" alt="figure" style="width:400px"> <figcaption>Figure: The parallelogram tiling</figcaption> </figure> #### Parallelization To parallelize our implementation we need to watch out for the dependencies between the tiles, the figure below shows which ranges of mp/mpi have collisions. <figure> <img src="collisions.png" alt="figure" style="width:100%"> <figcaption>Figure: The parallelogram tiling</figcaption> </figure> MP1-3 show local mp buffers, the ranges that are touched by the tiles 1-3 are colored by their respective color. The i range is the same for all 3 tiles, the are horizontal neighbors. Due to the symmetry of the triangle the j comparisons touch different ranges of mp. As we can see there would be collision if they worked on the same mp/mpi. That's why each tile has its own mp/mpi, in our case of 72 threads, it's 72 tiles. At first we had an array of structs and handled it in a custom reduction, but to make it more compatible with our intrinsics code we switched to a critical section at the end of the computations, serving as a handmade reduction. This did not decrease the performance, as it's only done once between 72 threads at the very end. ```cpp #pragma omp parallel num_threads(numThreads) { double* QT_tile = (double*)malloc(tileSize * sizeof(double)); double* mp_local = (double*)malloc(mlen_tiled * sizeof(double)); int * mpi_local = (int*)malloc(mlen_tiled * sizeof(int)); for (int k = 0; k < tileSize; k++) { mp_local[k] = -1; mpi_local[k] = -1; } #pragma omp for collapse(2) schedule(dynamic) for (int i = 0; i < mlen - sublen; i += tileSize) { for (int j = i + sublen; j < mlen; j += tileSize) { ``` The code fragment above shows the main part of the parallelization, we have local mp, mpi and QT for each thread and the tiles are distributed with one tile for each thread, or two tiles for some of the threads (this is handled by schedule(dynamic)). We also tried two parallel sections, where the 66 parallelograms get 66 threads and the 12 triangles get 6 threads, but that code was slower. (Possible with omp_set_nested(1)). Each thread has its own local buffers for mp/mp/qt, so no numa considerations needed (first touch). And we tried thread local df, dg and norm buffers, but that slowed down execution. Pinning is done with export OMP_PROC_BIND=close and export OMP_PLACES=cores, giving us a lot better performance for smaller lengths, opposed to no bindings. ### Parallel Performance The parallel performance was not effected by the inner loop vectorization, auto vectorization, omp simd and intrinsics performed nearly the same, as in the sequential case. ICPX ``` -O3 -xCORE-AVX512 -mtune=icelake-server -qopenmp ``` ``` Main Kernel took 0.752386 seconds 0.96 seconds ``` #### Performance Measurements - Perf report breakdown. (The AVX512 FMA and CMP instructions consumed most amount of time) ``` 11,63 │ vmovupd (%rbx,%rcx,8),%zmm7 10,27 │ vfmadd213pd (%r8,%rcx,8),%zmm5,%zmm7 0,01 │ mov 0x1f8(%rsp),%rbx 20,42 │ vfmadd231pd (%rbx,%rcx,8),%zmm4,%zmm7 0,01 │ lea (%r11,%rcx,1),%ebx │ movslq %ebx,%rbx 7,52 │ vmulpd (%rdx,%rcx,8),%zmm2,%zmm8 0,01 │ vmovupd %zmm7,(%r12,%rbx,8) 7,12 │ vmulpd %zmm7,%zmm8,%zmm7 20,59 │ vcmpltpd %zmm7,%zmm0,%k1 (Compiler switches the '>' to a '<' comparison) 2,10 │ vmaxpd %zmm0,%zmm7,%zmm0 ``` The following stats are collected from Intel core/uncore events using pmu_tools (https://github.com/andikleen/pmu-tools). This provides us access to more counters than the ones available in perf. pmu-tools provides us with a Top Level Breakdown of performance counters. The Top-Down Characterization is a hierarchical organization of event-based metrics that identifies the dominant performance bottlenecks in an application. We also used the manufacturer's guide on Performance counters from http://www.intel.com/content/dam/www/public/us/en/documents/design-guides/xeon-e5-2600-uncore-guide.pdf Here are the following interesting metrics we obtained from the results - GFLOPs ``` 81.54 Giga Floating Point Operations Per Second ``` We are able to achieve 81 GFLOPs using the Tiled, Parallel and intrinsics version for 262159 length time series - Backend Bound Instructions ``` BE/Mem Backend_Bound.Memory_Bound % Slots 32.1 [ 1.2%] BE/Core Backend_Bound.Core_Bound % Slots 35.3 [ 1.2%] ``` The Memory Bound and Core Bound Instructions are equally split due to the tiling across threads. - Miscellaneous ``` 1.06 Cycles Per Instruction (per Logical Processor) 0.94 Instructions Per Cycle (per Logical Processor) 2.38 Floating Point Operations Per Cycle ``` We accomplished 2.38 FLOPs per cycle and this is greater than our Intrinsics and OMP SIMD version. #### Memory Bandwidth We calculate the Memory bandwidth using the intel uncore event ```iMC.MEM_BW_TOTAL``` This is derived from (CAS_COUNT.RD * 64) + (CAS_COUNT.WR * 64) CAS stands for _Column Address Select_. This metric is socket bound and thus, contains readings for the different channels in that socket. The pmu tool allows us to monitor the counter over a period of time. ```python pmu-tools/ucevent/ucevent.py iMC.MEM_BW_TOTAL --scale GB``` In the intrinsics version, since we are using only one thread and processing blocks of data, the Memory bandwidth peaks at **0.19 GB/s** during the beginning of the run and reduces steadily as more data is available in the caches and less needs to be queried from the Main memory. python ../../../pmu-tools/ucevent/ucevent.py iMC.MEM_BW_TOTAL --scale GB #### Different Thread Configurations. Number of elements in time series: 262159 Sub length: 10 Number of Threads: 8 ``` 9.06 seconds ``` Number of Threads: 32 ``` 2.04 seconds ``` Number of Threads: 72 ``` 0.96 seconds ``` The Speedup achieved by our implementation is linear w.r.t. the number of threads. ### Results for different data sizes The timings are for the whole program. - 262159 81.54 GFLOPs 0.96 seconds - 524299: 139.41 GFLOPs 6.82 Seconds - 1048591: 123.50 GFLOPs 32.309 Seconds ### Memory Bandwidth Calculation: TODO:

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.