--- title: Parrallel Programming HW1 --- Parrallel Programming HW1 === 110550047 巫廷翰 # Part 1 ## VectorOP.cpp Implementation > The vetor utilization higher than 60% and pass the verification. ![image](https://hackmd.io/_uploads/Sy43FibAR.png) The code shows that the usage of mask to allow vector values to be ajusted under some situation, such as `y==0` or `exp==1`. See the commit in my code to further understand each line logic. ```cpp= maskAll = _pp_init_ones(); maskIsZero = _pp_init_ones(0); _pp_vload_float(x, values + i, maskAll); _pp_vload_int(exp, exponents + i, maskAll); // check all y == 0 _pp_veq_int(maskIsZero, exp, zeroInt, maskAll); // if (y == 0) _pp_vset_float(result, 1.0f, maskIsZero); // result = 1.f; maskIsNotZero = _pp_mask_not(maskIsZero); // } else { _pp_vmove_float(result, x, maskIsNotZero); // result = x; _pp_vsub_int(exp, exp, oneInt, maskIsNotZero); // count = y - 1; _pp_veq_int(maskIsZero, exp, zeroInt, maskIsNotZero); maskIsNotZero = _pp_mask_not(maskIsZero); while(_pp_cntbits(maskIsNotZero) > 0){ _pp_vmult_float(result, result, x, maskIsNotZero); // result *= x; _pp_vsub_int(exp, exp, oneInt, maskIsNotZero); // count--; _pp_veq_int(maskIsZero, exp, zeroInt, maskIsNotZero); maskIsNotZero = _pp_mask_not(maskIsZero); } __pp_mask needClamp; _pp_vgt_float(needClamp, result, maxClamp, maskAll); _pp_vmove_float(result, maxClamp, needClamp); _pp_vstore_float(output + i, result, maskAll); ``` There is one missing part not mentioned. After observing the error occurs in the example of absVector, I conclude that it should process the remaining value at the last iteration. Therefore, I initialize those values. ```cpp= for (int i = N;i<N+VECTOR_WIDTH;i++){ values[i] = 0.0f; exponents[i] = 1; } ``` ## VECTOR_WIDTH Observation ![image](https://hackmd.io/_uploads/HJQky2-RR.png) ![image](https://hackmd.io/_uploads/rkP8AsZ00.png) ![image](https://hackmd.io/_uploads/SJ1zJn-CR.png) ![image](https://hackmd.io/_uploads/B1Why3WAC.png) + Does the vector utilization increase, decrease or stay the same as VECTOR_WIDTH changes? Why? > The vector utilization drops, because the total vector instruction number does not decline exponentially. There are tow major reasons. > First of all, although we accelerate the vector processing part, it still contains some essential serial operation or instruction. Therefore, the calculation of `vector width * total vector instructions` grows, but the total growing rete overruns the growing rate of `utilized vector lane`. > Second, if an extreme case occurs, which means that most of values in vector are zero, each instruction would waste a lot of bit lane. Moreover, larger the vector width be, higher the possibility to meet this situation. > Therefore, the vector utilization will somehow drops when increasing the vector width. [name=巫廷翰] ## Bonus To achieve the sum of array, I make use of both hadd and interleave function. `hadd` function can add each pair in the vector, and interleave are helpful of swiching other pairs' result to the former pair. ```cpp= _pp_hadd_float(result, result); // [(0 1) (2 3)] -> [0+1 0+1 2+3 2+3] _pp_interleave_float(result, result); // [0+1 0+1 2+3 2+3] -> [0+1 2+3 0+1 2+3] // next round: [(0+1 2+3) (0+1 2+3)] -> [0+1+2+3 0+1+2+3 0+1+2+3 0+1+2+3] ``` # Part 2 + Q2-1 YMM registers are 256 bits wide and can hold 8 single-precision floats (32 bits each). + Before ```cpp= a = (float *)__builtin_assume_aligned(a, 16); b = (float *)__builtin_assume_aligned(b, 16); c = (float *)__builtin_assume_aligned(c, 16); ``` ```assembly= vmovups (%rdi,%rcx,4), %ymm0 vmovups 32(%rdi,%rcx,4), %ymm1 vmovups 64(%rdi,%rcx,4), %ymm2 vmovups 96(%rdi,%rcx,4), %ymm3 vaddps (%rsi,%rcx,4), %ymm0, %ymm0 vaddps 32(%rsi,%rcx,4), %ymm1, %ymm1 vaddps 64(%rsi,%rcx,4), %ymm2, %ymm2 vaddps 96(%rsi,%rcx,4), %ymm3, %ymm3 vmovups %ymm0, (%rdx,%rcx,4) vmovups %ymm1, 32(%rdx,%rcx,4) vmovups %ymm2, 64(%rdx,%rcx,4) vmovups %ymm3, 96(%rdx,%rcx,4) ``` + After ```cpp= a = (float *)__builtin_assume_aligned(a, 32); // align float to 32 bits b = (float *)__builtin_assume_aligned(b, 32); c = (float *)__builtin_assume_aligned(c, 32); ``` ```assembly= .LBB0_2: # Loop vmovaps (%rdi,%rcx,4), %ymm0 vmovaps 32(%rdi,%rcx,4), %ymm1 vmovaps 64(%rdi,%rcx,4), %ymm2 vmovaps 96(%rdi,%rcx,4), %ymm3 vaddps (%rsi,%rcx,4), %ymm0, %ymm0 vaddps 32(%rsi,%rcx,4), %ymm1, %ymm1 vaddps 64(%rsi,%rcx,4), %ymm2, %ymm2 vaddps 96(%rsi,%rcx,4), %ymm3, %ymm3 vmovaps %ymm0, (%rdx,%rcx,4) vmovaps %ymm1, 32(%rdx,%rcx,4) vmovaps %ymm2, 64(%rdx,%rcx,4) vmovaps %ymm3, 96(%rdx,%rcx,4) addq $32, %rcx cmpq $1024, %rcx # imm = 0x400 jne .LBB0_2 ``` + Q2-2 + `test1` + case 1: 7.09757267sec + ~~6.943857se~~ + 7.096060sec + 7.098153sec + 7.098505sec + ~~7.149019sec~~ + case 2: 1.749802sec + ~~1.724345sec~~ + 1.733788sec + 1.745202sec + 1.770416sec + ~~1.814283sec~~ + case 3: 0.86272367sec + ~~0.860116sec~~ + 0.861022sec + 0.861979sec + 0.865170sec + ~~0.909127sec~~ + `test2` + case 1: 7.02531233sec + ~~6.939579sec~~ + 6.942839sec + 7.058382sec + 7.074716sec + ~~7.097520sec~~ + case 2: 1.72374767sec + ~~1.721233sec~~ + 1.722562sec + 1.723457sec + 1.725224sec + ~~1.728957sec~~ + case 3: 0.86307953sec + ~~0.862152sec~~ + 0.862156sec + 0.862844sec + 0.864239sec + ~~0.871524sec~~ + `test3` add `FASTMATH=1` + case 0(without `FASTMATH=1`): 18.869001sec + case 1: 18.4590407sec + ~~18.376527sec~~ + 18.378732sec + 18.401559sec + 18.596831sec + ~~18.948041sec~~ + case 2: 4.640163sec + ~~4.635100sec~~ + 4.637820sec + 4.638402sec + 4.644267sec + ~~4.647704sec~~ + case 3: 1.26747733sec + ~~1.264076sec~~ + 1.264872sec + 1.268347sec + 1.269213sec + ~~1.294285sec~~ + What speedup does the vectorized code achieve over the unvectorized code? > near 4x[name=巫廷翰] + What additional speedup does using -mavx2 give? > near 8x compared with the first one, and near 2x compared with the second one. > In the 3rd case, after adding FASTMATH=1, it accelerate far faster, about 14.5x[name=巫廷翰] + What can you infer about the bit width of the default vector registers on the PP machines? What about the bit width of the AVX2 vector registers? > Regarding to the speedup, the traditional one count for handling 1 float(32-bit) at a time, and so it might be 32-bit wide(supposely utililzed in assembly) > The default vector registers are likely 128-bit wide, capable of handling 4 floats (32-bit) at a time. > The AVX2 registers are 256-bit wide, which can handle 8 floats (32-bit) at a time, providing an additional 2x speedup over the 128-bit registers.[name=巫廷翰] + Q2-3 I obeserve the major difference between both versions is `maxps` instrunction usage. > Performs an SIMD compare of the packed single-precision floating-point values in the destination operand (first operand) and the source operand (second operand), and returns the maximum value for each pair of values to the destination operand. The source operand can be an XMM register or a 128-bit memory location. -[reference link](https://c9x.me/x86/html/file_module_x86_id_167.html) Description above shows that `maxps` is able to do SIMD compare, and version 2 code takes this strategy. The compiler doesn't identify version 1 code as a clear maximum computation pattern that can be vectorized. Therefore, it sticks to moving the values with `movaps` and doesn't use `maxps` to compare them in parallel. + Version 1 ```cpp= for (int i = 0; i < I; i++) { for (int j = 0; j < N; j++) { /* max() */ c[j] = a[j]; // Always assign `a[j]` first if (b[j] > a[j]) c[j] = b[j]; // Overwrite with `b[j]` if it's larger } } ``` + This version arises redundant assignment in assembly, because code assigned the value a[j], but it may be overwritten later if `b[j] > a[j]` + Since there is no dedicated floating-point comparison between a[j] and b[j] done in parallel (as in SIMD-style operations), the compiler does not generate the `maxps` instruction here. It’s simply doing the move of values, hence `movaps` is sufficient. + Version 2 ```cpp= for (int i = 0; i < I; i++) { for (int j = 0; j < N; j++) { /* max() */ if (b[j] > a[j]) c[j] = b[j]; else c[j] = a[j]; } } ``` + `maxps` is generated because this instruction can execute in parallel. The compiler uses `movaps` to load the values and `maxps` to compute the max in a single instruction, covering multiple elements at once (vectorized).