Programming Assignment I: SIMD Programming Parallel Programming

# Programming Assignment I: SIMD Programming Parallel Programming [TOC] ### Run ./myexp -s 10000 and sweep the vector width from 2, 4, 8, to 16. Record the resulting vector utilization. ![](https://i.imgur.com/b8iqXEc.png) ![](https://i.imgur.com/59kp5td.png) ![](https://i.imgur.com/nEyONBD.png) ![](https://i.imgur.com/30Vq0iH.png) ### Q1-1: Does the vector utilization increase, decrease or stay the same as VECTOR_WIDTH changes? Why? The vector utilization decreases as VECTOR_WIDTH increases. The fact is that vector will keep multiplying itself until all bits of mask are zero. If one bit is non-zero and others are all zero in mask, it causes a low vector utilization. While it happens more often in the case of larger VECTOR_WIDTH than the case of smaller VECTOR_WIDTH. As a consequence, a larger VECTOR_WIDTH brings a lower vector utilization. ```cpp= while(_pp_cntbits(maskIsNotZero)>0){ // while (count > 0) _pp_vgt_int(maskIsNotZero, count, zero, maskIsNotZero); // set maskIsNotZer = count > 0 _pp_vmult_float(result, result, x, maskIsNotZero); // result *= x; _pp_vsub_int(count, count, one, maskIsNotZero); // count--; } ``` :::info 你所提到的 "one bit is non-zero and others are all zero in mask" 在 VECTOR_WIDTH 大於一時都會出現。應該更清楚的說明當 VECTOR_WIDTH 增加時，在計算過程中 zero in mask 的比率會增加，導致 vector utilization 降低。 >[name= TA] ::: ## Q2-1 Fix the code to make sure it uses aligned moves for the best performance. AVX2 (also known as Haswell New Instructions) expands most integer commands to 256 bits. So, declaring data is aligned by 32 bytes makes compiler capable of using ```vmovaps``` instead of ```vmovups```. ```cpp= a = (float *)__builtin_assume_aligned(a, 32); b = (float *)__builtin_assume_aligned(b, 32); c = (float *)__builtin_assume_aligned(c, 32); ``` ## Q2-2: What speedup does the vectorized code achieve over the unvectorized code? What additional speedup does using -mavx2 give (AVX2=1 in the Makefile)? You may wish to run this experiment several times and take median elapsed times; you can report answers to the nearest 100% (e.g., 2×, 3×, etc). What can you infer about the bit width of the default vector registers on the PP machines? What about the bit width of the AVX2 vector registers. * None > 11.6779sec (N: 1024, I: 20000000) * VECTORIZE=1 > 3.26866sec (N: 1024, I: 20000000) > Speed up 3.57268727858x * VECTORIZE=1 AVX2=1 > 1.47097sec (N: 1024, I: 20000000) > Speed up 7.93891105869x First, ```float``` size is 32 bits. * The speed up of vec is close to 4x, so I infer the bit width of the default vector registers on the PP machines is 4 times lager than a float, which is 128 bits. * The speed up of vec is close to 8x, so I infer the bit width of the AVX2 vector registers is 4 times lager than a float, which is 256 bits. ## Q2-3: Provide a theory for why the compiler is generating dramatically different assembly. ```cpp= for (int j = 0; j < N; j++) { /* max() */ c[j] = a[j]; if (b[j] > a[j]) c[j] = b[j]; } ``` Sometimes instruction has to jump, while sometimes it does not need to jump. So, compiler has to consider its call flow indiviually. That may be the reason compiler treat it in a scalar way. ```cpp= /* max() */ - c[j] = a[j]; - if (b[j] > a[j]) - c[j] = b[j]; + if (b[j] > a[j]) c[j] = b[j]; + else c[j] = a[j]; ``` However, compiler can always take the same action in the loop of patched version. For example, mask can be used to assign two different kind of values. That might cause compiler deals with the for loop in a vectorized way.