HW1 Q1-1 : Answer | VECTOR_WIDTH| Utilization | | 2 | 89.80% | | 4 | 86.60% | | 8 | 84.90% | | 9 | 84.70% | | 10 | 84.60% | | 11 | 84.50% | | 12 | 84.40% | | 13 | 84.30% | | 14 | 84.20% | | 15 | 84.20% | | 16 | 84.20% | The Vector Utilization is decreasing with increasing value of VECTOR_WIDTH. I think this is due to the _pp_cntbits() function used in order to find the number of non-zero exponent remaining after the current pass of multiplication. When the VECTOR_WIDTH increases, the number of passes increase while most of the values in the vector have already calculated their results. This leads to under utilization of the vector. Q2-1 : Answer Code: ``` void test1(float* __restrict a, float* __restrict b, float* __restrict c, int N) { __builtin_assume(N == 1024); //vectorization width: 8, interleaved count: 4. //Hence, alignment should be a multiple of 8X4 = 32 //alignment with multiples of 32 will allow for aligned moves a = (float *)__builtin_assume_aligned(a, 32); b = (float *)__builtin_assume_aligned(b, 32); c = (float *)__builtin_assume_aligned(c, 32); fasttime_t time1 = gettime(); for (int i=0; i<I; i++) { for (int j=0; j<N; j++) { c[j] = a[j] + b[j]; } } fasttime_t time2 = gettime(); double elapsedf = tdiff(time1, time2); std::cout << "Elapsed execution time of the loop in test1():\n" << elapsedf << "sec (N: " << N << ", I: " << I << ")\n"; } ``` Interleaving means that unrolled iterations are interleaved within a loop. With vectorization width of 8 and interleaved count of 4, the compiler can load 32 items from the arrays "a" and "b" and perform their vector sums. But this is possible only if the arrays are aligned by 32 or multiples of 32. Q2-2 : Answer Unvectorized Code median runtime : 8.16466s Vectorized Code median runtime : 2.60798s Aligned and Vectorized Code median runtime : 1.35308s Unvectorized to Vectorized Speedup : 3.13x = ~3x Unaligned Vectorized to Aligned Vectorized Speedup : 1.92x = ~2x Bit-width of default vector registers : 128 Bit-width of AVX2 vector registers : 512 Q2-3 : Answer The assignment and the maximum operation are not associative with floating point numbers. Clang cannot reorder the instructions as required for vectorization, thus outputting unvectorized assembly by default. By using the patch, we explicitly reorder the instructions, thus allowing vectorization by Clang.