# Parallel Programming HW1
###### tags: `Multithread`
## Q1: Fix the code to make sure it uses aligned moves for the best performance.
If we use the builtin function the HW1 note provided after adding modifier __restrict, the vectorized assembly is still not alligned:
```c++=50
a = (float *)__builtin_assume_aligned(a, 16);
b = (float *)__builtin_assume_aligned(b, 16);
c = (float *)__builtin_assume_aligned(c, 16);
```
The result assembly is as follow:

We still see **vmovups** so they are not align moves.
According to the main.cpp, the three dynamic arrays is allocated by alignment value "32"

After changing the second argument in the builtin function to the value of athe alignment value when allocating, the result is aligned:
```c++=50
a = (float *)__builtin_assume_aligned(a, 32);
b = (float *)__builtin_assume_aligned(b, 32);
c = (float *)__builtin_assume_aligned(c, 32);
```

It came to the conclusion that when we allocate the dynamic array, we should use the same alignment value as the alignment value in the builtin function. To verify this inference, we change both the second value of the builtin function and the allocation alignment value to 16:
in main.cpp
```c++=50
float* values1 = new(std::align_val_t{ 16 }) float[N];
float* values2 = new(std::align_val_t{ 16 }) float[N];
double* values3 = new(std::align_val_t{ 16 }) double[N];
float* output = new(std::align_val_t{ 16 }) float[N];
```
in test1.cpp
```c++=50
a = (float *)__builtin_assume_aligned(a, 16);
b = (float *)__builtin_assume_aligned(b, 16);
c = (float *)__builtin_assume_aligned(c, 16);
```
Unfortunately, the result is unaligned:

Moreover, the result is found aligned if the alignment value when allocating array is 16 and the second argument of the builtin function is 32:
in main.cpp
```c++=50
float* values1 = new(std::align_val_t{ 16 }) float[N];
float* values2 = new(std::align_val_t{ 16 }) float[N];
double* values3 = new(std::align_val_t{ 16 }) double[N];
float* output = new(std::align_val_t{ 16 }) float[N];
```
in test1.cpp
```c++=50
a = (float *)__builtin_assume_aligned(a, 32);
b = (float *)__builtin_assume_aligned(b, 32);
c = (float *)__builtin_assume_aligned(c, 32);
```

By referencing document Intel® C++ Compiler Classic Developer Guide and Reference(https://software.intel.com/content/www/us/en/develop/documentation/cpp-compiler-developer-guide-and-reference/top/compiler-reference/intrinsics/intrinsics-for-intel-advanced-vector-extensions-2/overview-intrinsics-for-intel-advanced-vector-extensions-2-intel-avx2-instructions.html), **it came out that Advanced Vector Extensions 2(Intel® AVX2) extends Advanced Vector Extension (Intel® AVX) by promoting the 128bits(16 bytes) SIMD to 256bits(32 bytes). So when using it, 32 bytes alignment value has to be mentioned in builtin function** "(float *)__builtin_assume_aligned(a, 32)" **no matter the alignment value is 16 or 32 when allocating dynamic arrays.**
## Q2: What speedup does the vectorized code achieve over the unvectorized code? What additional speedup does using -mavx2 give (AVX2=1 in the Makefile)? You may wish to run this experiment several times and take median elapsed times; you can report answers to the nearest 100% (e.g., 2×, 3×, etc). What can you infer about the bit width of the default vector registers on the PP machines? What about the bit width of the AVX2 vector registers. Hint: Aside from speedup and the vectorization report, the most relevant information is that the data type for each array is float.
test1 raw data & median:

||no-vectorize|vectorize|vectorize+AVX2|
|-|-|-|-|-|
|Elapsed time(s)|9.91341s|2.89098s|1.73995s|
|Speedup|1x|3.4291x|5.6975x|
test2 raw data & median:

||no-vectorize|vectorize|vectorize+AVX2|
|-|-|-|-|-|
|Elapsed time(s)|9.813215s|2.23736s|1.234425s|
|Speedup|1x|4.3861x|7.9496x|
Assume that the speedup is relevant to the bit width of the vector register and the actual speedup should be lower than the ideal speedup.Also the register bit width should be the power of 2. The two fact are derived from the previous table is as bellow:
1. Vectorized situation is (3.4x/4.3861x) times better than no vectorized situation. This can lead to the result that vector registers can caculate 4x amount of data than general purpose register.
2. AVX2 situaltion is (5.7x/7.9496x) times better than no vectorized situaltion, This can lead to the result that AVX vector registers can caculate 8x amount of data than general purpose register.
Reference the document of my cpu: intel core i7 9th gen, there are some 32-bit registers and some 64-bit registers. By overviewing the assembly, I found that when caculating, we often use rbx, rcx, r15 64-bit registers.(This can be seen in the previous assembly diagram). So, we can speculate that the vector register's bit width is 4 times than 64bit which is 256bits. And, the AVX2 vector register is 8 times than 64bit which is 512bits.
**Conclusion: General purpose register: 64bits, vector register: 256bits, AVX2 register: 512bits.**
## Q3: Provide a theory for why the compiler is generating dramatically different assembly.
Before patching, the compile result showed that the computation is not vectorized. After patching, vectorization is successed.
I test for the following code blocks and try to derive the theory about why the vectorization is sometimes successed and sometimes failed.
By default, we know that the first block of code can vectorize successfully but the second can't.
1.
```c++=50
for (int j = 0; j < N; j++)
{
c[j] = a[j];
if (b[j] > a[j])
c[j] = b[j];
}
```
2.
```c++=50
for (int j = 0; j < N; j++)
{
if (b[j] > a[j])
c[j] = b[j];
else c[j] = a[j];
}
```
At first, I guess that a **single if statement without else** leads to the failure for vectorization. I change the inner loop to a single if by deleting the 52th line in code block 1:
```c++=50
for (int j = 0; j < N; j++)
{
if (b[j] > a[j])
c[j] = b[j];
}
```
This block of code is successfully vectorized but the single if still exists. This tells that single if statement without else is not the reason.
Second, I change the inner code block to the single line of code
```c++=50
for (int j = 0; j < N; j++)
{
c[j] = b[j];
}
```
And the vectorization failed due to the call instruction:


This can be fixed by adding the b[j] with a float near 0(or any other trival caculation):
```c++=50
for (int j = 0; j < N; j++)
{
c[j] = b[j] + 0.000001;
}
```
By the diagram bellow, the result assembly is vectorized successfully and all the callq instruction is vanished.

Futuermore, by exploring the assembly diagrams above, I found that whenever the depth of the inner loop is 1, the callq function appeared and the vectorization failed and whenever the depth is 2 the callq instruction disappeared and the vectorization successed.
Finally, if we modify the code to the following block by increase the c[j] = a[j] line into two lines of code c[j] = 0; c[j] += a[j]; which leads to the same result, the vectorization will succeed.
```c++=50
for (int j = 0; j < N; j++)
{
c[j] = 0;
c[j] += a[j];
if (b[j] > a[j])
c[j] = b[j];
}
```
By the experiments above, I derive some possible reason why the two different implementation of max() leads to dramatically different assembly.
1. The unpatched one uses call instruction(callq) to replace the depth2 inner loop. The call instruction prevent the compiler from vectorization. On the other hand, the patched one uses deeper loop instead of the callq instruction which leads to successful vectorization. By making the unpatched one more complicated(by adding some trivial memory assignment or caculation as the last code block) will make the compiler hard to reduce the loop depth and prevent the compiler to use callq instruction which make the vectorization successful.
2. Write after write dependency may also be the reason that cause the vectorization to fail. In the unpatched code situation, The a[j] can sometimes be written one time and sometimes be written two times. In the patched code situation, The a[j] is always written one times. The unpatched one is more possible to have WAW dependency.
## Reference
- Auto-vectorization with gcc 4.7(http://locklessinc.com/articles/vectorize/)
- A Guide to Vectorization with Intel® C++ Compilers(https://l.facebook.com/l.php?u=https%3A%2F%2Fd3f8ykwhia686p.cloudfront.net%2F1live%2Fintel%2FCompilerAutovectorizationGuide.pdf%3Ffbclid%3DIwAR2yr9g3UfW-O8sulDebZUrVqi4OF_Fg-4iNo4ATJYonJP0cauHo5TYKvBs&h=AT3lV3EmykpEwGqfZRgjyNEyM0v4HU7ph4brYe5fzqJbg2RnmlD2wKFbIYn7tO1tqT2rEKeoZu2tL5VZJD7f2sMfEAxuHKdi8fWeNgND75VPdr5Y2v8nI5A31vxrlYDF_TWGX5d_mMc1MeVhdV_JdQ)
- Intel® C++ Compiler Classic Developer Guide and Reference(https://software.intel.com/content/www/us/en/develop/documentation/cpp-compiler-developer-guide-and-reference/top/compiler-reference/intrinsics/intrinsics-for-intel-advanced-vector-extensions-2/overview-intrinsics-for-intel-advanced-vector-extensions-2-intel-avx2-instructions.html)