Parallel Programming HW1

###### tags: `Parallel Programming` # Parallel Programming HW1 ## Q1-1: Does the vector utilization increase, decrease or stay the same as VECTOR_WIDTH changes? Why? Ans: It decreases, because the larger the VECTOR_WIDTH is, the bigger the possibilities it has to wait other vector calculation to finish. Example: vector 1 has finished but because vector 2 has not finished yet, vector 1 is idle. | Vector Width | Vector Utilization | | ------------ | ------------------ | | 2 | 87.9% | | 4 | 84.1% | | 8 | 79.4% | | 16 | 80.2% | ## Q2-1: Fix the code to make sure it uses aligned moves for the best performance. Ans: Since AVX2 is using 256bit which is 32 bytes. Tell the compiler that it is 32 bytes aligned instead of 16 bytes to avoid misalignment. ``` void test(float* __restrict a, float* __restrict b, float* __restrict c, int N) { __builtin_assume(N == 1024); a = (float *)__builtin_assume_aligned(a, 32); // original is (a, 16) b = (float *)__builtin_assume_aligned(b, 32); // original is (a, 16) c = (float *)__builtin_assume_aligned(c, 32); // original is (a, 16) for (int i=0; i<I; i++) { for (int j=0; j<N; j++) { c[j] = a[j] + b[j]; } } } ``` ## Q2-2: What speedup does the vectorized code achieve over the unvectorized code? What additional speedup does using -mavx2 give (AVX2=1 in the Makefile)? You may wish to run this experiment several times and take median elapsed times; you can report answers to the nearest 100% (e.g., 2×, 3×, etc). What can you infer about the bit width of the default vector registers on the PP machines? What about the bit width of the AVX2 vector registers. Ans: | Case | 1st run | 2nd run | 3rd run | Average | Speedup | | ---------------- | ---------- | ---------- | ---------- | --------- | ------- | | Original | 8.19095sec | 8.19238sec | 8.19116sec | 8.1915sec | 1x | | Vectorize | 2.61805sec | 2.61716sec | 2.61809sec | 2.6178sec | 4x | | Vectorize + AVX2 | 1.37841sec | 1.35897sec | 1.35931sec | 1.3656sec | 8x | The bit width of default vector registers should be 4 vectorization width * 32 bit(float data type) = 128bit. The bit width of AVX2 vector registers should be 8 vectorization width * 32 bit(float data type) = 256bit. ## Q2-3: Provide a theory for why the compiler is generating dramatically different assembly. Ans: Code A ``` for (int i = 0; i < I; i++) { for (int j = 0; j < N; j++) { /* max() */ c[j] = a[j]; if (b[j] > a[j]) c[j] = b[j]; } } ``` Code B ``` for (int i = 0; i < I; i++) { for (int j = 0; j < N; j++) { /* max() */ if (b[j] > a[j]) c[j] = b[j]; else c[j] = a[j]; } } ``` Comparing 2 codes, the compiler actually optimizing the ```if (b[j] > a[j])``` part. Comparing the logic between 2 codes, Code A have to initialize ```c```, then check if ```b``` is larger than ```a```, then ```c``` equal to ```b```. It needs 2 assignment to ```c``` compare to code B which takes 1 assignment to ```c``` because it compares 2 aligned vector ```a``` and ```b``` and set the max value to ```c```. From code's logic perspective, it is a more direct approach in code B and I think compiler take this simplicity to generate simpler assembly program.

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.