Parallel Programming Assignment I

Parallel Programming Assignment I === ## Q1-1: Does the vector utilization increase, decrease or stay the same as VECTOR_WIDTH changes? Why? 向量使用率的高低是依據向量運算中mask的1數量決定，而在code中影響最嚴重的是指數運算loop中的乘法運算數量，因為高的VECTOR_WIDTH比較容易因為其中幾個欄位有較高的次方，造成多數的欄位在多次乘法後就不再做乘法了(mask已經為0)，進而導致向量使用率低，但低的VECTOR_WIDTH受到的影響就較少。 ## Q2-2: What speedup does the vectorized code achieve over the unvectorized code? What can you infer about the bit width of the default vector registers on the PP machines? ``` # case 1 $ make clean && make && ./test_auto_vectorize -t 1 8.32627sec # case 2 $ make clean && make VECTORIZE=1 && ./test_auto_vectorize -t 1 2.66081sec ``` Vectorized code is faster **3x**(8.32627/2.66081) than unvectorized code. 因為一個float佔32bit(4byte)且暫存器的大小一般為2的次方數，因為加速為3x，所以default vector registers大約為**128bit**。 ## Q2-3: Provide a theory for why the compiler is generating dramatically different assembly. 使用`make test2.o ASSEMBLE=1 VECTORIZE=1`與`make test2.o ASSEMBLE=1`皆會產生相同非向量化的assemble code，但在修改C code之後確實出現了`movaps`與`maxps`指令(如下方所示)。 ``` #test2.vec(before).s和test2.novec(before).s <--fix前 ... .LBB0_3: # Parent Loop BB0_2 Depth=1 # => This Inner Loop Header: Depth=2 movl (%r15,%rcx,4), %edx movl %edx, (%rbx,%rcx,4) movss (%r14,%rcx,4), %xmm0 # xmm0 = mem[0],zero,zero,zero movd %edx, %xmm1 ucomiss %xmm1, %xmm0 jbe .LBB0_5 ... ``` ``` #test2.vec(after).s <--fix後 ... .LBB0_3: # Parent Loop BB0_2 Depth=1 # => This Inner Loop Header: Depth=2 movaps (%r15,%rcx,4), %xmm0 movaps 16(%r15,%rcx,4), %xmm1 maxps (%rbx,%rcx,4), %xmm0 maxps 16(%rbx,%rcx,4), %xmm1 movups %xmm0, (%r14,%rcx,4) movups %xmm1, 16(%r14,%rcx,4) movaps 32(%r15,%rcx,4), %xmm0 movaps 48(%r15,%rcx,4), %xmm1 maxps 32(%rbx,%rcx,4), %xmm0 maxps 48(%rbx,%rcx,4), %xmm1 movups %xmm0, 32(%r14,%rcx,4) movups %xmm1, 48(%r14,%rcx,4) addq $16, %rcx cmpq $1024, %rcx # imm = 0x400 jne .LBB0_3 ... ``` 在patch修改code之前make會產生提示`loop not vectorized: unsafe dependent memory operations in loop.`(如下方所示)，所以可能是因為存在**資料相依**問題才沒辦法向量化。 ``` 311551144@pp037-ubuntu:~/HW1/HW1/part2$ make clean; make test2.o ASSEMBLE=1 VECTORIZE=1 rm -f *.o *.s test_auto_vectorize *~ if [ ! -d "./assembly" ]; then mkdir "./assembly"; fi clang++-11 -I./common -O3 -std=c++17 -Wall -S -Rpass=loop-vectorize -Rpass-missed=loop-vectorize -Rpass-analysis=loop-vectorize -c test2.cpp -o assembly/test2.vec.s test2.cpp:14:5: remark: loop not vectorized: unsafe dependent memory operations in loop. Use #pragma loop distribute(enable) to allow loop distribution to attempt to isolate the offending operations into a separate loop [-Rpass-analysis=loop-vectorize] for (int j = 0; j < N; j++) ^ test2.cpp:14:5: remark: loop not vectorized [-Rpass-missed=loop-vectorize] ```

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.