Homework 1 - HackMD

--- tags: 平行程式設計, 研究所課程 --- # Homework 1 ## Part 1. #### Run `./myexp -s 10000` and sweep the vector width from 2, 4, 8, to 16 :::info | vector width | utilization | |:------------:| ----------- | | 2 | 76.8% | | 4 | 70.6% | | 8 | 67.0% | | 16 | 66.1% | ::: #### **Q1-1**: Does the vector utilization increase, decrease or stay the same as VECTOR_WIDTH changes? Why? :::info The vector utilization will decrease when vector sidth increase. Since the vector width increase, it is harder to fully utilize the vector parallelism . ::: ## Part 2. #### **Q2-1**: Fix the code to make sure it uses aligned moves for the best performance. :::info I modify test1.cpp accordingly the following and recompile: ```c++ void test(float* __restrict a, float* __restrict b, float* __restrict c, int N) { __builtin_assume(N == 1024); a = (float *)__builtin_assume_aligned(a, 32); b = (float *)__builtin_assume_aligned(b, 32); c = (float *)__builtin_assume_aligned(c, 32); for (int i=0; i<I; i++) { for (int j=0; j<N; j++) { c[j] = a[j] + b[j]; } } } ``` See the difference: > `$ diff assembly/test1.vec.restr.align.s assembly/test1.vec.restr.align.avx2.s` we can see the code is aligned when using AVX2 registers by`vmovups` instruction. ![](https://i.imgur.com/3wYHK5Y.png) ::: #### **Q2-2**: What speedup does the vectorized code achieve over the unvectorized code? What additional speedup does using -mavx2 give (AVX2=1 in the Makefile)? You may wish to run this experiment several times and take median elapsed times; you can report answers to the nearest 100% (e.g., 2×, 3×, etc). What can you infer about the bit width of the default vector registers on the PP machines? What about the bit width of the AVX2 vector registers. :::info | | Case 1 | Case 2 | Case 3 | | ------- | ------ | ------ | ------ | | Times(sec) | 8.17964 | 2.61064 | 1.35484 | | Speedup | 1x | 3.13x | 6.04x | * Default bit-width of vector registers on the PP machines should be 128 bits. * Bit width of the AVX2 vector registers should be 256. Therefore, when we assume aligned of float in 32 bytes, it can make `vmovups` instruction works well and get the benefit from parallelism in our programm. ::: #### **Q2-3**: Provide a theory for why the compiler is generating dramatically different assembly. :::info * The Compiler will detect the data dependency in code, in addition, the mask in following if-else conditional will also affect whether can utilize on parallelism or not. * In this case, `c[j] = a[j]` will always execuated in for loop, compilier will remark these instruction that before if-else condition is unsafe to parallel, therefore cannot vectorize this loop. ::: :::info well done > [name=劉安齊] :::

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.