Programming Assignment I: SIMD Programming

# Programming Assignment I: SIMD Programming ## I forgot to submit to e3, so I got zero on this homework. ### Anyone who is seeing this, it is alright to copy my whole work. It's not plagarism since I never even submitted. #### Remember to submit your work. ## Q1 ### Q1-1: Does the vector utilization increase, decrease or stay the same as VECTOR_WIDTH changes? Why? ***ANS:*** Vector utilization decreases as VECTOR_WIDTH increases. logger.cpp: ![](https://i.imgur.com/Z1tp28A.png) From above, we can see that vector utiliztion is affected by mask values. In my code, the mask of the indexes that still needs to do multiplication will have one (all indexes with value > cnt mask = 1, then cnt++). With larger VECTOR_WIDTH, there will more likely to be a lot of 0s, decreasing vector utilization. VECTOR_WIDTH = 2: ![](https://i.imgur.com/oiryAGP.png) VECTOR_WIDTH = 4: ![](https://i.imgur.com/tR62QQJ.png) VECTOR_WIDTH = 8: ![](https://i.imgur.com/HGG2QGZ.png) VECTOR_WIDTH = 16: ![](https://i.imgur.com/c5xDNDQ.png) ## Q2 ### Q2-1: Fix the code to make sure it uses aligned moves for the best performance. Hint: we want to see vmovaps rather than vmovups. ***ANS:*** **Attemp1:** I included "immintrin.h" to use AVX2 Data Types, then replaced float with __mm128, which description is "128-bit vector containing 4 floats". It works, but then I can't properly continue to the next question, hence I assume this is not the way TA wants. ***Attemp2:*** Change a,b,c's "__builtin_assume_aligned(const void *, byte)" from 16B to 32B. This works because AVX2 supports 256-bit wide(32B) SIMD registers, it can't assume 16B float is all aligned. So we have to do so to let the compiler ensure alignment. ### Q2-2: What speedup does the vectorized code achieve over the unvectorized code? What additional speedup does using -mavx2 give (AVX2=1 in the Makefile)? Hint: Aside from speedup and the vectorization report, the most relevant information is that the data type for each array is float. **ANS:** ```sh # case 1 $ make clean && make && ./test_auto_vectorize -t 1 ``` ![](https://i.imgur.com/I6oERMm.png) ```sh # case 2 $ make clean && make VECTORIZE=1 && ./test_auto_vectorize -t 1 ``` ![](https://i.imgur.com/a6oMvIS.png) ```sh # case 3 $ make clean && make VECTORIZE=1 AVX2=1 && ./test_auto_vectorize -t 1 ``` ![](https://i.imgur.com/mev4mDP.png) **From unvectorized to vectorized:** 3x more **Using AVX2:** almost another 2x Since *float is 4 Bytes (32 bits)*, ***default vector registers might be 16 Bytes (128 bits)***, processing 3x information in same unit time. ***AVX2 vector registers might be 32 Bytes (256 bits)***, processing 2x information in same unit time. ### Q2-3: Provide a theory for why the compiler is generating dramatically different assembly. **ANS:** I have two possible guesses: 1. The operations of the first way is too hard for the compiler to recognize, while the second way is easy for it to read (set mask with "b[j] > a[j]", then do different vectorize ops according to it). 2. Consider processing with the first way ( c[j] = a[j], then c[j] = b[j] if b[j] > a[j] ), it may need to change the values in the same index twice, which might lead to data dependency problems in a unconsidered way if parallelized. ### References [AVX - wikipeida](https://www.chessprogramming.org/AVX) [Intel® Intrinsics Guide](https://software.intel.com/sites/landingpage/IntrinsicsGuide/#techs=AVX2) [Crunching Numbers with AVX and AVX2](https://www.codeproject.com/Articles/874396/Crunching-Numbers-with-AVX-and-AVX)