Programming Assignment I: SIMD Programming

# Programming Assignment I: SIMD Programming ### Part 1: Vectorizing Code Using Fake SIMD Intrinsics #### Please enter the `part1` folder: ```shell $ cd part1 ``` #### Execute: ```shell $ ./myexp ``` #### Execute & Specify N (workload size): ```shell $ ./myexp -s 3 $ ./myexp -s 1000 $ ./myexp -s 10000 ``` #### Execute & Print Log: ```shell $ ./myexp -l ``` --- ### Part 2: Vectorizing Code with Automatic Vectorization Optimizations #### Please enter the `part2` folder: ```shell $ cd part2 ``` --- ### Part 3: Q & A ### Q1-1: Does the vector utilization increase, decrease or stay the same as `VECTOR_WIDTH` changes? Why? ### Discuss the vector utilization in two aspect: #### In `logger.cpp`, we can get the equations: #### $$vector \space utilization = {utilized \space lane \over total \space lane} \space \space \space \space (1)$$ #### $$utilized \space lane = \# \space of \text{ } instruction * \# \space of \space 1 \space in \space mask \space \space \space \space (2)$$ #### $$total \space lane = \# \space of \space instruction * vector \space width \space \space \space \space (3)$$ #### 1. From the Equation (3): #### When `VECTOR_WIDTH` increases, the total lane would increase accordingly. Therefore, the vector utilization would decrease. #### 2. From the Implementation Aspect: #### When # of 1 in the mask increases, the utilized lane would increase. In clampedExpVector(), I use values of `exponents` to decide values of masks. If the value of the exponent is not yet zero, the mask lane of that exponent would be set to 1. Otherwise, the mask lane would be set to zero. In the end, the calculation would stop until all the values of the mask lane become zero. #### If one value of the exponents is much larger than other exponents, it is likely that other values have already finished the operation of multiplication, and only that particular exponent is still calculating. This would cause vector utilization to be less efficient. With smaller `VECTOR_WIDTH`, this situation can be eased, and therefore the vector utilizaiton would be higher. - `VECTOR_WIDTH` = 2, Vector Utilization = 90.7% ```shell $ make && ./myexp -s 10000 CLAMPED EXPONENT (required) Results matched with answer! ****************** Printing Vector Unit Statistics ******************* Vector Width: 2 Total Vector Instructions: 187516 Vector Utilization: 90.7% Utilized Vector Lanes: 340001 Total Vector Lanes: 375032 ************************ Result Verification ************************* ClampedExp Passed!!! ``` - `VECTOR_WIDTH` = 4, Vector Utilization = 88.7% ```shell $ make && ./myexp -s 10000 CLAMPED EXPONENT (required) Results matched with answer! ****************** Printing Vector Unit Statistics ******************* Vector Width: 4 Total Vector Instructions: 107072 Vector Utilization: 88.7% Utilized Vector Lanes: 379945 Total Vector Lanes: 428288 ************************ Result Verification ************************* ClampedExp Passed!!! ``` - `VECTOR_WIDTH` = 8, Vector Utilization = 87.7% ```shell $ make && ./myexp -s 10000 CLAMPED EXPONENT (required) Results matched with answer! ****************** Printing Vector Unit Statistics ******************* Vector Width: 8 Total Vector Instructions: 57878 Vector Utilization: 87.7% Utilized Vector Lanes: 406001 Total Vector Lanes: 463024 ************************ Result Verification ************************* ClampedExp Passed!!! ``` - `VECTOR_WIDTH` = 16, Vector Utilization = 87.2% ```shell $ make && ./myexp -s 10000 CLAMPED EXPONENT (required) Results matched with answer! ****************** Printing Vector Unit Statistics ******************* Vector Width: 16 Total Vector Instructions: 30093 Vector Utilization: 87.2% Utilized Vector Lanes: 419857 Total Vector Lanes: 481488 ************************ Result Verification ************************* ClampedExp Passed!!! ``` :::info #### A1-1: The vector utilization decreases as `VECTOR_WIDTH` increases. ::: --- ### Q1 Bonus: Implement a vectorized version of arraySumSerial in arraySumVector ### Time Complexity: #### $$O({N \over vector \text{ } width} + log{_2}{(vector \text{ } width)})$$ - `VECTOR_WIDTH` = 2, Vector Utilization = 100% ```shell $ make && ./myexp -s 10000 ARRAY SUM (bonus) ****************** Printing Vector Unit Statistics ******************* Vector Width: 2 Total Vector Instructions: 10000 Vector Utilization: 100.0% Utilized Vector Lanes: 20000 Total Vector Lanes: 20000 ************************ Result Verification ************************* ArraySum Passed!!! ``` - `VECTOR_WIDTH` = 4, Vector Utilization = 100% ```shell $ make && ./myexp -s 10000 ARRAY SUM (bonus) ****************** Printing Vector Unit Statistics ******************* Vector Width: 4 Total Vector Instructions: 5000 Vector Utilization: 100.0% Utilized Vector Lanes: 20000 Total Vector Lanes: 20000 ************************ Result Verification ************************* ArraySum Passed!!! ``` - `VECTOR_WIDTH` = 8, Vector Utilization = 100% ```shell $ make && ./myexp -s 10000 ARRAY SUM (bonus) ****************** Printing Vector Unit Statistics ******************* Vector Width: 8 Total Vector Instructions: 2500 Vector Utilization: 100.0% Utilized Vector Lanes: 20000 Total Vector Lanes: 20000 ************************ Result Verification ************************* ArraySum Passed!!! ``` - `VECTOR_WIDTH` = 16, Vector Utilization = 100% ```shell $ make && ./myexp -s 10000 ARRAY SUM (bonus) ****************** Printing Vector Unit Statistics ******************* Vector Width: 16 Total Vector Instructions: 1250 Vector Utilization: 100.0% Utilized Vector Lanes: 20000 Total Vector Lanes: 20000 ************************ Result Verification ************************* ArraySum Passed!!! ``` - [Reference](https://people.cs.vt.edu/yongcao/teaching/cs5234/spring2013/slides/Lecture10.pdf) --- ### Q2-2: What speedup does the vectorized code achieve over the unvectorized code? ### Elapsed execution time in test1(): > #### In order to run for multiple experiments to reduce measurement error on gettime(), `I` is defined as 20000000 in `test.h`. - #### Case1: Unvectorized Code: ```shell $ make clean && make && ./test_auto_vectorize -t 1 Running test1()... Elapsed execution time of the loop in test1(): 11.602sec (N: 1024, I: 20000000) ``` - #### Case2: Vectorized Code: ```shell $ make clean && make VECTORIZE=1 && ./test_auto_vectorize -t 1 Running test1()... Elapsed execution time of the loop in test1(): 2.89274sec (N: 1024, I: 20000000) ``` :::info #### A2-2: The speedup the vectorized code achieves is 4.01 times faster than the unvectorized code. ::: --- ### What can you infer about the bit width of the default vector registers on the PP machines? ### Derive the the bit width of the default vector registers in two ways: 1. #### The Speed Up Time #### The speedup of the vectorized code is about 4 times faster than the unvectorized code. Therefore, we can infer that the handling data of the vectorized code is 4 times more than the unvectorized code. Since the data type is `float`, the bit width of the default vector registers is: $$4 (times) * 4 (bytes) * 8 (bits) = 128 (bits)$$ 2. #### The Assembly Code #### The instructions of `movaps`, `movups` and `maxps` access the data in the base of 16 bytes. In addition, according to [MOVAPS instruction reference](https://www.felixcloutier.com/x86/movaps.html), we can infer that the bit width of XMM registers is 128 (bits). #### The following is the assembly code of the inner loop in the vectorized code of test2(): ```shell .LBB0_3: # Parent Loop BB0_2 Depth=1 # => This Inner Loop Header: Depth=2 movaps (%r15,%rcx,4), %xmm0 movaps 16(%r15,%rcx,4), %xmm1 maxps (%rbx,%rcx,4), %xmm0 maxps 16(%rbx,%rcx,4), %xmm1 movups %xmm0, (%r14,%rcx,4) movups %xmm1, 16(%r14,%rcx,4) movaps 32(%r15,%rcx,4), %xmm0 movaps 48(%r15,%rcx,4), %xmm1 maxps 32(%rbx,%rcx,4), %xmm0 maxps 48(%rbx,%rcx,4), %xmm1 movups %xmm0, 32(%r14,%rcx,4) movups %xmm1, 48(%r14,%rcx,4) addq $16, %rcx cmpq $1024, %rcx # imm = 0x400 jne .LBB0_3 ``` :::info #### A: The bit width of the default vector registers on the PP machines is 128 (bits). ::: --- ### Q2-3: Provide a theory for why the compiler is generating dramatically different assembly. ### The Assembly Code of test2(): - #### Case1: Before Modifying Code #### Use `movl`, `movss` and `movd` to move the value of a, b, c, and then use `ucomiss` to compare the value of a and b to decide whether to put b into c. And finally, use `jbe` to connect all operations. #### The instructions are trival and are not concise at all! ```cpp c[j] = a[j]; if (b[j] > a[j]) c[j] = b[j]; ``` ```shell .LBB0_7: # in Loop: Header=BB0_3 Depth=2 addq $2, %rcx cmpq $1024, %rcx # imm = 0x400 je .LBB0_8 .LBB0_5: # in Loop: Header=BB0_3 Depth=2 movl 4(%r15,%rcx,4), %edx movl %edx, 4(%rbx,%rcx,4) movss 4(%r14,%rcx,4), %xmm0 # xmm0 = mem[0],zero,zero,zero movd %edx, %xmm1 ucomiss %xmm1, %xmm0 jbe .LBB0_7 # %bb.6: # in Loop: Header=BB0_3 Depth=2 movss %xmm0, 4(%rbx,%rcx,4) jmp .LBB0_7 ``` - #### Case2: After Modifying Code #### Use `movaps`, `maxps` to do both the operation and comparation. `maxps` can be used to find the max value and store the value into the register. #### This method save the # of operation and the used registers, which can process more data at the same time. ```cpp if (b[j] > a[j]) c[j] = b[j]; else c[j] = a[j]; ``` ```shell .LBB0_3: # Parent Loop BB0_2 Depth=1 # => This Inner Loop Header: Depth=2 movaps (%r15,%rcx,4), %xmm0 movaps 16(%r15,%rcx,4), %xmm1 maxps (%rbx,%rcx,4), %xmm0 maxps 16(%rbx,%rcx,4), %xmm1 movups %xmm0, (%r14,%rcx,4) movups %xmm1, 16(%r14,%rcx,4) movaps 32(%r15,%rcx,4), %xmm0 movaps 48(%r15,%rcx,4), %xmm1 maxps 32(%rbx,%rcx,4), %xmm0 maxps 48(%rbx,%rcx,4), %xmm1 movups %xmm0, 32(%r14,%rcx,4) movups %xmm1, 48(%r14,%rcx,4) addq $16, %rcx cmpq $1024, %rcx # imm = 0x400 jne .LBB0_3 ``` :::info #### A2-3: We can get better performance if the vectorized assembly code use `movaps` and `maxps`, instead of `ucomiss` and `jbe`. :::