HW1 - HackMD

# HW1 ### **Part 1** ##### Q1-1 ![](https://i.imgur.com/sywerPY.png) This part of work is aiming to let us simulate the vector operations on the high-level language. Based on the picture above, I have the utilization of vector of 73%. First, as the vector is loaded, i use the bool vector to record if every entry has data. Then, I use the cntbits() to record if each data's exponent is larger than one, which mean it need to multiply until cntbits() return zero then the result is gotten. #### Q1-2 VECTOR_WIDTH = 2 ![](https://i.imgur.com/2lnEX0e.png) VECTOR_WIDTH = 4 ![](https://i.imgur.com/tegQ2ID.png) VECTOR_WIDTH = 8 ![](https://i.imgur.com/n34J7dR.png) VECTOR_WIDTH = 16 ![](https://i.imgur.com/KuggHdz.png) By comparing the four pictures' vector utilization, it is obvious that as the vector width increases as the utilization decreases. Since the operation calculates serveal datum at a time, not every data needs equal operations. Thus, the entry has finished needs to wait for the other hasn't finished, so this phenomenon causes the utilization becoming lower. The vector width is bigger, which means more entries need to wait for the longest operations entry, so the utilization is lower than those vector width is small. #### Bonus ![](https://i.imgur.com/ONE26bw.png) I set the vector width is 4 and the N is 64. First, I calculate the log number of vector width, C. Then, I process C times of hadd() and interleave(). Finally, every entry of vector has the same value which is the sum of the vector. Last, summing up all the vector.value[0] and we get the answer. My vector utilization is 100%. ## **Part 2** #### Q2-1 ![](https://i.imgur.com/hvtqeSw.png) I change the align set to 32 bits, since float datatype occupies 32 bits in the memory. the cache block size is at least 32 bits. vmovups needs cache, page access, because the instrucitons need differnet data and it is not in cache. vmovaps consume same resource, which eliminates moving work. #### Q2-2 The speedup is 3.322 as the vectorized code achieve over the unvectorized code. The speedup is 5.787 by using -mavx2. I think the bit width of PP machine is 128 bits and the AVX2 is 256 bits. Owing to the speedup I show above, I think the ideal speedup is 4 and 8 respectively. The origianl operation is 32 bits at each time, and the vector width cause the speed up. #### Q2-3 Compiler will detect the data dependency between instrucitons. Therefore, it will generate different assemble code to optimize the workflow.