2018q3 Homework6

# 2018q3 Homework6 contributed by < `allenchen8210`, `dange0`, `flawless0714`> ###### tags: `sysprog2018` ## Cycles Per Element * Cycles Per Element (CPE) 來表示程式性能，這種度量方式對重複計算相當合適，例如處理影像的像素。簡而言之，就是 CPE 可以用來描述隨著處理量的增加，CPU 需要的 cycle 數量，越低約好。運算每元素所需要的周期數。 * 這些項中的係數稱為 CPE * `a+4.0n` CPE = 4.0 * `a+3.5n` CPE = 3.5 ![](https://i.imgur.com/Qv6pJxC.jpg) ### combine1 ```c= #define INDENT 1 #define OP * void combine1(vec_ptr, data *dest) { long i; *desst = IDENT; for(i=0; i < vec_length(v); i++){ data_t val; get_vec_element(v, i, &val); *dest = *dest OP val; } } ``` CPE test for multiply long * combine1 : ==20.18== 可以看到第 8 行`vec_length(v)`一直在做重複取值的動作，陣列的長度不會在 `for` 迴圈裡面改變，沒有必要一直取值的動作 ### combine2 修改 `for` 迴圈後 ```c= #define INDENT 1 #define OP * void combine2(vec_ptr, data *dest) { long i; long length = vec_length(v); *desst = IDENT; for(i=0; i < length; i++){ data_t val; get_vec_element(v, i, &val); *dest = *dest OP val; } } ``` CPE test for multiply long * combine1 : 20.18 * combine2 : ==11.03== 可以發現到 `combine2` 仍然有改進的空間，在第 11 行取值的方法可以直接改成直接取數值，不用透過 `get_vec_element` 取值，裡面還要做邊接檢測，本來取的數值就不會超過邊界。 ```c= int get_vec_element(vec_ptr v, long index, data_t *dest) { if(index < 0 || index >= v->len) return 0; *dest = v->data[index] return 1; } ``` ### combine3 ```c= #define INDENT 1 #define OP * data_t *get_vec_start(vec_ptr v) { return v->data; } void combine3(vec_ptr, data *dest) { long i; long length = vec_length(v); data_t *data = get_vec_start(v); *dest = IDENT; for(i=0; i < length; i++){ *dest = *dest OP data[i]; } } ``` CPE test for multiply long * combine1 : 20.18 * combine2 : 11.03 * combine3 : ==11.03== ### combine4 而作者又進一步將`combine3` 的`*dest = *dest OP data[i];`改進成 `acc = acc OP data[i]` 利用 `acc` 來儲存累積的值，為什麼要這麼做呢？原本的 `combine3` 在每次迭代時候都要將累積的數值 `dest` 從記憶體讀出來然後做完在寫回去。==但是每做完一次的累積值就是下一次開始的累積值==。為了避免重複讀寫，我可以直接先將累積值放在一個地方，等全部做完在放回去 `*dest` 。就變成如下程式碼 ```c= void combine4(vec_ptr, data *dest) { long i; long length = vec_length(v); data_t *data = get_vec_start(v); data_t acc = IDENT; for(i=0; i < length; i++){ acc = acc OP data[i]; } *dest = acc; } ``` #### 在組合語言我們可以看的更清楚 * Inner loop of combine3 * dest in `%rbx` * data+i in `%rdx` * data+length in `%rax` ```= .L17 vmovsd (%rbx), %xmm0 // read dest vmulsd (%rbx), %xmm0, %xmm0 vmovsd %xmm0, (%rbx) // write dest addq $8, %rdx cmpq %rax, %rdx jne .L17 ``` ==重複點在 read dest 與 write dest==，我們可以避免這兩個讀與寫，最後全部做完在寫就好。 * Inner loop of combine4 * acc in `%xmm0` * data+i in `%rdx` * data+length in `%rax` ```= .L25 vmulsd (%rdx), %xmm0, %xmm0 addq $8, %rdx cmpq %rax, %rdx jne .L25 ``` CPE test for multiply long * combine1 : 20.18 * combine2 : 11.03 * combine3 : 11.03 * combine4 : ==5.01== 作業題目`5.13` 兩個向量內積 ```c= void inner4(vec_ptr u, vec_ptr v, data_t *dest) { long i; long length = veclength(u) data_t *udata = get_vec_start(u); data_t *vdata = get_vec_start(v); data_t sum = (data_t) 0; for(i = 0; i < length; i++){ sum = sum + udata[i] * vdata[i]; } *dest = sum; } ``` 組合語言 ```= .L15: vmovsd 0(%rbp,%rcx,8), %mm1 vmulsd (%rax,%rcx,8), %xmm1, %xmm1 vaddsd %xmm1, %xmm0, %xmm0 addq $1, %rcx cmpq %rbx, %rcx\ jne .L15 ``` 1. 組合語言圖形化 ![](https://i.imgur.com/TfOYA91.png) 2.Critical Path ![](https://i.imgur.com/dUfsbJG.png) 3. 決定關鍵路徑的 CPE 是　add ，浮點數　CPE 是　3.0 4. 決定關鍵路徑的 CPE 是　add ，整數　CPE 是　1.0 * 發射時間 (issue time)　表示兩個連續同類型的運算之間需要的最小周期數。一個運算可能是好幾個 blocks　組成，有些　block 需要　2 cycles 有些需要　1 cycle，例如有些運算需要 1 2 1 ，有些運算需要　1 3 3 等多種不一樣的組合。如果是可以完全切成 1 1 1 的話就是完全流水線化，等於是一個　cycle 就可以在塞一個同樣的運算進去。 * 若不是完全流水線化：發射時間等於他的延遲 * 發射時間倒數為最大吞吐量 * 假設是完全流水線化，代表隨著 clock 觸發一次，運算就丟進去一個，如果一個`4GHZ` 處理器，帶表他吞吐量是 4 GIPS　 * GIPS 每秒千兆條指令，就是每秒十億條指令。 * 具有多個功能單元可以進一步提高吞吐量。 ![](https://i.imgur.com/9Inskdd.png) ## performancelab 於 github 上找到分析效能之 [performancelab](https://github.com/zxc479773533/CS-APP3e-labs/tree/master/performancelab) ### trace code 主要計算 CPE 之程式碼可以參考 `fcyc.c` 中的 `fcyc_v` **[fcyc.c](https://github.com/zxc479773533/CS-APP3e-labs/blob/master/performancelab/fcyc.c#L110)** ```cpp=110 double fcyc_v(test_funct_v f, void *params[]) { double result; init_sampler(); if (compensate) { do { double cyc; if (clear_cache) clear(); start_comp_counter(); f(params); cyc = get_comp_counter(); add_sample(cyc); } while (!has_converged() && samplecount < maxsamples); } else { do { double cyc; if (clear_cache) clear(); start_counter(); f(params); cyc = get_counter(); add_sample(cyc); } while (!has_converged() && samplecount < maxsamples); } #ifdef DEBUG { int i; printf(" %d smallest values: [", kbest); for (i = 0; i < kbest; i++) printf("%.0f%s", values[i], i==kbest-1 ? "]\n" : ", "); } #endif result = values[0]; #if !KEEP_VALS free(values); values = NULL; #endif return result; } ``` - 該函式的傳入值為要做效能測試之函式與待測函式的傳入值 - init_sampler() - 初始化之後要放 sample 的陣列 - add_sample(cyc) - 將 sample 到的 cycle 數放進陣列中 - has_converged() - 於 129 行，start_counter() 與 get_counter() 將 f(params) 包住，去記錄進入 f(params) 之前之 cpu cycle 差異，詳細行為將在 `clock.c` 中分析 **[clock.c](https://github.com/zxc479773533/CS-APP3e-labs/blob/master/performancelab/clock.c#L83)** ```cpp=83 #if IS_x86 /* $begin x86cyclecounter */ /* Initialize the cycle counter */ static unsigned cyc_hi = 0; static unsigned cyc_lo = 0; /* Set *hi and *lo to the high and low order bits of the cycle counter. Implementation requires assembly code to use the rdtsc instruction. */ void access_counter(unsigned *hi, unsigned *lo) { asm("rdtsc; movl %%edx,%0; movl %%eax,%1" /* Read cycle counter */ : "=r" (*hi), "=r" (*lo) /* and move results to */ : /* No input */ /* the two outputs */ : "%edx", "%eax"); } /* Record the current value of the cycle counter. */ void start_counter() { access_counter(&cyc_hi, &cyc_lo); } /* Return the number of cycles since the last call to start_counter. */ double get_counter() { unsigned ncyc_hi, ncyc_lo; unsigned hi, lo, borrow; double result; /* Get cycle counter */ access_counter(&ncyc_hi, &ncyc_lo); /* Do double precision subtraction */ lo = ncyc_lo - cyc_lo; borrow = lo > ncyc_lo; hi = ncyc_hi - cyc_hi - borrow; result = (double) hi * (1 << 30) * 4 + lo; if (result < 0) { fprintf(stderr, "Error: counter returns neg value: %.0f\n", result); } return result; } /* $end x86cyclecounter */ #endif /* x86 */ ``` - start_counter - 透過 access_counter 去得到 x86 的 cycle counter - get_counter - 去計算 start_counter 到 get_counter 之間經過了幾個 cycle - `result = (double) hi * (1 << 30) * 4 + lo` 相當於 `result = hi << 32 + lo` 形成一個 64-bits 的結果 - access_counter - 利用asm() 呼叫 x86 的指令 `rdtsc`，將 cycle counter 取出 - `rdtsc` 會將 [Time stamp counters](https://en.wikipedia.org/wiki/Time_Stamp_Counter) 回傳至 edx 與 eax - [asm](https://gcc.gnu.org/onlinedocs/gcc/Extended-Asm.html) 介紹 - The asm statement allows you to include assembly instructions directly within C code. - 使用方法如下： ``` asm ( AssemblerTemplate : OutputOperands [ : InputOperands [ : Clobbers ] ]) ``` - 用 `:` 將 AssemblerTemplate, OutputOperands, InputOperands, clobber Clobbers 隔開 - AssemblerTemplate：放 asm code 的地方，並用 `%0` `%1` 等依序存取 OutputOperands 與 InputOperands - Output/Input Operands：放 output/input paramater 的地方，每個參數前面會先宣告參數的讀寫權限 - `=`：Write-only - `+`：Read-Write - `&`：Output-only - Clobbers：告訴 compiler 有哪些 register 或 memory 會被修改 - 解讀 access_counter 中的 asm() ``` asm("rdtsc; movl %%edx,%0; movl %%eax,%1" /* Read cycle counter */ : "=r" (*hi), "=r" (*lo) /* and move results to */ : /* No input */ /* the two outputs */ : "%edx", "%eax"); ``` 可以被解讀為 ``` rdtsc movl %edx, hi movl %eax, lo ``` 其中 *hi 與 *lo 的屬性為 write-only 的 general register 經過上述分析，可以得知 CPE 的計算方式為紀錄待測函式執行前後的 cycle counter，並計算其差異。其中取得 cycle counter 的方式為透過現成的指令。 ## 問題討論 ### 作業第四小題題目：請解釋雖然乘法操作需要五個時鐘週期，但是為什麼兩個浮點版本的 CPE 都是3.00? 參考了幾份解答都說因為 ADD 在關鍵路徑上，所以 CPE 為3.00。可是雙精度浮點數的乘法需要五個時鐘週期，這樣不是單次迴圈就需要至少五個時鐘週期了嗎？ ### 嘗試將 inner4() 移植到 performancelab 中分析額外移植進 performancelab 之 code: ```=c double *create_double_array(int size) { double *array = malloc(size * sizeof(double)); if (!array) return NULL; for (int i = 0; i < size; i++) array[i] = (float) rand() / (float)(RAND_MAX / 100); return array; } void inner4(double *u, double *v, long length) { long i; data_t *udata = u; data_t *vdata = v; data_t sum = (data_t) 0; for (i = 0; i < length; i++) { sum = sum + udata[i] * vdata[i]; } *v = sum; } ``` 修改後 `driver.c` 中的 `test_rotate()`(我們將 `rotate` 的測試內容改成作業的 `inner4` 函數) : (原為 **[driver.c](https://github.com/zxc479773533/CS-APP3e-labs/blob/8c7d4ecf61dcb69734f8afef3257d19ade3e8e38/performancelab/driver.c#L351)**) ```c=351 /* Measure CPE */ { double num_cycles, cpe; int tmpdim = dim; void *arglist[4]; double dimension = (double) dim; double work = dimension*dimension; #ifdef DEBUG printf("DEBUG: dimension=%.1f\n",dimension); printf("DEBUG: work=%.1f\n",work); #endif arglist[0] = (void *) benchmarks_rotate[bench_index].tfunct; arglist[1] = (void *) &tmpdim; arglist[2] = (void *) orig; arglist[3] = (void *) result; create(dim); num_cycles = fcyc_v((test_funct_v)&func_wrapper, arglist); cpe = num_cycles / 100; /* 100 is the cycle of iteration */ printf("custom test running!\n"); benchmarks_rotate[bench_index].cpes[test_num] = cpe; } } ``` 修改後 `fycy.c` 中的 `fcyc_v()` : (原為 **[fcyc.c](https://github.com/zxc479773533/CS-APP3e-labs/blob/8c7d4ecf61dcb69734f8afef3257d19ade3e8e38/performancelab/fcyc.c#L156)**) ```c=156 double fcyc_v(test_funct_v f, void *params[]) { double result; init_sampler(); if (compensate) { do { double cyc; if (clear_cache) clear(); double* arr1 = create_double_array(100); double* arr2 = create_double_array(100); if (arr1 && arr2) { start_comp_counter(); //f(params); inner4(arr1, arr2, 100); cyc = get_comp_counter(); // printf("\ncycle per times: %.0f\n", cyc); free(arr1); free(arr2); } printf("custom test running!\n"); /* start_comp_counter(); f(params);printf("custom test running!\n"); cyc = get_comp_counter(); */ add_sample(cyc); } while (!has_converged() && samplecount < maxsamples); } else { do { double cyc; if (clear_cache) clear(); double* arr1 = create_double_array(100); double* arr2 = create_double_array(100); if (arr1 && arr2) { start_counter(); //f(params); inner4(arr1, arr2, 100); cyc = get_counter(); // printf("\ncycle per times: %.0f\n", cyc); free(arr1); free(arr2); } //printf("\ncycle per times: %.0f\n", cyc); add_sample(cyc); } while (!has_converged() && samplecount < maxsamples); } #ifdef DEBUG { int i; printf(" %d smallest values: [", kbest); for (i = 0; i < kbest; i++) printf("%.0f%s", values[i], i==kbest-1 ? "]\n" : ", "); } #endif result = values[0]; #if !KEEP_VALS free(values); values = NULL; #endif return result; } } ``` 輸出結果： (開啟補償 ([compensate for timer overhead](https://github.com/zxc479773533/CS-APP3e-labs/blob/8c7d4ecf61dcb69734f8afef3257d19ade3e8e38/performancelab/driver.c#L716))) ``` Teamname: team_15 Member 1: N/A Email 1: N/A Rotate: Version = inner: custom evaluation: Dim 64 128 256 512 1024 Mean Your CPEs 3.8 3.8 3.8 3.8 3.8 Baseline CPEs 14.7 40.1 46.4 65.9 94.5 Speedup 3.9 10.6 12.4 17.6 25.2 11.8 Summary of Your Best Scores: Rotate: 11.8 (inner: custom evaluation) Smooth: 0.0 ((null)) ``` 其中我們較關心的部份為 `Your CPEs 3.8 3.8 3.8 3.8 3.8`，CPE 比作業解答理論值多出 0.8，應該是 throughput bound。輸出結果: (關閉補償) ``` Teamname: team_15 Member 1: N/A Email 1: N/A Rotate: Version = inner: custom evaluation: Dim 64 128 256 512 1024 Mean Your CPEs 15.8 15.8 6.3 3.6 3.6 Baseline CPEs 14.7 40.1 46.4 65.9 94.5 Speedup 0.9 2.5 7.3 18.4 26.5 6.1 Summary of Your Best Scores: Rotate: 6.1 (inner: custom evaluation) Smooth: 0.0 ((null)) ``` 由本次輸出可以發現到頭幾次執行的 CPE 比較高，相較之下有補償的版本就沒有這問題，這邊對 CPE 是逐次下降而不是總平均升高有疑問，沒有補償的版本 CPE 不是應該平均提高嗎？還是說有哪裡少考慮到了。 --- >把 performanclab 移植進去 inner4 測試，發現各個 dimension 的 CPE 在 10 左右 >[source code](https://github.com/allenchen8210/team15) >[name=jn] >[color=green] ==Desktop== * gcc version 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.10) ``` Architecture: x86_64 CPU 作業模式： 32-bit, 64-bit Byte Order: Little Endian CPU(s): 12 On-line CPU(s) list: 0-11 每核心執行緒數：2 每通訊端核心數：6 Socket(s): 1 NUMA 節點： 1 供應商識別號： GenuineIntel CPU 家族： 6 型號： 158 Model name: Intel(R) Core(TM) i7-8700K CPU @ 3.70GHz 製程： 10 CPU MHz： 800.029 CPU max MHz: 4700.0000 CPU min MHz: 800.0000 BogoMIPS: 7392.00 虛擬： VT-x L1d 快取： 32K L1i 快取： 32K L2 快取： 256K L3 快取： 12288K NUMA node0 CPU(s): 0-11 ``` ``` For 999-Dimensional Vector Dot Product. Cycle = 11088.000000 CPE = 11.099099 For 999-Dimensional Vector Dot Product. Cycle = 11080.000000 CPE = 11.091091 For 999-Dimensional Vector Dot Product. Cycle = 11752.000000 CPE = 11.763763 For 999-Dimensional Vector Dot Product. Cycle = 11068.000000 CPE = 11.079079 For 999-Dimensional Vector Dot Product. Cycle = 11784.000000 CPE = 11.795795 For 999-Dimensional Vector Dot Product. Cycle = 11742.000000 CPE = 11.753754 For 999-Dimensional Vector Dot Product. Cycle = 11762.000000 CPE = 11.773774 For 999-Dimensional Vector Dot Product. Cycle = 11106.000000 CPE = 11.117117 For 999-Dimensional Vector Dot Product. Cycle = 11794.000000 CPE = 11.805806 For 1000-Dimensional Vector Dot Product. Cycle = 11102.000000 CPE = 11.102000 For 1000-Dimensional Vector Dot Product. Cycle = 11096.000000 CPE = 11.096000 For 1000-Dimensional Vector Dot Product. Cycle = 11092.000000 CPE = 11.092000 For 1000-Dimensional Vector Dot Product. Cycle = 11100.000000 CPE = 11.100000 For 1000-Dimensional Vector Dot Product. Cycle = 11104.000000 CPE = 11.104000 For 1000-Dimensional Vector Dot Product. Cycle = 11800.000000 CPE = 11.800000 For 1000-Dimensional Vector Dot Product. Cycle = 11092.000000 CPE = 11.092000 For 1000-Dimensional Vector Dot Product. Cycle = 11810.000000 CPE = 11.810000 For 1000-Dimensional Vector Dot Product. Cycle = 11118.000000 CPE = 11.118000 For 1000-Dimensional Vector Dot Product. Cycle = 11104.000000 CPE = 11.104000 ``` ==Macbook Air== * gcc version 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.10) ``` allen@Mac:~/embeededHW/team15$ lscpu Architecture: x86_64 CPU 作業模式： 32-bit, 64-bit Byte Order: Little Endian CPU(s): 4 On-line CPU(s) list: 0-3 每核心執行緒數：2 每通訊端核心數：2 Socket(s): 1 NUMA 節點： 1 供應商識別號： GenuineIntel CPU 家族： 6 型號： 61 Model name: Intel(R) Core(TM) i5-5250U CPU @ 1.60GHz 製程： 4 CPU MHz： 1634.462 CPU max MHz: 2700.0000 CPU min MHz: 500.0000 BogoMIPS: 3200.02 虛擬： VT-x L1d 快取： 32K L1i 快取： 32K L2 快取： 256K L3 快取： 3072K NUMA node0 CPU(s): 0-3 ``` ``` For 984-Dimensional Vector Dot Product. Cycle = 5710.000000 CPE = 5.802845 For 984-Dimensional Vector Dot Product. Cycle = 5708.000000 CPE = 5.800813 For 984-Dimensional Vector Dot Product. Cycle = 5710.000000 CPE = 5.802845 For 984-Dimensional Vector Dot Product. Cycle = 5724.000000 CPE = 5.817073 For 984-Dimensional Vector Dot Product. Cycle = 5710.000000 CPE = 5.802845 For 984-Dimensional Vector Dot Product. Cycle = 5708.000000 CPE = 5.800813 For 984-Dimensional Vector Dot Product. Cycle = 5710.000000 CPE = 5.802845 For 984-Dimensional Vector Dot Product. Cycle = 5710.000000 CPE = 5.802845 For 984-Dimensional Vector Dot Product. Cycle = 5726.000000 CPE = 5.819106 For 984-Dimensional Vector Dot Product. Cycle = 5708.000000 CPE = 5.800813 For 985-Dimensional Vector Dot Product. Cycle = 5714.000000 CPE = 5.801015 For 985-Dimensional Vector Dot Product. Cycle = 5728.000000 CPE = 5.815228 For 985-Dimensional Vector Dot Product. Cycle = 5758.000000 CPE = 5.845685 For 985-Dimensional Vector Dot Product. Cycle = 5716.000000 CPE = 5.803046 For 985-Dimensional Vector Dot Product. Cycle = 5714.000000 CPE = 5.801015 For 985-Dimensional Vector Dot Product. Cycle = 5714.000000 CPE = 5.801015 For 985-Dimensional Vector Dot Product. Cycle = 5714.000000 CPE = 5.801015 For 985-Dimensional Vector Dot Product. Cycle = 5728.000000 CPE = 5.815228 For 985-Dimensional Vector Dot Product. Cycle = 5712.000000 CPE = 5.798985 For 985-Dimensional Vector Dot Product. Cycle = 5714.000000 CPE = 5.801015 For 986-Dimensional Vector Dot Product. Cycle = 5734.000000 CPE = 5.815416 ``` * Gnuplot ![](https://i.imgur.com/ePU4d54.png) >可以看到兩部電腦明顯 CPE 差了快一半，反而是 Intel® Core™ i5-5250U CPE 較佳 >[name=jn] >[color=blue] --- ## test inner4 function * Test Environment * gcc version 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.10) * lscpu ``` Architecture: x86_64 CPU 作業模式： 32-bit, 64-bit Byte Order: Little Endian CPU(s): 12 On-line CPU(s) list: 0-11 每核心執行緒數：2 每通訊端核心數：6 Socket(s): 1 NUMA 節點： 1 供應商識別號： GenuineIntel CPU 家族： 6 型號： 158 Model name: Intel(R) Core(TM) i7-8700K CPU @ 3.70GHz 製程： 10 CPU MHz： 799.936 CPU max MHz: 4700.0000 CPU min MHz: 800.0000 BogoMIPS: 7392.00 虛擬： VT-x L1d 快取： 32K L1i 快取： 32K L2 快取： 256K L3 快取： 12288K NUMA node0 CPU(s): 0-11 ``` `$ gcc -S 5_13.c` ```cpp= float inner4(double *u, double *v, long length) { long i = 0; double sum = 200; for (i = 0; i < length; i++) { sum = sum + u[i] * v[i]; } return 0; } ``` * assembly code ``` inner4: .LFB3: .cfi_startproc pushq %rbp .cfi_def_cfa_offset 16 .cfi_offset 6, -16 movq %rsp, %rbp .cfi_def_cfa_register 6 movq %rdi, -24(%rbp) movq %rsi, -32(%rbp) movq %rdx, -40(%rbp) movq $0, -16(%rbp) movsd .LC1(%rip), %xmm0 movsd %xmm0, -8(%rbp) movq $0, -16(%rbp) jmp .L7 .L8: movq -16(%rbp), %rax leaq 0(,%rax,8), %rdx movq -24(%rbp), %rax addq %rdx, %rax movsd (%rax), %xmm1 // 取到 u[i] movq -16(%rbp), %rax leaq 0(,%rax,8), %rdx movq -32(%rbp), %rax addq %rdx, %rax movsd (%rax), %xmm0 // 取到 v[i] mulsd %xmm1, %xmm0 movsd -8(%rbp), %xmm1 addsd %xmm1, %xmm0 movsd %xmm0, -8(%rbp) // 儲存到 sum addq $1, -16(%rbp) // i++ .L7: movq -16(%rbp), %rax cmpq -40(%rbp), %rax jl .L8 pxor %xmm0, %xmm0 popq %rbp .cfi_def_cfa 7, 8 ret .cfi_endproc .LFE3: .size inner4, .-inner4 .globl inner5 .type inner5, @function ``` ### for loop in inner4 ```cpp= for (i = 0; i < length; i++) { sum = sum + u[i] * v[i]; } ``` ``` .L8: movq -16(%rbp), %rax leaq 0(,%rax,8), %rdx movq -24(%rbp), %rax addq %rdx, %rax movsd (%rax), %xmm1 // 取到 u[i] movq -16(%rbp), %rax leaq 0(,%rax,8), %rdx movq -32(%rbp), %rax addq %rdx, %rax movsd (%rax), %xmm0 // 取到 v[i] mulsd %xmm1, %xmm0 movsd -8(%rbp), %xmm1 addsd %xmm1, %xmm0 movsd %xmm0, -8(%rbp) // 儲存到 sum addq $1, -16(%rbp) // i++ ``` ```cpp= for (i = 0; i < length; i++) { *dest = *dest + u[i] * v[i]; } ``` ``` .L8: movq -64(%rbp), %rax movl (%rax), %edx movq -16(%rbp), %rax leaq 0(,%rax,4), %rcx movq -40(%rbp), %rax addq %rcx, %rax movl (%rax), %ecx movq -16(%rbp), %rax leaq 0(,%rax,4), %rsi movq -48(%rbp), %rax addq %rsi, %rax movl (%rax), %eax imull %ecx, %eax addl %eax, %edx movq -64(%rbp), %rax movl %edx, (%rax) addq $1, -16(%rbp) ``` ### double v.s. int ![](https://i.imgur.com/3bpPbIG.png) ### dereference v.s. 直接儲存在 sum ![](https://i.imgur.com/nvEUko5.png) >可以觀察到 int 與 double 在 CPE 的差別， int 比較低，也可以看到dereference 與直接儲存在 sum 的差別，雖然差別不太明顯，但還是可以看得出直接儲存在 sum 較佳。後來有把最佳化開到 `-o3` CPE 還可以在往下降一單位，但要在 elements 超過 8000 後。 >[name=jn] >[color=blue] --- * Test Environment * gcc version 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.10) * lscpu ``` Architecture: x86_64 CPU 作業模式： 32-bit, 64-bit Byte Order: Little Endian CPU(s): 4 On-line CPU(s) list: 0-3 每核心執行緒數：2 每通訊端核心數：2 Socket(s): 1 NUMA 節點： 1 供應商識別號： GenuineIntel CPU 家族： 6 型號： 61 Model name: Intel(R) Core(TM) i5-5250U CPU @ 1.60GHz 製程： 4 CPU MHz： 2500.197 CPU max MHz: 2700.0000 CPU min MHz: 500.0000 BogoMIPS: 3199.98 虛擬： VT-x L1d 快取： 32K L1i 快取： 32K L2 快取： 256K L3 快取： 3072K NUMA node0 CPU(s): 0-3 ``` ### double v.s. int ![](https://i.imgur.com/aSA18Rw.png) ### dereference v.s. 直接儲存在 sum ![](https://i.imgur.com/SoDBUzz.png) --- ### 不同編譯器最佳化等級之結果 - 測試環境為 `Lubuntu 18.10(x86_64)`, `gcc (Ubuntu 7.3.0-27ubuntu1~18.04)` - 測試 [程式碼](https://github.com/flawless0714/team15/blob/master/5_13.c) - 未最佳化輸出之組合語言: (僅擷取部分程式碼, 完整輸出存放於 github 中, 編譯 command `$ gcc -S 5_13.c -o no_opt.s -Wall`) ```asm= inner4: .LFB6: .cfi_startproc pushq %rbp .cfi_def_cfa_offset 16 .cfi_offset 6, -16 movq %rsp, %rbp .cfi_def_cfa_register 6 subq $80, %rsp movq %rdi, -56(%rbp) movq %rsi, -64(%rbp) movq %rdx, -72(%rbp) pxor %xmm0, %xmm0 movsd %xmm0, -32(%rbp) movq -56(%rbp), %rax movq %rax, -24(%rbp) movq -64(%rbp), %rax movq %rax, -16(%rbp) pxor %xmm0, %xmm0 movsd %xmm0, -8(%rbp) movl $0, %eax call start_comp_counter@PLT movq $0, -40(%rbp) jmp .L7 .L8: movq -40(%rbp), %rax leaq 0(,%rax,8), %rdx movq -24(%rbp), %rax addq %rdx, %rax movsd (%rax), %xmm1 movq -40(%rbp), %rax leaq 0(,%rax,8), %rdx movq -16(%rbp), %rax addq %rdx, %rax movsd (%rax), %xmm0 mulsd %xmm1, %xmm0 movsd -32(%rbp), %xmm1 addsd %xmm1, %xmm0 movsd %xmm0, -32(%rbp) addq $1, -40(%rbp) .L7: movq -40(%rbp), %rax cmpq -72(%rbp), %rax jl .L8 movl $0, %eax call get_comp_counter@PLT movq %xmm0, %rax movq %rax, -8(%rbp) cvtsi2sdq -72(%rbp), %xmm0 movsd -8(%rbp), %xmm1 divsd %xmm0, %xmm1 movapd %xmm1, %xmm0 cvtsd2ss %xmm0, %xmm2 movss %xmm2, -44(%rbp) movq -8(%rbp), %rdx movq -72(%rbp), %rax movq %rdx, -80(%rbp) movsd -80(%rbp), %xmm0 movq %rax, %rsi leaq .LC2(%rip), %rdi movl $1, %eax call printf@PLT cvtss2sd -44(%rbp), %xmm0 leaq .LC3(%rip), %rdi movl $1, %eax call printf@PLT movss -44(%rbp), %xmm0 leave .cfi_def_cfa 7, 8 ret .cfi_endproc ``` 由此輸出可觀察到 start\_comp\_counter() 與 get\_comp\_counter() 有正確涵蓋到欲測試之程式碼。 - `-O2` 最佳化輸出之組合語言: (僅擷取部分程式碼, 完整輸出存放於 github 中, 編譯 command `$ gcc -O2 -S 5_13.c -o opt.s -Wall`) ```asm= inner4: .LFB42: .cfi_startproc pushq %rbx .cfi_def_cfa_offset 16 .cfi_offset 3, -16 xorl %eax, %eax movq %rdx, %rbx subq $16, %rsp .cfi_def_cfa_offset 32 call start_comp_counter@PLT xorl %eax, %eax call get_comp_counter@PLT pxor %xmm1, %xmm1 leaq .LC1(%rip), %rsi movapd %xmm0, %xmm2 movq %rbx, %rdx pxor %xmm3, %xmm3 movl $1, %edi cvtsi2sdq %rbx, %xmm1 movl $1, %eax divsd %xmm1, %xmm2 cvtsd2ss %xmm2, %xmm3 movss %xmm3, 12(%rsp) call __printf_chk@PLT pxor %xmm0, %xmm0 leaq .LC2(%rip), %rsi movl $1, %edi movl $1, %eax cvtss2sd 12(%rsp), %xmm0 call __printf_chk@PLT movss 12(%rsp), %xmm0 addq $16, %rsp .cfi_def_cfa_offset 16 popq %rbx .cfi_def_cfa_offset 8 ret .cfi_endproc ``` ### double v.s. int ![](https://i.imgur.com/EJCqE8t.png) 由此輸出可觀察到 start\_comp\_counter() 與 get\_comp\_counter() 並沒有包覆著欲觀測之程式碼, 取而代之的是一行 `xorl %eax, %eax`, 這應該就是為什麼測量數據(詳細數據因過於龐大故儲存於 github)會有平均9個 CPE 跟平均0.2個 CPE 的原因, 至於為什麼編譯器會做出這樣的優化仍是個謎。另外其他組員有發現到不同 CPU 在執行未優化的測試程式會有不同的 CPE ，並且效能越好的 CPU(i7-8700K) 會有較高的 CPE 。 ## 參考資料 - [gcc -lm option](https://blog.csdn.net/Aguangg_6655_la/article/details/62042810)