Dange
    • Create new note
    • Create a note from template
      • Sharing URL Link copied
      • /edit
      • View mode
        • Edit mode
        • View mode
        • Book mode
        • Slide mode
        Edit mode View mode Book mode Slide mode
      • Customize slides
      • Note Permission
      • Read
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Write
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Engagement control Commenting, Suggest edit, Emoji Reply
    • Invite by email
      Invitee

      This note has no invitees

    • Publish Note

      Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

      Your note will be visible on your profile and discoverable by anyone.
      Your note is now live.
      This note is visible on your profile and discoverable online.
      Everyone on the web can find and read all notes of this public team.
      See published notes
      Unpublish note
      Please check the box to agree to the Community Guidelines.
      View profile
    • Commenting
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
      • Everyone
    • Suggest edit
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
    • Emoji Reply
    • Enable
    • Versions and GitHub Sync
    • Note settings
    • Note Insights
    • Engagement control
    • Transfer ownership
    • Delete this note
    • Save as template
    • Insert from template
    • Import from
      • Dropbox
      • Google Drive
      • Gist
      • Clipboard
    • Export to
      • Dropbox
      • Google Drive
      • Gist
    • Download
      • Markdown
      • HTML
      • Raw HTML
Menu Note settings Versions and GitHub Sync Note Insights Sharing URL Create Help
Create Create new note Create a note from template
Menu
Options
Engagement control Transfer ownership Delete this note
Import from
Dropbox Google Drive Gist Clipboard
Export to
Dropbox Google Drive Gist
Download
Markdown HTML Raw HTML
Back
Sharing URL Link copied
/edit
View mode
  • Edit mode
  • View mode
  • Book mode
  • Slide mode
Edit mode View mode Book mode Slide mode
Customize slides
Note Permission
Read
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Write
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Engagement control Commenting, Suggest edit, Emoji Reply
  • Invite by email
    Invitee

    This note has no invitees

  • Publish Note

    Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

    Your note will be visible on your profile and discoverable by anyone.
    Your note is now live.
    This note is visible on your profile and discoverable online.
    Everyone on the web can find and read all notes of this public team.
    See published notes
    Unpublish note
    Please check the box to agree to the Community Guidelines.
    View profile
    Engagement control
    Commenting
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    • Everyone
    Suggest edit
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    Emoji Reply
    Enable
    Import from Dropbox Google Drive Gist Clipboard
       owned this note    owned this note      
    Published Linked with GitHub
    2
    Subscribed
    • Any changes
      Be notified of any changes
    • Mention me
      Be notified of mention me
    • Unsubscribe
    Subscribe
    # 2018q3 Homework6 contributed by < `allenchen8210`, `dange0`, `flawless0714`> ###### tags: `sysprog2018` ## Cycles Per Element * Cycles Per Element (CPE) 來表示程式性能,這種度量方式對重複計算相當合適,例如處理影像的像素。簡而言之,就是 CPE 可以用來描述隨著處理量的增加,CPU 需要的 cycle 數量,越低約好。運算每元素所需要的周期數。 * 這些項中的係數稱為 CPE * `a+4.0n` CPE = 4.0 * `a+3.5n` CPE = 3.5 ![](https://i.imgur.com/Qv6pJxC.jpg) ### combine1 ```c= #define INDENT 1 #define OP * void combine1(vec_ptr, data *dest) { long i; *desst = IDENT; for(i=0; i < vec_length(v); i++){ data_t val; get_vec_element(v, i, &val); *dest = *dest OP val; } } ``` CPE test for multiply long * combine1 : ==20.18== 可以看到第 8 行`vec_length(v)`一直在做重複取值的動作,陣列的長度不會在 `for` 迴圈裡面改變,沒有必要一直取值的動作 ### combine2 修改 `for` 迴圈後 ```c= #define INDENT 1 #define OP * void combine2(vec_ptr, data *dest) { long i; long length = vec_length(v); *desst = IDENT; for(i=0; i < length; i++){ data_t val; get_vec_element(v, i, &val); *dest = *dest OP val; } } ``` CPE test for multiply long * combine1 : 20.18 * combine2 : ==11.03== 可以發現到 `combine2` 仍然有改進的空間,在第 11 行取值的方法可以直接改成直接取數值,不用透過 `get_vec_element` 取值,裡面還要做邊接檢測,本來取的數值就不會超過邊界。 ```c= int get_vec_element(vec_ptr v, long index, data_t *dest) { if(index < 0 || index >= v->len) return 0; *dest = v->data[index] return 1; } ``` ### combine3 ```c= #define INDENT 1 #define OP * data_t *get_vec_start(vec_ptr v) { return v->data; } void combine3(vec_ptr, data *dest) { long i; long length = vec_length(v); data_t *data = get_vec_start(v); *dest = IDENT; for(i=0; i < length; i++){ *dest = *dest OP data[i]; } } ``` CPE test for multiply long * combine1 : 20.18 * combine2 : 11.03 * combine3 : ==11.03== ### combine4 而作者又進一步將`combine3` 的`*dest = *dest OP data[i];`改進成 `acc = acc OP data[i]` 利用 `acc` 來儲存累積的值,為什麼要這麼做呢? 原本的 `combine3` 在每次迭代時候都要將累積的數值 `dest` 從記憶體讀出來然後做完在寫回去。==但是每做完一次的累積值就是下一次開始的累積值==。為了避免重複讀寫,我可以直接先將累積值放在一個地方,等全部做完在放回去 `*dest` 。 就變成如下程式碼 ```c= void combine4(vec_ptr, data *dest) { long i; long length = vec_length(v); data_t *data = get_vec_start(v); data_t acc = IDENT; for(i=0; i < length; i++){ acc = acc OP data[i]; } *dest = acc; } ``` #### 在組合語言我們可以看的更清楚 * Inner loop of combine3 * dest in `%rbx` * data+i in `%rdx` * data+length in `%rax` ```= .L17 vmovsd (%rbx), %xmm0 // read dest vmulsd (%rbx), %xmm0, %xmm0 vmovsd %xmm0, (%rbx) // write dest addq $8, %rdx cmpq %rax, %rdx jne .L17 ``` ==重複點在 read dest 與 write dest==,我們可以避免這兩個讀與寫,最後全部做完在寫就好。 * Inner loop of combine4 * acc in `%xmm0` * data+i in `%rdx` * data+length in `%rax` ```= .L25 vmulsd (%rdx), %xmm0, %xmm0 addq $8, %rdx cmpq %rax, %rdx jne .L25 ``` CPE test for multiply long * combine1 : 20.18 * combine2 : 11.03 * combine3 : 11.03 * combine4 : ==5.01== 作業題目`5.13` 兩個向量內積 ```c= void inner4(vec_ptr u, vec_ptr v, data_t *dest) { long i; long length = veclength(u) data_t *udata = get_vec_start(u); data_t *vdata = get_vec_start(v); data_t sum = (data_t) 0; for(i = 0; i < length; i++){ sum = sum + udata[i] * vdata[i]; } *dest = sum; } ``` 組合語言 ```= .L15: vmovsd 0(%rbp,%rcx,8), %mm1 vmulsd (%rax,%rcx,8), %xmm1, %xmm1 vaddsd %xmm1, %xmm0, %xmm0 addq $1, %rcx cmpq %rbx, %rcx\ jne .L15 ``` 1. 組合語言圖形化 ![](https://i.imgur.com/TfOYA91.png) 2.Critical Path ![](https://i.imgur.com/dUfsbJG.png) 3. 決定關鍵路徑的 CPE 是 add ,浮點數 CPE 是 3.0 4. 決定關鍵路徑的 CPE 是 add ,整數 CPE 是 1.0 * 發射時間 (issue time) 表示兩個連續同類型的運算之間需要的最小周期數。一個運算可能是好幾個 blocks 組成,有些 block 需要 2 cycles 有些需要 1 cycle,例如有些運算需要 1 2 1 ,有些運算需要 1 3 3 等多種不一樣的組合。如果是可以完全切成 1 1 1 的話就是完全流水線化,等於是一個 cycle 就可以在塞一個同樣的運算進去。 * 若不是完全流水線化:發射時間等於他的延遲 * 發射時間倒數為最大吞吐量 * 假設是完全流水線化,代表隨著 clock 觸發一次,運算就丟進去一個,如果一個`4GHZ` 處理器,帶表他吞吐量是 4 GIPS  * GIPS 每秒千兆條指令,就是每秒十億條指令。 * 具有多個功能單元可以進一步提高吞吐量。 ![](https://i.imgur.com/9Inskdd.png) ## performancelab 於 github 上找到分析效能之 [performancelab](https://github.com/zxc479773533/CS-APP3e-labs/tree/master/performancelab) ### trace code 主要計算 CPE 之程式碼可以參考 `fcyc.c` 中的 `fcyc_v` **[fcyc.c](https://github.com/zxc479773533/CS-APP3e-labs/blob/master/performancelab/fcyc.c#L110)** ```cpp=110 double fcyc_v(test_funct_v f, void *params[]) { double result; init_sampler(); if (compensate) { do { double cyc; if (clear_cache) clear(); start_comp_counter(); f(params); cyc = get_comp_counter(); add_sample(cyc); } while (!has_converged() && samplecount < maxsamples); } else { do { double cyc; if (clear_cache) clear(); start_counter(); f(params); cyc = get_counter(); add_sample(cyc); } while (!has_converged() && samplecount < maxsamples); } #ifdef DEBUG { int i; printf(" %d smallest values: [", kbest); for (i = 0; i < kbest; i++) printf("%.0f%s", values[i], i==kbest-1 ? "]\n" : ", "); } #endif result = values[0]; #if !KEEP_VALS free(values); values = NULL; #endif return result; } ``` - 該函式的傳入值為要做效能測試之函式與待測函式的傳入值 - init_sampler() - 初始化之後要放 sample 的陣列 - add_sample(cyc) - 將 sample 到的 cycle 數放進陣列中 - has_converged() - 於 129 行,start_counter() 與 get_counter() 將 f(params) 包住,去記錄進入 f(params) 之前之 cpu cycle 差異,詳細行為將在 `clock.c` 中分析 **[clock.c](https://github.com/zxc479773533/CS-APP3e-labs/blob/master/performancelab/clock.c#L83)** ```cpp=83 #if IS_x86 /* $begin x86cyclecounter */ /* Initialize the cycle counter */ static unsigned cyc_hi = 0; static unsigned cyc_lo = 0; /* Set *hi and *lo to the high and low order bits of the cycle counter. Implementation requires assembly code to use the rdtsc instruction. */ void access_counter(unsigned *hi, unsigned *lo) { asm("rdtsc; movl %%edx,%0; movl %%eax,%1" /* Read cycle counter */ : "=r" (*hi), "=r" (*lo) /* and move results to */ : /* No input */ /* the two outputs */ : "%edx", "%eax"); } /* Record the current value of the cycle counter. */ void start_counter() { access_counter(&cyc_hi, &cyc_lo); } /* Return the number of cycles since the last call to start_counter. */ double get_counter() { unsigned ncyc_hi, ncyc_lo; unsigned hi, lo, borrow; double result; /* Get cycle counter */ access_counter(&ncyc_hi, &ncyc_lo); /* Do double precision subtraction */ lo = ncyc_lo - cyc_lo; borrow = lo > ncyc_lo; hi = ncyc_hi - cyc_hi - borrow; result = (double) hi * (1 << 30) * 4 + lo; if (result < 0) { fprintf(stderr, "Error: counter returns neg value: %.0f\n", result); } return result; } /* $end x86cyclecounter */ #endif /* x86 */ ``` - start_counter - 透過 access_counter 去得到 x86 的 cycle counter - get_counter - 去計算 start_counter 到 get_counter 之間經過了幾個 cycle - `result = (double) hi * (1 << 30) * 4 + lo` 相當於 `result = hi << 32 + lo` 形成一個 64-bits 的結果 - access_counter - 利用asm() 呼叫 x86 的指令 `rdtsc`,將 cycle counter 取出 - `rdtsc` 會將 [Time stamp counters](https://en.wikipedia.org/wiki/Time_Stamp_Counter) 回傳至 edx 與 eax - [asm](https://gcc.gnu.org/onlinedocs/gcc/Extended-Asm.html) 介紹 - The asm statement allows you to include assembly instructions directly within C code. - 使用方法如下: ``` asm ( AssemblerTemplate : OutputOperands [ : InputOperands [ : Clobbers ] ]) ``` - 用 `:` 將 AssemblerTemplate, OutputOperands, InputOperands, clobber Clobbers 隔開 - AssemblerTemplate:放 asm code 的地方,並用 `%0` `%1` 等依序存取 OutputOperands 與 InputOperands - Output/Input Operands:放 output/input paramater 的地方,每個參數前面會先宣告參數的讀寫權限 - `=`:Write-only - `+`:Read-Write - `&`:Output-only - Clobbers:告訴 compiler 有哪些 register 或 memory 會被修改 - 解讀 access_counter 中的 asm() ``` asm("rdtsc; movl %%edx,%0; movl %%eax,%1" /* Read cycle counter */ : "=r" (*hi), "=r" (*lo) /* and move results to */ : /* No input */ /* the two outputs */ : "%edx", "%eax"); ``` 可以被解讀為 ``` rdtsc movl %edx, hi movl %eax, lo ``` 其中 *hi 與 *lo 的屬性為 write-only 的 general register 經過上述分析,可以得知 CPE 的計算方式為紀錄待測函式執行前後的 cycle counter,並計算其差異。 其中取得 cycle counter 的方式為透過現成的指令。 ## 問題討論 ### 作業第四小題 題目: 請解釋雖然乘法操作需要五個時鐘週期,但是為什麼兩個浮點版本的 CPE 都是3.00? 參考了幾份解答都說因為 ADD 在關鍵路徑上,所以 CPE 為3.00。可是雙精度浮點數的乘法需要五個時鐘週期,這樣不是單次迴圈就需要至少五個時鐘週期了嗎? ### 嘗試將 inner4() 移植到 performancelab 中分析 額外移植進 performancelab 之 code: ```=c double *create_double_array(int size) { double *array = malloc(size * sizeof(double)); if (!array) return NULL; for (int i = 0; i < size; i++) array[i] = (float) rand() / (float)(RAND_MAX / 100); return array; } void inner4(double *u, double *v, long length) { long i; data_t *udata = u; data_t *vdata = v; data_t sum = (data_t) 0; for (i = 0; i < length; i++) { sum = sum + udata[i] * vdata[i]; } *v = sum; } ``` 修改後 `driver.c` 中的 `test_rotate()`(我們將 `rotate` 的測試內容改成作業的 `inner4` 函數) : (原為 **[driver.c](https://github.com/zxc479773533/CS-APP3e-labs/blob/8c7d4ecf61dcb69734f8afef3257d19ade3e8e38/performancelab/driver.c#L351)**) ```c=351 /* Measure CPE */ { double num_cycles, cpe; int tmpdim = dim; void *arglist[4]; double dimension = (double) dim; double work = dimension*dimension; #ifdef DEBUG printf("DEBUG: dimension=%.1f\n",dimension); printf("DEBUG: work=%.1f\n",work); #endif arglist[0] = (void *) benchmarks_rotate[bench_index].tfunct; arglist[1] = (void *) &tmpdim; arglist[2] = (void *) orig; arglist[3] = (void *) result; create(dim); num_cycles = fcyc_v((test_funct_v)&func_wrapper, arglist); cpe = num_cycles / 100; /* 100 is the cycle of iteration */ printf("custom test running!\n"); benchmarks_rotate[bench_index].cpes[test_num] = cpe; } } ``` 修改後 `fycy.c` 中的 `fcyc_v()` : (原為 **[fcyc.c](https://github.com/zxc479773533/CS-APP3e-labs/blob/8c7d4ecf61dcb69734f8afef3257d19ade3e8e38/performancelab/fcyc.c#L156)**) ```c=156 double fcyc_v(test_funct_v f, void *params[]) { double result; init_sampler(); if (compensate) { do { double cyc; if (clear_cache) clear(); double* arr1 = create_double_array(100); double* arr2 = create_double_array(100); if (arr1 && arr2) { start_comp_counter(); //f(params); inner4(arr1, arr2, 100); cyc = get_comp_counter(); // printf("\ncycle per times: %.0f\n", cyc); free(arr1); free(arr2); } printf("custom test running!\n"); /* start_comp_counter(); f(params);printf("custom test running!\n"); cyc = get_comp_counter(); */ add_sample(cyc); } while (!has_converged() && samplecount < maxsamples); } else { do { double cyc; if (clear_cache) clear(); double* arr1 = create_double_array(100); double* arr2 = create_double_array(100); if (arr1 && arr2) { start_counter(); //f(params); inner4(arr1, arr2, 100); cyc = get_counter(); // printf("\ncycle per times: %.0f\n", cyc); free(arr1); free(arr2); } //printf("\ncycle per times: %.0f\n", cyc); add_sample(cyc); } while (!has_converged() && samplecount < maxsamples); } #ifdef DEBUG { int i; printf(" %d smallest values: [", kbest); for (i = 0; i < kbest; i++) printf("%.0f%s", values[i], i==kbest-1 ? "]\n" : ", "); } #endif result = values[0]; #if !KEEP_VALS free(values); values = NULL; #endif return result; } } ``` 輸出結果: (開啟補償 ([compensate for timer overhead](https://github.com/zxc479773533/CS-APP3e-labs/blob/8c7d4ecf61dcb69734f8afef3257d19ade3e8e38/performancelab/driver.c#L716))) ``` Teamname: team_15 Member 1: N/A Email 1: N/A Rotate: Version = inner: custom evaluation: Dim 64 128 256 512 1024 Mean Your CPEs 3.8 3.8 3.8 3.8 3.8 Baseline CPEs 14.7 40.1 46.4 65.9 94.5 Speedup 3.9 10.6 12.4 17.6 25.2 11.8 Summary of Your Best Scores: Rotate: 11.8 (inner: custom evaluation) Smooth: 0.0 ((null)) ``` 其中我們較關心的部份為 `Your CPEs 3.8 3.8 3.8 3.8 3.8`,CPE 比作業解答理論值多出 0.8,應該是 throughput bound。 輸出結果: (關閉補償) ``` Teamname: team_15 Member 1: N/A Email 1: N/A Rotate: Version = inner: custom evaluation: Dim 64 128 256 512 1024 Mean Your CPEs 15.8 15.8 6.3 3.6 3.6 Baseline CPEs 14.7 40.1 46.4 65.9 94.5 Speedup 0.9 2.5 7.3 18.4 26.5 6.1 Summary of Your Best Scores: Rotate: 6.1 (inner: custom evaluation) Smooth: 0.0 ((null)) ``` 由本次輸出可以發現到頭幾次執行的 CPE 比較高,相較之下有補償的版本就沒有這問題,這邊對 CPE 是逐次下降而不是總平均升高有疑問,沒有補償的版本 CPE 不是應該平均提高嗎?還是說有哪裡少考慮到了。 --- >把 performanclab 移植進去 inner4 測試,發現各個 dimension 的 CPE 在 10 左右 >[source code](https://github.com/allenchen8210/team15) >[name=jn] >[color=green] ==Desktop== * gcc version 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.10) ``` Architecture: x86_64 CPU 作業模式: 32-bit, 64-bit Byte Order: Little Endian CPU(s): 12 On-line CPU(s) list: 0-11 每核心執行緒數:2 每通訊端核心數:6 Socket(s): 1 NUMA 節點: 1 供應商識別號: GenuineIntel CPU 家族: 6 型號: 158 Model name: Intel(R) Core(TM) i7-8700K CPU @ 3.70GHz 製程: 10 CPU MHz: 800.029 CPU max MHz: 4700.0000 CPU min MHz: 800.0000 BogoMIPS: 7392.00 虛擬: VT-x L1d 快取: 32K L1i 快取: 32K L2 快取: 256K L3 快取: 12288K NUMA node0 CPU(s): 0-11 ``` ``` For 999-Dimensional Vector Dot Product. Cycle = 11088.000000 CPE = 11.099099 For 999-Dimensional Vector Dot Product. Cycle = 11080.000000 CPE = 11.091091 For 999-Dimensional Vector Dot Product. Cycle = 11752.000000 CPE = 11.763763 For 999-Dimensional Vector Dot Product. Cycle = 11068.000000 CPE = 11.079079 For 999-Dimensional Vector Dot Product. Cycle = 11784.000000 CPE = 11.795795 For 999-Dimensional Vector Dot Product. Cycle = 11742.000000 CPE = 11.753754 For 999-Dimensional Vector Dot Product. Cycle = 11762.000000 CPE = 11.773774 For 999-Dimensional Vector Dot Product. Cycle = 11106.000000 CPE = 11.117117 For 999-Dimensional Vector Dot Product. Cycle = 11794.000000 CPE = 11.805806 For 1000-Dimensional Vector Dot Product. Cycle = 11102.000000 CPE = 11.102000 For 1000-Dimensional Vector Dot Product. Cycle = 11096.000000 CPE = 11.096000 For 1000-Dimensional Vector Dot Product. Cycle = 11092.000000 CPE = 11.092000 For 1000-Dimensional Vector Dot Product. Cycle = 11100.000000 CPE = 11.100000 For 1000-Dimensional Vector Dot Product. Cycle = 11104.000000 CPE = 11.104000 For 1000-Dimensional Vector Dot Product. Cycle = 11800.000000 CPE = 11.800000 For 1000-Dimensional Vector Dot Product. Cycle = 11092.000000 CPE = 11.092000 For 1000-Dimensional Vector Dot Product. Cycle = 11810.000000 CPE = 11.810000 For 1000-Dimensional Vector Dot Product. Cycle = 11118.000000 CPE = 11.118000 For 1000-Dimensional Vector Dot Product. Cycle = 11104.000000 CPE = 11.104000 ``` ==Macbook Air== * gcc version 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.10) ``` allen@Mac:~/embeededHW/team15$ lscpu Architecture: x86_64 CPU 作業模式: 32-bit, 64-bit Byte Order: Little Endian CPU(s): 4 On-line CPU(s) list: 0-3 每核心執行緒數:2 每通訊端核心數:2 Socket(s): 1 NUMA 節點: 1 供應商識別號: GenuineIntel CPU 家族: 6 型號: 61 Model name: Intel(R) Core(TM) i5-5250U CPU @ 1.60GHz 製程: 4 CPU MHz: 1634.462 CPU max MHz: 2700.0000 CPU min MHz: 500.0000 BogoMIPS: 3200.02 虛擬: VT-x L1d 快取: 32K L1i 快取: 32K L2 快取: 256K L3 快取: 3072K NUMA node0 CPU(s): 0-3 ``` ``` For 984-Dimensional Vector Dot Product. Cycle = 5710.000000 CPE = 5.802845 For 984-Dimensional Vector Dot Product. Cycle = 5708.000000 CPE = 5.800813 For 984-Dimensional Vector Dot Product. Cycle = 5710.000000 CPE = 5.802845 For 984-Dimensional Vector Dot Product. Cycle = 5724.000000 CPE = 5.817073 For 984-Dimensional Vector Dot Product. Cycle = 5710.000000 CPE = 5.802845 For 984-Dimensional Vector Dot Product. Cycle = 5708.000000 CPE = 5.800813 For 984-Dimensional Vector Dot Product. Cycle = 5710.000000 CPE = 5.802845 For 984-Dimensional Vector Dot Product. Cycle = 5710.000000 CPE = 5.802845 For 984-Dimensional Vector Dot Product. Cycle = 5726.000000 CPE = 5.819106 For 984-Dimensional Vector Dot Product. Cycle = 5708.000000 CPE = 5.800813 For 985-Dimensional Vector Dot Product. Cycle = 5714.000000 CPE = 5.801015 For 985-Dimensional Vector Dot Product. Cycle = 5728.000000 CPE = 5.815228 For 985-Dimensional Vector Dot Product. Cycle = 5758.000000 CPE = 5.845685 For 985-Dimensional Vector Dot Product. Cycle = 5716.000000 CPE = 5.803046 For 985-Dimensional Vector Dot Product. Cycle = 5714.000000 CPE = 5.801015 For 985-Dimensional Vector Dot Product. Cycle = 5714.000000 CPE = 5.801015 For 985-Dimensional Vector Dot Product. Cycle = 5714.000000 CPE = 5.801015 For 985-Dimensional Vector Dot Product. Cycle = 5728.000000 CPE = 5.815228 For 985-Dimensional Vector Dot Product. Cycle = 5712.000000 CPE = 5.798985 For 985-Dimensional Vector Dot Product. Cycle = 5714.000000 CPE = 5.801015 For 986-Dimensional Vector Dot Product. Cycle = 5734.000000 CPE = 5.815416 ``` * Gnuplot ![](https://i.imgur.com/ePU4d54.png) >可以看到兩部電腦明顯 CPE 差了快一半,反而是 Intel® Core™ i5-5250U CPE 較佳 >[name=jn] >[color=blue] --- ## test inner4 function * Test Environment * gcc version 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.10) * lscpu ``` Architecture: x86_64 CPU 作業模式: 32-bit, 64-bit Byte Order: Little Endian CPU(s): 12 On-line CPU(s) list: 0-11 每核心執行緒數:2 每通訊端核心數:6 Socket(s): 1 NUMA 節點: 1 供應商識別號: GenuineIntel CPU 家族: 6 型號: 158 Model name: Intel(R) Core(TM) i7-8700K CPU @ 3.70GHz 製程: 10 CPU MHz: 799.936 CPU max MHz: 4700.0000 CPU min MHz: 800.0000 BogoMIPS: 7392.00 虛擬: VT-x L1d 快取: 32K L1i 快取: 32K L2 快取: 256K L3 快取: 12288K NUMA node0 CPU(s): 0-11 ``` `$ gcc -S 5_13.c` ```cpp= float inner4(double *u, double *v, long length) { long i = 0; double sum = 200; for (i = 0; i < length; i++) { sum = sum + u[i] * v[i]; } return 0; } ``` * assembly code ``` inner4: .LFB3: .cfi_startproc pushq %rbp .cfi_def_cfa_offset 16 .cfi_offset 6, -16 movq %rsp, %rbp .cfi_def_cfa_register 6 movq %rdi, -24(%rbp) movq %rsi, -32(%rbp) movq %rdx, -40(%rbp) movq $0, -16(%rbp) movsd .LC1(%rip), %xmm0 movsd %xmm0, -8(%rbp) movq $0, -16(%rbp) jmp .L7 .L8: movq -16(%rbp), %rax leaq 0(,%rax,8), %rdx movq -24(%rbp), %rax addq %rdx, %rax movsd (%rax), %xmm1 // 取到 u[i] movq -16(%rbp), %rax leaq 0(,%rax,8), %rdx movq -32(%rbp), %rax addq %rdx, %rax movsd (%rax), %xmm0 // 取到 v[i] mulsd %xmm1, %xmm0 movsd -8(%rbp), %xmm1 addsd %xmm1, %xmm0 movsd %xmm0, -8(%rbp) // 儲存到 sum addq $1, -16(%rbp) // i++ .L7: movq -16(%rbp), %rax cmpq -40(%rbp), %rax jl .L8 pxor %xmm0, %xmm0 popq %rbp .cfi_def_cfa 7, 8 ret .cfi_endproc .LFE3: .size inner4, .-inner4 .globl inner5 .type inner5, @function ``` ### for loop in inner4 ```cpp= for (i = 0; i < length; i++) { sum = sum + u[i] * v[i]; } ``` ``` .L8: movq -16(%rbp), %rax leaq 0(,%rax,8), %rdx movq -24(%rbp), %rax addq %rdx, %rax movsd (%rax), %xmm1 // 取到 u[i] movq -16(%rbp), %rax leaq 0(,%rax,8), %rdx movq -32(%rbp), %rax addq %rdx, %rax movsd (%rax), %xmm0 // 取到 v[i] mulsd %xmm1, %xmm0 movsd -8(%rbp), %xmm1 addsd %xmm1, %xmm0 movsd %xmm0, -8(%rbp) // 儲存到 sum addq $1, -16(%rbp) // i++ ``` ```cpp= for (i = 0; i < length; i++) { *dest = *dest + u[i] * v[i]; } ``` ``` .L8: movq -64(%rbp), %rax movl (%rax), %edx movq -16(%rbp), %rax leaq 0(,%rax,4), %rcx movq -40(%rbp), %rax addq %rcx, %rax movl (%rax), %ecx movq -16(%rbp), %rax leaq 0(,%rax,4), %rsi movq -48(%rbp), %rax addq %rsi, %rax movl (%rax), %eax imull %ecx, %eax addl %eax, %edx movq -64(%rbp), %rax movl %edx, (%rax) addq $1, -16(%rbp) ``` ### double v.s. int ![](https://i.imgur.com/3bpPbIG.png) ### dereference v.s. 直接儲存在 sum ![](https://i.imgur.com/nvEUko5.png) >可以觀察到 int 與 double 在 CPE 的差別, int 比較低,也可以看到dereference 與 直接儲存在 sum 的差別,雖然差別不太明顯,但還是可以看得出直接儲存在 sum 較佳。後來有把最佳化開到 `-o3` CPE 還可以在往下降一單位,但要在 elements 超過 8000 後。 >[name=jn] >[color=blue] --- * Test Environment * gcc version 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.10) * lscpu ``` Architecture: x86_64 CPU 作業模式: 32-bit, 64-bit Byte Order: Little Endian CPU(s): 4 On-line CPU(s) list: 0-3 每核心執行緒數:2 每通訊端核心數:2 Socket(s): 1 NUMA 節點: 1 供應商識別號: GenuineIntel CPU 家族: 6 型號: 61 Model name: Intel(R) Core(TM) i5-5250U CPU @ 1.60GHz 製程: 4 CPU MHz: 2500.197 CPU max MHz: 2700.0000 CPU min MHz: 500.0000 BogoMIPS: 3199.98 虛擬: VT-x L1d 快取: 32K L1i 快取: 32K L2 快取: 256K L3 快取: 3072K NUMA node0 CPU(s): 0-3 ``` ### double v.s. int ![](https://i.imgur.com/aSA18Rw.png) ### dereference v.s. 直接儲存在 sum ![](https://i.imgur.com/SoDBUzz.png) --- ### 不同編譯器最佳化等級之結果 - 測試環境為 `Lubuntu 18.10(x86_64)`, `gcc (Ubuntu 7.3.0-27ubuntu1~18.04)` - 測試 [程式碼](https://github.com/flawless0714/team15/blob/master/5_13.c) - 未最佳化輸出之組合語言: (僅擷取部分程式碼, 完整輸出存放於 github 中, 編譯 command `$ gcc -S 5_13.c -o no_opt.s -Wall`) ```asm= inner4: .LFB6: .cfi_startproc pushq %rbp .cfi_def_cfa_offset 16 .cfi_offset 6, -16 movq %rsp, %rbp .cfi_def_cfa_register 6 subq $80, %rsp movq %rdi, -56(%rbp) movq %rsi, -64(%rbp) movq %rdx, -72(%rbp) pxor %xmm0, %xmm0 movsd %xmm0, -32(%rbp) movq -56(%rbp), %rax movq %rax, -24(%rbp) movq -64(%rbp), %rax movq %rax, -16(%rbp) pxor %xmm0, %xmm0 movsd %xmm0, -8(%rbp) movl $0, %eax call start_comp_counter@PLT movq $0, -40(%rbp) jmp .L7 .L8: movq -40(%rbp), %rax leaq 0(,%rax,8), %rdx movq -24(%rbp), %rax addq %rdx, %rax movsd (%rax), %xmm1 movq -40(%rbp), %rax leaq 0(,%rax,8), %rdx movq -16(%rbp), %rax addq %rdx, %rax movsd (%rax), %xmm0 mulsd %xmm1, %xmm0 movsd -32(%rbp), %xmm1 addsd %xmm1, %xmm0 movsd %xmm0, -32(%rbp) addq $1, -40(%rbp) .L7: movq -40(%rbp), %rax cmpq -72(%rbp), %rax jl .L8 movl $0, %eax call get_comp_counter@PLT movq %xmm0, %rax movq %rax, -8(%rbp) cvtsi2sdq -72(%rbp), %xmm0 movsd -8(%rbp), %xmm1 divsd %xmm0, %xmm1 movapd %xmm1, %xmm0 cvtsd2ss %xmm0, %xmm2 movss %xmm2, -44(%rbp) movq -8(%rbp), %rdx movq -72(%rbp), %rax movq %rdx, -80(%rbp) movsd -80(%rbp), %xmm0 movq %rax, %rsi leaq .LC2(%rip), %rdi movl $1, %eax call printf@PLT cvtss2sd -44(%rbp), %xmm0 leaq .LC3(%rip), %rdi movl $1, %eax call printf@PLT movss -44(%rbp), %xmm0 leave .cfi_def_cfa 7, 8 ret .cfi_endproc ``` 由此輸出可觀察到 start\_comp\_counter() 與 get\_comp\_counter() 有正確涵蓋到欲測試之程式碼。 - `-O2` 最佳化輸出之組合語言: (僅擷取部分程式碼, 完整輸出存放於 github 中, 編譯 command `$ gcc -O2 -S 5_13.c -o opt.s -Wall`) ```asm= inner4: .LFB42: .cfi_startproc pushq %rbx .cfi_def_cfa_offset 16 .cfi_offset 3, -16 xorl %eax, %eax movq %rdx, %rbx subq $16, %rsp .cfi_def_cfa_offset 32 call start_comp_counter@PLT xorl %eax, %eax call get_comp_counter@PLT pxor %xmm1, %xmm1 leaq .LC1(%rip), %rsi movapd %xmm0, %xmm2 movq %rbx, %rdx pxor %xmm3, %xmm3 movl $1, %edi cvtsi2sdq %rbx, %xmm1 movl $1, %eax divsd %xmm1, %xmm2 cvtsd2ss %xmm2, %xmm3 movss %xmm3, 12(%rsp) call __printf_chk@PLT pxor %xmm0, %xmm0 leaq .LC2(%rip), %rsi movl $1, %edi movl $1, %eax cvtss2sd 12(%rsp), %xmm0 call __printf_chk@PLT movss 12(%rsp), %xmm0 addq $16, %rsp .cfi_def_cfa_offset 16 popq %rbx .cfi_def_cfa_offset 8 ret .cfi_endproc ``` ### double v.s. int ![](https://i.imgur.com/EJCqE8t.png) 由此輸出可觀察到 start\_comp\_counter() 與 get\_comp\_counter() 並沒有包覆著欲觀測之程式碼, 取而代之的是一行 `xorl %eax, %eax`, 這應該就是為什麼測量數據(詳細數據因過於龐大故儲存於 github)會有平均9個 CPE 跟 平均0.2個 CPE 的原因, 至於為什麼編譯器會做出這樣的優化仍是個謎。另外其他組員有發現到不同 CPU 在執行未優化的測試程式會有不同的 CPE ,並且效能越好的 CPU(i7-8700K) 會有較高的 CPE 。 ## 參考資料 - [gcc -lm option](https://blog.csdn.net/Aguangg_6655_la/article/details/62042810)

    Import from clipboard

    Paste your markdown or webpage here...

    Advanced permission required

    Your current role can only read. Ask the system administrator to acquire write and comment permission.

    This team is disabled

    Sorry, this team is disabled. You can't edit this note.

    This note is locked

    Sorry, only owner can edit this note.

    Reach the limit

    Sorry, you've reached the max length this note can be.
    Please reduce the content or divide it to more notes, thank you!

    Import from Gist

    Import from Snippet

    or

    Export to Snippet

    Are you sure?

    Do you really want to delete this note?
    All users will lose their connection.

    Create a note from template

    Create a note from template

    Oops...
    This template has been removed or transferred.
    Upgrade
    All
    • All
    • Team
    No template.

    Create a template

    Upgrade

    Delete template

    Do you really want to delete this template?
    Turn this template into a regular note and keep its content, versions, and comments.

    This page need refresh

    You have an incompatible client version.
    Refresh to update.
    New version available!
    See releases notes here
    Refresh to enjoy new features.
    Your user state has changed.
    Refresh to load new user state.

    Sign in

    Forgot password

    or

    By clicking below, you agree to our terms of service.

    Sign in via Facebook Sign in via Twitter Sign in via GitHub Sign in via Dropbox Sign in with Wallet
    Wallet ( )
    Connect another wallet

    New to HackMD? Sign up

    Help

    • English
    • 中文
    • Français
    • Deutsch
    • 日本語
    • Español
    • Català
    • Ελληνικά
    • Português
    • italiano
    • Türkçe
    • Русский
    • Nederlands
    • hrvatski jezik
    • język polski
    • Українська
    • हिन्दी
    • svenska
    • Esperanto
    • dansk

    Documents

    Help & Tutorial

    How to use Book mode

    Slide Example

    API Docs

    Edit in VSCode

    Install browser extension

    Contacts

    Feedback

    Discord

    Send us email

    Resources

    Releases

    Pricing

    Blog

    Policy

    Terms

    Privacy

    Cheatsheet

    Syntax Example Reference
    # Header Header 基本排版
    - Unordered List
    • Unordered List
    1. Ordered List
    1. Ordered List
    - [ ] Todo List
    • Todo List
    > Blockquote
    Blockquote
    **Bold font** Bold font
    *Italics font* Italics font
    ~~Strikethrough~~ Strikethrough
    19^th^ 19th
    H~2~O H2O
    ++Inserted text++ Inserted text
    ==Marked text== Marked text
    [link text](https:// "title") Link
    ![image alt](https:// "title") Image
    `Code` Code 在筆記中貼入程式碼
    ```javascript
    var i = 0;
    ```
    var i = 0;
    :smile: :smile: Emoji list
    {%youtube youtube_id %} Externals
    $L^aT_eX$ LaTeX
    :::info
    This is a alert area.
    :::

    This is a alert area.

    Versions and GitHub Sync
    Get Full History Access

    • Edit version name
    • Delete

    revision author avatar     named on  

    More Less

    Note content is identical to the latest version.
    Compare
      Choose a version
      No search result
      Version not found
    Sign in to link this note to GitHub
    Learn more
    This note is not linked with GitHub
     

    Feedback

    Submission failed, please try again

    Thanks for your support.

    On a scale of 0-10, how likely is it that you would recommend HackMD to your friends, family or business associates?

    Please give us some advice and help us improve HackMD.

     

    Thanks for your feedback

    Remove version name

    Do you want to remove this version name and description?

    Transfer ownership

    Transfer to
      Warning: is a public team. If you transfer note to this team, everyone on the web can find and read this note.

        Link with GitHub

        Please authorize HackMD on GitHub
        • Please sign in to GitHub and install the HackMD app on your GitHub repo.
        • HackMD links with GitHub through a GitHub App. You can choose which repo to install our App.
        Learn more  Sign in to GitHub

        Push the note to GitHub Push to GitHub Pull a file from GitHub

          Authorize again
         

        Choose which file to push to

        Select repo
        Refresh Authorize more repos
        Select branch
        Select file
        Select branch
        Choose version(s) to push
        • Save a new version and push
        • Choose from existing versions
        Include title and tags
        Available push count

        Pull from GitHub

         
        File from GitHub
        File from HackMD

        GitHub Link Settings

        File linked

        Linked by
        File path
        Last synced branch
        Available push count

        Danger Zone

        Unlink
        You will no longer receive notification when GitHub file changes after unlink.

        Syncing

        Push failed

        Push successfully