owned this note
owned this note
Published
Linked with GitHub
# 2018q3 Homework6
contributed by < `allenchen8210`, `dange0`, `flawless0714`>
###### tags: `sysprog2018`
## Cycles Per Element
* Cycles Per Element (CPE) 來表示程式性能,這種度量方式對重複計算相當合適,例如處理影像的像素。簡而言之,就是 CPE 可以用來描述隨著處理量的增加,CPU 需要的 cycle 數量,越低約好。運算每元素所需要的周期數。
* 這些項中的係數稱為 CPE
* `a+4.0n` CPE = 4.0
* `a+3.5n` CPE = 3.5
![](https://i.imgur.com/Qv6pJxC.jpg)
### combine1
```c=
#define INDENT 1
#define OP *
void combine1(vec_ptr, data *dest)
{
long i;
*desst = IDENT;
for(i=0; i < vec_length(v); i++){
data_t val;
get_vec_element(v, i, &val);
*dest = *dest OP val;
}
}
```
CPE test for multiply long
* combine1 : ==20.18==
可以看到第 8 行`vec_length(v)`一直在做重複取值的動作,陣列的長度不會在 `for` 迴圈裡面改變,沒有必要一直取值的動作
### combine2
修改 `for` 迴圈後
```c=
#define INDENT 1
#define OP *
void combine2(vec_ptr, data *dest)
{
long i;
long length = vec_length(v);
*desst = IDENT;
for(i=0; i < length; i++){
data_t val;
get_vec_element(v, i, &val);
*dest = *dest OP val;
}
}
```
CPE test for multiply long
* combine1 : 20.18
* combine2 : ==11.03==
可以發現到 `combine2` 仍然有改進的空間,在第 11 行取值的方法可以直接改成直接取數值,不用透過 `get_vec_element` 取值,裡面還要做邊接檢測,本來取的數值就不會超過邊界。
```c=
int get_vec_element(vec_ptr v, long index, data_t *dest)
{
if(index < 0 || index >= v->len)
return 0;
*dest = v->data[index]
return 1;
}
```
### combine3
```c=
#define INDENT 1
#define OP *
data_t *get_vec_start(vec_ptr v)
{
return v->data;
}
void combine3(vec_ptr, data *dest)
{
long i;
long length = vec_length(v);
data_t *data = get_vec_start(v);
*dest = IDENT;
for(i=0; i < length; i++){
*dest = *dest OP data[i];
}
}
```
CPE test for multiply long
* combine1 : 20.18
* combine2 : 11.03
* combine3 : ==11.03==
### combine4
而作者又進一步將`combine3` 的`*dest = *dest OP data[i];`改進成 `acc = acc OP data[i]` 利用 `acc` 來儲存累積的值,為什麼要這麼做呢?
原本的 `combine3` 在每次迭代時候都要將累積的數值 `dest` 從記憶體讀出來然後做完在寫回去。==但是每做完一次的累積值就是下一次開始的累積值==。為了避免重複讀寫,我可以直接先將累積值放在一個地方,等全部做完在放回去 `*dest` 。
就變成如下程式碼
```c=
void combine4(vec_ptr, data *dest)
{
long i;
long length = vec_length(v);
data_t *data = get_vec_start(v);
data_t acc = IDENT;
for(i=0; i < length; i++){
acc = acc OP data[i];
}
*dest = acc;
}
```
#### 在組合語言我們可以看的更清楚
* Inner loop of combine3
* dest in `%rbx`
* data+i in `%rdx`
* data+length in `%rax`
```=
.L17
vmovsd (%rbx), %xmm0 // read dest
vmulsd (%rbx), %xmm0, %xmm0
vmovsd %xmm0, (%rbx) // write dest
addq $8, %rdx
cmpq %rax, %rdx
jne .L17
```
==重複點在 read dest 與 write dest==,我們可以避免這兩個讀與寫,最後全部做完在寫就好。
* Inner loop of combine4
* acc in `%xmm0`
* data+i in `%rdx`
* data+length in `%rax`
```=
.L25
vmulsd (%rdx), %xmm0, %xmm0
addq $8, %rdx
cmpq %rax, %rdx
jne .L25
```
CPE test for multiply long
* combine1 : 20.18
* combine2 : 11.03
* combine3 : 11.03
* combine4 : ==5.01==
作業題目`5.13`
兩個向量內積
```c=
void inner4(vec_ptr u, vec_ptr v, data_t *dest)
{
long i;
long length = veclength(u)
data_t *udata = get_vec_start(u);
data_t *vdata = get_vec_start(v);
data_t sum = (data_t) 0;
for(i = 0; i < length; i++){
sum = sum + udata[i] * vdata[i];
}
*dest = sum;
}
```
組合語言
```=
.L15:
vmovsd 0(%rbp,%rcx,8), %mm1
vmulsd (%rax,%rcx,8), %xmm1, %xmm1
vaddsd %xmm1, %xmm0, %xmm0
addq $1, %rcx
cmpq %rbx, %rcx\
jne .L15
```
1. 組合語言圖形化
![](https://i.imgur.com/TfOYA91.png)
2.Critical Path
![](https://i.imgur.com/dUfsbJG.png)
3. 決定關鍵路徑的 CPE 是 add ,浮點數 CPE 是 3.0
4. 決定關鍵路徑的 CPE 是 add ,整數 CPE 是 1.0
* 發射時間 (issue time) 表示兩個連續同類型的運算之間需要的最小周期數。一個運算可能是好幾個 blocks 組成,有些 block 需要 2 cycles 有些需要 1 cycle,例如有些運算需要 1 2 1 ,有些運算需要 1 3 3 等多種不一樣的組合。如果是可以完全切成 1 1 1 的話就是完全流水線化,等於是一個 cycle 就可以在塞一個同樣的運算進去。
* 若不是完全流水線化:發射時間等於他的延遲
* 發射時間倒數為最大吞吐量
* 假設是完全流水線化,代表隨著 clock 觸發一次,運算就丟進去一個,如果一個`4GHZ` 處理器,帶表他吞吐量是 4 GIPS
* GIPS 每秒千兆條指令,就是每秒十億條指令。
* 具有多個功能單元可以進一步提高吞吐量。
![](https://i.imgur.com/9Inskdd.png)
## performancelab
於 github 上找到分析效能之 [performancelab](https://github.com/zxc479773533/CS-APP3e-labs/tree/master/performancelab)
### trace code
主要計算 CPE 之程式碼可以參考 `fcyc.c` 中的 `fcyc_v`
**[fcyc.c](https://github.com/zxc479773533/CS-APP3e-labs/blob/master/performancelab/fcyc.c#L110)**
```cpp=110
double fcyc_v(test_funct_v f, void *params[])
{
double result;
init_sampler();
if (compensate) {
do {
double cyc;
if (clear_cache)
clear();
start_comp_counter();
f(params);
cyc = get_comp_counter();
add_sample(cyc);
} while (!has_converged() && samplecount < maxsamples);
} else {
do {
double cyc;
if (clear_cache)
clear();
start_counter();
f(params);
cyc = get_counter();
add_sample(cyc);
} while (!has_converged() && samplecount < maxsamples);
}
#ifdef DEBUG
{
int i;
printf(" %d smallest values: [", kbest);
for (i = 0; i < kbest; i++)
printf("%.0f%s", values[i], i==kbest-1 ? "]\n" : ", ");
}
#endif
result = values[0];
#if !KEEP_VALS
free(values);
values = NULL;
#endif
return result;
}
```
- 該函式的傳入值為要做效能測試之函式與待測函式的傳入值
- init_sampler()
- 初始化之後要放 sample 的陣列
- add_sample(cyc)
- 將 sample 到的 cycle 數放進陣列中
- has_converged()
- 於 129 行,start_counter() 與 get_counter() 將 f(params) 包住,去記錄進入 f(params) 之前之 cpu cycle 差異,詳細行為將在 `clock.c` 中分析
**[clock.c](https://github.com/zxc479773533/CS-APP3e-labs/blob/master/performancelab/clock.c#L83)**
```cpp=83
#if IS_x86
/* $begin x86cyclecounter */
/* Initialize the cycle counter */
static unsigned cyc_hi = 0;
static unsigned cyc_lo = 0;
/* Set *hi and *lo to the high and low order bits of the cycle counter.
Implementation requires assembly code to use the rdtsc instruction. */
void access_counter(unsigned *hi, unsigned *lo)
{
asm("rdtsc; movl %%edx,%0; movl %%eax,%1" /* Read cycle counter */
: "=r" (*hi), "=r" (*lo) /* and move results to */
: /* No input */ /* the two outputs */
: "%edx", "%eax");
}
/* Record the current value of the cycle counter. */
void start_counter()
{
access_counter(&cyc_hi, &cyc_lo);
}
/* Return the number of cycles since the last call to start_counter. */
double get_counter()
{
unsigned ncyc_hi, ncyc_lo;
unsigned hi, lo, borrow;
double result;
/* Get cycle counter */
access_counter(&ncyc_hi, &ncyc_lo);
/* Do double precision subtraction */
lo = ncyc_lo - cyc_lo;
borrow = lo > ncyc_lo;
hi = ncyc_hi - cyc_hi - borrow;
result = (double) hi * (1 << 30) * 4 + lo;
if (result < 0) {
fprintf(stderr, "Error: counter returns neg value: %.0f\n", result);
}
return result;
}
/* $end x86cyclecounter */
#endif /* x86 */
```
- start_counter
- 透過 access_counter 去得到 x86 的 cycle counter
- get_counter
- 去計算 start_counter 到 get_counter 之間經過了幾個 cycle
- `result = (double) hi * (1 << 30) * 4 + lo` 相當於 `result = hi << 32 + lo` 形成一個 64-bits 的結果
- access_counter
- 利用asm() 呼叫 x86 的指令 `rdtsc`,將 cycle counter 取出
- `rdtsc` 會將 [Time stamp counters](https://en.wikipedia.org/wiki/Time_Stamp_Counter) 回傳至 edx 與 eax
- [asm](https://gcc.gnu.org/onlinedocs/gcc/Extended-Asm.html) 介紹
- The asm statement allows you to include assembly instructions directly within C code.
- 使用方法如下:
```
asm ( AssemblerTemplate
: OutputOperands
[ : InputOperands
[ : Clobbers ] ])
```
- 用 `:` 將 AssemblerTemplate, OutputOperands, InputOperands, clobber Clobbers 隔開
- AssemblerTemplate:放 asm code 的地方,並用 `%0` `%1` 等依序存取 OutputOperands 與 InputOperands
- Output/Input Operands:放 output/input paramater 的地方,每個參數前面會先宣告參數的讀寫權限
- `=`:Write-only
- `+`:Read-Write
- `&`:Output-only
- Clobbers:告訴 compiler 有哪些 register 或 memory 會被修改
- 解讀 access_counter 中的 asm()
```
asm("rdtsc; movl %%edx,%0; movl %%eax,%1" /* Read cycle counter */
: "=r" (*hi), "=r" (*lo) /* and move results to */
: /* No input */ /* the two outputs */
: "%edx", "%eax");
```
可以被解讀為
```
rdtsc
movl %edx, hi
movl %eax, lo
```
其中 *hi 與 *lo 的屬性為 write-only 的 general register
經過上述分析,可以得知 CPE 的計算方式為紀錄待測函式執行前後的 cycle counter,並計算其差異。
其中取得 cycle counter 的方式為透過現成的指令。
## 問題討論
### 作業第四小題
題目: 請解釋雖然乘法操作需要五個時鐘週期,但是為什麼兩個浮點版本的 CPE 都是3.00?
參考了幾份解答都說因為 ADD 在關鍵路徑上,所以 CPE 為3.00。可是雙精度浮點數的乘法需要五個時鐘週期,這樣不是單次迴圈就需要至少五個時鐘週期了嗎?
### 嘗試將 inner4() 移植到 performancelab 中分析
額外移植進 performancelab 之 code:
```=c
double *create_double_array(int size)
{
double *array = malloc(size * sizeof(double));
if (!array)
return NULL;
for (int i = 0; i < size; i++)
array[i] = (float) rand() / (float)(RAND_MAX / 100);
return array;
}
void inner4(double *u, double *v, long length)
{
long i;
data_t *udata = u;
data_t *vdata = v;
data_t sum = (data_t) 0;
for (i = 0; i < length; i++) {
sum = sum + udata[i] * vdata[i];
}
*v = sum;
}
```
修改後 `driver.c` 中的 `test_rotate()`(我們將 `rotate` 的測試內容改成作業的 `inner4` 函數) : (原為 **[driver.c](https://github.com/zxc479773533/CS-APP3e-labs/blob/8c7d4ecf61dcb69734f8afef3257d19ade3e8e38/performancelab/driver.c#L351)**)
```c=351
/* Measure CPE */
{
double num_cycles, cpe;
int tmpdim = dim;
void *arglist[4];
double dimension = (double) dim;
double work = dimension*dimension;
#ifdef DEBUG
printf("DEBUG: dimension=%.1f\n",dimension);
printf("DEBUG: work=%.1f\n",work);
#endif
arglist[0] = (void *) benchmarks_rotate[bench_index].tfunct;
arglist[1] = (void *) &tmpdim;
arglist[2] = (void *) orig;
arglist[3] = (void *) result;
create(dim);
num_cycles = fcyc_v((test_funct_v)&func_wrapper, arglist);
cpe = num_cycles / 100; /* 100 is the cycle of iteration */
printf("custom test running!\n");
benchmarks_rotate[bench_index].cpes[test_num] = cpe;
}
}
```
修改後 `fycy.c` 中的 `fcyc_v()` : (原為 **[fcyc.c](https://github.com/zxc479773533/CS-APP3e-labs/blob/8c7d4ecf61dcb69734f8afef3257d19ade3e8e38/performancelab/fcyc.c#L156)**)
```c=156
double fcyc_v(test_funct_v f, void *params[])
{
double result;
init_sampler();
if (compensate) {
do {
double cyc;
if (clear_cache)
clear();
double* arr1 = create_double_array(100);
double* arr2 = create_double_array(100);
if (arr1 && arr2) {
start_comp_counter();
//f(params);
inner4(arr1, arr2, 100);
cyc = get_comp_counter();
// printf("\ncycle per times: %.0f\n", cyc);
free(arr1);
free(arr2);
}
printf("custom test running!\n");
/*
start_comp_counter();
f(params);printf("custom test running!\n");
cyc = get_comp_counter();
*/
add_sample(cyc);
} while (!has_converged() && samplecount < maxsamples);
} else {
do {
double cyc;
if (clear_cache)
clear();
double* arr1 = create_double_array(100);
double* arr2 = create_double_array(100);
if (arr1 && arr2) {
start_counter();
//f(params);
inner4(arr1, arr2, 100);
cyc = get_counter();
// printf("\ncycle per times: %.0f\n", cyc);
free(arr1);
free(arr2);
}
//printf("\ncycle per times: %.0f\n", cyc);
add_sample(cyc);
} while (!has_converged() && samplecount < maxsamples);
}
#ifdef DEBUG
{
int i;
printf(" %d smallest values: [", kbest);
for (i = 0; i < kbest; i++)
printf("%.0f%s", values[i], i==kbest-1 ? "]\n" : ", ");
}
#endif
result = values[0];
#if !KEEP_VALS
free(values);
values = NULL;
#endif
return result;
}
}
```
輸出結果: (開啟補償 ([compensate for timer overhead](https://github.com/zxc479773533/CS-APP3e-labs/blob/8c7d4ecf61dcb69734f8afef3257d19ade3e8e38/performancelab/driver.c#L716)))
```
Teamname: team_15
Member 1: N/A
Email 1: N/A
Rotate: Version = inner: custom evaluation:
Dim 64 128 256 512 1024 Mean
Your CPEs 3.8 3.8 3.8 3.8 3.8
Baseline CPEs 14.7 40.1 46.4 65.9 94.5
Speedup 3.9 10.6 12.4 17.6 25.2 11.8
Summary of Your Best Scores:
Rotate: 11.8 (inner: custom evaluation)
Smooth: 0.0 ((null))
```
其中我們較關心的部份為 `Your CPEs 3.8 3.8 3.8 3.8 3.8`,CPE 比作業解答理論值多出 0.8,應該是 throughput bound。
輸出結果: (關閉補償)
```
Teamname: team_15
Member 1: N/A
Email 1: N/A
Rotate: Version = inner: custom evaluation:
Dim 64 128 256 512 1024 Mean
Your CPEs 15.8 15.8 6.3 3.6 3.6
Baseline CPEs 14.7 40.1 46.4 65.9 94.5
Speedup 0.9 2.5 7.3 18.4 26.5 6.1
Summary of Your Best Scores:
Rotate: 6.1 (inner: custom evaluation)
Smooth: 0.0 ((null))
```
由本次輸出可以發現到頭幾次執行的 CPE 比較高,相較之下有補償的版本就沒有這問題,這邊對 CPE 是逐次下降而不是總平均升高有疑問,沒有補償的版本 CPE 不是應該平均提高嗎?還是說有哪裡少考慮到了。
---
>把 performanclab 移植進去 inner4 測試,發現各個 dimension 的 CPE 在 10 左右
>[source code](https://github.com/allenchen8210/team15)
>[name=jn]
>[color=green]
==Desktop==
* gcc version 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.10)
```
Architecture: x86_64
CPU 作業模式: 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 12
On-line CPU(s) list: 0-11
每核心執行緒數:2
每通訊端核心數:6
Socket(s): 1
NUMA 節點: 1
供應商識別號: GenuineIntel
CPU 家族: 6
型號: 158
Model name: Intel(R) Core(TM) i7-8700K CPU @ 3.70GHz
製程: 10
CPU MHz: 800.029
CPU max MHz: 4700.0000
CPU min MHz: 800.0000
BogoMIPS: 7392.00
虛擬: VT-x
L1d 快取: 32K
L1i 快取: 32K
L2 快取: 256K
L3 快取: 12288K
NUMA node0 CPU(s): 0-11
```
```
For 999-Dimensional Vector Dot Product. Cycle = 11088.000000
CPE = 11.099099
For 999-Dimensional Vector Dot Product. Cycle = 11080.000000
CPE = 11.091091
For 999-Dimensional Vector Dot Product. Cycle = 11752.000000
CPE = 11.763763
For 999-Dimensional Vector Dot Product. Cycle = 11068.000000
CPE = 11.079079
For 999-Dimensional Vector Dot Product. Cycle = 11784.000000
CPE = 11.795795
For 999-Dimensional Vector Dot Product. Cycle = 11742.000000
CPE = 11.753754
For 999-Dimensional Vector Dot Product. Cycle = 11762.000000
CPE = 11.773774
For 999-Dimensional Vector Dot Product. Cycle = 11106.000000
CPE = 11.117117
For 999-Dimensional Vector Dot Product. Cycle = 11794.000000
CPE = 11.805806
For 1000-Dimensional Vector Dot Product. Cycle = 11102.000000
CPE = 11.102000
For 1000-Dimensional Vector Dot Product. Cycle = 11096.000000
CPE = 11.096000
For 1000-Dimensional Vector Dot Product. Cycle = 11092.000000
CPE = 11.092000
For 1000-Dimensional Vector Dot Product. Cycle = 11100.000000
CPE = 11.100000
For 1000-Dimensional Vector Dot Product. Cycle = 11104.000000
CPE = 11.104000
For 1000-Dimensional Vector Dot Product. Cycle = 11800.000000
CPE = 11.800000
For 1000-Dimensional Vector Dot Product. Cycle = 11092.000000
CPE = 11.092000
For 1000-Dimensional Vector Dot Product. Cycle = 11810.000000
CPE = 11.810000
For 1000-Dimensional Vector Dot Product. Cycle = 11118.000000
CPE = 11.118000
For 1000-Dimensional Vector Dot Product. Cycle = 11104.000000
CPE = 11.104000
```
==Macbook Air==
* gcc version 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.10)
```
allen@Mac:~/embeededHW/team15$ lscpu
Architecture: x86_64
CPU 作業模式: 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 4
On-line CPU(s) list: 0-3
每核心執行緒數:2
每通訊端核心數:2
Socket(s): 1
NUMA 節點: 1
供應商識別號: GenuineIntel
CPU 家族: 6
型號: 61
Model name: Intel(R) Core(TM) i5-5250U CPU @ 1.60GHz
製程: 4
CPU MHz: 1634.462
CPU max MHz: 2700.0000
CPU min MHz: 500.0000
BogoMIPS: 3200.02
虛擬: VT-x
L1d 快取: 32K
L1i 快取: 32K
L2 快取: 256K
L3 快取: 3072K
NUMA node0 CPU(s): 0-3
```
```
For 984-Dimensional Vector Dot Product. Cycle = 5710.000000
CPE = 5.802845
For 984-Dimensional Vector Dot Product. Cycle = 5708.000000
CPE = 5.800813
For 984-Dimensional Vector Dot Product. Cycle = 5710.000000
CPE = 5.802845
For 984-Dimensional Vector Dot Product. Cycle = 5724.000000
CPE = 5.817073
For 984-Dimensional Vector Dot Product. Cycle = 5710.000000
CPE = 5.802845
For 984-Dimensional Vector Dot Product. Cycle = 5708.000000
CPE = 5.800813
For 984-Dimensional Vector Dot Product. Cycle = 5710.000000
CPE = 5.802845
For 984-Dimensional Vector Dot Product. Cycle = 5710.000000
CPE = 5.802845
For 984-Dimensional Vector Dot Product. Cycle = 5726.000000
CPE = 5.819106
For 984-Dimensional Vector Dot Product. Cycle = 5708.000000
CPE = 5.800813
For 985-Dimensional Vector Dot Product. Cycle = 5714.000000
CPE = 5.801015
For 985-Dimensional Vector Dot Product. Cycle = 5728.000000
CPE = 5.815228
For 985-Dimensional Vector Dot Product. Cycle = 5758.000000
CPE = 5.845685
For 985-Dimensional Vector Dot Product. Cycle = 5716.000000
CPE = 5.803046
For 985-Dimensional Vector Dot Product. Cycle = 5714.000000
CPE = 5.801015
For 985-Dimensional Vector Dot Product. Cycle = 5714.000000
CPE = 5.801015
For 985-Dimensional Vector Dot Product. Cycle = 5714.000000
CPE = 5.801015
For 985-Dimensional Vector Dot Product. Cycle = 5728.000000
CPE = 5.815228
For 985-Dimensional Vector Dot Product. Cycle = 5712.000000
CPE = 5.798985
For 985-Dimensional Vector Dot Product. Cycle = 5714.000000
CPE = 5.801015
For 986-Dimensional Vector Dot Product. Cycle = 5734.000000
CPE = 5.815416
```
* Gnuplot
![](https://i.imgur.com/ePU4d54.png)
>可以看到兩部電腦明顯 CPE 差了快一半,反而是 Intel® Core™ i5-5250U CPE 較佳
>[name=jn]
>[color=blue]
---
## test inner4 function
* Test Environment
* gcc version 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.10)
* lscpu
```
Architecture: x86_64
CPU 作業模式: 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 12
On-line CPU(s) list: 0-11
每核心執行緒數:2
每通訊端核心數:6
Socket(s): 1
NUMA 節點: 1
供應商識別號: GenuineIntel
CPU 家族: 6
型號: 158
Model name: Intel(R) Core(TM) i7-8700K CPU @ 3.70GHz
製程: 10
CPU MHz: 799.936
CPU max MHz: 4700.0000
CPU min MHz: 800.0000
BogoMIPS: 7392.00
虛擬: VT-x
L1d 快取: 32K
L1i 快取: 32K
L2 快取: 256K
L3 快取: 12288K
NUMA node0 CPU(s): 0-11
```
`$ gcc -S 5_13.c`
```cpp=
float inner4(double *u, double *v, long length)
{
long i = 0;
double sum = 200;
for (i = 0; i < length; i++) {
sum = sum + u[i] * v[i];
}
return 0;
}
```
* assembly code
```
inner4:
.LFB3:
.cfi_startproc
pushq %rbp
.cfi_def_cfa_offset 16
.cfi_offset 6, -16
movq %rsp, %rbp
.cfi_def_cfa_register 6
movq %rdi, -24(%rbp)
movq %rsi, -32(%rbp)
movq %rdx, -40(%rbp)
movq $0, -16(%rbp)
movsd .LC1(%rip), %xmm0
movsd %xmm0, -8(%rbp)
movq $0, -16(%rbp)
jmp .L7
.L8:
movq -16(%rbp), %rax
leaq 0(,%rax,8), %rdx
movq -24(%rbp), %rax
addq %rdx, %rax
movsd (%rax), %xmm1 // 取到 u[i]
movq -16(%rbp), %rax
leaq 0(,%rax,8), %rdx
movq -32(%rbp), %rax
addq %rdx, %rax
movsd (%rax), %xmm0 // 取到 v[i]
mulsd %xmm1, %xmm0
movsd -8(%rbp), %xmm1
addsd %xmm1, %xmm0
movsd %xmm0, -8(%rbp) // 儲存到 sum
addq $1, -16(%rbp) // i++
.L7:
movq -16(%rbp), %rax
cmpq -40(%rbp), %rax
jl .L8
pxor %xmm0, %xmm0
popq %rbp
.cfi_def_cfa 7, 8
ret
.cfi_endproc
.LFE3:
.size inner4, .-inner4
.globl inner5
.type inner5, @function
```
### for loop in inner4
```cpp=
for (i = 0; i < length; i++) {
sum = sum + u[i] * v[i];
}
```
```
.L8:
movq -16(%rbp), %rax
leaq 0(,%rax,8), %rdx
movq -24(%rbp), %rax
addq %rdx, %rax
movsd (%rax), %xmm1 // 取到 u[i]
movq -16(%rbp), %rax
leaq 0(,%rax,8), %rdx
movq -32(%rbp), %rax
addq %rdx, %rax
movsd (%rax), %xmm0 // 取到 v[i]
mulsd %xmm1, %xmm0
movsd -8(%rbp), %xmm1
addsd %xmm1, %xmm0
movsd %xmm0, -8(%rbp) // 儲存到 sum
addq $1, -16(%rbp) // i++
```
```cpp=
for (i = 0; i < length; i++) {
*dest = *dest + u[i] * v[i];
}
```
```
.L8:
movq -64(%rbp), %rax
movl (%rax), %edx
movq -16(%rbp), %rax
leaq 0(,%rax,4), %rcx
movq -40(%rbp), %rax
addq %rcx, %rax
movl (%rax), %ecx
movq -16(%rbp), %rax
leaq 0(,%rax,4), %rsi
movq -48(%rbp), %rax
addq %rsi, %rax
movl (%rax), %eax
imull %ecx, %eax
addl %eax, %edx
movq -64(%rbp), %rax
movl %edx, (%rax)
addq $1, -16(%rbp)
```
### double v.s. int
![](https://i.imgur.com/3bpPbIG.png)
### dereference v.s. 直接儲存在 sum
![](https://i.imgur.com/nvEUko5.png)
>可以觀察到 int 與 double 在 CPE 的差別, int 比較低,也可以看到dereference 與 直接儲存在 sum 的差別,雖然差別不太明顯,但還是可以看得出直接儲存在 sum 較佳。後來有把最佳化開到 `-o3` CPE 還可以在往下降一單位,但要在 elements 超過 8000 後。
>[name=jn]
>[color=blue]
---
* Test Environment
* gcc version 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.10)
* lscpu
```
Architecture: x86_64
CPU 作業模式: 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 4
On-line CPU(s) list: 0-3
每核心執行緒數:2
每通訊端核心數:2
Socket(s): 1
NUMA 節點: 1
供應商識別號: GenuineIntel
CPU 家族: 6
型號: 61
Model name: Intel(R) Core(TM) i5-5250U CPU @ 1.60GHz
製程: 4
CPU MHz: 2500.197
CPU max MHz: 2700.0000
CPU min MHz: 500.0000
BogoMIPS: 3199.98
虛擬: VT-x
L1d 快取: 32K
L1i 快取: 32K
L2 快取: 256K
L3 快取: 3072K
NUMA node0 CPU(s): 0-3
```
### double v.s. int
![](https://i.imgur.com/aSA18Rw.png)
### dereference v.s. 直接儲存在 sum
![](https://i.imgur.com/SoDBUzz.png)
---
### 不同編譯器最佳化等級之結果
- 測試環境為 `Lubuntu 18.10(x86_64)`, `gcc (Ubuntu 7.3.0-27ubuntu1~18.04)`
- 測試 [程式碼](https://github.com/flawless0714/team15/blob/master/5_13.c)
- 未最佳化輸出之組合語言: (僅擷取部分程式碼, 完整輸出存放於 github 中, 編譯 command `$ gcc -S 5_13.c -o no_opt.s -Wall`)
```asm=
inner4:
.LFB6:
.cfi_startproc
pushq %rbp
.cfi_def_cfa_offset 16
.cfi_offset 6, -16
movq %rsp, %rbp
.cfi_def_cfa_register 6
subq $80, %rsp
movq %rdi, -56(%rbp)
movq %rsi, -64(%rbp)
movq %rdx, -72(%rbp)
pxor %xmm0, %xmm0
movsd %xmm0, -32(%rbp)
movq -56(%rbp), %rax
movq %rax, -24(%rbp)
movq -64(%rbp), %rax
movq %rax, -16(%rbp)
pxor %xmm0, %xmm0
movsd %xmm0, -8(%rbp)
movl $0, %eax
call start_comp_counter@PLT
movq $0, -40(%rbp)
jmp .L7
.L8:
movq -40(%rbp), %rax
leaq 0(,%rax,8), %rdx
movq -24(%rbp), %rax
addq %rdx, %rax
movsd (%rax), %xmm1
movq -40(%rbp), %rax
leaq 0(,%rax,8), %rdx
movq -16(%rbp), %rax
addq %rdx, %rax
movsd (%rax), %xmm0
mulsd %xmm1, %xmm0
movsd -32(%rbp), %xmm1
addsd %xmm1, %xmm0
movsd %xmm0, -32(%rbp)
addq $1, -40(%rbp)
.L7:
movq -40(%rbp), %rax
cmpq -72(%rbp), %rax
jl .L8
movl $0, %eax
call get_comp_counter@PLT
movq %xmm0, %rax
movq %rax, -8(%rbp)
cvtsi2sdq -72(%rbp), %xmm0
movsd -8(%rbp), %xmm1
divsd %xmm0, %xmm1
movapd %xmm1, %xmm0
cvtsd2ss %xmm0, %xmm2
movss %xmm2, -44(%rbp)
movq -8(%rbp), %rdx
movq -72(%rbp), %rax
movq %rdx, -80(%rbp)
movsd -80(%rbp), %xmm0
movq %rax, %rsi
leaq .LC2(%rip), %rdi
movl $1, %eax
call printf@PLT
cvtss2sd -44(%rbp), %xmm0
leaq .LC3(%rip), %rdi
movl $1, %eax
call printf@PLT
movss -44(%rbp), %xmm0
leave
.cfi_def_cfa 7, 8
ret
.cfi_endproc
```
由此輸出可觀察到 start\_comp\_counter() 與 get\_comp\_counter() 有正確涵蓋到欲測試之程式碼。
- `-O2` 最佳化輸出之組合語言: (僅擷取部分程式碼, 完整輸出存放於 github 中, 編譯 command `$ gcc -O2 -S 5_13.c -o opt.s -Wall`)
```asm=
inner4:
.LFB42:
.cfi_startproc
pushq %rbx
.cfi_def_cfa_offset 16
.cfi_offset 3, -16
xorl %eax, %eax
movq %rdx, %rbx
subq $16, %rsp
.cfi_def_cfa_offset 32
call start_comp_counter@PLT
xorl %eax, %eax
call get_comp_counter@PLT
pxor %xmm1, %xmm1
leaq .LC1(%rip), %rsi
movapd %xmm0, %xmm2
movq %rbx, %rdx
pxor %xmm3, %xmm3
movl $1, %edi
cvtsi2sdq %rbx, %xmm1
movl $1, %eax
divsd %xmm1, %xmm2
cvtsd2ss %xmm2, %xmm3
movss %xmm3, 12(%rsp)
call __printf_chk@PLT
pxor %xmm0, %xmm0
leaq .LC2(%rip), %rsi
movl $1, %edi
movl $1, %eax
cvtss2sd 12(%rsp), %xmm0
call __printf_chk@PLT
movss 12(%rsp), %xmm0
addq $16, %rsp
.cfi_def_cfa_offset 16
popq %rbx
.cfi_def_cfa_offset 8
ret
.cfi_endproc
```
### double v.s. int
![](https://i.imgur.com/EJCqE8t.png)
由此輸出可觀察到 start\_comp\_counter() 與 get\_comp\_counter() 並沒有包覆著欲觀測之程式碼, 取而代之的是一行 `xorl %eax, %eax`, 這應該就是為什麼測量數據(詳細數據因過於龐大故儲存於 github)會有平均9個 CPE 跟 平均0.2個 CPE 的原因, 至於為什麼編譯器會做出這樣的優化仍是個謎。另外其他組員有發現到不同 CPU 在執行未優化的測試程式會有不同的 CPE ,並且效能越好的 CPU(i7-8700K) 會有較高的 CPE 。
## 參考資料
- [gcc -lm option](https://blog.csdn.net/Aguangg_6655_la/article/details/62042810)