owned this note
owned this note
Published
Linked with GitHub
2016q3 Homework2 (phonebook-concurrent)
===
contributed by <ktvexe>
- [ ] concurrency 教材的認知
- [ ] 程式正確性驗證
- [ ] 效能分析實驗 (必須要有圖表)
- [ ] 以及充份說明你如何改善效能
## Code Refactoring
[Martin Fowler](https://en.wikipedia.org/wiki/Martin_Fowler) 在書中定義 Refactor 如下:
> 在不改變軟體外部行為的前提下,改變其內部結構,使其更容易理解且易於修改
Refactoring之誤解-
* 為軟體增加功能、改善安全性或是提升效能皆不能算上重構,其中改善效能算是最佳化,不是重構。重構的主要目的就是為了提升可讀性,以及為了日後有了需求的變化時,可以更容易因應修改或是擴充。
## Lock-Free Programming
在An Introduction to Lock-Free Programming
一文中提到:
programming without mutexes只是lock-free的一小部份,由下圖可以清楚了解重點在“locking up”
![image alt](http://preshing.com/images/its-lock-free.png)
## Concurrency
Concurrency系列一文中提到:
Concurrency是所謂的並行性,並不是Parallelism,正確的處理不同thread之間交互執行衍伸而來的問題,就是concurrency探討的方向。
在冼鏡光的[並行計算投影片](http://blog.dcview.com/article.php?a=BTgBYw1qUWIAaQ%3D%3D)中就舉了很好懂且生活化的例子,以下摘錄片段:
譬如說,一邊走路一邊講電話就同時做兩件事(走路和講電話),一邊寫報告一邊上網聊天也是如此。然而這兩個同時做兩件事的例子卻有著很微妙的差異,前者我們真的是在每一瞬間都是同時進行兩件事(走路和講電話);後者卻不然,大多人都是寫一段報告、就跳過去聊天、再寫報告、再聊天、等等,幾乎不可能在每一瞬間都同時在寫報告和聊天。
兩個例子中前者(兩件事齊頭並進)的處理方式叫做平行(parallel),後者在兩件事中交錯執行、但每一瞬間每件事都有點進展;並行(concurrent,此地採用國內的譯名)指的是交錯執行和平行兩者,所以並行比平行在意義上較廣。
### user-level thread system
說到thread就會想到恐龍書中講到的LWP實作,那現在來討論以coroutine的方式實作thread。
比較一下coroutine與thread的差別:
* thread的切換後暫停是靠context switch實現的,當然process也是,不過process context switch的成本又比thread更高,因為thread共用process的大量資源(相同address space),所以需要儲存的資料較少;而coroutine是沒有context switch的,所以理論上切換成本會最低。
* thread_create是有成本的,應該說再POSIX THREAD的實作下,這個成本應該是比coroutine還要來的大,這可以設計實驗(待續)
![image alt](https://hackpad-attachments.s3.amazonaws.com/embedded2016.hackpad.com_K6DJ0ZtiecH_p.537916_1460615044111_undefined)
上面這張圖寫著thread拿來講coroutine實在讓人不舒服,但卻有表達出其意。
Knuth's coroutines:
Knuth是偉大的電腦科學家,在Coroutines in C一文中提到,他提出的coroutine是一個拿掉了stack觀念的作法,且routine間不再有caller與callee的關係,是透過jump與儲存PC的方式實現;但作者提到這只能在assembl的層級實現,再Unix like的系統上實現並不實際。
Stack-based coroutines:
正因如此,以C語言的情況,必定有一者為caller,另一者為callee,直接舉例:在此我們希望能夠收到0~9的回傳值,很明顯下方的程式無法辦到這點。
```clike=
int function(void) {
int i;
for (i = 0; i < 10; i++)
return i; /* won't work, but wouldn't it be nice */
}
```
所以在Coroutines in C中使用到goto與label這種接近assembly寫法的方式實現,這段程式碼很明顯沒有善用C語言的特性,使用了switch又用goto,而且應該會在i==10時出現問題。
```clike=
int function(void) {
static int i, state = 0;
switch (state) {
case 0: goto LABEL0;
case 1: goto LABEL1;
}
LABEL0: /* start of function */
for (i = 0; i < 10; i++) {
state = 1; /* so we will come back to LABEL1 */
return i;
LABEL1:; /* resume control straight after the return */
}
}
```
>本來想說會出現問題停掉,結果發現這段程式根本不會停,將asm倒出來看看。[name=劉亮谷]
```asm=
cmpl $1, %eax
je .L10
jmp .L2
.L9:
nop
.L2:
movl $0, i.2286(%rip)
jmp .L6
.L8:
movl $1, state.2287(%rip)
movl i.2286(%rip), %eax
jmp .L1
.L10:
nop
.L5:
movl i.2286(%rip), %eax
addl $1, %eax
movl %eax, i.2286(%rip)
.L6:
movl i.2286(%rip), %eax
cmpl $9, %eax
jle .L8
.L1:
popq %rbp
.cfi_def_cfa 7, 8
ret
.cfi_endproc
```
Duff's device:
```clike=
switch (count % 8) {
case 0: do { *to = *from++;
case 7: *to = *from++;
case 6: *to = *from++;
case 5: *to = *from++;
case 4: *to = *from++;
case 3: *to = *from++;
case 2: *to = *from++;
case 1: *to = *from++;
} while ((count -= 8) > 0);
}
```
## SuperMalloc
## mmap
在linux下使用mmap實在是相當方便,看到linux manual,除了解決memory management 的問題外,還因為是VMA,所以使用fd讀寫時已經確保過image segamentation正確,這如果是在FreeRTOS上就會是相當大的工程。
`void *mmap(void *addr, size_t length, int prot, int flags,int fd, off_t offset);`
`int munmap(void *addr, size_t length);`
所以在append時,下方第12行不需要再做malloc的動作,因為已經使用過mmap獲取空間與紀錄尾端位址。
```clike=
append_a *new_append_a(char *ptr, char *eptr, int tid, int ntd,
entry *start)
{
append_a *app = (append_a *) malloc(sizeof(append_a));
app->ptr = ptr;
app->eptr = eptr;
app->tid = tid;
app->nthread = ntd;
app->entryStart = start;
app->pHead = (app->pLast = app->entryStart);
return app;
}
```
說到append就想到在找mmap末端時,`main`中使用`fsize()`取得檔案大小:
`off_t fs = fsize(ALIGN_FILE);`
這個fs取名取的過於簡略,影響閱讀,先換成fileOffset。
同樣的問題發生在append_a上,將其修改
```clike=
typedef struct _append_a {
char *ptr;
char *ptrEnd;
int threadId;
int threadNum;
entry *entryStart;
entry *pHead;
entry *pLast;
} append_a;
```
## Asynchronous
## Execution
**threadNUM= 4**
```
orig:
execution time of append() : 0.049369 sec
execution time of findName() : 0.006308 sec
opt:
size of entry : 24 bytes
execution time of append() : 0.004550 sec
execution time of findName() : 0.003804 sec
Performance counter stats for './phonebook_opt' (100 runs):
918,201 cache-misses # 53.284 % of all cache refs
2,158,402 cache-references
291,523,621 instructions # 1.31 insns per cycle
271,427,123 cycles
0.252327449 seconds time elapsed ( +- 2.08% )
```
![](https://i.imgur.com/heIWRVo.png)
**threadNUM=2**
```
size of entry : 136 bytes
execution time of append() : 0.049411 sec
execution time of findName() : 0.005607 sec
size of entry : 24 bytes
execution time of append() : 0.004385 sec
execution time of findName() : 0.001574 sec
Performance counter stats for './phonebook_opt' (100 runs):
481,146 cache-misses # 54.327 % of all cache refs
803,873 cache-references
272,596,218 instructions # 1.47 insns per cycle
182,661,768 cycles
0.249541210 seconds time elapsed ( +- 1.79% )
```
![](https://i.imgur.com/MJDsxK0.png)
**threadNUM=1**
```
size of entry : 136 bytes
execution time of append() : 0.050384 sec
execution time of findName() : 0.005666 sec
size of entry : 24 bytes
execution time of append() : 0.006198 sec
execution time of findName() : 0.003526 sec
```
![](https://i.imgur.com/lcWA0XJ.png)
執行後先來確認正確性,原先想寫個輸出append結果的function,結果發現header中已經有了`show_entry()`,就直接拿來加寫檔功能。
輸出後先下`sort`在用`colordiff`來進行比較,結果如下:
```shell
$ sort new.txt >sort_new.txt
$ colordiff sort_new.txt dictionary/words.txt
0a1,4
> aaaa
> aaaaa
> aaaaaa
> aaaaaaa
```
看到結果會發現append少了前頭的4個elements,這是在thread=4的情況下。
那將thread=2來diff看看。
```shell=
$ sort new.txt >sort.txt
$ colordiff sort.txt dictionary/words.txt
0a1,2
> aaaa
> aaaaa
```
就只有前頭兩個elements缺少,看到這應該就可以知道是哪裡出錯。
一開始再trace code時,就覺得這邊有點不直覺,直接append上`pHead->pNext`,想說應該是某個地方有處理過head,看到diff結果就直接對此處進行改寫,直接append pHead。
```clike=
if (i == 0) {
pHead = app[i]->pHead->pNext;
dprintf("Connect %d head string %s %p\n", i,
app[i]->pHead->pNext->lastName, app[i]->ptr);
} else {
etmp->pNext = app[i]->pHead->pNext;
dprintf("Connect %d head string %s %p\n", i,
app[i]->pHead->pNext->lastName, app[i]->ptr);
```
在trace`debug.h`時發現有趣的東西
```clike=
#define eprintf(...) \
do { \
fprintf(stderr, "ERROR: "); fprintf(stderr, __VA_ARGS__); \
} while (0)
#else
```
這個 `do{}while(0)`我一開始不知道他的用途,就直接忽視,剛剛把他丟google後看到stackoverflow上有討論,覺得實在是太厲害的作法了,見識到真實世界的程式:
當在macro中有多個statement時,如下:
```clike=
#define FOO(x) foo(x); bar(x)
if (condition)
FOO(x);
else // syntax error here
...;
```
注意上方bar()後方分號,如果改寫成下方形式,FOO後方就不能有分號,這都跟平常的C語言格式有所衝突:
```clike=
#define FOO(x) { foo(x); bar(x); }
if (condition)
FOO(x)
else
...
```
所以才使用do{}while(0)改寫成:
```clike=
#define FOO(x) do { foo(x); bar(x); } while (0)
if (condition)
FOO(x);
else
....
```
`pthread_setconcurrency(THREAD_NUM + 1);`
然後來討論thread的使用:
這邊的`join`會使得所有的thread在等待最慢的一個thread,附上manu
The pthread_join() function waits for the thread specified by thread to terminate. If that thread has already terminated, then pthread_join() returns immediately. The thread specified by thread must be joinable.
```clike=
clock_gettime(CLOCK_REALTIME, &mid);
for (int i = 0; i < THREAD_NUM; i++)
pthread_create( &tid[i], NULL, (void *) &append, (void *) app[i]);
for (int i = 0; i < THREAD_NUM; i++)
pthread_join(tid[i], NULL);
```
所以改成thread pool:使用threadpool-mbrossard
結果:
![](https://i.imgur.com/Y5uKfrP.png)
```shell
Performance counter stats for './phonebook_opt' (100 runs):
793,725 cache-misses # 45.454 % of all cache refs
1,806,443 cache-references
290,825,345 instructions # 1.33 insns per cycle
220,308,926 cycles
0.244764659 seconds time elapsed ( +- 2.04% )
```
很明顯效能沒有改進,思考原因~
在找原因時覺得append中的i,j可讀性也很差,
改寫`new append`與`append`
```shell=
Performance counter stats for './phonebook_opt' (100 runs):
437,951 cache-misses # 66.591 % of all cache refs
628,081 cache-references
304,163,942 instructions # 1.58 insns per cycle
187,373,621 cycles
0.249370244 seconds time elapsed ( +- 2.35% )
```
將`append`拆成row major與col major,對`append`而言,row major比較快,可是col major的cache miss 卻比較少。
![](https://i.imgur.com/QoN6R3P.png)
```
Performance counter stats for './phonebook_opt_row' (100 runs):
420,299 cache-misses # 62.773 % of all cache refs
707,135 cache-references
305,033,514 instructions # 1.58 insns per cycle
208,815,659 cycles
0.247210751 seconds time elapsed ( +- 2.32% )
Performance counter stats for './phonebook_opt_col' (100 runs):
907,305 cache-misses # 49.503 % of all cache refs
1,813,183 cache-references
291,582,947 instructions # 1.29 insns per cycle
221,915,978 cycles
0.248214764 seconds time elapsed ( +- 1.20% )
```
## Reference
* [Locks, Mutexes, and Semaphores: Types of Synchronization Objects](https://www.justsoftwaresolutions.co.uk/threading/locks-mutexes-semaphores.html)
* [An Introduction to Lock-Free Programming](http://preshing.com/20120612/an-introduction-to-lock-free-programming/)
* [MMAP(2) Linux Programmer's Manual ](http://man7.org/linux/man-pages/man2/mmap.2.html)
* [do { … } while (0) — what is it good for? [duplicate]](http://stackoverflow.com/questions/257418/do-while-0-what-is-it-good-for)