2017q1 Homework4 (phonebook-concurrent)

contributed by <yangyang95>

tags: `sysprog2017` `HW4` `yangyang95`

開發環境看這裡
作業要求看這裡
github page 看這裡

程式碼重構

(1) 避免 warning

在 make 的時候噴出以下 warning
phonebook_opt.c:48:12: warning: variable ‘cpu_time’ set but not used [-Wunused-but-set-variable] double cpu_time;

原因是 cpu_time 變數是 DEBUG 的時候用的，沒有加入 DEBUG 標籤就會產生 warning
所以用 #ifdef DEBUG 把它包起來，make 的時候就不會有 warning 了

(2) 相同標籤程式碼重新排列

原程式碼中有許多 #ifndef OPT #else #endif 分散在各處，導致程式碼相當難閱讀，將相同標籤的程式排列在一起，讓閱讀程式的時候能更有邏輯。

在重新排列時也發現有趣的現象
















#ifndef OPT
{
	int i=0;
	......
}
#else
{
	......
}
#endif

.......

#ifndef	OPT
	printf("%d", i);
#endif

原本是想用 { } 把 OPT 中間的程式碼框起來，結果竟然編譯不過
變數 i 的生命周期只在 scope 裡面，跑到下面 printf 地方的時候，i 是沒有定義的

之前看到程式碼這樣寫還以為是為了好看，沒想到是一個特別的技巧

(3) loop 改進

程式碼裡面有一段長這樣









/* Connect the linked list of each thread */
for (int i = 0; i < THREAD_NUM; i++) {
    if (i == 0) {
        pHead = thread_args[i]->lEntry_head->pNext;
    } else {
        e->pNext = thread_args[i]->lEntry_head->pNext;
    }
    e = thread_args[i]->lEntry_tail;
}

很明顯 i == 0 只會執行一次，可以移到 for 迴圈外面







pHead = thread_args[0]->lEntry_head->pNext;
e = thread_args[0]->lEntry_tail;

for (int i = 1; i < THREAD_NUM; i++) {    
    e->pNext = thread_args[i]->lEntry_head->pNext;
    e = thread_args[i]->lEntry_tail;
}

效能分析

$ ./phonebook_orig
size of entry : 136 bytes
execution time of append() : 0.065837 sec
execution time of findName() : 0.005717 sec

$ ./phonebook_opt (4 threads)
size of entry : 24 bytes
orginal file size = 3206080
execution time of append() : 0.008602 sec
execution time of findName() : 0.004075 sec

新增新的規則 $ make thread_test_plot，執行從 1 ~ 128 thread 的 phonebook_opt

makeflie:













ifeq ($(strip $(THREAD_TEST)),1)
CFLAGS_opt  += -D THREAD_TEST
endif

thread_test: main.c
	for i in `seq 1 128`; \
	do $(MAKE) phonebook_opt THREAD_TEST=1 THREAD=$$i --silent; \
	./phonebook_opt; echo "\n"; \
	rm -rf phonebook_opt; \
	done;

thread_test_plot: clean thread_test opt.txt
	gnuplot scripts/thread_test.gp

main.c: 為了方便做圖在輸出檔案加入執行緒數量





#if defined(THREAD_TEST)
    fprintf(output, "%d append() findName() %lf %lf\n", THREAD_NUM, cpu_time1, cpu_time2);
#else
    fprintf(output, "append() findName() %lf %lf\n", cpu_time1, cpu_time2);
#endif

thread_test.gp

reset
set terminal png size 2000
set ylabel 'time(sec)'
set style fill solid
set title 'Different thread append() perfomance comparision (from 1 to 128 thread)'
set term png enhanced font 'Verdana,10'
set output 'thread_test_append.png'

plot [:][:0.02]'opt.txt' using 4:xtic(1) with histogram title 'optimized'

set title 'Different thread findName() perfomance comparision (from 1 to 128 thread)'
set output 'thread_test_findName.png'

plot [:][:0.02]'opt.txt' using 5:xtic(1) with histogram title 'optimized'

那趕快來欣賞我 ~~做到快吐血~~ 的圖形吧

append() 放大圖
 findName() 放大圖

append() 的圖形，我其實看不出明顯的規律。貌似(!?)有些微周期的振蕩（~~其實是我的錯覺~~）
不過有一點可以確定的是 3 個執行緒 append() 的效果最好

findName() 的圖形可以看出有周期性 15,30,44,56,64,71,73,77,88,95…(100後的數字看到眼花)
這些數字所需的時間都相當低

如果 append 的部分使用 3 threads，findName 也選擇時間短的 thread，應當可以將時間降到相當低。不過，我不確定這樣的方式是否實用。
畢竟，對程式每個區段進行量測，感覺不太實際。程式維護上也有可能造成困難。

目前也還不知道圖形如此呈現的原因，得試着思考看看

參考資料：

作業 1 補充

延續 2017q1 Homework1 (phonebook) 的不足，持續改進

Cache-miss 分析

Performance counter stats for './phonebook_orig' (100 runs):

    2,290,073   cache-misses        # 85.325 % of all cache refs   ( +-  0.08% )
    2,683,939   cache-references                                   ( +-  0.15% )
  321,828,538   instructions        # 1.20  insn per cycle         ( +-  0.03% )
  267,943,910   cycles                                             ( +-  0.37% )

       0.089032849 seconds time elapsed                            ( +-  0.39% )

 Performance counter stats for './phonebook_opt' (100 runs):  HW1

      410,488   cache-misses        # 66.447 % of all cache refs   ( +-  0.28% )
      617,763   cache-references                                   ( +-  0.59% )
  243,373,348   instructions        # 1.63 insn per cycle          ( +-  0.04% )
  149,450,713   cycles                                             ( +-  0.28% )

       0.052516323 seconds time elapsed                            ( +-  1.28% )

Performance counter stats for './phonebook_opt' (100 runs):

    1,126,509   cache-misses        # 45.986 % of all cache refs   ( +-  0.61% )
    2,449,667   cache-references                                   ( +-  0.85% )
  248,939,912   instructions        # 1.02  insn per cycle         ( +-  0.09% )
  242,920,594   cycles                                             ( +-  0.68% )

       0.073554961 seconds time elapsed                            ( +-  0.73% )

Comparison	Size	append	findName	cache miss	total (average)
original	136 bytes	0.065837 sec	0.005717 sec	85.325 %	0.089032849 sec
struct_reduce	32 bytes	0.069770 sec	0.002947 sec	66.447 %	0.052516323 sec
concurrent	24 bytes	0.008602 sec	0.004075 sec	45.986 %	0.073554961 sec

可以看到 cahch miss 隨 structure size 減少而降低
總時間 concurrent > struct_reduce 估計是因為 text_align() 的原因，之後可以實驗看看

問題回答

本例選用的 dataset (也就是輸入電話簿的姓名) 是否存在問題？

有代表性嗎？跟真實英文姓氏差異大嗎？

Ans：

還真的沒看過外國的電話簿，所以在網路上找了個 1969 年的電話簿圖片來參考 [source]

給定的字典一個名字只有出現一次，但電話簿中卻很可能有同名同姓的人。甚至其中還有數字、縮寫、店家的名字。每個 entry 的字數也不一樣，有的是 3 個字，有的是 4 個字。
如果依照實作只使用 last name 搜尋的話，絕對會產生問題。
資料難道「數大就是美」嗎？

Ans：資料正確性是首要考量，不過也要有一定規模才比較具有參考價值
如果我們要建立符合真實電話簿情境的輸入，該怎麼作？

Ans：可以在網路上尋找 open data 來參考實際電話簿資訊

放大圖

SURNAME	FORENAME	TITLE	AKA	PHONE	EMAIL	DIVISION
Tully	Susan	Ms			st@soton.ac.uk	CZ
Cornick	Carlene Jean Linda	Mrs		+442380596808	C.J.L.Geddes@soton.ac.uk	NP
Gramatyka	Jakub Julian	Mr			J.Gramatyka@soton.ac.uk	MM
Ryan	Matthew George	Dr	Matt	+442380593308	M.G.Ryan@soton.ac.uk	CC
Stone	Nicole Clare	Mrs		+442380597770	N.C.Stone@soton.ac.uk	JW

配合適當的資料結構，可加入對電話簿結構的封裝與方法，如：插入、刪除…等功能

老師建議

在 commit b6a2385 jserv 提供意見，因為強制 Push 時留言可能會不見，所以在這裡備份

+ifeq ($(strip $(THREAD_TEST)),1)
 +CFLAGS_opt  += -D THREAD_TEST
 ...

You don't have to rebuild with THREAD_TEST. Instead, you can enhance build system to generate executable files with various configurations. They can exist suffixing "_g" (for debugging symbols), "_prof" (for profiler enabled), etc. jserv

What's the benefit for doing so? It's seems like I still have to use wildcard or some technique, since there is a output difference to file in main.c when running thread_test.
Or maybe I just misunderstand your words. yangyang95

You can automate all test plans if the executable files are prepare well in advance. Once continuous integration (CI) is set up in such manner and configurations, everything can be manipulated.
Example: https://github.com/chewing/libchewing/blob/master/.travis.yml
It triggers Travis-CI and automate build and verification process. jserv

試了幾次，結果不會用 Travis CI，一直因為 clock_gettime build 失敗… orz
只好先關掉，~~逃避雖然可恥，但是有用 XD~~

多試了幾次後發現把-lrt flag 加到 make 的最後面就可以編譯成功

可以在專案內加入美美的測試圖案，會根據測試是否通過改變狀態

如此以來就可以在遠端測試，不同環境下是否能夠進行編譯，還有執行測試的 script

順帶放上 Travis-Ci 使用的 .travis.yml 檔案

language: c
sudo: false

branches:
  only:
    - master

cache: apt

addons:
  apt:
    sources:
    - ubuntu-toolchain-r-test
    packages:
    - gcc-4.8
    - gcc-5
    - gcc-6

matrix:
  include:
  - os: linux
    compiler: gcc
  - os: linux
    compiler: gcc-4.8
  - os: linux
    compiler: gcc-5
  - os: linux
    compiler: gcc-6

script:
  - make && ./phonebook_orig && ./phonebook_opt