2016q3 Homework2 (phonebook-concurrent)

contributed by <ruby0109>

反省這兩個禮拜糟糕的自己, 完全不行阿QAQQQ
參考了CheHsuan 和 TempoJiJi兩位同學的筆記之後, 發現自己沒有驗證原本的程式正確性, mmap沒有下去探討, 學習thread pool等沒有先從小程式實作，和遇到問題不會拆解所以卡住。以下加入預期目標, 繼續完成作業。

之前進度：phonebook

預期目標

學習驗證程式正確性
concurrency 認知整理
學習其他人共筆
thread pool

理解作業程式碼

影片

開始實作之前, 我先看了吳彥寬同學程式解說video, 筆記如下：

fgets 本身就可以讀行, 所以可以簡化

補充fgets基礎知識：
根據c99 規格書,
#include <stdio.h>
char *fgets(char * restrict s, int n,FILE * restrict stream);
description:
The fgets function reads at most one less than the number of characters specified by n from the stream pointed to by stream into the array pointed to by s. No additionalcharacters are read after a new-line character (which is retained) or after end-of-file. A null character is written immediately after the last character read into the array.

理解：
fgets讀取stream指向的stream最多n-1個字元到s指向的陣列中, 遇到換行符號或是檔案結尾字元就不會再讀, '\0' 會寫在陣列最後一個字元之後。
strncasecmp：

#include <strings.h>
int strcasecmp(const char *s1, const char *s2);int strncasecmp(const char *s1, const char *s2, size_t n);

from linux manual:
strcasecmp() ：function compares the two strings s1 and s2, ignoring the case of the characters. It returns an integer less than, equal to, or greater than zero if s1 is found, respectively, to be less than, to match, or be greater than s2.The strncasecmp() function is similar, except it compares the only first n bytes of s1

strncasecmp比較s1前n bytes和s2, 比較時忽略大小寫, 若s1和s2相同傳回0; s1長度大於s2傳大於0值; 小於則傳負值。
~~memory pool 和 pthread 這兩個部份看影片仍然不太明白~~
~~可以在findname放其他資料,而不是在append整個放進去~~（findname或許會些微上升, 可是不會很多）append進去的只有lastName 的指標, findname時再放入lastName
整個append用pthread下去做而不是只有mmap(還不懂為什麼原本只做mmap時間會上升)

檢測程式正確性

先執行一次看看, make plot之後：
Performance counter stats for './phonebook_orig' (100 runs):

  4,457,101  cache-misses     #   91.778 % of all cache refs( +-  0.07% )
  4,856,373  cache-references                               ( +-  0.05% )
  263,262,629  instructions     #    1.69  insns per cycle    ( +-  0.02% )
  155,898,150  cycles                                         ( +-  0.06% )

 	0.076228638 seconds time elapsed                     ( +-  0.31% )

Performance counter stats for './phonebook_opt' (100 runs):

  3,548,725  cache-misses     # 61.592 % of all cache refs( +-  0.25% )
  5,761,661  cache-references                             ( +-  0.46% )
  224,875,100  instructions   #  1.93  insns per cycle    ( +-  0.04% )
  116,299,224  cycles                                     ( +-  0.42% )

 	0.278607164 seconds time elapsed                 ( +-  2.06% )

優化後的cache miss略下降一些。

file align
先看在main.c中被include的file.c產生的align.txt, 直接點開的話裡面只有aaaa,用vim 的話則是:
```
aaaa
^@^@^@^@^@^@^@^@^@^@^@aaaaa
^@^@^@^@^@^@^@^@^@^@aaaaaa
^@^@^@^@^@^@^@^@^@aaaaaaa
^@^@^@^@^@^@^@^@aaaaaaaa
^@^@^@^@^@^@^@aaaaaaaah
^@^@^@^@^@^@aaaaaaauugh
^@^@^@^@aaaaaagh
^@^@^@^@^@^@^@aaaaaahhhhh
^@^@^@^@aaaaaaugh
^@^@^@^@^@^@aaaaagh
^@^@^@^@^@^@^@^@aaaaah
^@^@^@^@^@^@^@^@^@aaaarthur
^@^@^@^@^@^@aaaaw
^@^@^@^@^@^@^@^@^@^@aaagghhhh
^@^@^@^@^@^@aaah
^@^@^@^@^@^@^@^@^@^@^@aaaugh
^@^@^@^@^@^@^@^@^@aaccf
```
不是很懂^@是什麼, 參考what is ^@ character in vi?才知道是NULL bytes。不過奇怪的是，首先第一行沒有^@,再來每一行的行數不同,像第7行有17個,第8行只有12個。

先用GDB觀察main.c跳進file_align function後的執行情形。在wbuf和rbuf都有align, 顯示出來的值也都正確。rbuf的大小似乎可以調小一點, 畢竟也有措施避免strcpy的overflow。
linked list的資料是否正確
試試看原本同學寫的debug.h,跑完之後發現在終端機不太方便測試是否有遺漏的資訊, ~~要怎麼比對呢？~~ 可以直接用findname

    // use findname to test whether all elements in the list
    // open file for test and origin word.txt
    FILE *test,*org;
    test = fopen("test","a");
    org = fopen(DICT_FILE, "r");

    char line[MAX_LAST_NAME_SIZE];
    // use findname to test whether all elements in the list
    while(fgets(line, sizeof(line),org)){
        e = pHead;
        if(findName(line, e) == NULL) fprintf(test,"%s not found\n",line);
    }
    fclose(org);
    fclose(test);

直接把words.txt檔的東西丟下去找, 可是這樣若找尋350000筆，每筆0.03897要尋找約22分鐘。發現TempoJiJi大大直接另外寫一個測, 才發現原來這個版本沒有Hash

寫了entry_test.c 可是有segmentation fault

debug還是卡了好久, 用GDB下去看發現是在append_target中, 第60行的地方, malloc執行到第三個word就會直接掛掉。

把他改成小一點的程式, 縮小錯誤範圍：












































#include <stdlib.h>
#include <string.h>
#include <stdio.h>
#include <ctype.h>

#define MAX_LAST_NAME_SIZE 16
#define HASH_SIZE 42737
#define DICT_FILE "./dictionary/words.txt"

typedef struct __PHONE_BOOK_ENTRY {
    char *lastName;
    struct __PHONE_BOOK_ENTRY *pNext;
} entry;


entry *append_target(char lastName[], entry *e)
{
    e->pNext = (entry *)malloc(sizeof(entry));
    e = e->pNext;
    strcpy(e->lastName, lastName);
    e->pNext = NULL;

    return e;
}

int main(void)
{
    FILE *org;
    org = fopen("./dictionary/words.txt","r");

    char line[MAX_LAST_NAME_SIZE];
    entry *target, *tHead;

    tHead = (entry *)malloc(sizeof(entry));
    target = tHead;
    target -> pNext = NULL;
    while(fgets(line,sizeof(line),org)) {
        target = append_target(line,target);
    }

    fclose(org);
    free(tHead);
    free(tHead->pNext);
}

gdb跑出的結果：

Program received signal SIGSEGV, Segmentation fault.
__strcpy_sse2_unaligned () at ../sysdeps/x86_64/multiarch/strcpy-sse2-unaligned.S:605
605	../sysdeps/x86_64/multiarch/strcpy-sse2-unaligned.S: 沒有此一檔案或目錄.

發現是struct 忘記改><
感謝徐朝逸大大幫我看codeQAQ

幾經波折, 把所有菜鳥可以犯的錯犯過一遍然後debug完後, 發現有一長串都沒有在裡面, 這裡擷取一小部份：

aaaa
 not found
angelika
 not found
anklet
 not found
antescript
 not found
antirobin
 not found
assiento
 not found

發現我的程式是錯的！！雖然有幾個字母帶進原程式會發現的確沒在裡面, 可以很多都是跑得出來的。把append和findname找不到的資料互相比較, 發現其實有append進去, 可是findName沒有找到，　原來是findName的地方有寫錯, 這樣代表之前自己的code也有錯…

正確性解決
把程式改好之後, 跑出來的結果：

   aaaa
   not found
   aaaaa
   not found
   aaaaaa
   not found
   aaaaaaa
   not found

因為pthrea跑的方式是
thread0:0 4 8..
thread1:1 5 9..
thread2:2 6 10..
thread3:3 7 11..

所以可以推測這可能是pthread各自list的開頭有問題。
仔細看之後發現在main.c 中的pHead = stack[i]->pHead->pNext和etmp->pNext = stack[i]->pHead->pNext 應該要把pNext去掉!!

實作可以改良的地方

在語法的部份:
```
  if ((suffix = (pad - strlen(rbuf))) != 0)
  strcpy(wbuf, rbuf);
```
應該要大於0, 以防超過陣列大小
- 沒有註解看起來好痛苦QAQQ : 加入註解
- 變數在main.c中和.c file不好對照, 有點難讀
- 把new_append_a改成ThrdInitial(char *StartAdrs, char *EndAdrs, int tid, int nthrd, entry *pptr)
- append_a改成

typedef struct _Thread_Stack_ {
    char *StartAdrs;
    char *EndAdrs;
    int tid;
    int nthread;
    entry *PoolPtr;
    entry *pHead;
    entry *pTail;
} ThrdStack;

concurrency

筆記：

Concurrency is not Parallelism, by Rob Pike
- sometime people run a program with may processes, but it gets slow, why?
  what is concurrency?
  - concurrency in computer world is a way to build things, concurrent is a composition of
    independently executing things typically funcitons but they don't have to be(independently executing processes)這裡的process泛指各種thread, coroutine等
    理解：concurrency就像是一種方法, 由各個獨立執行的函式(包含process, thread, coroutine?等等)組成, 可是方法不一定要採用。
  - parallelism is simultaneously executing lots of multiple thing, possible related possible not
    理解：parallelism是同時執行很多東西，可能相關，也可能互相獨立。
  - concurrency is about dealing with a lot of things at once and
    parallelism is about doing a lot of things at once
    理解：concurrency是一次處理很多東西, 但不一定要同時執行, parallelism 就是同時執行很多東西。
  - concurrency is about the structure
    parallelism is about the execution
  - concurrency is a structure things that you can maybe use parallelism to do a better job
    but parallism is not the goal of concurrency
    concurrency's goal is a good structure
    ex. OS deal with lots of things, but these things aren't necessarily parallel, only one processor only one of them is
    ever running the time, so there are concurrencty structure, but not necessary parallel
    理解：parallism不是concurrency的目標, 就像是OS也是一次處理很多東西, 而且不一定要都同時執行。也就是concurrency是一個結構，可是不一定要用parallism的方式執行。
  - parallelism像是一個箭頭,可以分成不同的方向
  - concurrency gives you a structure program different pieces, but we need to find out how to coordinate those pieces
    and communication
    理解：concurrency給我們一個架構, 可以把程式分成不同的區塊, 可是我們必須找出怎麼協調和讓這些區塊可以溝通

後面的GO就聽不是很懂了><

The Free Lunch Is Over
A Fundamental Turn Toward Concurrency in Software
By Herb Sutter
- 這30年來, CPU 設計者主要藉由改善以下來達到效能的提昇
  - clock speed
  - execution optimization-為了改善效能, 甚至會有風險改變程式
    - pipelining
    - branch prediction
    - executing multiple instructions in the same clock cycle(s)
    - reordering the instruction stream for out-of-order execution.
  - cache
- 可是發展到了現在 clock speed很難再提昇, 因為如果再提昇會有無法散去高熱, 太耗能跟電流液出的情形
- Myths and Realities: 2 x 3GHz < 6 GHz
  - 即使有兩個thread在兩個實體處理器上運作, 也不會得到兩倍的效果, 為什麼呢？
    - 要維持cache coherency
    - 除非兩個core在跑不同的程式, thread run independently所以不需要等待彼此不然這些core不會被良好的運用
      - 雖然系統還是會被潛在的廣告軟體和間諜軟體佔用
- 未來可以藉由以下來繼續提高效能
  - hyperthreading- 在一個CPU上運行數個thread in parallel
    - 但仍然只有一個 cache, 一個 integer math unit, 一個 FPU, and in general just one each of most basic CPU features.
    - 很難提昇兩倍效能, 而且對於single-threaded應用沒有幫助。
  - multicore
    - 很難提昇兩倍效能, 而且對於single-threaded應用沒有幫助。
  - cache
    - 應該可以繼續成長，因為space is speed
    - 一個cache miss就要多花10~50倍的時間(相較於去cache拿data)
- Concurrency is the next major revolution in how we write software
Concurrency的Youtube Lecture
- Locks: 擁有權, 擁有者才能unlock。

作業中的concurrency:

第一次看到pthread_setconcurrency, 查manual理解之下是在linuxthread沒有太大作用, 可是可以增加相容性。

學習其他人共筆

LanKuDot (李昆憶)共筆

參考筆記並閱讀The Free Lunch Is Over, 對於重構的兩則常見誤解
valgrind：
- 安裝：$ sudo apt-get install valgrind
- 說明：$ valgrind –help
- 參考資料：分析工具valgrind的使用(這篇的指令tools要改成tool)，Valgrind is NOT a leak checker
- 用途：
  - 偵測undefined behavior
    valgrind ./phonebook_opt
  - function and memory profiler
  - data-race detection(覺得這好酷！！)
    valgrind --tool=helgrind ./phonebook_opt
  - 可以用來看memory leak
    --leak-check=full
14872 Memcheck, a memory error detector
14872 Copyright © 2002-2013, and GNU GPL'd, by Julian Seward et al.
14872 Using Valgrind-3.10.1 and LibVEX; rerun with -h for copyright info
14872 Command: ./phonebook_opt
14872
size of entry : 24 bytes
execution time of append() : 0.118666 sec
execution time of findName() : 0.002815 sec
14872
14872 HEAP SUMMARY:
14872 in use at exit: 2,215 bytes in 16 blocks
14872 total heap usage: 28 allocs, 12 frees, 8,408,637 bytes allocated
14872
14872 16 bytes in 1 blocks are definitely lost in loss record 1 of 13
14872 at 0x4C2AB80: malloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
14872 by 0x400F23: file_align.4663 (in /home/ruby/2016_autumn/W2/phonebook-concurrent/phonebook_opt)
14872 by 0x40101B: main (in /home/ruby/2016_autumn/W2/phonebook-concurrent/phonebook_opt)
14872
14872 16 bytes in 1 blocks are definitely lost in loss record 2 of 13
14872 at 0x4C2AB80: malloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
14872 by 0x40166E: findName (in /home/ruby/2016_autumn/W2/phonebook-concurrent/phonebook_opt)
14872 by 0x4013F6: main (in /home/ruby/2016_autumn/W2/phonebook-concurrent/phonebook_opt)
14872
14872 16 bytes in 1 blocks are definitely lost in loss record 3 of 13
14872 at 0x4C2AB80: malloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
14872 by 0x40166E: findName (in /home/ruby/2016_autumn/W2/phonebook-concurrent/phonebook_opt)
14872 by 0x40142A: main (in /home/ruby/2016_autumn/W2/phonebook-concurrent/phonebook_opt)
14872
14872 16 bytes in 1 blocks are definitely lost in loss record 4 of 13
14872 at 0x4C2AB80: malloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
14872 by 0x40166E: findName (in /home/ruby/2016_autumn/W2/phonebook-concurrent/phonebook_opt)
14872 by 0x40147E: main (in /home/ruby/2016_autumn/W2/phonebook-concurrent/phonebook_opt)
14872
14872 24 bytes in 1 blocks are definitely lost in loss record 5 of 13
14872 at 0x4C2AB80: malloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
14872 by 0x401050: main (in /home/ruby/2016_autumn/W2/phonebook-concurrent/phonebook_opt)
14872
14872 107 bytes in 1 blocks are definitely lost in loss record 8 of 13
14872 at 0x4C2AB80: malloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
14872 by 0x4016B1: findName (in /home/ruby/2016_autumn/W2/phonebook-concurrent/phonebook_opt)
14872 by 0x4013F6: main (in /home/ruby/2016_autumn/W2/phonebook-concurrent/phonebook_opt)
14872
14872 107 bytes in 1 blocks are definitely lost in loss record 9 of 13
14872 at 0x4C2AB80: malloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
14872 by 0x4016B1: findName (in /home/ruby/2016_autumn/W2/phonebook-concurrent/phonebook_opt)
14872 by 0x40142A: main (in /home/ruby/2016_autumn/W2/phonebook-concurrent/phonebook_opt)
14872
14872 107 bytes in 1 blocks are definitely lost in loss record 10 of 13
14872 at 0x4C2AB80: malloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
14872 by 0x4016B1: findName (in /home/ruby/2016_autumn/W2/phonebook-concurrent/phonebook_opt)
14872 by 0x40147E: main (in /home/ruby/2016_autumn/W2/phonebook-concurrent/phonebook_opt)
14872
14872 192 bytes in 4 blocks are definitely lost in loss record 11 of 13
14872 at 0x4C2AB80: malloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
14872 by 0x401709: ThrdInitial (in /home/ruby/2016_autumn/W2/phonebook-concurrent/phonebook_opt)
14872 by 0x4011EC: main (in /home/ruby/2016_autumn/W2/phonebook-concurrent/phonebook_opt)
14872
14872 LEAK SUMMARY:
14872 definitely lost: 601 bytes in 12 blocks
14872 indirectly lost: 0 bytes in 0 blocks
14872 possibly lost: 0 bytes in 0 blocks
14872 still reachable: 1,614 bytes in 4 blocks
14872 suppressed: 0 bytes in 0 blocks
14872 Reachable blocks (those to which a pointer was found) are not shown.
14872 To see them, rerun with: –leak-check=full –show-leak-kinds=all
14872
14872 For counts of detected and suppressed errors, rerun with: -v
14872 ERROR SUMMARY: 9 errors from 9 contexts (suppressed: 0 from 0)
問題：看到一些文章說free並不是必須, 可是大部分是好的, 比如In C, is it necessary to free a pointer at exit?, What REALLY happens when you don't free after malloc?, Does calling free or delete ever release memory back to the “system” 要視情況而定, 可是有種很難體會的感覺。
Solve memory leak
在findName中, 不需要把lastname放進entry->lastName 中, 因為strncasecmp就是用pointer即可。
學習李昆憶大大的重構方式
- 在每個段落加上這個段落的作用, 比如file preprocessing, release the memory等等方便看的人快速理解程式內容
- 把在程式碼中的indlude header移到最上方
- 用參數讓輸入進函式的資料更容易解讀
- 去掉不需要的參數如etmp
- 改掉dprintf
優化append
- 原本是12341234的方式(row-major)
- 改成把資料分成四大段, 符合cache的排列方式

結果：

Thread Pool

參考C 的 Thread Pool 筆記
以下使用threadpool-mbrossard的程式碼
理解ThreadPool
這個程式看起來好舒服!!!原來這就叫做程式碼><!!
- 理解過程中程式筆記
  - 在每個function前說明function作用, 和argument說明
  - 學到goto用法
  - 原來可以這樣寫：pool->head = pool->tail = pool->count = 0;
- 程式運作：
  先來看wiki 使用thread pool處理Task的圖
  
  再來看程式的struct, 首先threadpool中會紀錄兩個array, 一個是threads另一個是task queue。

struct threadpool_t {
    pthread_mutex_t lock;
    pthread_cond_t notify;
    pthread_t *threads;//放置所有thread空間的指標
    threadpool_task_t *queue;//放置所有task空間的指標
    int thread_count;
    int queue_size;
    int head;
    int tail;
    int count;
    int shutdown;
    int started;
};

threads的struct是pthread_t;
task queue的struct則是由funciton和argument組成

typedef struct {
    void (*function)(void *);
    void *argument;
} threadpool_task_t;

再來看執行的流程圖

TypeError: y.shiftX is not a function

不懂的地方：
next = (pool->tail + 1) % pool->queue_size;
pool->head = (pool->head + 1) % pool->queue_size;
上面這個是像一個ring, 有head和tail在兩端移動

Implementation
在implement的時候, 我遇到的問題是不知道thread pool的大小要設多大, 參考TempoJiJi是8, 先試試看
- 把thread pool大小設成thread的兩倍大

從圖上可以發現四個thread的時候最快, 8個以上時間就變得相當高

改進：

學習github.ignore
- 參考資料-Git 教學(1) : Git 的基本使用
  - vim .gitignore 裡面加上想要忽略的檔案
thread pool影響
- 比較(row-major opt 和thread pool)
  
  好像沒有變快, 為什麼呢？

參考資料

待做筆記

semaphore