2018q1 Homework3 (list)

2018q1 Homework3 (list) === contributed by < `TingL7` > ## 生日悖論 ### 如何透過近似的運算式，估算上述機率計算呢？附上有效的 LaTeX 算式生日悖論是指 23 人以上，出現兩個人生日同一天的機率高於 50 % * 參考 [Birthday problem](https://en.wikipedia.org/wiki/Birthday_problem) * 假設 * 所有人的生日都是隨機分佈 * 一年 365 天 * P(n) : n 人中至少兩人生日是同一天 * $n 人中生日都不同天 = \dfrac{365}{365} \cdot \dfrac{364}{365} \cdot \dfrac{363}{365} \cdot ...\dfrac{366-n}{365} = \dfrac{365!}{365^{n} \cdot (365-n)!}$ * $P(n) = 1 - n 人中生日都不同天 = 1 - \dfrac{365!}{365^{n} \cdot (365-n)!}$ * $P(23) = 1 - \dfrac{365!}{365^{23} \cdot 342!}\\ \approx 1 - 0.49270276567 = 0.50729723432$ #### 參考資料 * [HackMD – LaTeX 語法與示範](https://hackmd.io/s/B1RwlM85Z#) ### [Birthday problem](https://en.wikipedia.org/wiki/Birthday_problem) 對資訊安全的影響在哪些層面？ * 主要的影響在於 hash function 的應用 * hash function 的碰撞，利用窮舉法製作出具有相同 hash value 的檔案或合同，取代真正的檔案以進行欺騙 * 有一長度 n 的字串 t1 透過 hash function 得到 f(t1) * 要找到字串 t2 ，使得 f(t1) = f(t2) ，可以透過大量製造內容不同但意義相同（如，增減標點或空白）的字串 * 從生日悖論中，可以知道只要有 2^n/2^ 個字串就會有很大的機率發生碰撞 * 因此，只要製造出 2^n/2^ 個字串就有機會找到具有相同雜湊值的字串 * 應對方法：增加字串長度 * 參考[Birthday attack](https://en.wikipedia.org/wiki/Birthday_attack) 、 [生日攻击是什么，有什么用？](https://www.zhihu.com/question/54307104) 、 [hash碰撞，数字签名，生日攻击](http://blog.renren.com/share/250153535/16381267487/0) ### 像是 [Random Number Generator](http://stattrek.com/statistics/random-number-generator.aspx) 這樣的產生器，是否能產生「夠亂」的亂數呢？ * 產生器產生的亂數，是透過 seed 來產生亂數表，因此 seed 相同的話，得到的亂數表都是相同的。 * 如果每次的亂數種子都設不一樣就可以得到不同的亂數表，進而達到看起來像亂數的效果。 * 像是 c 語言中常用的 `srand(time(NULL));` ，以時間作為亂數種子，使得每次亂數種子都不同，得到不同亂數表，進而得到「亂」數 * 而 `rand()` 測試幾次，會發現每次得到的亂數是相同的，因為他的種子每次都相同 ## Linux 風格的 linked list ### 為何 Linux 採用 macro 來實作 linked list？一般的 function call 有何成本？ * 參考 [What is the difference between a macro and a function in C?](https://stackoverflow.com/questions/4990362/what-is-the-difference-between-a-macro-and-a-function-in-c) 中提到，`The basic difference is that function is compiled and macro is preprocessed.` * macro 在前置處理的過程中，就完成程式碼的替換。 * function 在被編譯後執行，每次呼叫時都需要時間進行 stack 的 pop 、 push，比較耗費時間。 * 因此，採用 macro 來實作 linked list 可以避免過多的 function call 時間成本 ### Linux 應用 linked list 在哪些場合？舉三個案例並附上對應程式碼，需要解說，以及揣摩對應的考量 ### GNU extension 的 typeof 有何作用？在程式碼中扮演什麼角色？ * 參考 [GCC typeof在kernel中的使用](http://deltamaster.is-programmer.com/posts/37253.html) 、 [Referring to a Type with typeof](https://gcc.gnu.org/onlinedocs/gcc/Typeof.html) * `int a;typeof（a) b;` 宣告與變數 `a` 相同 type 的變數 `b` * 可以達到 c++ 中「多型」的效果 * 可以根據函式傳入不同的 type 而產生不同的行為 e.g. ```c #define max(a,b) \ ({ typeof (a) _a = (a); \ typeof (b) _b = (b); \ _a > _b ? _a : _b; }) ``` ### 解釋以下巨集的原理 ```c #define container_of(ptr, type, member) \ __extension__({ \ const __typeof__(((type *) 0)->member) *__pmember = (ptr); \ (type *) ((char *) __pmember - offsetof(type, member)); \ }) ``` * 在 [Alternate Keywords ](https://gcc.gnu.org/onlinedocs/gcc-3.0.1/gcc_5.html#SEC109) 中，提到編譯參數 `-traditional` 、 `-ansi` 、 `-std=c99` 會使一些關鍵字失效，如 `typeof` ，在GNU C 中，在字的前後加上 `__` 可以解決關鍵字失效的問題 * 定義 `container_of(ptr, type, member)` 以 `__extension__()` 取代。 * `const __typeof__(((type *) 0)->member) *__pmember = (ptr);` * `(type *) 0` : 型態轉成 type * 的 0 * `((type *) 0)->member` : type * 的 0 指向 member * 由 `->` 可以猜測 type 是 structure * `__typeof__(((type *) 0)->member)` : type 中 member 的型態 * 0 可以替換成其他數字，只是為了知道 member 的 type 而使用 * `const __typeof__(((type *) 0)->member) *` : 型態為 const type 中 member 的型態指標 * `const __typeof__(((type *) 0)->member) *__pmember` : 宣告變數__pmember，型態為 const type 中 member 的型態指標 * `const __typeof__(((type *) 0)->member) *__pmember = (ptr);` ：宣告變數__pmember，型態為 const type 中 member 的型態指標，指向 ptr 指的位置 * `(type *) ((char *) __pmember - offsetof(type, member));` * `(char *) __pmember` : 型態轉換成 char * 的 __pmember * `offsetof(type, member)` : type 中 member 的位址大小 * `#define offsetof(TYPE, MEMBER) ((size_t) &((TYPE *)0)->MEMBER)` 計算出 struct 和其 member 的距離 * `((char *) __pmember - offsetof(type, member))` : 指向 ptr 也是 struct typr 中的 member的位址減去 struct 開頭與 member 的距離，也就是 struct 開頭的位址 * `(type *) ((char *) __pmember - offsetof(type, member));` ：以(type *) 儲存 type 開頭的位址 ### 除了你熟悉的 add 和 delete 操作，list.h 還定義一系列操作，為什麼呢？這些有什麼益處？ * 除了一般的 add 和 delete ，還定義這些操作 * list_add_tail * list_del_init : 在刪除節點之前，先將其初始化，這樣可以避免錯誤的操作 * list_empty * list_is_singular : 是否只有一個節點 * list_splice ：把 list 的一些節點插入到另一個 list 開頭 * list_splice_tail * list_splice_init * list_cut_position : 將 list 開頭移到另一個 list * list_move : 把指定的節點移到 list 開頭 * list_move_tail * 由於這是環狀 doublely-linked list ，會有針對開頭或尾端的操作，這樣能夠更方便的使用，也能提升效能 ### LIST_POISONING 這樣的設計有何意義？ * [list poisoning](https://en.wikipedia.org/wiki/List_poisoning) 是指 mail list 出現無效的郵件地址導致資源的消耗 * LIST_POISONING 被定義在 [linux-list/include/list.h](https://github.com/TingL7/linux-list/blob/master/include/list.h)中 ```clike=119 /** * list_del() - Remove a list node from the list * @node: pointer to the node * * The node is only removed from the list. Neither the memory of the removed * node nor the memory of the entry containing the node is free'd. The node * has to be handled like an uninitialized node. Accessing the next or prev * pointer of the node is not safe. * * Unlinked, initialized nodes are also uninitialized after list_del. * * LIST_POISONING can be enabled during build-time to provoke an invalid memory * access when the memory behind the next/prev pointer is used after a list_del. * This only works on systems which prohibit access to the predefined memory * addresses. */ static inline void list_del(struct list_head *node) { struct list_head *next = node->next; struct list_head *prev = node->prev; next->prev = prev; prev->next = next; #ifdef LIST_POISONING node->prev = (struct list_head *) (0x00100100); node->next = (struct list_head *) (0x00200200); #endif } ``` * 第 130 行的註解有提到， LIST_POISONING 的設計是因為刪除節點時，只有刪除 list 中的節點，並沒有真正的釋放記憶體，因此在重新使用這塊記憶體時，可能會去使用到不該用的 next / prev ### linked list 採用環狀是基於哪些考量？ * 在使用上效能比較好，例如要使用到 linked list 尾端，直接由 head 就可以找到，不必再使用 loop 去找到 `node->next = NULL;` ### list_for_each_safe 和 list_for_each 的差異在哪？“safe” 在執行時期的影響為何？ * [linux-list/include/list.h](https://github.com/TingL7/linux-list/blob/master/include/list.h)中 ```clike=401 /** * list_for_each_safe - iterate over list nodes and allow deletes * @node: list_head pointer used as iterator * @safe: list_head pointer used to store info for next entry in list * @head: pointer to the head of the list * * The current node (iterator) is allowed to be removed from the list. Any * other modifications to the the list will cause undefined behavior. */ #define list_for_each_safe(node, safe, head) \ for (node = (head)->next, safe = node->next; node != (head); \ node = safe, safe = node->next) ``` ```clike=370 /** * list_for_each - iterate over list nodes * @node: list_head pointer used as iterator * @head: pointer to the head of the list * * The nodes and the head of the list must must be kept unmodified while * iterating through it. Any modifications to the the list will cause undefined * behavior. */ #define list_for_each(node, head) \ for (node = (head)->next; node != (head); node = node->next) ``` * 從程式碼來看最大的差異在於 safe 版本多了一個變數 safe 儲存了當下節點的 next * 多了一個 safe 的好處在於，可以對當下節點進行刪除的動作，也不會影響整個迴圈的進行，因為 safe 會事先儲存下一個節點的位址，不會使 list 斷掉 ### for_each 風格的開發方式對程式開發者的影響為何？ * 與 c 語言中 for 的寫法相比， foreach 簡潔許多，可讀性也較高 * foreach 可以簡單的走訪過整個資料結構 * Perl foreach 用法 ``` foreach my $i (@array){;} ``` * 可以發現，參數僅有控制變數 i 跟要操作的 array ，而在 c 的 for 的寫法中可以自訂變數初始動作、條件、每次內容做完後的動作，彈性比較大，但相對的寫法就比較複雜 ### 程式註解裡頭大量存在 @ 符號，這有何意義？你能否應用在後續的程式開發呢？ * [Doxygen](https://en.wikipedia.org/wiki/Doxygen) 是產生參考文件的工具，會抓取程式內容與註解來自動產生說明文件。因此，只要按照 Doxygen 的格式撰寫註解，就可以不用另外寫說明文件，可以由 Doxygen 自動產生 * [Doxygen 說明文件](https://www.stack.nl/~dimitri/doxygen/manual/commands.html#cmda) 中提到，`@` 符號表示註解的開頭，而 `\` 對於 Doxygen 來說，與 `@` 意義相同 * `@` 會用在參數的開頭，或是一些關鍵字的開頭 (如 `@struct`) ### tests/ 目錄底下的 unit test 的作用為何？就軟體工程來說的精神為何？ * tests/ 目錄底下有許多測試 function 的檔案，可以先確認每個 function 的功能正確無誤，再應用到程式中，使得 debug 更容易 * 就軟體工程來說，希望將開發流程、大程式切割成許多小部份，可以實做一些就測試分析一些，在維護、找錯、團隊合作上都會進行的更容易、更順利 ### tests/ 目錄的 unit test 可如何持續精進和改善呢？ ## 針對 linked list 的排序演算法 ### 對 linked list 進行排序的應用場合為何？ * linked list 的優點在於插入、刪除、共用節點，而且需要的記憶體空間是動態的 * 需要以不同的優先權依據重新排列 * 排程的管理 * 需要以優先順序排列（優先權依據可能是必須馬上做的或者比較重要的） * 有新的事項進來要插入容易 * 事項的數量不固定 ### 考慮到 linked list 在某些場合幾乎都是幾乎排序完成，這樣貿然套用 quick sort 會有什麼問題？如何確保在平均和最差的時間複雜度呢？ * 參考 [快速排序(Quick Sort) – 寫點科普，請給指教。](https://hellolynn.hpd.io/2017/08/03/%E5%BF%AB%E9%80%9F%E6%8E%92%E5%BA%8F-quick-sort/) * quick sort 不適用於幾乎排序好的資料 * quick sort 在使用時，會選擇最後一個當分割的 pivot ，如果是幾乎排序好的每次在挑 pivot 時，容易選到最大值，以此做分割會造成沒有分割成兩組數的效果 * 因此，對於幾乎排序完成的資料，在使用 quick sort 的時間複雜度通常是 worst case O(n^2^) * 可能的解法： Middle-of-Three 方法 * 由資料的最左值、最右值、中間值三個中挑出中間值作為 pivot * 可以避免取到最大值造成分割沒有效果的情況，甚至接近 best case ，因為是幾乎排序好的資料，容易選到中間值，分割出數量相等的兩組數 ### 能否找出 Linux 核心原始程式碼裡頭，對 linked list 排序的案例？ * 利用 [搜尋工具](https://livegrep.com/search/linux) ，搜尋關鍵字 `list_sort` * [/lib/list_sort.c](https://github.com/torvalds/linux/blob/v4.16/lib/list_sort.c#L1) * 以 merge sort 來實作 list sort * [/drivers/gpu/drm/drm_modes.c](https://github.com/torvalds/linux/blob/v4.16/drivers/gpu/drm/drm_modes.c#L34) * mode 的 list 需要以 favorability 的順序排序，因此用到 list sort ### 透過 [skiplist](https://en.wikipedia.org/wiki/Skip_list) 這樣的變形，能否加速排序？ ![](https://i.imgur.com/rGiwDyt.png) * skip list 是一種針對 sorted linked list 的資料結構，透過不同 level 做出不同的 linked list 達到加速搜尋的效果 e.g. 搜尋 `7` 1. 從 level 4 比對， 1 <= 7 ， null 跳 level 3 2. level 3 比對 4 <= 7 , 6 <= 7, null 跳 level 2 3. level 2 比對 9 >= 7 跳 level 1 4. level 1 比對 7 == 7 比對 5 次，比直接查詢（7 次）還要快 * 對於 insertion sort ，這樣子的資料結構，能夠加速，因為每次插入時都需要搜尋插入的位置，這時候 skip list 就派上用場了 * 對於其他的排序法就沒有明顯的幫助 ### 研讀 [How to find the kth largest element in an unsorted array of length n in O(n)?](https://stackoverflow.com/questions/251781/how-to-find-the-kth-largest-element-in-an-unsorted-array-of-length-n-in-on)，比照上方連結，是否能設計出針對 linked list 有效找出第 k 大 (或小) 元素的演算法？ * 參考連結中[ hoder ](https://stackoverflow.com/users/2525478/hoder) 的回答，試著設計找出第 k 小元素的 pseudocode: ``` QUICKSELECT(list, k) 建立 l1, l2 兩個空的 linled list r = 從 list 中隨機挑出一個節點中的 val foreach (node, head) if noed->val < r then add(node, l1) if noed->val > r then add(node, l2) if k <= len(l1) return QUICKSORT(l1, k) else if k > len(list)-len(l2) return QUICKSORT(l2, k - (len(list) - len(l2))) else return r ``` ### [linux-list](https://github.com/sysprog21/linux-list) 裡頭 examples 裡頭有兩種排序實作（insertion 和 quick sort)，請簡述其原理 * insertion : 分成兩部份 list_insert_sorted 及 list_insertsort * 原理：將未排序好的 list 中所有節點，一個個插入到已排序的 list 中適合的位置 * list_insertsort ```clike static void list_insertsort(struct list_head *head) { struct list_head list_unsorted; struct listitem *item = NULL, *is = NULL; INIT_LIST_HEAD(&list_unsorted); list_splice_init(head, &list_unsorted); list_for_each_entry_safe (item, is, &list_unsorted, list) { list_del(&item->list); list_insert_sorted(item, head); } } ``` 這個函式主要做兩件事 1. 將 list 移到另一個 list (list_unsorted) 2. 先將節點從 list 中刪除，再將節點插入排序好的 list 中 * list_insert_sorted ```clike static void list_insert_sorted(struct listitem *entry, struct list_head *head) { struct listitem *item = NULL; if (list_empty(head)) { list_add(&entry->list, head); return; } list_for_each_entry (item, head, list) { if (cmpint(&entry->i, &item->i) < 0) { list_add_tail(&entry->list, &item->list); return; } } list_add_tail(&entry->list, head); } ``` 要將節點插入排序好的 list 中，分 3 種情況 1. 第一個：當 list 是空的話， item 直接放進 list 中 2. 中間：找到比 item 小的節點,將 item 放到它後面 3. 最後一個： list 中最小的元素 * quick sort * 原理：設定一個 pivot ，將 list 以 pivot 為刀切割成兩部份，一部分都比 pivot 大，一部分比 pivot 小，在將切割出來的部份遞迴做 quick sort 直到切出來的部份為空 * list_qsort ```clike= static void list_qsort(struct list_head *head) { struct list_head list_less, list_greater; struct listitem *pivot; struct listitem *item = NULL, *is = NULL; if (list_empty(head) || list_is_singular(head)) return; INIT_LIST_HEAD(&list_less); INIT_LIST_HEAD(&list_greater); pivot = list_first_entry(head, struct listitem, list); list_del(&pivot->list); list_for_each_entry_safe (item, is, head, list) { if (cmpint(&item->i, &pivot->i) < 0) list_move_tail(&item->list, &list_less); else list_move(&item->list, &list_greater); } list_qsort(&list_less); list_qsort(&list_greater); list_add(&pivot->list, head); list_splice(&list_less, head); list_splice_tail(&list_greater, head); } ``` 由第 13 行可以知道， pivot 挑選第一個 entry 第 26~28 行是將切割過的 list 重新收集、排好