2018q1 Homework3 (list)

contributed by <rwe0214>

tags: `Linux`, `rwe0214`, `NCKU`

作業說明

D05: list

生日悖論

如何透過近似的運算式，估算上述機率計算呢？附上有效的 LaTeX 算式

假設 1 年有 365 天
n 個人生日不重複的事件為
$C_{n}^{365} \times n!$
樣本數為
$365^{n}$
機率
$p$ 為
$\frac{C_{n}^{365} \times n!}{365^{n}}$
則 n 個人生日重複的機率
$p^{'}$ 為
$1 - \frac{C_{n}^{365} \times n!}{365^{n}}$
- $C_{n}^{365}$ 怎麼計算？
將
$p$ 展開可得
$\frac{365}{365} \times \frac{364}{365} \times \frac{363}{365} \times . . . \times \frac{365 - n + 1}{365} = 1 \times (1 - \frac{1}{365}) \times (1 - \frac{2}{365}) \times (1 - \frac{3}{365}) \times . . . \times (1 - \frac{n - 1}{365})$
- 引入泰勒級數且因為
  $x << 1$ ，所以
  $e^{x} \approx 1 - x$
- 原式變成
  $1 \times e^{\frac{- 1}{365}} \times e^{\frac{- 2}{365}} \times e^{\frac{- 3}{365}} \times . . . \times e^{\frac{- n}{365}} = e^{\frac{- n (n - 1) / 2}{365}} = e^{\frac{- n (n - 1)}{730}}$
- $p^{'} = 1 - e^{\frac{- n (n - 1)}{730}}$
將 n = 22 代入
$p^{'} \approx 1 - e^{\frac{- 462}{730}} = 1 - e^{\frac{- 231}{365}} = 1 - e^{- 0.6328} \approx 1 - 0.531 = 0.469$
將 n = 23 代入
$p^{'} \approx 1 - e^{\frac{- 506}{730}} = 1 - e^{\frac{- 253}{365}} = 1 - e^{- 0.69315} \approx 1 - 0.4999 = 0.5001$ ，機率為
$50.01 ％$

Birthday problem 對資訊安全的影響在哪些層面

生日攻擊即利用在 hash function 中的 collision 現象做攻擊。
一個設計良好的 hash function 不容易利用 hash value 反推回原本的明文
$t_{1}$ ，但是若是這個 hash value
$h (t_{1})$ 的 bit 數不夠多（即值域不夠大），那麼我們可以透過生日悖論的想法，去枚舉找到另一份明文
$t_{2}$ 其
$h (t_{2}) = h (t_{1})$
也許枚舉這個動作聽來不怎麼可怕，但是將生日悖論求得的數學式代入會發現
- $p = e^{\frac{- m (m - 1) / 2}{2^{n}}} \approx e^{\frac{m^{2}}{2^{n}}}$
- $p^{'} \approx 1 - e^{\frac{m^{2}}{2^{n}}}$
- 若
  $p^{'} > 0.5$ 則
  $m \approx 2^{n / 2}$
- 換句話說，我只需要枚舉
  $2^{n / 2}$ 的
  $t_{2}$ 就有超過一半的機率使得
  $h (t_{2}) = h (t_{1})$

像是 Random Number Generator 這樣的產生器，是否能產生「夠亂」的亂數呢？

在這個亂數產生器中的 FAQ 有寫到一點：

Although no computer algorithm can produce numbers that are truly random, 
Stat Trek's Random Number Generator produces numbers that are nearly random.

換句話說這個產生器也只能產生接近真的亂數的偽亂數而已
那要如何產生真正的亂數呢？
- 在這個產生器裡，亂數產生是根據使用者輸入的 seed 所 "計算" 出來的。
- 所以我們只要在每次產生亂數的時候，隨機的輸入 seed 的就行了！
  - ？？？
  - 產生亂數的時候要先隨機輸入 seed，這不就是蛋生雞還是雞生蛋一樣的問題了嗎！
參考資料
- 隨機亂數生成演算法獲突破
- 你的程式夠「亂」嗎？

Linux 風格的 linked list

為何 Linux 採用 macro 來實作 linked list？一般的 function call 有何成本？

function call 會在程式呼叫時將 function 所需 argument 和 pc 返回地址 push 進 stack，並且在 function 結束時從 stack pop 出正確的 pc 返回地址。
而 macro 則是會在程式呼叫的地方直接展開插入程式碼，而省略了 push 和 pop 的動作。

Linux 應用 linked list 在哪些場合？舉三個案例並附上對應程式碼，需要解說，以及揣摩對應的考量

在老師的講座中有提到，linux 在排程管理上使用了 linked list 來實做，所以我便以這個方向去查詢。

linux/kernel/sched/sched.h







/*
 * This is the priority-queue data structure of the RT scheduling class:
 */
struct rt_prio_array {
	DECLARE_BITMAP(bitmap, MAX_RT_PRIO+1); /* include 1 bit for delimiter */
	struct list_head queue[MAX_RT_PRIO];
};

功能：process 管理
依據註解的說明這個 struct 是 real-time schedule 的 priority-queue，而這個 queue 是以 linked list 來實做

linux/mm/slab.h























/*
 * Common fields provided in kmem_cache by all slab allocators
 * This struct is either used directly by the allocator (SLOB)
 * or the allocator must include definitions for all fields
 * provided in kmem_cache_common in their definition of kmem_cache.
 *
 * Once we can do anonymous structs (C11 standard) we could put a
 * anonymous struct definition in these allocators so that the
 * separate allocations in the kmem_cache structure of SLAB and
 * SLUB is no longer needed.
 */
struct kmem_cache {
	unsigned int object_size;/* The original size of the object */
	unsigned int size;	/* The aligned/padded/added on size  */
	unsigned int align;	/* Alignment as calculated */
	slab_flags_t flags;	/* Active flags on the slab */
	unsigned int useroffset;/* Usercopy region offset */
	unsigned int usersize;	/* Usercopy region size */
	const char *name;	/* Slab name for sysfs */
	int refcount;		/* Use counter */
	void (*ctor)(void *);	/* Called on object slot creation */
	struct list_head list;	/* List of all slab caches on the system */
};

功能：kernel memory 的管理
由註解中可以看到在這個 struct 中也是以 linked list 穿接起系統中所有的 slab cache
What is slab cache ?
- Slab 算是 Kernel 裡最早的一種演算法，事實上，除了 Slab 之外，Linux Kernel 已經實作出數種演算法，以優化記憶體的配置機制。前面提到，Kernel 的底層是以 page 為記憶體配置的單位，因此，哪怕只是需要 1 Byte 的空間，都會佔用一個 page，所以良好的 Slab 演算法，才能盡可能避免記憶體浪費和破碎的問題。
- Slab Allocator 實作了一個 cache 的架構（這裡的 cache 並非意指快取），對系統程式下的記憶體需求群組化，以達到管理的目的。一般的程式可以藉由向 Slab 註冊 cache，以得到記憶體的配置。
- 參考資料：Linux Kernel 記憶體管理機制之美

linux/mm/swap.c

















/**
 * put_pages_list() - release a list of pages
 * @pages: list of pages threaded on page->lru
 *
 * Release a list of pages which are strung together on page.lru.  Currently
 * used by read_cache_pages() and related error recovery code.
 */
void put_pages_list(struct list_head *pages)
{
	while (!list_empty(pages)) {
		struct page *victim;

		victim = list_entry(pages->prev, struct page, lru);
		list_del(&victim->lru);
		put_page(victim);
	}
}

功能：釋放一連串的 page ( 以 lru 法則挑選 )
不僅行程排班是以 linked list 實做，連記憶體的 page 結構也是以 linked list 來完成。

GNU extension 的 typeof 有何作用？在程式碼中扮演什麼角色？

回傳一個 type，e.g. typeof(a) b 就會以 a 的型態宣告一個變數 b
可以幫助我們以更高層次的角度去封裝程式碼
參考資料：Using the GNU Compiler Collection (GCC): Typeof

解釋以下巨集的原理

#define container_of(ptr, type, member)                            
    __extension__({                                                
        const __typeof__(((type *) 0)->member) *__pmember = (ptr); 
        (type *) ((char *) __pmember - offsetof(type, member));    
    })

這個 macro 定義了 ptr 去指向 member 這個結構成員的地址，這個地方也使用 typeof 來讓 ptr 可以是使用者所指定的 type。
最後再減掉這個 member 在原本 struct 中的記憶體位移量來算出 struct 的記憶體起始位置。

除了你熟悉的 add 和 delete 操作，list.h 還定義一系列操作，為什麼呢？這些有什麼益處？

linux/inlude/list.h
一打開便在開頭發現了註解

/*
 * Simple doubly linked list implementation.
 *
 * Some of the internal functions ("__xxx") are useful when
 * manipulating whole lists rather than single entries, as
 * sometimes we already know the next/prev entries and we can
 * generate better code by using them directly rather than
 * using the generic single-entry routines.
 */

說明在某些已經知道 pre/next 節點的情況下，使 __xxx 版本的 linked list 操作會更 useful。

`LIST_POISONING` 這樣的設計有何意義？

linux/include/poison.h







/*
 * These are non-NULL pointers that will result in page faults
 * under normal circumstances, used to verify that nobody uses
 * non-initialized list entries.
 */
#define LIST_POISON1  ((void *) 0x100 + POISON_POINTER_DELTA)
#define LIST_POISON2 ((void *) 0x200 + POISON_POINTER_DELTA)

當程式 access 到 LIST_POISONING 時便會發出 page fault 以便確保沒有人會使用到沒有初始化過得 list entry ( 被刪除掉的 )
而在 list_del() 中可發現被刪除的 list entry 都會指向 LIST_POISONING






static inline void list_del(struct list_head *entry)
{
	__list_del_entry(entry);
	entry->next = LIST_POISON1;
	entry->prev = LIST_POISON2;
}

linked list 採用環狀是基於哪些考量？

因為 linux 使用 doubly linked list 使用 circular 可以省掉總是需要兩個指標去指向頭和尾的空間，並在操作上也省去這些步驟。

`list_for_each_safe` 和 `list_for_each` 的差異在哪？"safe" 在執行時期的影響為何？

參考 LIST_POISONING 這樣的設計有何意義？
在 list_for_each_safe 中多了一個變數儲存 pos 的地址，這樣可以確保如果在迴圈中使用 list_del 刪除節點後，程式指標不會發生找不到 head 而無限迴圈的現象。





#define list_for_each(pos, head) 
for (pos = (head)->next; pos != (head); pos = pos->next)
    
#define list_for_each_safe(pos, n, head) 
for (pos = (head)->next, n = pos->next; pos != (head); pos = n, n = pos->next)

for_each 風格的開發方式對程式開發者的影響為何？

參考資料：Foreach loop | Programmer's Wiki | FANDOM powered by Wikia
以程式開發者來說 for 已經相當方便並且可以自由的制定範圍，但是若只是單純想要走訪一個資料結構的全部資料，使用 for 迴圈可能需要撰寫額外的程式碼，但是若使用 for_each 可以使程式碼更簡潔。

程式註解裡頭大量存在 `@` 符號，這有何意義？你能否應用在後續的程式開發呢？

參考資料：Doxygen Manual: Special Commands
Doxyygen 是一個可以自動根據程式碼的註解內容產生說明文件的產生器，如果依照其規定的格式撰寫註解，這樣就可以省下額外多寫一份說明文件的步驟。
而 @ 符號是 Doxygen 裡規定的註解開頭。

`tests/` 目錄底下的 unit test 的作用為何？就軟體工程來說的精神為何？

:~/linux-list/tests$ ls
common.h           list_add_tail.ok       list_del.ok            list_for_each_entry.o         list_for_each_safe.ok   list_init-local.ok    list_move.ok          list_splice.ok
containerof        list_cut_position      list_entry             list_for_each_entry.o.d       list_init-explicit      list_is_singular      list_move_tail        list_splice_tail
containerof.c      list_cut_position.c    list_entry.c           list_for_each_entry.ok        list_init-explicit.c    list_is_singular.c    list_move_tail.c      list_splice_tail.c
containerof.o      list_cut_position.o    list_entry.o           list_for_each_entry_safe      list_init-explicit.o    list_is_singular.o    list_move_tail.o      list_splice_tail_init
containerof.o.d    list_cut_position.o.d  list_entry.o.d         list_for_each_entry_safe.c    list_init-explicit.o.d  list_is_singular.o.d  list_move_tail.o.d    list_splice_tail_init.c
containerof.ok     list_cut_position.ok   list_entry.ok          list_for_each_entry_safe.o    list_init-explicit.ok   list_is_singular.ok   list_move_tail.ok     list_splice_tail_init.o
list_add           list_del               list_first_entry       list_for_each_entry_safe.o.d  list_init-global        list_last_entry       list_splice           list_splice_tail_init.o.d
list_add.c         list_del.c             list_first_entry.c     list_for_each_entry_safe.ok   list_init-global.c      list_last_entry.c     list_splice.c         list_splice_tail_init.ok
list_add.o         list_del_init          list_first_entry.o     list_for_each.o               list_init-global.o      list_last_entry.o     list_splice_init      list_splice_tail.o
list_add.o.d       list_del_init.c        list_first_entry.o.d   list_for_each.o.d             list_init-global.o.d    list_last_entry.o.d   list_splice_init.c    list_splice_tail.o.d
list_add.ok        list_del_init.o        list_first_entry.ok    list_for_each.ok              list_init-global.ok     list_last_entry.ok    list_splice_init.o    list_splice_tail.ok
list_add_tail      list_del_init.o.d      list_for_each          list_for_each_safe            list_init-local         list_move             list_splice_init.o.d
list_add_tail.c    list_del_init.ok       list_for_each.c        list_for_each_safe.c          list_init-local.c       list_move.c           list_splice_init.ok
list_add_tail.o    list_del.o             list_for_each_entry    list_for_each_safe.o          list_init-local.o       list_move.o           list_splice.o
list_add_tail.o.d  list_del.o.d           list_for_each_entry.c  list_for_each_safe.o.d        list_init-local.o.d     list_move.o.d         list_splice.o.d

可以看到在 test/ 目錄底下有許多小程式，且幾乎每個裡面皆有使用 assert() ，這時查看 MAKEFILE 發現在 make 的時候便會執行所有的小程式，並測試這些程式能不能正確執行完畢。
這些小程式的目的即檢測 list.h 內的 function 能不能正確執行。
當我們在撰寫大型專案的時候，常常會寫許多工具函式，如果沒有先確保這些工具函式的正確性，往後向上層設計整個系統的時候便會不知道是在何處出錯，是使用的工具函式的邏輯錯誤還是整個工具函式有問題
如果我們能夠先行利用 tests 的手法確認函式的正確性，更能增進往後撰寫程式的效率。

`tests/` 目錄的 unit test 可如何持續精進和改善呢？

參考 rex662624
可以增加變異性，並測試一些 boundary condition。或是故意測試一些會出錯的值，看程式碼是否能夠抓到並印出 warning。

針對 linked list 的排序演算法

對 linked list 進行排序的應用場合為何？

因 linux 行程管理是使用 linked list 串接，所以假設我們需要查看現在是哪些行程佔用了 cpu，並比較何行程佔的比例最大，就會使用到 linked list 的排序。

考慮到 linked list 在某些場合幾乎都是幾乎排序完成，這樣貿然套用 quick sort 會有什麼問題？如何確保在平均和最差的時間複雜度呢？

quick-sort 的時間複雜度，取決於一開始的 pivot 值能否有效的把資料有效切割。因為一般情形下 quick-sort 的 pivot 值通常是取第一個或是最後一個 element，所以如果對已經排序過的資料做 quick-sort，取出的 pivot 就會是最大值或最小值，此情形即為 quick-sort 的 worst-case。
如果能在取 pivot 的過程動一些些手腳，使 pivot 不為最大或最小，即能提昇時間複雜度，例如使用 medium-of-mediums 的方法。

能否找出 Linux 核心原始程式碼裡頭，對 linked list 排序的案例？

參閱 rex662624 的工具可以直接搜尋 linux 的程式碼
輸入指令 path: list_sort.h 可以看到所有使用 list_sort.h 函式庫的程式。
總共 28 matches found

透過 skiplist 這樣的變形，能否加速排序？

研讀 How to find the kth largest element in an unsorted array of length n in O(n)?，比照上方連結，是否能設計出針對 linked list 有效找出第 k 大 (或小) 元素的演算法？

linux-list 裡頭 `examples` 裡頭有兩種排序實作（insertion 和 quick sort)，請簡述其原理

insert-sort 是將資料從第一個 element 插入進一個新的 linked list ( head )，而新的 list 中是已經排序過的資料，並從頭比對資料大小做插入排序，而因為這個過程須走訪每個節點，所以時間複雜度為
$O (n^{2})$ 。
quick-sort 是利用 pivot 值將原本的資料序列做分割，然後分割後的 list 再遞迴呼叫 quick-sort 排序。
- 分割過程為將資料和 pivot 比大小，將比 pivot 小的是值搬到 list 頭，將比 pivot 小的值搬到 list 尾。
- 此過程是要確保在 pivot 之前資料都是小於 pivot 的，反之在 pivot 後的資料皆大於 pivot。
平均情形：
$O (n l o g n)$
- $T (n) = \frac{1}{n} \sum_{s = 1}^{n} (T (s) + T (n - s)) + c n$
- $T (n) = \frac{1}{n} [T (1) + T (n - 1)] + [T (2) + T (n - 2)] + . . . + [T (n - 1) + T (1)] + [T (n) + T (0)] + c n$
- $T (n) = \frac{2}{n} [T (1) + . . . + T (n - 1)] + \frac{T (n)}{n} + c n$
- $n T (n) = 2 [T (1) + . . . + T (n - 1)] + T (n) + c n^{2}$ ––––- ①
- $(n - 1) T (n - 1) = 2 [T (1) + . . . + T (n - 2)] + T (n - 1) + c (n - 1)^{2}$ ––––- ②
- ① - ② 可得
  $(n - 1) T (n) = n T (n - 1) + c (2 n - 1)$
- $\frac{T (n)}{n} = \frac{T (n - 1)}{n - 1} + c (\frac{1}{n} + \frac{1}{n - 1})$
- $\frac{T (n)}{n} = c (\frac{1}{n} + \frac{1}{n - 1} + . . . + \frac{1}{2}) + c (\frac{1}{n - 1} + \frac{1}{n - 2} + . . . + \frac{1}{1})$
- $\frac{T (n)}{n} = c (H_{n} - 1) + c (H_{n} - \frac{1}{n})$
- $T (n) = c n (H_{n} - 1) + c (n H_{n} - 1)$
- $T (n) = 2 c n H_{n} - c n - c$
- $T (n) = 2 c n l o g n - c n - c$
- $T (n) = O (n l o g n)$

實作測試程式，批次建立包含若干大小的 linked list，隨機放入數值為介於 1 到 366 之間 (包含) 的節點 (代表 day number) ，至少應該涵蓋 10¹, 10², 10³, … 10⁶ 等數量，透過前述的 insert, quick, merge sort 進行排序，分析執行時間並解讀其行為

檢驗程式碼
程式執行結果：

$ ./analyze
size	insert-sort	quick-sort	merge-sort
10^1	0.0000020971	0.0000040781	0.0000070296
10^2	0.0000390438	0.0000373689	0.0000674257
10^3	0.0018955263	0.0003176225	0.0005918828
10^4	0.2778701301	0.0058945494	0.0080722889
10^5	59.8833043724	0.2961517506	0.1274873439

可以看到 insert-sort 在時間上耗費相當的大，所以之後的比較省略 insert-sort。

$ ./analyze
size	quick-sort	merge-sort	which is fast
2^1	0.0000004582	0.0000008401	quick-sort (1.83 X)
2^2	0.0000011706	0.0000018430	quick-sort (1.57 X)
2^3	0.0000029139	0.0000042060	quick-sort (1.44 X)
2^4	0.0000065810	0.0000112332	quick-sort (1.71 X)
2^5	0.0000103636	0.0000182269	quick-sort (1.76 X)
2^6	0.0000133417	0.0000234677	quick-sort (1.76 X)
2^7	0.0000292101	0.0000531666	quick-sort (1.82 X)
2^8	0.0000681951	0.0001245088	quick-sort (1.83 X)
2^9	0.0001520672	0.0002769880	quick-sort (1.82 X)
2^10	0.0003306429	0.0005956724	quick-sort (1.80 X)
2^11	0.0007398336	0.0013145460	quick-sort (1.78 X)
2^12	0.0017331422	0.0028466553	quick-sort (1.64 X)
2^13	0.0043816538	0.0063244468	quick-sort (1.44 X)
2^14	0.0121712947	0.0136517630	quick-sort (1.12 X)
2^15	0.0380133697	0.0310422692	merge-sort (1.22 X)
2^16	0.1327709013	0.0761860735	merge-sort (1.74 X)

可以發現 merge-sort 在資料量大的時候表現比 quick-sort 優異許多。

2018q1 Homework3 (list)

tags: Linux, rwe0214, NCKU

作業說明

生日悖論

如何透過近似的運算式，估算上述機率計算呢？附上有效的 LaTeX 算式

Birthday problem 對資訊安全的影響在哪些層面

像是 Random Number Generator 這樣的產生器，是否能產生「夠亂」的亂數呢？

Linux 風格的 linked list

為何 Linux 採用 macro 來實作 linked list？一般的 function call 有何成本？

Linux 應用 linked list 在哪些場合？舉三個案例並附上對應程式碼，需要解說，以及揣摩對應的考量

GNU extension 的 typeof 有何作用？在程式碼中扮演什麼角色？

解釋以下巨集的原理

除了你熟悉的 add 和 delete 操作，list.h 還定義一系列操作，為什麼呢？這些有什麼益處？

LIST_POISONING 這樣的設計有何意義？

linked list 採用環狀是基於哪些考量？

list_for_each_safe 和 list_for_each 的差異在哪？"safe" 在執行時期的影響為何？

for_each 風格的開發方式對程式開發者的影響為何？

程式註解裡頭大量存在 @ 符號，這有何意義？你能否應用在後續的程式開發呢？

tests/ 目錄底下的 unit test 的作用為何？就軟體工程來說的精神為何？

tests/ 目錄的 unit test 可如何持續精進和改善呢？

針對 linked list 的排序演算法

對 linked list 進行排序的應用場合為何？

考慮到 linked list 在某些場合幾乎都是幾乎排序完成，這樣貿然套用 quick sort 會有什麼問題？如何確保在平均和最差的時間複雜度呢？

能否找出 Linux 核心原始程式碼裡頭，對 linked list 排序的案例？

透過 skiplist 這樣的變形，能否加速排序？

研讀 How to find the kth largest element in an unsorted array of length n in O(n)?，比照上方連結，是否能設計出針對 linked list 有效找出第 k 大 (或小) 元素的演算法？

linux-list 裡頭 examples 裡頭有兩種排序實作 （insertion 和 quick sort)，請簡述其原理

設計 deduplication 程式，將上述輸入中的數值挑出重複的節點並與刪除，透過統計分析，解讀執行時間分佈，以及探討 data deduplication 的效益

Read more

NCTU Machine Learning

LINUX 核心設計 課程用書筆記

Play in a Sandbox!?

2020 Machine Learning Homework 7

tags: `Linux`, `rwe0214`, `NCKU`

`LIST_POISONING` 這樣的設計有何意義？

`list_for_each_safe` 和 `list_for_each` 的差異在哪？"safe" 在執行時期的影響為何？

程式註解裡頭大量存在 `@` 符號，這有何意義？你能否應用在後續的程式開發呢？

`tests/` 目錄底下的 unit test 的作用為何？就軟體工程來說的精神為何？

`tests/` 目錄的 unit test 可如何持續精進和改善呢？

linux-list 裡頭 `examples` 裡頭有兩種排序實作（insertion 和 quick sort)，請簡述其原理

LINUX 核心設計課程用書筆記