D05: list

contributed by < rex662624 >

生日悖論

如何透過近似的運算式，估算上述機率計算呢？附上有效的 LaTeX 算式

若今天進入一個有22個人的房間內，加上自己有23個人。若用一年365天來計算，則在這個房間至少有一個人與我生日相同的機率已經超過50% 。而在這裡我利用 Poisson distribution 作解。
首先 Poisson distribution 敘述一事件發生的機率為

P (k e v e n t s i n i n t e r v a l) = e^{- λ} \frac{λ^{k}}{k!} 其 中 1. λ i s t h e a v e r a g e n u m b e r o f e v e n t s p e r i n t e r v a l 2. e i s t h e n u m b e r 2.71828 . . . (E u l e r^{'} s n u m b e r) t h e b a s e o f t h e n a t u r a l l o g a r i t h m s 3. k t a k e s v a l u e s 0, 1, 2, \dots 4. k! = k \times (k - 1) \times (k - 2) \times \dots \times 2 \times 1 i s t h e f a c t o r i a l o f k .

先寫出我們所求的結果應該為:

P (k > 0) = 1 - P (k = 0)

意思為，算出至少有一個 event k 發生(k>0)的機率，這個 event k 指的是幾個人與我的生日相同。而我們要算的

1 - P (k = 0)

已經有了

k = 0

，現在還缺了

λ

。這裡透過 Poisson approximation for the binomial知道

λ = n p 。 其 中

n = C_{2}^{23} ， 是 23 人 中 取 某 2 人 生 日 相 同 ， p = 1 / 365 ， 是 兩 個 人 生 日 在 某 一 天 的 機 率 。

因 此 λ = n p = \frac{C_{2}^{23}}{365} = 0.6932

。

結 果 : P (k > 0) = 1 - P (k = 0) = 1 - e^{- 0.6932} \frac{- {0.6932}^{0}}{0!} \approx 1 - 0.499998 \approx 0.500002

可以得證在23人共處一房間時，至少兩人生日相同的機率>50%。

Birthday problem 對資訊安全的影響在哪些層面？

參考birthday attack，這裡寫道通常birthday attack用於尋找hash function 的 collision 。因為若是做訊息通信，可能會先加密再傳出去，傳到空氣中的電磁波(或是在網路上傳遞的數據)，在中間若是被攔截，就可能會被透過這種方式被猜出未加密的訊息是甚麼。因此在有限長度之下，要產生一組 hash key 有可能重覆的機率就大幅的提升。

維基百科也寫道，To avoid this attack, the output length of the hash function used for a signature scheme can be chosen large enough so that the birthday attack becomes computationally infeasible，若要防範 birthady attack，則信息或是密碼的長度上限可以設大一點，大到 birthday attack 無法運算，或是運算出來要花十分長的時間。

因此 hash 的安全標準也可以由此驗證，一個 40bits 長的訊息用大約一百萬次隨機Hash可至少以1/2的概率找到一個 collision 。安全的 Hash 的標準輸出長度選爲 160 bits 是出於這種考慮。

像是 Random Number Generator 這樣的產生器，是否能產生「夠亂」的亂數呢？

首先，這個網址自己有講到Stat Trek's Random Number Generator produces numbers that are nearly random.。首先觀察這個 generator，發現它在 min=0 max=2147483647 時會在下面的 table 顯示 error (m>=0就不會)，2147483648時會在 calculate 上面顯示紅色error。大致可以推測它只能產生 32-bit 數字。而沒有放 seed 的時候每次產生的亂數都不同，放 seed 則是每次相同，因此他也說給 seed 只是 short-term convenience，不適用於長期。

沒有任何的亂數產生器可以產生完全 random 的數字，但是這個 generator 保證可以產生近乎 random 的數字。因此來想想這句話是否真的保證正確。

首先，維基百科寫道，大部分計算機上的偽隨機數(pseudo-random numbers)，並不是真正的隨機數，只是重複的周期比較大的數列，是按一定的算法和種子值生成的。。我認為這句話是關鍵所在。最上面我寫的那段只是說明了能夠處理的數字大小 max 是 32-bit 以內。但是究竟能處理「多少數字」。回去觀察generator 果然發現這個 tool 最多只能產生1000個數字。因此我推測它與維基百科解釋的相同，是先由 seed 算出並用一個很長的數列來儲存輸出的 random number 。因此他說的Stat Trek's Random Number Generator produces numbers that are nearly random.我覺得是正確的，因為這個 tool 本身就限定最多只能產生1000個數字。

但若是遇到要產生非常巨大數量的 random number ，我覺得類似這樣的 generator 就不可行，因為若用數列產生，一定會有走到數列盡頭的一天。(用birthday problem也可以知道數列長度365，只要23個數字就有50%機率有重複)。
但一定也有解套的方法，例如若要產生1000個 6位數，內建 random number list 長度是100，則seed產生100個數字後，前三位先從100個數字隨機選一個，後三位再從100個數字隨機選一個，用如此的方法切割數字，只要存100個數字就有100*100種可能。比起原本若是只印出產生的 list ，會重複10次相同的數列。或許用類似的方法，就能突破 generator 會受限於內建數列長度的瓶頸。

結論:要能夠定義數字「夠亂」到底是甚麼程度，還有針對甚麼數據。這類的 random number generator，若是針對小量的數字，可以產生近乎亂數的數字，但是若是需求是超大量的數量，可能就要有另外的演算法突破。

Linux 風格的 linked list

為何 Linux 採用 macro 來實作 linked list？一般的 function call 有何成本？
function call 在執行時，必須要將傳進來的 argument copy 一份 push 進 stack ，而且必須先儲存本來的 state ，等等 return 時才能 pop出來。因此如此的在 stack 上 push pop 的動作，時間成本較多。而 marco 在 compile 時就會展開程式碼到呼叫的地方，因此雖然有了空間上的成本，但是時間上卻可以免去 push pop 的動作。
Linux 應用 linked list 在哪些場合？舉三個案例並附上對應程式碼，需要解說，以及揣摩對應的考量

linux/include/linux/fs.h – 作用:管理 file system
根據檔案名稱，大致可以猜測到他是用來管理 file system 相關的 code 。
其中的 1002~1006行( linux kernel 隨時會更新，不一定準確) code 為





struct file_lock {
	struct file_lock *fl_next;	/* singly linked list for this inode  */
	struct list_head fl_list;	/* link into file_lock_context */
	struct hlist_node fl_link;	/* node in global lists */
	struct list_head fl_block;	/* circular list of blocked processes */

觀察註解寫道 struct file_lock represents a generic "file lock". It's used to represent POSIX byte range locks, BSD (flock) locks, and leases.
這個結構應該是在管理 file 被某一個 process 寫入時，其他的 process 就不能寫入這個 file 的 file lock 角色。可以發現這裡的第5行就用到 linked list ，把其他想用此 file 資源但被 block 的 process ，串成一個 linked list 。

linux/fs/inode.c – 作用:處理 inode 的各種操作
首先要先釐清 inode 到底是甚麼東西。根據 OS 課程張大緯老師的講義 ch11 其中的P8 寫道:
File Control Blocks (FCB) is a data structure contains information about a file. In UNIX, an FCB is called an inode. 可以知道它是用來儲存每個檔案的一些資訊。
再看看其中一段關於產生新inode 的 code:
```
struct inode *new_inode(struct super_block *sb)
{
struct inode *inode;

spin_lock_prefetch(&sb->s_inode_list_lock);

inode = new_inode_pseudo(sb);
if (inode)
	inode_sb_list_add(inode);
return inode;
}
```
可以看到第7行首先呼叫函式產生新的 inode 結構，而後第9行把他加進list裡面。繼續往下看呼叫的inode_sb_list_add(inode);，可以發現的確呼叫了 list add 把他串進 linked list 中，而推測這個 list 是用於做索引，讓系統可以一次走訪所有 inode。
```
void inode_sb_list_add(struct inode *inode)
{
spin_lock(&inode->i_sb->s_inode_list_lock);
list_add(&inode->i_sb_list, &inode->i_sb->s_inodes);
spin_unlock(&inode->i_sb->s_inode_list_lock);
}
```

linux/kernel/sched/fair.c– 作用:管理CPU排班
我推測 CPU 排班本來就會用到 list (至少我的OS作業是這樣做)，因此去找cpu 排班的code，但是找到後一直不知道為何是用 fair.c 作為檔名，後來才發現 linux 系統預設排班法是用Completely Fair Scheduler(CFS)，因此也蠻合理的。
首先我們知道 CFS 排班法是選擇花費 CPU 時間最少的 process 起來執行，因此讓我們看看選擇 process 的 function pick_next_entity()。


























pick_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *curr)
{
struct sched_entity *left = __pick_first_entity(cfs_rq);
struct sched_entity *se;

/*
 * If curr is set we have to see if its left of the leftmost entity
 * still in the tree, provided there was anything in the tree at all.
 */
if (!left || (curr && entity_before(curr, left)))
	left = curr;

se = left; /* ideally we run the leftmost entity */

 .......(程式碼省略)

/*
 * Someone really wants this to run. If it's not unfair, run it.
 */
if (cfs_rq->next && wakeup_preempt_entity(cfs_rq->next, left) < 1)
	se = cfs_rq->next;

clear_buddies(cfs_rq, se);

return se;
}

因為linux是利用 rbtree 來維護 process ，所以一開始找 leftmost ，最左邊的process，就是有最小 runtime 的process ，這裡第20行的 next 我本來以為就是linked list 的應用，結果發現他這個cfs_rq->next指的只是下一個要跑的人，並不是一個一個指下去的 list 關係。而這裡是用 rbtree 結構來維護 process，因此應該不算 linked list 。

參考這裡總算找到這份 fair.c 用 list_add(第11行) 應用 linked list 。這裡寫到， cfs_task 這個 list 是用來做 CPU Load balancing ，意思就是用來處理多核心狀況下，把負擔較大的CPU 分配工作給其他閒置的CPU 。















account_entity_enqueue(struct cfs_rq *cfs_rq, struct sched_entity *se)
{
update_load_add(&cfs_rq->load, se->load.weight);
if (!parent_entity(se))
	update_load_add(&rq_of(cfs_rq)->load, se->load.weight);
#ifdef CONFIG_SMP
if (entity_is_task(se)) {
	struct rq *rq = rq_of(cfs_rq);

	account_numa_enqueue(rq, task_of(se));
	list_add(&se->group_node, &rq->cfs_tasks);
}
#endif
cfs_rq->nr_running++;
}

GNU extension 的 typeof 有何作用？在程式碼中扮演什麼角色？
typeof(a) 會 return a 的 type 。在程式碼中可以用來判斷是否兩者的 type 相同。
解釋以下巨集的原理






#define container_of(ptr, type, member)                            \
    __extension__({                                                \
        const __typeof__(((type *) 0)->member) *__pmember = (ptr); \
        (type *) ((char *) __pmember - offsetof(type, member));    \
    })

參考這裡，它的作用是用來取得 struct 結構的起始點，只要知道該結構的任一個成員的位址，就可以使用 container_of 巨集來算出該結構的起始位址。因此我們只要有指向某個 member 的 pointer，就能透過此巨集獲取指向某 struct 的 pointer 。而觀察程式碼，總共做了兩個動作。

先定義一個指向 member 這個 type 的指標 __pmember，然後將參數 ptr (就是我們的 input ，指向某個 member 的指標)餵給該指標。
然後將__pmember值(裡面存的是我們 input member 的address)，減去他在整個 struct 的 offset 值(另外有定義巨集 offsetof ，用來獲取某 member 與 struct 開頭的 offset)，就可以得到整個 struct 的起始 address。

除了你熟悉的 add 和 delete 操作，list.h 還定義一系列操作，為什麼呢？這些有什麼益處？
list.h 光是 list_add，就有 list_add_valid、hlist_add_behind、hlist_add_before、list_tail等等。而觀察程式碼一開始的註解寫到
```
/*
 * Simple doubly linked list implementation.
 *
 * Some of the internal functions ("__xxx") are useful when
 * manipulating whole lists rather than single entries, as
 * sometimes we already know the next/prev entries and we can
 * generate better code by using them directly rather than
 * using the generic single-entry routines.
 */
```
可以知道呼叫後面有 __XXX 的函式，比起 general 的 add ，這種特定操作的add 能帶來效能上的助益，因為當我們已經知道 next 或 prev 的 entry，我們或許可以善加利用這些資訊，而非只是用 general 的 add。

LIST_POISONING 這樣的設計有何意義？
List poisoning 在 wikipedia 上面的定義是用於電子郵件攻擊，被攻擊者用一些 invalid mail address 攻擊。而參考 linux-like-list ，裡面寫到了
















/**
* hlist_del() - Remove a hlist node from the hlist
* @node: pointer to the node
*
* The node is only removed from the hlist. Neither the memory of the removed
* node nor the memory of the entry containing the node is free'd. The node
* has to be handled like an uninitialized node. Accessing the next or pprev
* pointer of the node is not safe.
*
* Unlinked, initialized nodes are also uninitialized after hlist_del.
*
* LIST_POISONING can be enabled during build-time to provoke an invalid memory
* access when the memory behind the next/prev pointer is used after an
* hlist_del. This only works on systems which prohibit access to the predefined
* memory addresses.
*/

可以看到 12~15行說使用了hlist_del後，可能會發生LIST_POISONING，回頭看5~8行也有寫原因是因為用了這個函數後， node 只是從 list 裡面移除，list 所佔的 memory 或是裡面的資料都沒有被 free，因此若是此時又去訪問他的 next 或是 prev ，則這是不合法的，因為他被從 list 移除，操作他的行為應該要與操作一塊空白記憶體相同。
此外可以發現 linux/list.h 之中，list_del 將要刪除的 entry 的 next 和 prev 都指向 LIST_POISON 。

linked list 採用環狀是基於哪些考量？
1. 不需要 maintain 兩個 pointer front 和 rear ，只需要一個 last，我們就知道他的下一個 node 是 front。
2. 設計演算法較一般的 linked list 容易，彈性也較大，因為沒有第一個或是最後一個的問題，例如如果要改 last 到另一個 node ，一般的 list 需要把新 last 搬到後面，last->next=NULL 等動作，而 circular list 只要改 last指標就好，結構完全不需變動。

list_for_each_safe 和 list_for_each 的差異在哪？“safe” 在執行時期的影響為何？
首先比較程式碼







/**
 * list_for_each	-	iterate over a list
 * @pos:	the &struct list_head to use as a loop cursor.
 * @head:	the head for your list.
 */
#define list_for_each(pos, head) \
    for (pos = (head)->next; pos != (head); pos = pos->next)









/**
 * list_for_each_safe - iterate over a list safe against removal of list entry
 * @pos:	the &struct list_head to use as a loop cursor.
 * @n:		another &struct list_head to use as temporary storage
 * @head:	the head for your list.
 */
#define list_for_each_safe(pos, n, head) \
    for (pos = (head)->next, n = pos->next; pos != (head); \
	    pos = n, n = pos->next)

原本的版本處理 list_del 時會出問題。首先，我在LIST_POISONING那道題目最後講到，list_del 將要刪除的 entry 的 next 和 prev 都指向 LIST_POISON。因此如果是用 list_each 來 delete ，delete 完目標之後，下次回來for會做的是 pos =pos->next，但此時的 pos 已經是 LIST_POISON ，因此就會錯誤。
但是在 list_each_safe 的版本中，用一個 temp 指標n，在回到 for 的一開始就做 pos = n ，因此若是刪除 pos 這個 node ，下次回來會先執行 pos = n，而此時的n 是上一個循環存的pos->next，也就是刪除的人原本的 next ，因此就能成功的指向正常 list。

for_each 風格的開發方式對程式開發者的影響為何？
for each，for_each(InputIterator _First, InputIterator _Last, Function _Func);，可以讓使用者針對範圍( first 到 last )去做特定的 function ，比起原來用 for 迴圈，再 call function，程式碼簡潔得多。
程式註解裡頭大量存在 @ 符號，這有何意義？你能否應用在後續的程式開發呢？
參考 doxygen 。 Doxygen 是一個 document generator ，可以將程式中的註解轉換成為說明文件。如果我們寫註解的時候，可以依據某種格式，就可以不用另外寫說明文件，可以透過這個 generator 直接產生說明文件，若今天程式碼做修改，我們也不用再去維護文件，而是只要用 Doxygen 產生新版的文件就好。
而根據 doxygen 說明文件裡面第一句話就寫到了All commands in the documentation start with a backslash (\) or an at-sign (@).用@作為註解開頭，是 Doxygen command 規定的特定語法。我也思考 linux kernel 為什麼不用\作為開頭而是用@作為開頭，因為文件裡面寫兩者其實都可以。推測是因為 C 語言裡面的\是作為 escape character 。例如我們打printf(%s,"\"")，印出來的是"，因此本身C程式碼可能就會出現很多\，如果我們要搜尋整份程式碼寫的 doxygen 註解，用@才不會搜尋到我們不想要的資訊。
tests/ 目錄底下的 unit test 的作用為何？就軟體工程來說的精神為何？
tests/裡面分為許多檔案，而每個檔案的用處就是針對某個 list.h 的某個功能做測試。這樣的測試就軟體工程來說可以方便 debug 。先確定每個小模組是正確無誤的，然後再兜起來如果發生錯誤，我們就可以排除是某個小模組出錯的可能，而是可能在兜起來的邏輯上有錯誤。
tests/ 目錄的 unit test 可如何持續精進和改善呢？
可以增加變異性，並測試一些 boundary condition 。或是故意測試一些會出錯的值，看程式碼是否能夠抓到並印出 warning。

針對 linked list 的排序演算法

對 linked list 進行排序的應用場合為何？
1. 若使用者想知道 process消耗的 CPU 資源，可能就須經過 sort 把消耗資源比率由大排到小。
2. 管理 Memory 時，可能需要知道分散的 memory space 由小排到大的關係，才能用作例如 Best-Fit 或是 Worse-fit 等分配。
3. 第一個作業的 phonebook ，如果本來數據是打亂的，要把它變回原本按照英文字母排序的形式儲存在 list ，也要使用 sort。
考慮到 linked list 在某些場合幾乎都是幾乎排序完成，這樣貿然套用 quick sort 會有什麼問題？如何確保在平均和最差的時間複雜度呢？
首先看到 linux kernel 的 list sort 是用 Merge sort，而不是用 quicksort 。比較一下兩者的 Complexity 差異:來源

	mergesort	quicksort
best case	$O (n l o g n)$	$O (n l o g n)$
average case	$O (n l o g n)$	$O (n l o g n)$
worst case	$O (n l o g n)$	$O (n^{2})$
可以發現主要是考慮到 Worse case ， quicksort 的效能將會大幅下降，因此在不清楚資料分布的狀況下選用 Mergesort 是較好的選擇。而 Quick sort 遇到「幾乎已排好」或「幾乎正好排顛倒」的 data 時會遇到 worse case。

能否找出 Linux 核心原始程式碼裡頭，對 linked list 排序的案例？
利用這個工具可以從 linux 程式碼找關鍵字，另外想起上作業系統時，老師第一堂課有介紹 linux lsr 應該也是用來尋找關鍵字的網站。首先 linux 的 sort function 是命名為 list_sort。因此我鍵入關鍵字看能夠找到甚麼案例。
發現檔名比較親切的只有 linux/fs/ext4/fsmap.c，裡面的 line 451 有用到 list_sort。乍看之下是用來顯示 file system 的 information (參考這裡和這裡)。而裡面的 function ext4_getfsmap_find_fixed_metadata 裡面用到 sort，這個 function 的目的是獲取 filesystem 的 metadata。先收集完全部之後存在 meta_list ，然後會用 list_sort 做一次排序。
透過 skiplist 這樣的變形，能否加速排序？

skiplist是上圖這樣的資料結構。最底下一層是普通的 list ，儲存著所有的 element，而上面幾層則越來越少，每層都是下面一層的 subset。
若是對一個還沒排好的 skiplist 進行排序，我不認為比起一般的 list 可以帶來顯著的加速。因為若是先對上面的 list 做 sort ，到下面一層後還是要看上面已經排序好的那些 element ，才能知道確切要放在什麼位置。
研讀 How to find the kth largest element in an unsorted array of length n in O(n)?，比照上方連結，是否能設計出針對 linked list 有效找出第 k 大 (或小) 元素的演算法？
參考演算法講義09_Medians and Order Statistics P19 ，(或是上方連結有寫到)但這種演算法是尋找第 i 小的元素，因此稍加修改可以做出尋找第 i 大的元素
找出第 i 大的元素 pseudocode 為 :

Select(i)
{
   /*找 median-of-medians*/
   將n個元素分成 ⌈n/5⌉ 個由5個元素構成的小群
   找出每群的中位數
   利用 Select 函數找出這 ⌈n/5⌉ 個中位數的中位數 x
   
   /*找出第 i大的元素*/
   利用 x 作為 Partition 使用的 Pivot ， Partition 後 x=A[k]
   如 i=k 則 return x
   如 i>k 則對大於 x 的那些元素做 Select(i)
   如 i<k 則對小於 x 的那些元素做 Select(i-(n-k)) //因為全部index > k 的人都比他大，共 n-k 個。
}

linux-list 裡頭 examples 裡頭有兩種排序實作（insertion 和 quick sort)，請簡述其原理
程式碼:
































static void list_insert_sorted(struct listitem *entry,     struct list_head *head)
{
struct listitem *item = NULL;

if (list_empty(head)) {
    list_add(&entry->list, head);
    return;
}

list_for_each_entry (item, head, list) {
    if (cmpint(&entry->i, &item->i) < 0) {
        list_add_tail(&entry->list, &item->list);
        return;
    }
}

list_add_tail(&entry->list, head);
}

static void list_insertsort(struct list_head *head)
{
struct list_head list_unsorted;
struct listitem *item = NULL, *is = NULL;

INIT_LIST_HEAD(&list_unsorted);
list_splice_init(head, &list_unsorted);

list_for_each_entry_safe (item, is, &list_unsorted, list) {
    list_del(&item->list);
    list_insert_sorted(item, head);
}
}

首先看上方的 insertion sort ， list_insertsort()在 30 行，會把 item 傳進 list_insert_sorted() 內，而意義就是 insertion sort 的把目前要 insert 的 key 與目前已經排好的 list 做比較。因此在第10~14行，就是走訪目前 list ，直到找到比 key 還大的人就停下來插入 list。





























static void list_qsort(struct list_head *head)
{
struct list_head list_less, list_greater;
struct listitem *pivot;
struct listitem *item = NULL, *is = NULL;

if (list_empty(head) || list_is_singular(head))
    return;

INIT_LIST_HEAD(&list_less);
INIT_LIST_HEAD(&list_greater);

pivot = list_first_entry(head, struct listitem, list);
list_del(&pivot->list);

list_for_each_entry_safe (item, is, head, list) {
    if (cmpint(&item->i, &pivot->i) < 0)
        list_move_tail(&item->list, &list_less);
    else
        list_move(&item->list, &list_greater);
}

list_qsort(&list_less);
list_qsort(&list_greater);

list_add(&pivot->list, head);
list_splice(&list_less, head);
list_splice_tail(&list_greater, head);
}

以上是 quicksort 程式碼，可以注意到13~14行是在做選出 pivot key。然後16~21行，是對 list 中的每個數字，若是比 pivot 大就加入大的 list ，反之則加入小的 list。而23~24行對大小 list 做 recursive call 。最後終止條件在第7~8行。