2018q1 Homework5 (list)

tags: `linyunwen`

contributed by <LinYunWen>

D05: list

生日悖論

如何透過近似的運算式，估算上述機率計算呢？附上有效的 LaTeX 算式

理解
- 因為如文章裡面所述說道的，直觀上會認為機率就是
  $\frac{n}{365}$ ，但是實際上，我們所要找的是兩人生日相同的機率是多少。而不是你進入了一個有著 22 個人的房間，房間裡有人會和你有相同生日的機率，因此我們先照文章中所假設的人數為 23 人。
- 接下來就是選擇生日，因為第一個人可以從 365 中任選一天，所以是
  $\frac{365}{365} = 1$ ，第二個人因為不能重複，所以它只能從 365 天中的其他 364 天，故是
  $\frac{364}{365}$
- 以此類推，依序就會是
  $\frac{363}{365}$ 、
  $\frac{362}{365}$ 、
  $\frac{361}{365}$ … ，因為只有23個人，所以最後到
  $\frac{343}{365}$
- 因此這 23 人不在同一天生日的機率是
  
  $\frac{365}{365} \times \frac{364}{365} \times \frac{364}{365} . . . \frac{343}{365}$
- 而我們所要尋找的在同一天生日的機率即為
$\begin{aligned} 1 - \frac{365}{365} \times \frac{364}{365} \times \frac{364}{365} . . . \frac{343}{365} \\ = 1 - [1 \times (1 - \frac{1}{365}) \times (1 - \frac{2}{365}) . . . \times (1 - \frac{23}{365})] \end{aligned}$
- 因為自然常數次方的泰勒展開式為
  
  $e^{x} = 1 + x + \frac{x^{2}}{2!} + \frac{x^{3}}{3!} + . . . e^{x} \approx 1 + x S u p p o s e x i s - \frac{a}{365}, 1 \leq a \leq 23 e^{- \frac{a}{365}} \approx 1 - \frac{a}{365}$
- 因此
  
  $p (n) \approx 1 \times e^{- \frac{1}{365}} \times e^{- \frac{2}{365}} \times e^{- \frac{3}{365}} \times . . . e^{- \frac{23}{365}} = e^{- \frac{(1 + 2 + 3 + . . . + 23)}{365}} = e^{- \frac{276}{365}} \approx 0.46946366059$
  
  $1 - 0.46946366059 = 0.53053633941$
- 若是只有 22 人則為
  
  $1 - \frac{365}{365} \times \frac{364}{365} \times \frac{364}{365} . . . \frac{344}{365} = 1 - \frac{P_{22}^{365}}{365^{22}} = 1 - e^{- \frac{253}{365}} \approx 0.49999824781$
  
  $1 - 0.49999824781 = 0.50000175219$
- 若只有 21 人
  
  $1 - \frac{365}{365} \times \frac{364}{365} \times \frac{364}{365} . . . \frac{345}{365} = 1 - e^{- \frac{231}{365}} \approx 0.53106188922$
  
  $1 - 0.53106188922 = 0.46893811078$

Birthday problem 對資訊安全的影響在哪些層面？

一開始我真的想不到有什麼資訊安全跟 birthday paradox 有關係，所以我先上網查詢，最先查到的就是 birthday attack ，但是有些看不懂，只能大概知道他在寫跟機率有相關的部份，於是繼續查才比較了解
Reference: The Birthday Problem🎈

birthday attack 介紹：

首先要先了解一下雜湊 (hash)

所謂『雜湊函數』（Hash Function），是將不定長度訊息的輸入，演算成固定長度雜湊值的輸出，且所計算出來的雜湊值必須符合兩個主要條件：
(1) 由雜湊值是無法反推出原來的訊息
(2) 雜湊值必須隨明文改變而改變。
也就是說，不同明文所計算出來的雜湊值必須是不相同的，甚至僅改變明文中任何一個字元時，雜湊值的輸出也必須差異很大。由此可見，期望中的雜湊值就好像是一個無法冒充的識別碼，且每一份文件都有其特殊的雜湊值，且無法被偽造，故有數位指紋Digital Fingerprint之稱。

一般雜湊函數必須具備下列功能：

雜湊函數必須對任意長度的訊息輸入，產生固定長度的雜湊值輸出。

對於任意訊息 M，雜湊函數可以輕易的計算出 H(M)，並且可經由硬體或軟體來實現。

如果給予雜湊值 h，在計算上是無法找出原訊息 M，使其符合 h = H(M)，此特性稱之為『單向雜湊』（One-way Hash）。

對於一個訊息 M1，在計算上是無法找出另一個訊息 M2 （≠ M1），使得 H(M1) = H(M2)。也就是說，給定一個雜湊值（H(M1)），須無法找出另一個訊息（M2），使其所產生的雜湊值相同。

就兩訊息 M1、M2 而言，若他們的雜湊值相等，亦即 H(M1) = H(M2)，則 M1 與 M2 兩訊息也一定相等（M1 = M2）；同理，若 H(M1)≠H(M2)，則 M1≠M2。也就是說，給定一個明文與雜湊值，須保證無法找出另一個訊息來產生同樣的雜湊值。

第4項是雜湊函數最基本功能，亦即給予一段不定長度的訊息，能輕易計算出一個固定長度的雜湊值。如果反過來，給予一個固定長度的雜湊值，是無法計算出原來訊息的，也無法找出可以產生同樣雜湊值的另一段訊息，如能達到此功能，一般稱之為弱雜湊函數 (Weak Hash Function)。若再涵蓋第 5 項功能，則稱為強雜湊函數（Strong Hash Function），其表示有了明文與雜湊值，也無法另外偽造一個明文來產生相同的雜湊值。

基本上，雜湊函數是將原來訊息打亂，得到另一個亂無章法的雜湊值，而且是越亂越好。很不幸的，我們發現大部分的傳輸訊息都有一定的格式，譬如:信件一開始就有 “Dear”，結束時又出現 “Your Best Regards”。或者資料檔案都有一定的格式，攻擊者既然知道了明文，就可以依照這些標準格式去修改內容，以達到相同的雜湊值。

我們用一個例子來說明，假設春嬌希望約志明明天見面（See you tomorrow），寫好了信件，並建立了雜湊值（也許經過加密），世雄找出 {See you yesterday}、{See you upset}、{don’t see you again} 等等，只要能符合相同的雜湊值即可，再將偽造信與春嬌簽署的簽署碼（加密的雜湊值）一起發送給志明。

Reference: 雜湊

了解完 hash 之後，知道說如果今天想要破解 hash value 就是找出相同的 hash value ，但是以數學的觀點來看訊息經由雜湊演算法計算後，所產生的雜湊碼越長，則可能遺失的訊息就越少；如果採用較短的雜湊碼，可能遺失的訊息相對較多，不同訊息之間產生相同雜湊碼的機會也相對較大，這種觀念如同加密演算法，雜湊碼的位元數越長就越安全，因此攻擊者嘗試用不同的訊息來產生相同的 hash value ，有點像是暴力攻擊法，但在雜湊演算法常稱為生日攻擊法 (birthday attack)
數學上， birthday paradox 是要 n 個人同時在場，他們至少有一組人，生日在同一天的機率會超過 50％以上，而現在換成 hash value 問題，變成

hacker 要嘗試 n 個訊息，即會有 50％以上的機率產生至少一組相同 hash value

birthday attack 數學運算機率

假設 hash value 的長度為 64 bits ，照該明文的格式修改其內容，譬如，保留原來空白鍵、使用較接近的文字，製造出其它偽造明文；並經過雜湊演算法計算出是否與原明文相同的雜湊值。
因此每次產生不同 hash value 的機率是

$\frac{2^{64}}{2^{64}} \times \frac{2^{64} - 1}{2^{64}} \times \frac{2^{64} - 2}{2^{64}} . . . \frac{2^{64} - n}{2^{64}}$
會產生相同的機率就是用全部的機率減掉產生不同的機率，並且求得大於一半的 n

$1 - \frac{2^{64} \times (2^{64} - 1) \times (2^{64} - 2) . . . (2^{64} - n)}{2^{64 \times n}} > 0.5$

$p (n) \approx 1 \times e^{- \frac{1}{2^{64}}} \times e^{- \frac{2}{2^{64}}} \times e^{- \frac{3}{2^{64}}} \times . . . e^{- \frac{n}{2^{64}}} = e^{- \frac{(1 + 2 + 3 + . . . + n)}{2^{64}}} = e^{- \frac{(1 + n) n / 2}{2^{64}}} = e^{- \frac{(1 + n) n}{2^{65}}}$

$1 - e^{- \frac{(1 + n) n}{2^{65}}} > \frac{1}{2} \frac{1}{2} > e^{- \frac{(1 + n) n}{2^{65}}} - l n 2 > - \frac{(1 + n) n}{2^{65}} . . . 同取 l n 2^{65} l n 2 < (1 + n) n l n 2 \approx 0.69314718056 \approx \frac{1}{2} 2^{65} \times \frac{1}{2} < (1 + n) n 2^{64} < (1 + n) n n \approx 2^{32} \approx 4294967296$

而可以看到一般電腦跑 46 億次，差不多 10s~1m，所以這個有一定程度的危險
















#include <stdio.h>
#include <time.h>

int main() {
    long int i = 0;
    time_t start, end;

    time(&start);
    while (i < 4294967296) {
        i++;
    }
    time(&end);
    printf("time: %lf\n", difftime(end, start));
    return 0;
}
// output: time: 9.000000

像是 Random Number Generator 這樣的產生器，是否能產生「夠亂」的亂數呢？

基本上操作適當的話是可以作到夠亂的亂數，如 c 裡的 rand() 他的種子預設為 0 ，因此如果重複執行，每次的結果都會一樣，如以下範例：















#include <stdio.h>
#include <stdlib.h>
#include <time.h>

int main() {
    unsigned long int arr[10] = {0};
    for (unsigned long int i = 0; i < 100000000; i++) {
        arr[rand()%10]++;
    }

    for (int i = 0; i < 10; i++) {
        printf("%d: %lu\n", i, arr[i]);
    }
    return 0;
}

得出來的結果皆會是










0: 9999188
1: 10000364
2: 9996614
3: 10003534
4: 10000558
5: 9992621
6: 10002100
7: 9999641
8: 9998927
9: 10006453

因此多會加上利用時間來作為 rand() 的種子


srand(time(NULL));

因為基本上時間都會是唯一的，因此適合拿來做種子，但是若這隨機的程式攸關安全性，應該要具備不可預測性，而不是任何人都可以猜到的時間，因此有些程式或是硬體

為了滿足更嚴格的需求，現代的作業系統會定期蒐集硬體訊號攪拌入系統內建的位元池當中，供給應用程式作為產生亂數的依據。一些可能被使用的資訊像是：兩次鍵盤觸發的時間間隔、滑鼠移動距離、中斷觸發的時間間隔、甚至是各種硬體雜訊。這些資訊非常難以全盤掌握而且持續在更新，所以遠比純粹數學的方法安全，但也不是完全不可能破解。如果一個系統接受外部資訊的管道很少，理論上有可能被有心人士摸清內部狀態，甚至這些管道也有機會被操弄。
Reference: 電腦的隨機數是如何做到的?

但是事實上，這些亂數方式，看似是隨機，其實仍然是由一套明確的操作運算出來，如 c rand() 原始碼





























long int
random (void)
{
  if (rand_type == TYPE_0)
    {
      state[0] = ((state[0] * 1103515245) + 12345) & LONG_MAX;
      return state[0];
    }
  else
    {
      long int i;
      *fptr += *rptr;
      /* Chucking least random bit.  */
      i = (*fptr >> 1) & LONG_MAX;
      ++fptr;
      if (fptr >= end_ptr)
    {
      fptr = state;
      ++rptr;
    }
      else
    {
      ++rptr;
      if (rptr >= end_ptr)
        rptr = state;
    }
      return i;
    }
}

Reference: gcc(github)、What common algorithms are used for C's rand()?

可以很明顯的看到這是一組可預期的數學運算，這一來其未必能稱作是真正的隨機，真正的隨機是不可預期，所以這些看似隨機卻並不是真正隨機的方式，稱具有偽隨機性(Pseudorandomness)，如果直接下 man rand 看到說明第一句也程述出
```
The rand() function returns a pseudo-random integer in the range 0 to
RAND_MAX inclusive (i.e., the mathematical range [0, RAND_MAX]).
```
因此對現在一般的電腦來說是不可能運算出真正的隨機，頂多使得這個類似隨機的機制夠亂、夠不可預期，或許如老師所說，之後的量子電腦將會成功時做出真正的亂數產生器
Reference: 你的程式夠「亂」嗎？、man rand(3)、量子電腦－曙光乍現、【量子密碼】亂數試金石

Linux 風格的 linked list

為何 Linux 採用 macro 來實作 linked list？一般的 function call 有何成本？

首先可以先理解兩者的差別
最主要的差別在於 macro 會在 compile 前，行處理，把該體換的地方替換掉，因此不像 function 一樣需要傳入參數，或是呼叫時要先紀錄狀態等等，相對的執行速度上就會較快，但是其是以空間換取時間的方式進行，而且 macro 的置換有相對的問題存在，如：operators priority、condition flow 等等

e.g.

#define square(a) a*a
square(5) --> 5*5 --> 25
square(1+2) --> 1+2*1+2 --> 5 // unexpected

整理出來就是
- Advantages
  - It's preprocessed.
  - Not need to pass arguments like function.
  - Time efficiency.
- Disadvantages
  - Very hard to debug in large code.
  - Take more memory in stack compare to function.
因此可以推測經常性使用的 linked list 用 macro 方式去編寫，可以提高速度
Reference: What is the difference between a macro and a function in C？、Macros vs Functions

Linux 應用 linked list 在哪些場合？舉三個案例並附上對應程式碼，需要解說，以及揣摩對應的考量

linux 的 file system 即是一個 linked list 的應用















struct file {
    ...
    struct path		f_path;
    struct inode		*f_inode;	/* cached value */
    const struct file_operations	*f_op;
    
    ...
    spinlock_t		f_lock;
    enum rw_hint		f_write_hint;
    atomic_long_t		f_count;
    unsigned int 		f_flags;
    fmode_t			f_mode;
    struct mutex		f_pos_lock;
    loff_t			f_pos;
    ...

CPU 的排程
// TODO: complete here
Reference: Linux 當中的 VFS 虛擬檔案系統、The Linux Virtual File System

GNU extension 的 typeof 有何作用？在程式碼中扮演什麼角色？

typeof 是回傳這個變數的 type 的 function
- e.g. typeof(int a) 回傳 int

解釋以下巨集的原理





#define container_of(ptr, type, member)                            \
    __extension__({                                                \
        const __typeof__(((type *) 0)->member) *__pmember = (ptr); \
        (type *) ((char *) __pmember - offsetof(type, member));    \
    })

首先可以看到，在 linux/kernel.h 中也有類似的片段



#define container_of(ptr, type, member) ({          \
     const typeof( ((type *)0)->member ) *__mptr = (ptr);    \
     (type *)( (char *)__mptr - offsetof(type,member) );})

根據其 macro 的說明意指 container_of 是給定資料結構的成員，回傳其資料結構的位址

/**
* container_of - cast a member of a structure out to the containing structure
* @ptr:    the pointer to the member.
* @type:   the type of the container struct this is embedded in.
* @member: the name of the member within the struct.
*
*/

e.g. 給定一個資料結構 container




struct container {
  int some_other_data;
  int this_data;
}

有一個變數 int *my_ptr 指向 this_data ，用 container_of 可以找到裝此位址的變數位址，如下


struct container *my_container;
my_container = container_of(my_ptr, struct container, this_data);

那回頭來看程式碼怎麼作到的，一開始先宣告出一個相同型態的指標，並且給定位址
```
const typeof( ((type *)0)->member ) *__mptr = (ptr);
```
再來就是要計算位移偏差量
```
offsetof(type,member)
```

最後將指標位置減去偏差量，就可以得到此資料結構的位址

struct container
+-----------------+    ^
+   some_data     +    |  offset
+-----------------+ <---- my_ptr 
+   this_data     + 
+-----------------+

Reference: Understanding container_of macro in the Linux kernel、Linked List in Linux Kernel、Linux的container_of 與 offsetof巨集

除了你熟悉的 add 和 delete 操作，list.h 還定義一系列操作，為什麼呢？這些有什麼益處？

還有像 list_replace 、list_empty 、 list_move_tail 這接雖然未必是必要的存在，但是如果有這些 function 可以使得工程師撰寫程式的負擔下降

而且光是 add 操作就有很多種 list_add_valid 、hlist_add_behind 、hlist_add_before 等等，可以方便性為寫出較好的程式碼，如一開始的註解說明









/*
 * Simple doubly linked list implementation.
 *
 * Some of the internal functions ("__xxx") are useful when
 * manipulating whole lists rather than single entries, as
 * sometimes we already know the next/prev entries and we can
 * generate better code by using them directly rather than
 * using the generic single-entry routines.
 */

LIST_POISONING 這樣的設計有何意義？

參考 ecsv/linux-like-list/list.h

























/**
 * list_del() - Remove a list node from the list
 * @node: pointer to the node
 * 
 * ...
 * ...
 * 
 * LIST_POISONING can be enabled during build-time to provoke an invalid memory
 * access when the memory behind the next/prev pointer is used after a list_del.
 * This only works on systems which prohibit access to the predefined memory
 * addresses.
 */
static __inline__ void list_del(struct list_head *node)
{
    struct list_head *next = node->next;
    struct list_head *prev = node->prev;

    next->prev = prev;
    prev->next = next;

#ifdef LIST_POISONING
    node->prev = (struct list_head *)(0x00100100);
    node->next = (struct list_head *)(0x00200200);
#endif
}












/**
* hlist_del() - Remove a hlist node from the hlist
* @node: pointer to the node
*
* ...
* ...
* 
* LIST_POISONING can be enabled during build-time to provoke an invalid memory
* access when the memory behind the next/prev pointer is used after an
* hlist_del. This only works on systems which prohibit access to the predefined
* memory addresses.
*/

此外 linux/list.h 裡也定義







/*
 * These are non-NULL pointers that will result in page faults
 * under normal circumstances, used to verify that nobody uses
 * non-initialized list entries.
 */
#define LIST_POISON1  ((void *) 0x100 + POISON_POINTER_DELTA)
#define LIST_POISON2  ((void *) 0x200 + POISON_POINTER_DELTA)

是希望避免在 node 被 delete 之後，其記憶體區塊還未被釋放，若此時取用其 next or prev， access 到不合法的記憶體區段

linked list 採用環狀是基於哪些考量？

它可以雙向，並且從任一點，從頭到尾，也可以從尾到頭
而且到 tail 要回到 head 只需要 O(1) 的操作

list_for_each_safe 和 list_for_each 的差異在哪？“safe” 在執行時期的影響為何？

可以看到在說明敘述中， list_for_each_save 是安全的走訪每一個 node ，避免刪除的節點索引









/**
 * list_for_each_safe - iterate over a list safe against removal of list entry
 * @pos:	the &struct list_head to use as a loop cursor.
 * @n:		another &struct list_head to use as temporary storage
 * @head:	the head for your list.
 */
#define list_for_each_safe(pos, n, head) \
    for (pos = (head)->next, n = pos->next; pos != (head); \
        pos = n, n = pos->next)

可以看到兩者相比，多了第 444 行，就是避免 pos 節點被刪除，但是卻又在 for 之後 pos=pos->next; 就會是錯誤的狀況，所以先複製一個 pos 暫存於 n 再往下走訪

for_each 風格的開發方式對程式開發者的影響為何？

for_each 的開發方式對工程師而言，算是比較和善的，因為相較普通的 for loop 需要給定明確的邊界範圍，而 for_each 可以自動判斷，都是要對物件中的每一個元素做相同的操作，如果程式可以自行判斷範圍在哪，對工程師來說會是比較容易的，也使得程式碼較為簡潔。

程式註解裡頭大量存在 @ 符號，這有何意義？你能否應用在後續的程式開發呢？

這個非常的特別，要說到 Doxygen 工具，他是一套可以將註解自動轉換成說明文件的工具，有點類似 python 的 docstring。
因為在開發中若要讓人快速理解函式是在做什麼的就要那個 function 的說明，但是若編寫程式而要另外再寫一份說明文件著實有點麻煩，因此利用 Doxygen 可以將 function 上方的註解區塊轉換成 function 說明文件，因次 Doxygen 的語法中對參數的定義即是 @params ，因此才會在註解中看到許多＠。








/**
 * container_of() - Calculate address of object that contains address ptr
 * @ptr: pointer to member variable
 * @type: type of the structure containing ptr
 * @member: name of the member variable in struct @type
 *
 * Return: @type pointer of object containing ptr
 */

而不只 c 有在使用，它也可以用在 C++/C#/Java
Reference: Doxygen parameter

tests/ 目錄底下的 unit test 的作用為何？就軟體工程來說的精神為何？

先以 list_add.c 來做說明





































#include <assert.h>
#include "list.h"

#include "common.h"

int main(void)
{
    struct list_head testlist;
    struct listitem item1;
    struct listitem item2;
    struct listitem item3;
    struct listitem item4;
    struct listitem *item = NULL;
    size_t i;

    item1.i = 1;
    item2.i = 2;
    item3.i = 3;
    item4.i = 4;

    INIT_LIST_HEAD(&testlist);
    assert(list_empty(&testlist));

    list_add(&item1.list, &testlist);
    list_add(&item3.list, &item1.list);
    list_add(&item2.list, &item1.list);
    list_add(&item4.list, &item3.list);

    i = 1;
    list_for_each_entry (item, &testlist, list) {
        assert(item->i == i);
        i++;
    }
    assert(i == 5);

    return 0;
}

可以看到這段程式碼，主要宣告變數、初始化，並且重點是 24～27 用了四次 list_add ，然後再去用 assert() 檢查，所作的跟變化跟預期有沒有相同，因此這個 unit test 檔就是用來測試 list_add 是否有誤，其他在 tests/ 下的檔案皆是這個概念。
單元測試在軟體工程來說是非常重要的一環，有了單元測試在做 refactor 時，會相對比較容易而且比較不會出錯，如果對更進階的想法來說，編寫程式同時編寫測試案例，甚至有時會先撰寫測試案例，這就比較偏向 TDD (Test-Driven-Develop)，就是因為你會先知道要測試什麼，而實做出來的功能才不會開發過度，甚至知道有什麼功能應該要捨棄。
Reference: 【單元測試】改變了我程式設計的思維方式、不會寫 unit tests

tests/ 目錄的 unit test 可如何持續精進和改善呢？

首先感覺可以把測試的單元再縮小，盡量一個檔案裡呼叫到一個就好，比如 list_add.c 中除了 list_add() ，還有第 30 行呼叫到 list_for_each_entry() 如果在測試時，出錯的卻是 list_for_each_entry() 會造成不必要的麻煩，所以應該盡量用原生的方式來做






for (item = list_entry((&testlist)->next, __typeof__(*item), list); 
     &entry->list != (&testlist);                                     
     item = list_entry(item->list.next, __typeof__(*item), list))  {
    assert(item->i == i);
    i++;
}

如果真的無法再拆開的話，被呼叫到的函數應該放在較前面的測試案例，也就是說 Makefile 裡， list_for_each_entry() ，要在 list_add 前測試
























TESTS = \
    containerof \
    list_entry \
    list_init-explicit \
    list_init-local \
    list_init-global \
    list_add \
    list_add_tail \
    list_del \
    list_del_init \
    list_first_entry \
    list_last_entry \
    list_is_singular \
    list_for_each_safe \
    list_for_each_entry_safe \
    list_for_each \
    list_for_each_entry \
    list_move \
    list_move_tail \
    list_splice \
    list_splice_tail \
    list_splice_init \
    list_splice_tail_init \
    list_cut_position