第 4, 5, 6 週課堂問答簡記

jim12312321

$\log_{2} x$ 的英文

the logarithm x to the base 2.

kdnvt

為什麼 list_sort 中串列大小的比例差距限制是二比一

因為目的是為了避免合併差距過大，因此使用二比一而不是更大三比一、四比一很合理。需要考慮的是為什麼不用三比二或四比三這種比例。

先簡述一下 list_sort 的演算法：

第一階段：
將節點一個一個讀取，如果確定合併的大小不會超過整體的特定比例，就將之前的串列合併。因為本質還是 bottom up merge sort ，所以此階段的合併都是同大小，所有子串列也都會是二的冪。

第二階段：
當所有節點都讀進並盡可能的同大小合併後，就來到必須合併不同大小的階段。這個階段會將子串列由小大到一個接一個合併。

從以上兩階段可以看出，因為第一階段都是二的冪，在維持

2 m : 2 m + 1

比例的情況下，第一階段結束時最大的子串列就是

2^{⌊ l o g_{2} 2 m n / (2 m + 1) ⌋}

。

在第二篇論文 The cost distribution of queue-mergesort, optimal mergesorts, and
power-of-two rules 中看到以下這段話：

Note that
$2^{⌊ l o g_{2} 2 n / 3 ⌋}$ is the unique power of two lying between
$n / 3$ and
$2 n / 3$ and that the choice of rationals other than
$2 / 3$ will not be more balanced. For example, if
$2^{⌊ l o g_{2} n / 2 ⌋}$ or
$2^{⌊ l o g_{2} 5 n / 9 ⌋}$ then the sizes of two subproblems are
$(2, 5)$ for
$n = 7$ while the balanced power-of-two gives
$(3, 4)$ .

這段話其實沒有很精準，因為當

n = 3 \times 2^{k}

時，在包括的情況下介於

n / 3

以及

2 n / 3

的二的冪會有兩個，再不包括的情況下卻又會不存在。如果是寫成

n / 3 < 2^{⌊ l o g_{2} 2 n / 3 ⌋} \leq 2 n / 3

就符合這段話的描述。
而這段敘述講述了一件事，

2^{⌊ l o g_{2} 2 n / 3 ⌋}

可以找到最接近

n / 2

的二的冪。

如果今天是用

2^{⌊ l o g_{2} 3 n / 5 ⌋}

去求

2^{k}

，則當

n = 13

時，

2^{⌊ l o g_{2} 3 n / 5 ⌋} = 4

，而最接近

13 / 2

的二的冪卻是

8

。
這代表如果在第一階段用

3 : 2

的比例去限制，在第二階段合併中，最後一次合併會是

9 : 4

，這個比例差距已經超過第一階段的

3 : 2

，而同樣的反例在

2 m : 2 m + 1, m > 2

中也都會發生。因此這樣的演算法只能用在

2 : 1

的情況。

證明 list_sort 的演算法是 optimal mergesort：

optimal merge sort 定義：在最差的情況下，比較次數小於等於所有其他 mergesort 。

對所有 mergesort ，都可以繪製出一個 merge tree 來表示其合併的順序及長度，每一個節點的權重代表當下的串列長度。方形的是 external node (leaf) ，圓形的是 internal node 。如下圖所示：

graphviz參考連結

internal node 的數量為 leaf 數減一。在最差的情況下，合併 (x,y) 的次數為 x+y-1 ，因此最差情況下總共的比較次數就是

\sum_{v}^{T} (w (v) - 1) = \sum_{v}^{T} w (v) - (n - 1)

其中

n

為這次排序的節點總數，也是 leaf 的總數。而

v

為所有的 internal node 。

w (v)

為該 internal node 的權重（也就是長度）。在這邊因為對所有合併排序 n - 1 都是固定的，所以只要看

\sum_{v}^{T} w (v)

就好。

引述論文 Queue-Mergesort 中以下段落：

We use the fact that the weight of a merge tree is equal to its external path length. The height h(f) of a node I in a tree is the distance of a path from 1 to the root. The external path length of a tree T’ is the sum E(T’) =
$\sum_{l a l e a f o f T^{'}} h (l)$

以及以下段落：

Thus a merge tree T describes an optimal merge-sort on n items if and only if T has minimum external path length
$\sum_{l a l e a f} h (l)$ . It is known that this occurs if and only if the following condition is satisfied: all of T’s leaves are located on its bottom two levels.

因此我們可以得知，只要證明 list_sort 的 merge tree 符合所有 leaf 都在最下面兩層這個條件，就可以證明它是 optimal merge sort 。

分析 list_sort 的演算法，可以得出以下兩點：

第一點：在合併的過程中，最多只有一個子串列不是二的冪（第二階段排最後的子串列）。

第二點：在合併的過程中，兩者差異不大於兩倍。

證明合併過程中對所有長度 n 的子串列，其 merge tree 的所有 leaf 都在最下面兩層。

證明過程：

第一階段所有合併皆為二的冪，符合命題。

用數學歸納法證明第二階段：

最小端的子串列只可能是 (1,1) 或 (1,2)，兩者合併都符合命題。

對第二階段過程中的任意合併，假設其兩個子串列都符合命題。

則合併的過程中，由第一點可知，其中一個子串列一定為二的冪，因此其 merge tree 為 full binary tree ，所有 leaf 都在最底層。另其長度為

2^{k_{1}}

，則 merge tree 的高度為

k_{1}

。

當另一個子串列如果也是二的冪，因為第二點中兩者差異不大於兩倍，因此其大小只可能是

2^{k_{1} - 1}

或

2^{k_{1} + 1}

。 merge tree 的高度

k_{2} = k_{1} \pm 1

，符合命題。

當第二的子串列不為二的冪，那根據假設，此串列的 merge tree 所有 leaf 都在最下面兩層，而其中一層就是

k_{1}

，否則會違反第二點兩者差異不大於兩倍的描述，因此也符合命題。

根據數學歸納法，對第二階段過程中的任意合併，其 merge tree 的所有 leaf 都在最下面兩層。所以可以得出最後一次合併所產生的 merge tree 也符合命題。

根據論文中所述，可得知 list_sort 中的演算法也是 optimal merge sort 。

Uduru0522

第 4 週測驗題題目 1 當 x = 0，答案錯誤

idoleat

list_sort 是怎麼減少 cmp 的函式呼叫？每次呼叫的成本是什麼？

在 lib/sort.c 和 lib/list_sort.c 中 cmp 是讓使用者自訂排序方式的函式指標，需要傳入三個 arguments：

一個 cmp 函式呼叫至少會有以下指令的執行 (以 x86 為例)：

push   rbp
mov    rbp,rsp
mov    DWORD PTR [rbp-4], edi	# args
mov    DWORD PTR [rbp-8], esi
mov    DWORD PTR [rbp-12], edx
# 使用 rbp-4, rbp-8 和 rbp-12 進行比較
pop    rbp
ret

由於現代處理器架構複雜，可能會有 pipeline, out-of-order execution 等機制，沒辦法直接估算所需的 cycle 數量，但是

記憶體上一個 stack frame 會使用以下空間：

return address
old rbp
local variables

(WIP)

Nomad1230

第 4 週測驗題題目 2:

#define BITS_PER_BYTE 8
#define BITS_PER_LONG (sizeof(unsigned long) * BITS_PER_BYTE)

#include <stddef.h>

static inline size_t ffs(unsigned long x)
{
    if (x == 0)
        return 0;

    size_t o = 1;
    unsigned long t = ~0UL;
    size_t shift = BITS_PER_LONG;

    shift >>= 1;
    t >>= shift;

    while (shift) {
        if ((x & t) == 0) {
            x >>= shift;
            o += shift;
        }

        shift >>= 1;
        t >>= shift;
    }

    return o;
}

此函式的功能是找出變數的 bit pattern 中最低位元的 1 位於哪個位置，例如 ffs(0xF0) 其值為 4 。

此演算法採用 binary search，在檢查 bit pattern 時會將其切成兩半並檢查其中一邊，再從尚未檢查的那一邊再切一半，以此類推去計數，以 0xFF00000 為例，示意圖如下:

t 變數為 bit mask ，每次迴圈都遮住 x 的 bit pattern 的左半邊，若右半邊皆為 0 則將 o 加上 shift ，最後將 o 回傳。

ray90514

第 4 週測驗題 1

int ceil_log2(uint32_t x)
{
    uint32_t r, shift;
    x--;
    r = (x > 0xFFFF) << 4;
    x >>= r;
    shift = (x > 0xFF) << 3;
    x >>= shift;
    r |= shift;
    shift = (x > 0xF) << 2;
    x >>= shift;
    r |= shift;
    shift = (x > 0x3) << 1;
    x >>= shift;
    return (EXP1) + 1;       
}

對於 x 的二進位表示如下

x_{k}, . . ., x_{1}, x_{0}

，

x_{k}

為最高位的 1 ，也可以寫成以下形式，

2^{k} + x_{k - 1} 2^{k_{1}} + . . . + x_{1} 2^{1} + x_{0} 2^{0}

，求

l o g_{2}

其實就是在找最高位的 1 ，因為是向上取整所以最後將結果加一。 2 的冪不需要加一，所以一開始會將 x 減一，但這個方法不適用於 x = 0 的情況。

每次將 x 分成高位和低位的兩半邊，如果最高位的 1 在高位的半邊，代表 x 會大於低位的半邊全為 1 的值，如果在高位的半邊，就將 x 向右位移一半，這樣不管是在高位還低位，接下來都只要看低位。

r 負責紀錄

l o g_{2}

的結果，如果最高位在高位的半邊，代表

l o g_{2}

至少有位數的一半，就將該值加入至 r。

依照以上的邏輯，上述程式還缺少三行，而 EXP1 為計算

l o g_{2}

的結果。

r |= shift;
shift = (x > 0x1) << 0;
r |= shift;

若合併這三行也就是 EXP1 ，因為 x 最後只剩下兩位，看第二位就可以知有沒有大於一。

r | shift | x > 0x1
r | shift | x >> 1

idoleat

即便在一整個 CPU 都被一個 process 使用了，還有什麼事情會影響效能

minor page fault
TLB

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Context switch 發生在什麼時候

在 preamptive schedular 切換 tasks 的時候
在 Interrupt 之後 (恢復到原本的工作)

不可以只背課本上的定義，需要知道詳細的使用場景

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

mode switch 不等於 context switch

Mode switch: 當一個 user process 想要執行 privileged 的操作時，需透過 system call，也就是已經小心設計好給使用者的功能。早期 UNIX system call 有二十多個，現代作業系統則常有上百個，詳情可見 POSIX 標準。

OSTEP Ch6: Limited Direct Execution Protocol
trap instruction
return-from-trap instruction
trap table
trap handlers

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Context switch:

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

為了減少 mode switch 有沒有可能把 kernel mode 的 tasks 累積到一定數量再一次做完？

mode switch 的成本因為要防禦各種 CPU side-channel attack 手段而變高。於是，System call aggregation 就將多個系統呼叫合併成一個來做，從而縮減系統呼叫整體的成本，參見 Open Source Summit Japan 2021 的演講 Reduce System Call Overhead for Event-Driven Servers

Google 的 M:N threading model 研究 "ghOSt: Fast & Flexible User-Space Delegation of Linux Scheduling"

freshLiver

為什麼 clone 參數中的 parent_tid 以及 child_tid 需要傳入 address 而不是 TID value 而已

`clone()` Prototype

/* Prototype for the glibc wrapper function */

#define _GNU_SOURCE
#include <sched.h>

int clone(int (*fn)(void *), 
          void *stack, 
          int flags, 
          void *arg, 
          ...
         /* 
          * pid_t *parent_tid, 
          * void *tls, 
          * pid_t *child_tid 
          */ 
         );

說明

在呼叫 clone() 時給的 flags 中包含 CLONE_PARENT_SETTID 與 CLONE_CHILD_CLEARTID。而根據 clone() 的 Man Page 中對於 Flags 的說明，可以知道：

當 Flags 中包含 CLONE_PARENT_SETTID 時：

在建立執行緒時，會將 clone() 參數中的 parent_tid 的值修改成新建立的執行緒的 Thread ID。
當 Flags 中包含 CLONE_CHILD_CLEARTID 時：

當建立的執行緒結束時，會將 clone() 時傳送的參數 child_tid 的值修改為 0。

因此若不是傳送建立執行緒時新增的 node * 物件 insertedNode 中的 tid 的話，就無法在 thread_kill() 或是 killAllThreads() 函式中透過檢查 tid 來確認執行緒是否已經結束。

2020leon

偵測 unsigned addition 的 overflow
https://gcc.gnu.org/onlinedocs/gcc/Integer-Overflow-Builtins.html

在無號整數加法之前偵測溢位

bool add_overflow(uint32_t a, uint32_t b, uint32_t *c)
{
    if (UINT32_MAX - a < b) {
        return true;
    } else {
        *c = a + b;
        return false;
    }
}

對應的最佳化組合語言如下（ Intel syntax ）。

add_overflow1:
	mov	ecx, edi
	mov	al, 1
	not	ecx
	cmp	ecx, esi
	jb	.L4
	add	edi, esi
	xor	eax, eax
	mov	DWORD PTR [rdx], edi
.L4:
	ret

在無號整數加法之後偵測溢位

bool add_overflow(uint32_t a, uint32_t b, uint32_t *c)
{
    *c = a + b;
    return *c < a;
}

bool add_overflow(uint32_t a, uint32_t b, uint32_t *c)
{
    return __builtin_uadd_overflow(a, b, c);
}

對應的最佳化組合語言如下（ Intel syntax ）。

add_overflow:
	add	edi, esi
	mov	DWORD PTR [rdx], edi
	setc	al
	ret

其中 setc 指令即 Set if carry ，便是將 carry bit 存入指定的暫存器中。

kevinshieh0225

降低 GCD 的 branch 數量
TODO: REACTO for gcd

第 4, 5, 6 週課堂問答簡記

jim12312321

kdnvt

證明 list_sort 的演算法是 optimal mergesort：

Uduru0522

idoleat

Nomad1230

ray90514

idoleat

freshLiver

clone() Prototype

說明

2020leon

kevinshieh0225

`clone()` Prototype