# 2024q1 Homework2 (quiz1+2)
contributed by < [MathewSu-001](https://github.com/MathewSu-001) >
## 第一周測驗1
### 程式碼理解
參考[Optimized QuickSort — C Implementation (Non-Recursive)](https://alienryderflex.com/quicksort/)以及測驗題的程式碼
```c
node_t *p = pivot->next;
pivot->next = NULL;
while (p) {
node_t *n = p;
p = p->next;
list_add(n->value > value ? &right : &left, n);
}
```
上述可知以 pivot 為分隔點,比 pivot 小的值放到鏈結串列 left ;大的值放到鏈結串列 right。
```graphviz
digraph {
rankdir=TD;
node [style=filled, color="lightblue", fontname="Arial"]
edge [color="black", fontname="Arial"]
// 添加節點
Node1 [label="2", shape=box]
Node2 [label="3", shape=box]
Node3 [label="1", shape=box]
Node4 [label="4", shape=box]
Node5 [label="7", shape=box]
Node6 [label="5", shape=box]
left[shape=plaintext, style=""]
pivot[shape=plaintext, style=""]
right[shape=plaintext, style=""]
// 將所有節點排在同一行上
{rank=same; Node1; Node2; Node3; Node4; Node5; Node6;}
// 添加字指向節點的箭頭
left -> Node1
pivot -> Node4
right -> Node5
//將節點串接
Node1 -> Node2
Node2 -> Node3
Node5 -> Node6
}
```
```c
begin[i] = left;
end[i] = list_tail(&left);
begin[i + 1] = pivot;
end[i + 1] = pivot;
begin[i + 2] = right;
end[i + 2] = list_tail(&right);
left = right = NULL;
i += 2;
```
分類完後,以 left -> pivot -> right 分別將鏈結開頭跟尾端存入陣列 begin 跟 end 中,因為是用堆疊 stack 的方式,會一直確保取出最大值,串接到 result 裡。
### 延伸問題
我思考後覺得可以改進的地方有兩個:
- 更改`max_length`大小
- 決定 pivot 從哪裡取得
首先在看完程式碼後,因為 begin 跟 end 都是取鏈結串列的頭尾,所以總長度應該不超過 list 大小,所以更改程式碼如下:
```diff
void quick_sort(node_t **list) {
int n = list_length(list);
int value;
int i = 0;
- int max_level = 2 * n;
- node_t *begin[max_level], *end[max_level];
+ // int max_level = 2 * n;
+ node_t *begin[n], *end[n];
```
參考 [Linux 效能分析工具: Perf](https://wiki.csie.ncku.edu.tw/embedded/perf-tutorial),使用`perf` 觀察兩者的 instruction 、 cpu cycle 、 cache miss/references。
介紹上述四種性能計數器統計信息及其含義:
- cache-misses: CPU 核心的快取未命中次數。表示在執行程式時,CPU 試圖訪問快取中不存在的數據或指令的次數。
- cache-references: CPU 核心的快取參考次數。表示 CPU 試圖訪問快取的次數,無論訪問是否成功。
- instructions: CPU 核心的指令執行數量。
- cpu cycle: CPU 執行程式的總時鐘周期數。
```
$perf stat --repeat 5 -e cache-misses,cache-references,instructions,cycles ./qsort
```
| | before | after |
|-----------------|--------------|-------------|
|**cache-misses** | 2,375,967 |957,352 |
|**cache-references** |21,873,934 |21,795,217 |
|**instructions** |843,644,727 |820,959,299 |
|**cycles** |778,685,714 |692,183,813 |
|**elapsed time(s)**|0.034892 +- 0.000432 |0.03038 +- 0.00105|
那從數據表中,可以知道有最大的變化是 cache-misses 。參考[現代處理器設計: Cache 原理和實際影響](https://hackmd.io/@sysprog/HkW3Dr1Rb?type=view#%E6%94%B9%E8%AE%8A-array-size)可以知道原因是出在Capacity misses(空間性失誤):
> 如果有一個 64KB 的陣列需要重複存取,因為陣列大小遠大於 cache 大小,所以沒辦法全部放入 cache 。第一次存取陣列發生的 cache misses 全都是 compulsory misses 。但之後存取陣列發生的 cache misses 全都是 capacity misses ,這時的 cache 已存滿,容量不足以儲存全部的資料。
決定 pivot 從哪裡取得也是很重要的因素。考慮一個已經排序好的數列,在每次選擇的 pivot 都會是最小或最大值,這種情況下,快速排序的效能將退化為最壞的時間複雜度,即 $O(n^2)$。
### perf 執行錯誤解決
接下來我就遇到 perf top 出現問題:
![Screenshot from 2024-03-15 10-52-29](https://hackmd.io/_uploads/SJ3ld4WAa.png)
根據 [Linux 效能分析工具: Perf](https://wiki.csie.ncku.edu.tw/embedded/perf-tutorial)裡所提到,推測原因應該是我重開機後有更新到 kernel 。所以現在我來紀錄修復過程。
在編譯過程中,安裝相依的套件出現的錯誤:
```
BUILD: Doing 'make -j20' parallel build
Warning: Kernel ABI header differences:
diff -u tools/include/uapi/sound/asound.h include/uapi/sound/asound.h
diff -u tools/arch/x86/include/asm/cpufeatures.h arch/x86/include/asm/cpufeatures.h
diff -u tools/arch/arm64/include/asm/cputype.h arch/arm64/include/asm/cputype.h
Makefile.config:465: No libdw DWARF unwind found, Please install elfutils-devel/libdw-dev >= 0.158 and/or set LIBDW_DIR
Makefile.config:470: No libdw.h found or old libdw.h found or elfutils is older than 0.138, disables dwarf support. Please install new elfutils-devel/libdw-dev
Makefile.config:612: No sys/sdt.h found, no SDT events are defined, please install systemtap-sdt-devel or systemtap-sdt-dev
Makefile.config:698: Warning: Disabled BPF skeletons as clang (clang) is missing
Makefile.config:810: slang not found, disables TUI support. Please install slang-devel, libslang-dev or libslang2-dev
Makefile.config:857: Missing perl devel files. Disabling perl scripting support, please install perl-ExtUtils-Embed/libperl-dev
Makefile.config:1021: No libcap found, disables capability support, please install libcap-devel/libcap-dev
Makefile.config:1093: No libbabeltrace found, disables 'perf data' CTF format support, please install libbabeltrace-dev[el]/libbabeltrace-ctf-dev
Makefile.config:1158: libpfm4 not found, disables libpfm4 support. Please install libpfm4-dev
Makefile.config:1176: *** ERROR: libtraceevent is missing. Please install libtraceevent-dev/libtraceevent-devel or build with NO_LIBTRACEEVENT=1. Stop.
make[1]: *** [Makefile.perf:261: sub-make] Error 2
make: *** [Makefile:70: all] Error 2
```
可以根據[Minimal requirements to compile the Kernel](https://github.com/torvalds/linux/blob/e5eb28f6d1afebed4bb7d740a797d0390bd3a357/Documentation/process/changes.rst)去看說電腦哪些相依套件沒有安裝到,另外比較大問題是是缺少 `libtraceevent`函式庫
```
sudo apt install libtraceevent-dev # Debian/Ubuntu 系統
```
安裝完後編譯就成功了~
最後將編譯好後的 perf 掛接到 `/usr/bin/perf` 裡就可以重新運作了
```
$sudo ln -s /path/linux-6.8/tools/perf/perf /usr/bin/perf
```
## 第一周測驗2
### timsort.c程式碼理解
- `merge` function
```c
static struct list_head *merge(void *priv,
list_cmp_func_t cmp,
struct list_head *a,
struct list_head *b)
{
struct list_head *head;
struct list_head **tail = &head;
for (;;) {
/* if equal, take 'a' -- important for sort stability */
if (cmp(priv, a, b) <= 0) {
*tail = a;
tail = &a->next;
a = a->next;
if (!a) {
*tail = b;
break;
}
} else {
*tail = b;
tail = &b->next;
b = b->next;
if (!b) {
*tail = a;
break;
}
}
}
return head;
}
```
在 merge 函式裡,會先初始化宣告`head`,以及間接指標`tail`指向`head`的位址。
```graphviz
digraph structs {
node[shape=record]
{rank=same; structa,structd}
rankdir=LR;
structadptr [label="&(&head)|<adptr> tail"]
structptr [label="<name_ptr> &head|<ptr> head"];
structa [label="<A> a1|<n1>1"];
structb [label="<B> a2|<n2>3"];
structc [label="<C> a3|<n3>5"];
structd [label="<D> b1|<n4>2"];
structe [label="<E> b2|<n5>4"];
structadptr:adptr -> structptr:name_ptr:nw
structa:n1 -> structb:n2
structb:n2 -> structc:n3
structd:n4 -> structe:n5
a[shape=plaintext, style=""]
b[shape=plaintext, style=""]
a -> structa:A
b -> structd:D
}
```
通過 cmp 函式比較兩個鏈表節點的值。因為是用間接指標,所以 tail = a 這個指令會讓`head`去指向 a1 ,然後將`tail`指向 a->next 的位址,a 指向 a 的下一個節點。
```graphviz
digraph structs {
node[shape=record]
{rank=same; structa,structd}
rankdir=LR;
structadptr [label="<name> &a1.next|<adptr> tail"]
structptr [label="<name_ptr> &head|<ptr> head"];
structa [label="<A> a1|<n1>1"];
structb [label="<B> a2|<n2>3"];
structc [label="<C> a3|<n3>5"];
structd [label="<D> b1|<n4>2"];
structe [label="<E> b2|<n5>4"];
structptr:name -> structa:A
structa:n1 -> structb:n2
structb:n2 -> structc:n3
structd:n4 -> structe:n5
a_list[label="a", shape=plaintext, style=""]
b[shape=plaintext, style=""]
a_list -> structb:B
b -> structd:D
}
```
再利用 cmp 函式比較兩個鏈表節點的值。因為`tail`為間接指標,所以會更動 a1->next 指向 b1,這樣就會 a1 與 b1 相接 ,然後將`tail`指向 b1->next 的位址,b 指向 b 的下一個節點。
```graphviz
digraph structs {
node[shape=record]
rankdir=LR;
structadptr [label="<name> &b1.next|<adptr> tail"]
structptr [label="<name_ptr> &head|<ptr> head"];
structa [label="<A> a1|<n1>1"];
structb [label="<B> a2|<n2>3"];
structc [label="<C> a3|<n3>5"];
structd [label="<D> b1|<n4>2"];
structe [label="<E> b2|<n5>4"];
structptr:name -> structa:A
structa:n1 -> structd:n4
structb:n2 -> structc:n3
structd:n4 -> structe:n5
a[shape=plaintext, style=""]
b[shape=plaintext, style=""]
a -> structb:B
b -> structe
}
```
如此一來降低記憶體開銷 : 只會對 2 個 runs 中,其範圍重疊的部份進行合併,額外空間只需要配置二個要合併的部份中跟較小的那一部份元素數量的記憶體。
- `find_run` function
再來解釋`find_run`(某些程式碼已省略)
```c
static struct pair find_run(void *priv,
struct list_head *list,
list_cmp_func_t cmp)
{
size_t len = 1;
struct list_head *next = list->next, *head = list;
struct pair result;
if (cmp(priv, list, next) > 0) {
/* decending run, also reverse the list */
struct list_head *prev = NULL;
do {
len++;
list->next = prev;
prev = list;
list = next;
next = list->next;
head = list;
} while (next && cmp(priv, list, next) > 0);
list->next = prev;
} else {
/* ascending run, delete */
}
head->prev = NULL;
head->next->prev = (struct list_head *) len;
result.head = head, result.next = next;
return result;
}
```
`find_run`這個函式的用途就是去尋找連續的遞減或遞增數列,讓該數列成為一個 run。這邊以圖解的方式來說明遞減數列,遞增數列原裡差不多就不多贅述。
```graphviz
digraph {
rankdir=TD;
node [color="lightblue", fontname="Arial"]
edge [color="black", fontname="Arial"]
// 添加節點
Node1 [label="5", shape=box]
Node2 [label="4", shape=box]
Node3 [label="1", shape=box]
Node4 [label="2", shape=box]
Node5 [label="3", shape=box]
head[shape=plaintext, style=""]
next[shape=plaintext, style=""]
prev[shape=plaintext, style=""]
list[shape=plaintext, style=""]
// 將所有節點排在同一行上
{rank=same; Node1; Node2; Node3; Node4; Node5;}
// 添加字指向節點的箭頭
list -> Node1
head -> Node1
next -> Node2
//將節點串接
Node1 -> Node2
Node2 -> Node3
Node3 -> Node4
Node4 -> Node5
}
```
在 do while 迴圈裡就是將 list 指向的節點指派到 prev 鏈結串列,然後重新比較下兩個是否有符合遞減的條件。
```graphviz
digraph {
rankdir=TD;
node [color="lightblue", fontname="Arial"]
edge [color="black", fontname="Arial"]
// 添加節點
Node1 [label="5", shape=box]
Node2 [label="4", shape=box]
Node3 [label="1", shape=box]
Node4 [label="2", shape=box]
Node5 [label="3", shape=box]
head[shape=plaintext, style=""]
next[shape=plaintext, style=""]
prev[shape=plaintext, style=""]
list[shape=plaintext, style=""]
// 將所有節點排在同一行上
{rank=same; Node1; Node2; Node3; Node4; Node5;}
// 添加字指向節點的箭頭
list -> Node2
head -> Node2
next -> Node3
prev -> Node1
//將節點串接
Node2 -> Node3
Node3 -> Node4
Node4 -> Node5
}
```
當 list 與 next 的條件不再符合時,就將 list 和 prev 掛接在一起完成 reverse。並且宣告一個 result 的 head 為該 run 的開端;next 為 run 的尾端的下一個節點(方便在找下一個 run)。
```graphviz
digraph {
rankdir=TD;
node [color="lightblue", fontname="Arial"]
edge [color="black", fontname="Arial"]
// 添加節點
Node1 [label="5", shape=box]
Node2 [label="4", shape=box]
Node3 [label="1", shape=box]
Node4 [label="2", shape=box]
Node5 [label="3", shape=box]
head[shape=plaintext, style=""]
next[shape=plaintext, style=""]
list[shape=plaintext, style=""]
"result.head"[shape=plaintext, style=""]
"result.next"[shape=plaintext, style=""]
// 將所有節點排在同一行上
{rank=same; Node1; Node2; Node3; Node4; Node5;}
// 添加字指向節點的箭頭
list -> Node3
head -> Node3
next -> Node4
"result.head" -> Node3
{rank=same; "result.head", Node3, "result.next"}
"result.next" -> Node4
//將節點串接
Node2 -> Node1
Node3 -> Node2
Node4 -> Node5
}
```
接著,timsort 將檢視已找到的 runs 中,是否有存在可以合併的 runs,這也是 timsort 一個顯著特點,它會在分割過程中進行合併,以維持合併的平衡。
### 延伸問題
將 timsort 合併進 lob0-c 當中,運行 perf 後的結果如下
首先比較運算時間(四捨五入到小數點第三位)(單位 : s)
| | q_sort | timsort |
|---------|----------|-------------|
|第一次 | 0.884 | 0.846 |
|第二次 | 0.989 | 0.985 |
|第三次 | 0.926 | 1.072 |
|第四次 | 0.891 | 0.953 |
|第五次 | 0.892 | 0.863 |
|平均值 | 0.916 | 0.878 |
使用perf 觀察兩者的 instruction 、 cpu cycle 、 cache miss/references
```
$perf stat --repeat 5 -e cache-misses,cache-references,instructions,cycles ./qtest -f ./traces/trace-test-listsort.cmd
```
| | q_sort | q_listsort |
|---------------------|----------------|-------------|
|**cache-misses** |54,910,492 |48,025,172 |
|**cache-references** |89,689,860 |70,943,655 |
|**instructions** |4,914,653,177 |4,665,065,530|
|**cycles** |6,515,988,823 |6,657,494,108|
|**elapsed time(s)** |1.3629 +- 0.0272|1.4029 +- 0.0476|
## 第二周測驗1
### 解釋程式碼運作
以 [105. Construct Binary Tree from Preorder and Inorder Traversal](https://leetcode.com/problems/construct-binary-tree-from-preorder-and-inorder-traversal/description/)的題目作為例子,首先會遍歷 inorder 的所有值,並以[Linux 核心的 hash table 實作](https://hackmd.io/@sysprog/linux-hashtable)篇中的 Division method 作為找尋 key 的方法。
僅將部分圖畫出:
```graphviz
digraph G {
rankdir=LR;
node[shape=record];
hash [label = "<f> hash table|<f0> 0|<f1> 1|<f2> 2|<f3> 3|<f4> 4",height=2.5];
node[shape=none]
null1 [label=NULL]
null2 [label=NULL]
null4 [label=NULL]
subgraph cluster_2 {
rank=same;
style=filled;
color=lightgrey;
node [shape=record];
hn4 [label="15 | {<prev>pprev | <next>next}"];
hn3 [label="20 | {<prev>pprev | <next>next}"];
label="hash_key 0"
}
subgraph cluster_3 {
style=filled;
color=lightgrey;
node [shape=record];
hn2 [label="3 | {<prev>pprev | <next>next}"];
label="hash_key 3"
}
subgraph cluster_4 {
style=filled;
color=lightgrey;
node [shape=record];
hn5 [label="9 | {<prev>pprev | <next>next}"];
label="hash_key 4"
}
hash:f0 -> hn4
hn4:next -> hn3
hn3:next -> null1
hn3:prev:s -> hn4:next:s
hash:f3 -> hn2
hn2:next -> null2
hash:f4 -> hn5
hn5:next -> null4
hn4:prev:s -> hash:f0
hn5:prev:s -> hash:f4
}
```
由前序可知哪個節點在上面 (越前面的越上面)、而由中序可知哪個節點靠左(越前面的越左邊),於是可得知前序的第一個元素一定是 root 的值,且該元素對應到中序後,左邊的值就在左邊,右邊的值就在右邊。
```c
static struct TreeNode *dfs(int *preorder,
int pre_low,
int pre_high,
int *inorder,
int in_low,
int in_high,
struct hlist_head *in_heads,
int size)
{
if (in_low > in_high || pre_low > pre_high)
return NULL;
struct TreeNode *tn = malloc(sizeof(*tn));
tn->val = preorder[pre_low];
int idx = find(preorder[pre_low], size, in_heads);
tn->left = dfs(preorder, pre_low + 1, pre_low + (idx - in_low), inorder,
in_low, idx - 1, in_heads, size);
tn->right = dfs(preorder, pre_high - (in_high - idx - 1), pre_high, inorder,
idx + 1, in_high, in_heads, size);
return tn;
}
```
所以`dfs`函式所做的事就是找 preorder 的開端值,是位在 inorder 的哪個位置上,左邊的陣列與右邊的陣列就分別不斷遞迴`dfs`函式,直到二原樹畫完。
```c
struct hlist_node *p;
int hash = (num < 0 ? -num : num) % size;
hlist_for_each (p, &heads[hash]) {
struct order_node *on = list_entry(p, struct order_node, node);
if (num == on->val)
return on->idx;
}
```
`find`函式所做的事便是要找到位在 inorder 的哪個位置上,因此會利用`hlist_for_each`,去遍歷同一個 hash_key 找到相同值,回傳 on->idx。
- 程式碼可改進之處
根據[Linux 核心的 hash table 實作](https://hackmd.io/@sysprog/linux-hashtable),設計雜湊函數使其盡可能接近 perfect function,讓查找 hash table 中的資料時間複雜度為 $O(1)$,是關鍵要素。
使用 Multiplication Method ,且設定 A = [golden ratio](https://en.wikipedia.org/wiki/Golden_ratio) 時, 此雜湊函數稱為 Fibonacci hashing,且 key 經此函數轉換得到的 index 相當分散,減少碰撞次數。
## 第二周測驗2
### 解釋程式碼運作
首先要先了解`LRUCache`的結構是長怎樣。裡面包含:
- capacity: 該 cache 可存取的量
- count: 計算已經存取幾個 LRUNode
- dhead: 連結所有 LRUNode->link,用來判斷當 capacity 滿時,需要 evict 哪個 LRUNode
- hhead: 用來快速尋找 LRUNode 的 key 與 value
```graphviz
digraph structs{
node [shape="record"];
rankdir = "LR"
subgraph cluster_1
{
label = "LRUCache";
struct1 [label="capacity"];
struct2 [label="count"];
struct3 [label="dhead"];
struct4 [label="<head>hhead[0]|hhead[1]|...|hhead[capacity - 1]"];
struct3 -> struct3 [label="prev"];
}
subgraph cluster_2 {
style=filled;
color=lightgrey;
struct5 [label="key=0"];
struct6 [label="value"];
node [shape=record];
link1 [label="{<prev>prev | <next>next}"];
node1 [label="a | {<prev>pprev | <next>next}"];
label="LRUNode"
}
struct4:head:s -> node1
struct3 -> link1 [label="next"];
}
```
根據[146. LRU Cache](https://leetcode.com/problems/lru-cache/description/),`push`的功能為:
> Update the value of the key if the key exists. Otherwise, add the key-value pair to the cache. If the number of keys exceeds the capacity from this operation, evict the least recently used key.
所以在程式碼分成了兩個區塊:
1. Update the value of the key if the key exists.
```c
int hash = key % obj->capacity;
struct hlist_node *pos;
hlist_for_each (pos, &obj->hhead[hash]) {
LRUNode *c = list_entry(pos, LRUNode, node);
if (c->key == key) {
list_move(&c->link, &obj->dhead);
cache = c;
}
}
```
透過`hlist_for_each`來遍歷 hhead[hash] ,若 key 本來就存在就更新值,並且將該 LRUNode->link 移到 dhead 佇列開端。
2. If capacity is full, evict the least recently used key; otherwise, add the key-value pair to the cache.
```c
if (obj->count == obj->capacity) {
cache = list_last_entry(&obj->dhead, LRUNode, link);
list_move(&cache->link, &obj->dhead);
hlist_del(&cache->node);
hlist_add_head(&cache->node, &obj->hhead[hash]);
} else {
cache = malloc(sizeof(LRUNode));
hlist_add_head(&cache->node, &obj->hhead[hash]);
list_add(&cache->link, &obj->dhead);
obj->count++;
}
```
如果 capacity 滿了,提取出 dhead 佇列尾端(最近最少使用的 key)重新放到佇列尾端,修改 key 、 value為新的組合,hhead[hash]也是同樣步驟。
根據[146. LRU Cache](https://leetcode.com/problems/lru-cache/description/),`get`的功能為:
> Return the value of the key if the key exists, otherwise return -1.
所以程式碼的話就是用`hlist_for_each`來遍歷 hhead[hash],將該 LRUNode->link 移到 dhead 佇列開端,且 return LRUNode->value。
```c
int lRUCacheGet(LRUCache *obj, int key)
{
int hash = key % obj->capacity;
struct hlist_node *pos;
hlist_for_each (pos, &obj->hhead[hash]) {
LRUNode *cache = list_entry(pos, LRUNode, node);
if (cache->key == key) {
list_move(&cache->linkI, &obj->dhead);
return cache->value;
}
}
return -1;
}
```
## 第二周測驗3
### 解釋程式碼運作
```c
static inline unsigned long find_nth_bit(const unsigned long *addr,
unsigned long size,
unsigned long n)
{
if (n >= size)
return size;
if (small_const_nbits(size)) {
unsigned long val = *addr & GENMASK(size - 1, 0);
return val ? fns(val, n) : size;
}
return FIND_NTH_BIT(addr[idx], size, n);
}
```
在解讀主要函式`find_nth_bit`時,對於`small_const_nbits`的用途抱有很大的疑惑。
```c
#define small_const_nbits(nbits) \
(__builtin_constant_p(nbits) && (nbits) <= BITS_PER_LONG && (nbits) > 0)
```
從定義上來看,是必須要滿足:
1. 搜索的位元的最大數量小於等於64
2. size 必須為正數
3. 要滿足 __builtin_constant_p(?)
那麼`__builtin_constant_p`的用途是甚麼呢?根據[6.61 Other Built-in Functions Provided by GCC](https://gcc.gnu.org/onlinedocs/gcc/Other-Builtins.html)
> You can use the built-in function to determine if the expression is known to be constant at compile time and hence that GCC can perform constant-folding on expressions involving that value. The argument of the function is the expression to test. The expression is not evaluated, side-effects are discarded. The function returns the integer 1 if the argument is known to be a compile-time constant and 0 if it is not known to be a compile-time constant. Any expression that has side-effects makes the function return 0. A return of 0 does not indicate that the expression is not a constant, but merely that GCC cannot prove it is a constant within the constraints of the active set of optimization options.
`__builtin_constant_p`是 GCC 提供的內建函式,用於在編譯時判斷表達式是否是一個常數。如果參數被證明是編譯時常數,函式返回整數 1,如果不能確定是編譯時常數,則返回 0。
:::info
```c
#define BITMAP_LAST_WORD_MASK(nbits) (~0UL >> (-(nbits) & (BITS_PER_LONG - 1)))
```
當 `BITS_PER_LONG` 為 64(64-bit),輸出一個可以用來取回最後 `nbits` 的 bit mask。
假設 `nbits` 為 3,`BITS_PER_LONG` 為 16:
-(nbits) = (-3) = 0xFD
-(nbits) & (BITS_PER_LONG - 1) = 0xFD & (16-1) = 0xFD & 0xF = 0xD
(~0UL >> (-(nbits) & (BITS_PER_LONG - 1))) = (0xFF >> (0xD)) = (0xFFFF >> 13) = 0x7
也就是可以取回最後 3 個 bit 的 mask
:::