2020q3 第 6 週測驗題

# [2020q3](http://wiki.csie.ncku.edu.tw/sysprog/schedule) 第 6 週測驗題 ###### tags: `sysprog2020` :::info 目的: 檢驗學員對 [浮點數運算](https://hackmd.io/@sysprog/c-floating-point), [C 語言前置處理器應用篇](https://hackmd.io/@sysprog/c-preprocessor), [C 語言程式設計技巧](https://hackmd.io/@sysprog/c-trick) 的認知 ::: ==[作答表單](https://docs.google.com/forms/d/e/1FAIpQLSdutF5od2TPTTz-_V5dQBBcKXEo84BLpwTuf2Pu5vRd5RyxaQ/viewform)== ### 測驗 `1` [bfloat16](https://en.wikipedia.org/wiki/Bfloat16_floating-point_format) 浮點數格式由 Google 公司發展，最初用於該公司第三代 Tensor 處理單元 ([Cloud TPU](https://cloud.google.com/tpu))。bfloat16 的主要想法是提供 16 位元浮點數格式，其動態範圍與標準 IEEE 754 的 [FP32](https://en.wikipedia.org/wiki/Single-precision_floating-point_format) (Single-precision floating-point format) 相同，但精度較低，相當於指數區和 FP32 保持相同的 8 位元，並將 FP32 的 fraction 區域縮減到 7 位元。 ![](https://i.imgur.com/4phc5LY.png) BFloat16 的範例 * 4049~Hex~ = 0 10000000 1001001 = 3.140625~10~ ≈ $\pi$ 下列是個轉換程式: ```cpp float fp32tobf16(float x) { float y = x; int *py = (int *) &y; unsigned int exp, man; exp = *py & 0x7F800000u; man = *py & 0x007FFFFFu; if (!exp && !man) /* zero */ return x; if (exp == 0x7F800000u) /* infinity or NaN */ return x; /* Normalized number. round to nearest */ float r = x; int *pr = (int *) &r; *pr &= BB1; r /= 256; y = x + r; *py &= BB2; return y; } ``` 對應的測試程式: ```cpp void print_hex(float x) { int *p = (int *) &x; printf("%f=%x\n", x, *p); } int main() { float a[] = {3.140625, 1.2, 2.31, 3.46, 5.63}; for (int i = 0; i < sizeof(a) / sizeof(a[0]); i++) { print_hex(a[i]); float bf_a = fp32tobf16(a[i]); print_hex(bf_a); } return 0; } ``` 參考執行結果: ``` 3.140625=40490000 3.140625=40490000 1.200000=3f99999a 1.203125=3f9a0000 2.310000=4013d70a 2.312500=40140000 3.460000=405d70a4 3.453125=405d0000 5.630000=40b428f6 5.625000=40b40000 ``` 請補完程式碼。 ==作答區== BB1 = ? * `(a)` 0x7f800000 * `(b)` 0x007fffff * `(c)` 0xff800000 * `(d)` 0x0000ff80 * `(e)` 0xffff0000 * `(f)` 0x0000ffff BB2 = ? * `(a)` 0x7f800000 * `(b)` 0x007fffff * `(c)` 0xff800000 * `(d)` 0x0000ff80 * `(e)` 0xffff0000 * `(f)` 0x0000ffff 參考資料: * [FP64, FP32, FP16, BFLOAT16, TF32, and other members of the ZOO](https://medium.com/@moocaholic/fp64-fp32-fp16-bfloat16-tf32-and-other-members-of-the-zoo-a1ca7897d407) * TensorFlow 原始程式碼 - [tensorflow/core/framework/bfloat16.h](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/framework/bfloat16.h) - [tensorflow/core/framework/bfloat16.cc](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/framework/bfloat16.cc) :::success 延伸題目: 1. 解釋上述程式碼的運作機制，需要搭配 [BFloat16: The secret to high performance on Cloud TPUs](https://cloud.google.com/blog/products/ai-machine-learning/bfloat16-the-secret-to-high-performance-on-cloud-tpus) 和 [Making floating point math highly efficient for AI hardware](https://engineering.fb.com/ai-research/floating-point-math/) 解說; 2. 改進上述程式碼，實作 DP/SP 轉換程式，並針對 BFloat16 撰寫測試程式，應儘量涵蓋有效的數值輸入和輸出範圍; 2. 考慮批次轉換 FP32/FP64 為 BFloat16 的需求，請改進上述程式碼，得以提高轉換的效率; ::: --- ### 測驗 `2` 考慮以下 [ring buffer](https://en.wikipedia.org/wiki/Circular_buffer) 的實作: ```cpp #define RINGBUF_DECL(T, NAME) \ typedef struct { \ int size; \ int start, end; \ T *elements; \ } NAME #define RINGBUF_INIT(BUF, S, T) \ { \ static T static_ringbuf_mem[S + 1]; \ BUF.elements = static_ringbuf_mem; \ } \ BUF.size = S; \ BUF.start = 0; \ BUF.end = 0; #define NEXT_START_INDEX(BUF) \ (((BUF)->start != (BUF)->size) ? ((BUF)->start + RB1) : 0) #define NEXT_END_INDEX(BUF) (((BUF)->end != (BUF)->size) ? ((BUF)->end + RB2) : 0) #define is_ringbuf_empty(BUF) ((BUF)->end == (BUF)->start) #define is_ringbuf_full(BUF) (NEXT_END_INDEX(BUF) == (BUF)->start) #define ringbuf_write_peek(BUF) (BUF)->elements[(BUF)->end] #define ringbuf_write_skip(BUF) \ do { \ (BUF)->end = NEXT_END_INDEX(BUF); \ if (is_ringbuf_empty(BUF)) \ (BUF)->start = NEXT_START_INDEX(BUF); \ } while (0) #define ringbuf_read_peek(BUF) (BUF)->elements[(BUF)->start] #define ringbuf_read_skip(BUF) (BUF)->start = NEXT_START_INDEX(BUF); #define ringbuf_write(BUF, ELEMENT) \ do { \ ringbuf_write_peek(BUF) = ELEMENT; \ ringbuf_write_skip(BUF); \ } while (0) #define ringbuf_read(BUF, ELEMENT) \ do { \ ELEMENT = ringbuf_read_peek(BUF); \ ringbuf_read_skip(BUF); \ } while (0) ``` 其測試程式如下: ```cpp #include <assert.h> RINGBUF_DECL(int, int_buf); int main() { int_buf my_buf; RINGBUF_INIT(my_buf, 2, int); assert(is_ringbuf_empty(&my_buf)); ringbuf_write(&my_buf, 37); ringbuf_write(&my_buf, 72); assert(!is_ringbuf_empty(&my_buf)); int first; ringbuf_read(&my_buf, first); assert(first == 37); int second; ringbuf_read(&my_buf, second); assert(second == 72); return 0; } ``` 請補完程式碼，使得測試程式得以正確執行。 ==作答區== RB1 = ? * `(a)` 1 * `(b)` 0 RB2 = ? * `(a)` 1 * `(b)` 0 :::success 延伸題目: 1. 解釋上述程式碼的運作機制，包含 `do { ... } while (0)` 使用的考量 > 確認詳閱 [C 語言前置處理器應用篇](https://hackmd.io/@sysprog/c-preprocessor) 和 [C 語言程式設計技巧](https://hackmd.io/@sysprog/c-trick) 2. 研讀 [Super Fast Circular Ring Buffer Using Virtual Memory trick](https://medium.com/@abhinavagarwal1996/a-fast-circular-ring-buffer-4d102ef4d4a3)，指出效能改善的機制並用 C99/C11 (可搭配 GNU extension) 重新實作; ::: --- ### 測驗 `3` 考慮到以下靜態初始化的 singly-linked list 實作: ```cpp #include <stdio.h> /* clang-format off */ #define cons(x, y) (struct llist[]){{y, x}} /* clang-format on */ struct llist { int val; struct llist *next; }; void sorted_insert(struct llist **head, struct llist *node) { if (!*head || (*head)->val >= node->val) { SS1; SS2; return; } struct llist *current = *head; while (current->next && current->next->val < node->val) current = current->next; node->next = current->next; current->next = node; } void sort(struct llist **head) { struct llist *sorted = NULL; for (struct llist *current = *head; current;) { struct llist *next = current->next; sorted_insert(&sorted, current); current = next; } *head = sorted; } int main() { struct llist *list = cons(cons(cons(cons(NULL, A), B), C), D); struct llist *p; for (p = list; p; p = p->next) printf("%d", p->val); printf("\n"); sort(&list); for (p = list; p; p = p->next) printf("%d", p->val); printf("\n"); return 0; } ``` 其執行結果為: ``` 9547 4579 ``` > 這裡的 `9547` 不是周星馳電影裡面的密碼，而是指[小行星 9547](https://zh.wikipedia.org/wiki/%E5%B0%8F%E8%A1%8C%E6%98%9F9547) 請補完程式碼，使程式碼和執行結果相匹配。 ==作答區== A = ? * `(a)` 9 * `(b)` 5 * `(c)` 4 * `(d)` 7 B = ? * `(a)` 9 * `(b)` 5 * `(c)` 4 * `(d)` 7 C = ? * `(a)` 9 * `(b)` 5 * `(c)` 4 * `(d)` 7 D = ? * `(a)` 9 * `(b)` 5 * `(c)` 4 * `(d)` 7 SS1 = ? * `(a)` `*head = node->next` * `(b)` `*head = node` * `(c)` `node = *head` * `(d)` `node->next = *head` SS2 = ? * `(a)` `*head = node->next` * `(b)` `*head = node` * `(c)` `node = *head` * `(d)` `node->next = *head` :::success 延伸題目: 1. 解釋上述程式碼的運作機制，參照 [C 語言程式設計技巧](https://hackmd.io/@sysprog/c-trick)，需要列出相關的 C 語言規格描述並解讀; 2. 研讀 [Doubly Linked Lists in C](http://locklessinc.com/articles/dlist/) 並比對 Linux 核心的 linked list 實作予以解說; 3. 練習 LeetCode [148. Sort List](https://leetcode.com/problems/sort-list/)，嘗試將 [Optimizing merge sort](https://en.wikipedia.org/wiki/Merge_sort#Optimizing_merge_sort) 的手法引入到 C 程式中實作 * 用長度為 `n` 的 `list* l`，分析 `sort(l)` 的時間複雜度，得到 $O(n^2)$ 的結果 $\begin{split} T(n) &= T(1)+T(n-1)+cn, where\ c\ is\ a\ constant \\ &= 2\cdot{T(1)} + T(n-2) + c\cdot(n+(n-1)) \\ &= (n-1)\cdot{T(1)} + T(1) + c\cdot(n+(n-1)+(n-2)+...+2) \\ &= O(n^2) \end{split}$ * [Fundamental Sorting Algorithms](https://victoramelkin.com/ta/cs32/pa/2/pa2.html) 提到 tiled merge sort 此種 cache-aware 演算法的考量點，對於排序大量資料而言，quick sort 和 merge sort 勝過 insertion sort ($O(n\cdot{log(n)}) < O(n^2)$)，對於小量資料則取決於執行環境。因此在 merge sort 將問題切成 L1 cache size 後，即不再往下遞迴，改用 insertion sort 進行排序。 * 交叉閱讀共筆 [eecheng87/quiz3](https://hackmd.io/@eecheng/ryqwNCoVI) ::: --- ### 測驗 `4` LeetCode [287. Find the Duplicate Number](https://leetcode.com/problems/find-the-duplicate-number/) 給定一個整數序列，其中會有一個數值重複，請找出。已知條件： 1. 假設陣列長度為 $n$，數值的範圍是 $1$ 到 $n-1$ 2. 重複的數值不一定只重複一次考慮以下程式碼是可能的解法: ```cpp int findDuplicate(int *nums, int numsSize) { int res = 0; const size_t log_2 = 8 * sizeof(int) - __builtin_clz(numsSize); for (size_t i = 0; i < log_2; i++) { int bit = 1 << i; int c1 = 0, c2 = 0; for (size_t k = 0; k < numsSize; k++) { if (k & bit) ++c1; if (nums[k] & bit) ++c2; } if (CCC) res += bit; } return res; } ``` CCC = ? * `(a)` `c1 == c2` * `(b)` `c1 < c2` * `(c)` `c1 > c2` * `(d)` `c1 & c2` * `(e)` `c1 ^ c2` :::success 延伸題目: 1. 解釋上述 LeetCode 題目和程式碼的運作機制，指出改進空間並實作; 2. 實作不同的程式碼 (合法的 C 程式，可運用 GNU extension)，使其表現優於上述程式碼; ::: ---

Read more

Linux 核心的紅黑樹

執行古老的 QPE 執行檔

Git 教學和 GitHub 設定指引

你所不知道的 C 語言：bitwise 操作