# [2020q3](http://wiki.csie.ncku.edu.tw/sysprog/schedule) 第 6 週測驗題
###### tags: `sysprog2020`
:::info
目的: 檢驗學員對 [浮點數運算](https://hackmd.io/@sysprog/c-floating-point), [C 語言前置處理器應用篇](https://hackmd.io/@sysprog/c-preprocessor), [C 語言程式設計技巧](https://hackmd.io/@sysprog/c-trick) 的認知
:::
==[作答表單](https://docs.google.com/forms/d/e/1FAIpQLSdutF5od2TPTTz-_V5dQBBcKXEo84BLpwTuf2Pu5vRd5RyxaQ/viewform)==
### 測驗 `1`
[bfloat16](https://en.wikipedia.org/wiki/Bfloat16_floating-point_format) 浮點數格式由 Google 公司發展,最初用於該公司第三代 Tensor 處理單元 ([Cloud TPU](https://cloud.google.com/tpu))。bfloat16 的主要想法是提供 16 位元浮點數格式,其動態範圍與標準 IEEE 754 的 [FP32](https://en.wikipedia.org/wiki/Single-precision_floating-point_format) (Single-precision floating-point format) 相同,但精度較低,相當於指數區和 FP32 保持相同的 8 位元,並將 FP32 的 fraction 區域縮減到 7 位元。
![](https://i.imgur.com/4phc5LY.png)
BFloat16 的範例
* 4049~Hex~ = 0 10000000 1001001 = 3.140625~10~ ≈ $\pi$
下列是個轉換程式:
```cpp
float fp32tobf16(float x) {
float y = x;
int *py = (int *) &y;
unsigned int exp, man;
exp = *py & 0x7F800000u;
man = *py & 0x007FFFFFu;
if (!exp && !man) /* zero */
return x;
if (exp == 0x7F800000u) /* infinity or NaN */
return x;
/* Normalized number. round to nearest */
float r = x;
int *pr = (int *) &r;
*pr &= BB1;
r /= 256;
y = x + r;
*py &= BB2;
return y;
}
```
對應的測試程式:
```cpp
void print_hex(float x) {
int *p = (int *) &x;
printf("%f=%x\n", x, *p);
}
int main() {
float a[] = {3.140625, 1.2, 2.31, 3.46, 5.63};
for (int i = 0; i < sizeof(a) / sizeof(a[0]); i++) {
print_hex(a[i]);
float bf_a = fp32tobf16(a[i]);
print_hex(bf_a);
}
return 0;
}
```
參考執行結果:
```
3.140625=40490000
3.140625=40490000
1.200000=3f99999a
1.203125=3f9a0000
2.310000=4013d70a
2.312500=40140000
3.460000=405d70a4
3.453125=405d0000
5.630000=40b428f6
5.625000=40b40000
```
請補完程式碼。
==作答區==
BB1 = ?
* `(a)` 0x7f800000
* `(b)` 0x007fffff
* `(c)` 0xff800000
* `(d)` 0x0000ff80
* `(e)` 0xffff0000
* `(f)` 0x0000ffff
BB2 = ?
* `(a)` 0x7f800000
* `(b)` 0x007fffff
* `(c)` 0xff800000
* `(d)` 0x0000ff80
* `(e)` 0xffff0000
* `(f)` 0x0000ffff
參考資料:
* [FP64, FP32, FP16, BFLOAT16, TF32, and other members of the ZOO](https://medium.com/@moocaholic/fp64-fp32-fp16-bfloat16-tf32-and-other-members-of-the-zoo-a1ca7897d407)
* TensorFlow 原始程式碼
- [tensorflow/core/framework/bfloat16.h](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/framework/bfloat16.h)
- [tensorflow/core/framework/bfloat16.cc](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/framework/bfloat16.cc)
:::success
延伸題目:
1. 解釋上述程式碼的運作機制,需要搭配 [BFloat16: The secret to high performance on Cloud TPUs](https://cloud.google.com/blog/products/ai-machine-learning/bfloat16-the-secret-to-high-performance-on-cloud-tpus) 和 [Making floating point math highly efficient for AI hardware](https://engineering.fb.com/ai-research/floating-point-math/) 解說;
2. 改進上述程式碼,實作 DP/SP 轉換程式,並針對 BFloat16 撰寫測試程式,應儘量涵蓋有效的數值輸入和輸出範圍;
2. 考慮批次轉換 FP32/FP64 為 BFloat16 的需求,請改進上述程式碼,得以提高轉換的效率;
:::
---
### 測驗 `2`
考慮以下 [ring buffer](https://en.wikipedia.org/wiki/Circular_buffer) 的實作:
```cpp
#define RINGBUF_DECL(T, NAME) \
typedef struct { \
int size; \
int start, end; \
T *elements; \
} NAME
#define RINGBUF_INIT(BUF, S, T) \
{ \
static T static_ringbuf_mem[S + 1]; \
BUF.elements = static_ringbuf_mem; \
} \
BUF.size = S; \
BUF.start = 0; \
BUF.end = 0;
#define NEXT_START_INDEX(BUF) \
(((BUF)->start != (BUF)->size) ? ((BUF)->start + RB1) : 0)
#define NEXT_END_INDEX(BUF) (((BUF)->end != (BUF)->size) ? ((BUF)->end + RB2) : 0)
#define is_ringbuf_empty(BUF) ((BUF)->end == (BUF)->start)
#define is_ringbuf_full(BUF) (NEXT_END_INDEX(BUF) == (BUF)->start)
#define ringbuf_write_peek(BUF) (BUF)->elements[(BUF)->end]
#define ringbuf_write_skip(BUF) \
do { \
(BUF)->end = NEXT_END_INDEX(BUF); \
if (is_ringbuf_empty(BUF)) \
(BUF)->start = NEXT_START_INDEX(BUF); \
} while (0)
#define ringbuf_read_peek(BUF) (BUF)->elements[(BUF)->start]
#define ringbuf_read_skip(BUF) (BUF)->start = NEXT_START_INDEX(BUF);
#define ringbuf_write(BUF, ELEMENT) \
do { \
ringbuf_write_peek(BUF) = ELEMENT; \
ringbuf_write_skip(BUF); \
} while (0)
#define ringbuf_read(BUF, ELEMENT) \
do { \
ELEMENT = ringbuf_read_peek(BUF); \
ringbuf_read_skip(BUF); \
} while (0)
```
其測試程式如下:
```cpp
#include <assert.h>
RINGBUF_DECL(int, int_buf);
int main()
{
int_buf my_buf;
RINGBUF_INIT(my_buf, 2, int);
assert(is_ringbuf_empty(&my_buf));
ringbuf_write(&my_buf, 37);
ringbuf_write(&my_buf, 72);
assert(!is_ringbuf_empty(&my_buf));
int first;
ringbuf_read(&my_buf, first);
assert(first == 37);
int second;
ringbuf_read(&my_buf, second);
assert(second == 72);
return 0;
}
```
請補完程式碼,使得測試程式得以正確執行。
==作答區==
RB1 = ?
* `(a)` 1
* `(b)` 0
RB2 = ?
* `(a)` 1
* `(b)` 0
:::success
延伸題目:
1. 解釋上述程式碼的運作機制,包含 `do { ... } while (0)` 使用的考量
> 確認詳閱 [C 語言前置處理器應用篇](https://hackmd.io/@sysprog/c-preprocessor) 和 [C 語言程式設計技巧](https://hackmd.io/@sysprog/c-trick)
2. 研讀 [Super Fast Circular Ring Buffer Using Virtual Memory trick](https://medium.com/@abhinavagarwal1996/a-fast-circular-ring-buffer-4d102ef4d4a3),指出效能改善的機制並用 C99/C11 (可搭配 GNU extension) 重新實作;
:::
---
### 測驗 `3`
考慮到以下靜態初始化的 singly-linked list 實作:
```cpp
#include <stdio.h>
/* clang-format off */
#define cons(x, y) (struct llist[]){{y, x}}
/* clang-format on */
struct llist {
int val;
struct llist *next;
};
void sorted_insert(struct llist **head, struct llist *node)
{
if (!*head || (*head)->val >= node->val) {
SS1;
SS2;
return;
}
struct llist *current = *head;
while (current->next && current->next->val < node->val)
current = current->next;
node->next = current->next;
current->next = node;
}
void sort(struct llist **head)
{
struct llist *sorted = NULL;
for (struct llist *current = *head; current;) {
struct llist *next = current->next;
sorted_insert(&sorted, current);
current = next;
}
*head = sorted;
}
int main()
{
struct llist *list = cons(cons(cons(cons(NULL, A), B), C), D);
struct llist *p;
for (p = list; p; p = p->next)
printf("%d", p->val);
printf("\n");
sort(&list);
for (p = list; p; p = p->next)
printf("%d", p->val);
printf("\n");
return 0;
}
```
其執行結果為:
```
9547
4579
```
> 這裡的 `9547` 不是周星馳電影裡面的密碼,而是指[小行星 9547](https://zh.wikipedia.org/wiki/%E5%B0%8F%E8%A1%8C%E6%98%9F9547)
請補完程式碼,使程式碼和執行結果相匹配。
==作答區==
A = ?
* `(a)` 9
* `(b)` 5
* `(c)` 4
* `(d)` 7
B = ?
* `(a)` 9
* `(b)` 5
* `(c)` 4
* `(d)` 7
C = ?
* `(a)` 9
* `(b)` 5
* `(c)` 4
* `(d)` 7
D = ?
* `(a)` 9
* `(b)` 5
* `(c)` 4
* `(d)` 7
SS1 = ?
* `(a)` `*head = node->next`
* `(b)` `*head = node`
* `(c)` `node = *head`
* `(d)` `node->next = *head`
SS2 = ?
* `(a)` `*head = node->next`
* `(b)` `*head = node`
* `(c)` `node = *head`
* `(d)` `node->next = *head`
:::success
延伸題目:
1. 解釋上述程式碼的運作機制,參照 [C 語言程式設計技巧](https://hackmd.io/@sysprog/c-trick),需要列出相關的 C 語言規格描述並解讀;
2. 研讀 [Doubly Linked Lists in C](http://locklessinc.com/articles/dlist/) 並比對 Linux 核心的 linked list 實作予以解說;
3. 練習 LeetCode [148. Sort List](https://leetcode.com/problems/sort-list/),嘗試將 [Optimizing merge sort](https://en.wikipedia.org/wiki/Merge_sort#Optimizing_merge_sort) 的手法引入到 C 程式中實作
* 用長度為 `n` 的 `list* l`,分析 `sort(l)` 的時間複雜度,得到 $O(n^2)$ 的結果
$\begin{split}
T(n) &=
T(1)+T(n-1)+cn, where\ c\ is\ a\ constant \\
&= 2\cdot{T(1)} + T(n-2) + c\cdot(n+(n-1)) \\
&= (n-1)\cdot{T(1)} + T(1) + c\cdot(n+(n-1)+(n-2)+...+2) \\
&= O(n^2)
\end{split}$
* [Fundamental Sorting Algorithms](https://victoramelkin.com/ta/cs32/pa/2/pa2.html) 提到 tiled merge sort 此種 cache-aware 演算法的考量點,對於排序大量資料而言,quick sort 和 merge sort 勝過 insertion sort ($O(n\cdot{log(n)}) < O(n^2)$),對於小量資料則取決於執行環境。因此在 merge sort 將問題切成 L1 cache size 後,即不再往下遞迴,改用 insertion sort 進行排序。
* 交叉閱讀共筆 [eecheng87/quiz3](https://hackmd.io/@eecheng/ryqwNCoVI)
:::
---
### 測驗 `4`
LeetCode [287. Find the Duplicate Number](https://leetcode.com/problems/find-the-duplicate-number/) 給定一個整數序列,其中會有一個數值重複,請找出。
已知條件:
1. 假設陣列長度為 $n$,數值的範圍是 $1$ 到 $n-1$
2. 重複的數值不一定只重複一次
考慮以下程式碼是可能的解法:
```cpp
int findDuplicate(int *nums, int numsSize)
{
int res = 0;
const size_t log_2 = 8 * sizeof(int) - __builtin_clz(numsSize);
for (size_t i = 0; i < log_2; i++) {
int bit = 1 << i;
int c1 = 0, c2 = 0;
for (size_t k = 0; k < numsSize; k++) {
if (k & bit)
++c1;
if (nums[k] & bit)
++c2;
}
if (CCC)
res += bit;
}
return res;
}
```
CCC = ?
* `(a)` `c1 == c2`
* `(b)` `c1 < c2`
* `(c)` `c1 > c2`
* `(d)` `c1 & c2`
* `(e)` `c1 ^ c2`
:::success
延伸題目:
1. 解釋上述 LeetCode 題目和程式碼的運作機制,指出改進空間並實作;
2. 實作不同的程式碼 (合法的 C 程式,可運用 GNU extension),使其表現優於上述程式碼;
:::
---