2020q3 Homework6 (quiz6)

contributed by < Veternal1226 >

tags: `sysprog2020`

測驗 `1`

bfloat16 浮點數格式由 Google 公司發展，最初用於該公司第三代 Tensor 處理單元 (Cloud TPU)。bfloat16 的主要想法是提供 16 位元浮點數格式，其動態範圍與標準 IEEE 754 的 FP32 (Single-precision floating-point format) 相同，但精度較低，相當於指數區和 FP32 保持相同的 8 位元，並將 FP32 的 fraction 區域縮減到 7 位元。

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

BFloat16 的範例

4049Hex = 0 10000000 1001001 = 3.14062510 ≈
π

下列是個轉換程式:





















float fp32tobf16(float x) {
    float y = x;
    int *py = (int *) &y;
    unsigned int exp, man;
    exp = *py & 0x7F800000u;
    man = *py & 0x007FFFFFu;
    if (!exp && !man) /* zero */           
        return x;
    if (exp == 0x7F800000u) /* infinity or NaN */
        return x;

    /* Normalized number. round to nearest */
    float r = x;
    int *pr = (int *) &r;
    *pr &= BB1;
    r /= 256;
    y = x + r;

    *py &= BB2;
    return y;
}

對應的測試程式:















void print_hex(float x) {
    int *p = (int *) &x;
    printf("%f=%x\n", x, *p);
}

int main() {
    float a[] = {3.140625, 1.2, 2.31, 3.46, 5.63};
    for (int i = 0; i < sizeof(a) / sizeof(a[0]); i++) {
        print_hex(a[i]);
        float bf_a = fp32tobf16(a[i]);
        print_hex(bf_a);
    }

    return 0;
}

參考執行結果:

3.140625=40490000
3.140625=40490000
1.200000=3f99999a
1.203125=3f9a0000
2.310000=4013d70a
2.312500=40140000
3.460000=405d70a4
3.453125=405d0000
5.630000=40b428f6
5.625000=40b40000

請補完程式碼。

作答區

BB1 = (c) 0xff800000
BB2 = (e) 0xffff0000

參考資料:

FP64, FP32, FP16, BFLOAT16, TF32, and other members of the ZOO
TensorFlow 原始程式碼
- tensorflow/core/framework/bfloat16.h
- tensorflow/core/framework/bfloat16.cc

解題想法

*pr &= BB1;
作用:取出sign+exp 所以BB1選 (c) 0xff800000
目標:把23位元砍到變7位元，且不能動到sign&exp

因為bitwise是對整數，所以要強制轉型(int *pr = (int *) &r;)

fraction 長度從 23 變成 7 是右位移 16bits，所以除以256(r /= 256;)

把剛剛保留的 sign & exp 與縮短完的數結合，把最低的16bit去掉，即完成 float32 to bfloat16 轉換，因此BB2選 (e) 0xffff0000

延伸問題:

解釋上述程式碼的運作機制，需要搭配 BFloat16: The secret to high performance on Cloud TPUs 和 Making floating point math highly efficient for AI hardware 解說;
改進上述程式碼，實作 DP/SP 轉換程式，並針對 BFloat16 撰寫測試程式，應儘量涵蓋有效的數值輸入和輸出範圍;
考慮批次轉換 FP32/FP64 為 BFloat16 的需求，請改進上述程式碼，得以提高轉換的效率;

//TODO

測驗 `2`

考慮以下 ring buffer 的實作:













































#define RINGBUF_DECL(T, NAME) \
    typedef struct {          \
        int size;             \
        int start, end;       \
        T *elements;          \
    } NAME

#define RINGBUF_INIT(BUF, S, T)             \
    {                                       \
        static T static_ringbuf_mem[S + 1]; \
        BUF.elements = static_ringbuf_mem;  \
    }                                       \
    BUF.size = S;                           \
    BUF.start = 0;                          \
    BUF.end = 0;

#define NEXT_START_INDEX(BUF) \
    (((BUF)->start != (BUF)->size) ? ((BUF)->start + RB1) : 0)
#define NEXT_END_INDEX(BUF) (((BUF)->end != (BUF)->size) ? ((BUF)->end + RB2) : 0)

#define is_ringbuf_empty(BUF) ((BUF)->end == (BUF)->start)
#define is_ringbuf_full(BUF) (NEXT_END_INDEX(BUF) == (BUF)->start)

#define ringbuf_write_peek(BUF) (BUF)->elements[(BUF)->end]
#define ringbuf_write_skip(BUF)                   \
    do {                                          \
        (BUF)->end = NEXT_END_INDEX(BUF);         \
        if (is_ringbuf_empty(BUF))                \
            (BUF)->start = NEXT_START_INDEX(BUF); \
    } while (0)

#define ringbuf_read_peek(BUF) (BUF)->elements[(BUF)->start]
#define ringbuf_read_skip(BUF) (BUF)->start = NEXT_START_INDEX(BUF);

#define ringbuf_write(BUF, ELEMENT)        \
    do {                                   \
        ringbuf_write_peek(BUF) = ELEMENT; \
        ringbuf_write_skip(BUF);           \
    } while (0)

#define ringbuf_read(BUF, ELEMENT)        \
    do {                                  \
        ELEMENT = ringbuf_read_peek(BUF); \
        ringbuf_read_skip(BUF);           \
    } while (0)

其測試程式如下:
























#include <assert.h>

RINGBUF_DECL(int, int_buf);

int main()
{
    int_buf my_buf;
    RINGBUF_INIT(my_buf, 2, int);
    assert(is_ringbuf_empty(&my_buf));

    ringbuf_write(&my_buf, 37);
    ringbuf_write(&my_buf, 72);
    assert(!is_ringbuf_empty(&my_buf));

    int first;
    ringbuf_read(&my_buf, first);
    assert(first == 37);

    int second;
    ringbuf_read(&my_buf, second);
    assert(second == 72);

    return 0;
}

請補完程式碼，使得測試程式得以正確執行。

作答區

RB1 = (a) 1
RB2 = (a) 1

解題想法

第 17 & 18 行的 macro 作用為回傳下一個可使用的 start 位置，若buffer內有元素則回傳(BUF)->start+1 (表下一個)，若該位置還沒有元素則回傳 0。
因此 RB1 = (a) 1
第 19 行也是相同機制，macro 作用為回傳下一個可使用的 end 位置，若能使用則回傳(BUF)->end+1 (表下一個)，若 buffer 已滿則回傳 0。
因此 RB2 = (a) 1

延伸問題:

解釋上述程式碼的運作機制，包含 do { … } while (0) 使用的考量

確認詳閱 C 語言前置處理器應用篇和 C 語言程式設計技巧
研讀 Super Fast Circular Ring Buffer Using Virtual Memory trick，指出效能改善的機制並用 C99/C11 (可搭配 GNU extension) 重新實作;

//TODO

測驗 `3`

考慮到以下靜態初始化的 singly-linked list 實作:


















































#include <stdio.h>

/* clang-format off */
#define cons(x, y) (struct llist[]){{y, x}}
/* clang-format on */

struct llist {
    int val;
    struct llist *next;
};

void sorted_insert(struct llist **head, struct llist *node)
{
    if (!*head || (*head)->val >= node->val) {
        SS1;
        SS2;	
        return;
    }
    struct llist *current = *head;
    while (current->next && current->next->val < node->val)
        current = current->next;
    node->next = current->next;
    current->next = node;
}

void sort(struct llist **head)
{
    struct llist *sorted = NULL;
    for (struct llist *current = *head; current;) {
        struct llist *next = current->next;
        sorted_insert(&sorted, current);
        current = next;
    }
    *head = sorted;
}

int main()
{
    struct llist *list = cons(cons(cons(cons(NULL, A), B), C), D);
    struct llist *p;
    for (p = list; p; p = p->next)
        printf("%d", p->val);
    printf("\n");

    sort(&list);
    for (p = list; p; p = p->next)
        printf("%d", p->val);
    printf("\n");
    return 0;
}

其執行結果為:

9547
4579

這裡的 9547 不是周星馳電影裡面的密碼，而是指小行星 9547

請補完程式碼，使程式碼和執行結果相匹配。

作答區

A = (d) 7
B = (c) 4
C = (b) 5
D = (a) 9
SS1 = (d) node->next = *head
SS2 = (b) *head = node

解題想法

第39行宣告完後，list如下圖所示：

因此[D,C,B,A]應依序填入[9,5,4,7]。
A = (d) 7
B = (c) 4
C = (b) 5
D = (a) 9

第12~24行的函式目的是將新node插入list中正確的位置而不破壞由小到大的排序，14~18是在做例外處理，用來處理*head指向NULL的情況、與新增的數值(val)小於list中最小節點的值時如何處理，也就是將新node插在list的前面，圖解如下：

新增一個llist節點

先將*head assign 給新節點的 next，翻成程式碼即為 node->next = *head，
因此 SS1 = (d) node->next = *head

將*head 指到此新node，翻成程式碼即為 *head = node ，
因此SS2 = (b) *head = node

延伸問題:

解釋上述程式碼的運作機制，參照 C 語言程式設計技巧，需要列出相關的 C 語言規格描述並解讀;
研讀 Doubly Linked Lists in C 並比對 Linux 核心的 linked list 實作予以解說;
練習 LeetCode 148. Sort List，嘗試將 Optimizing merge sort 的手法引入到 C 程式中實作
- 用長度為 n 的 list* l，分析 sort(l) 的時間複雜度，得到
  $O (n^{2})$ 的結果
  
  $T (n) = T (1) + T (n - 1) + c n, w h e r e c i s a c o n s t a n t$
  
  $= 2 \cdot T (1) + T (n - 2) + c \cdot (n + (n - 1))$
  
  $= (n - 1) \cdot T (1) + T (1) + c \cdot (n + (n - 1) + (n - 2) + . . . + 2)$
  
  $= O (n^{2})$
- Fundamental Sorting Algorithms 提到 tiled merge sort 此種 cache-aware 演算法的考量點，對於排序大量資料而言，quick sort 和 merge sort 勝過 insertion sort (
  $O (n \cdot l o g (n)) < O (n^{2})$ )，對於小量資料則取決於執行環境。因此在 merge sort 將問題切成 L1 cache size 後，即不再往下遞迴，改用 insertion sort 進行排序。
- 交叉閱讀共筆 eecheng87/quiz3

//TODO

測驗 `4`

LeetCode 287. Find the Duplicate Number 給定一個整數序列，其中會有一個數值重複，請找出。
已知條件：

假設陣列長度為
$n$ ，數值的範圍是
$1$ 到
$n - 1$
重複的數值不一定只重複一次

考慮以下程式碼是可能的解法:


















int findDuplicate(int *nums, int numsSize)
{
    int res = 0;
    const size_t log_2 = 8 * sizeof(int) - __builtin_clz(numsSize);
    for (size_t i = 0; i < log_2; i++) {
        int bit = 1 << i;
        int c1 = 0, c2 = 0;
        for (size_t k = 0; k < numsSize; k++) {
            if (k & bit)
                ++c1;
            if (nums[k] & bit)
                ++c2;
        }
        if (CCC)
            res += bit;
    }
    return res;
}

CCC = (b) c1 < c2

解題想法

以數列[5,2,5]舉例

k	num[k]	bit=0001(c1,c2)	0010	0100	1000
0000	0101	0,1	0,0	0,1	0,0
0001	0010	1,1	0,1	0,1	0,0
0010	0101	1,2	1,1	0,2	0,0
	預期結果	1	0	1	0

因此推測可能的判斷式為 (b) c1 < c2

延伸問題:

解釋上述 LeetCode 題目和程式碼的運作機制，指出改進空間並實作;
實作不同的程式碼 (合法的 C 程式，可運用 GNU extension)，使其表現優於上述程式碼;

//TODO

2020q3 Homework6 (quiz6)

tags: sysprog2020

測驗 1

解題想法

測驗 2

解題想法

測驗 3

解題想法

測驗 4

解題想法

Read more

Sony X Performance (F8132) ROOT 筆記

Android (2021 summer)

2020q3 Quiz 15

2020q3 Homework2 (quiz2)

tags: `sysprog2020`

測驗 `1`

測驗 `2`

測驗 `3`

測驗 `4`