2020q3 Homework6 (quiz6)

contributed by < JimmyLiu0530 >

tags: `進階電腦系統理論與實作`

測驗1: fp32 轉 bfloat16

bfloat16 浮點數格式由 Google 公司發展，最初用於該公司第三代 Tensor 處理單元 (Cloud TPU)。bfloat16 的主要想法是提供 16 位元浮點數格式，其動態範圍與標準 IEEE 754 的 FP32 (Single-precision floating-point format) 相同，但精度較低，相當於指數區和 FP32 保持相同的 8 位元，並將 FP32 的 fraction 區域縮減到 7 位元。

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

BFloat16 的範例

4049Hex = 0 10000000 1001001 = 3.14062510 ≈ π

下列是個轉換的程式:





















float fp32tobf16(float x) {
    float y = x;
    int *py = (int *) &y;
    unsigned int exp, man;
    exp = *py & 0x7F800000u;
    man = *py & 0x007FFFFFu;
    if (!exp && !man) /* zero */           
        return x;
    if (exp == 0x7F800000u) /* infinity or NaN */
        return x;

    /* Normalized number. round to nearest */
    float r = x;
    int *pr = (int *) &r;
    *pr &= BB1;
    r /= 256;
    y = x + r;

    *py &= BB2;
    return y;
}

對應的測試程式:















void print_hex(float x) {
    int *p = (int *) &x;
    printf("%f=%x\n", x, *p);
}

int main() {
    float a[] = {3.140625, 1.2, 2.31, 3.46, 5.63};
    for (int i = 0; i < sizeof(a) / sizeof(a[0]); i++) {
        print_hex(a[i]);
        float bf_a = fp32tobf16(a[i]);
        print_hex(bf_a);
    }

    return 0;
}

參考執行結果:

3.140625=40490000
3.140625=40490000
1.200000=3f99999a
1.203125=3f9a0000
2.310000=4013d70a
2.312500=40140000
3.460000=405d70a4
3.453125=405d0000
5.630000=40b428f6
5.625000=40b40000

程式說明

程式上半段主要處理 0 、 infinite 及 NaN。BFloat16 的這三種特別值的編碼與 float32 差不多，只差在 significand 的位數而已，如下圖所示:

val	BFloat16
+0	0 00000000 0000000
-0	1 00000000 0000000
+inf	0 11111111 0000000
-inf	1 11111111 0000000
+NaN	0 11111111 非0
-NaN	1 11111111 非0

程式碼下半段則是處理進位，這裡是採進位到最靠近的數 (Round to nearest)。那要如何操作呢? 我們先觀察下面的幾個例子:

舉 1.200000 為例，16進位為 0x3f99999a，因為此值超過 0x3f998000 (即 0x3f990000 與 0x3f9a0000 的中點)，因此將來會進位成 0x3F9AXXXX。
舉 3.460000 為例，16進位為 0x405d70a4，因為此值沒超過 0x405d8000 (即 0x405d0000 與 0x405e0000 的中點)，因此將來不會進位，仍是 0x405dXXXX。

由上面的例子可以看出，是否會進位到下一個數其實是由第 15 個位元決定的，因為若第 15 個位元為 1，則代表此值一定在中點上或超過中點，因此需要進位到下一個數值。若第 15 個位元為 0，則代表不超過中點，因此不需要進位。綜合上面兩種可能，我們只需要對第 15 位元 +1，即可實現進位或不進位。

那要如何針對原本的 float32 的第 15 位元 +1 ? (假設浮點數為

(- 1)^{S} * (1. F) * 2^{E}

)
首先，取得該值的 exp，對應到程式碼 line 15，因此 BB1 = 0xff800000。此時，r 會等於

(- 1)^{S} * 2^{E}

。

接著再執行 r/= 256;，r 會變成

(- 1)^{S} * 2^{E - 8}

。

之後 line 17:

y = x + r

(- 1)^{S} * (1. F) * 2^{E} + (- 1)^{S} * 2^{E - 8}

(- 1)^{S} * 2^{E - 8} * ((1. F) * 2^{8} + 1)

，而

(1. F) * 2^{8} + 1

就會針對第 15 個位元+1。

最後因為 BFloat16 只使用 16 位元，BB2 = 0xffff0000，只留下最左邊 16 個位元。

Note
可以將





    float r = x;
    int *pr = (int *) &r;
    *pr &= BB1;
    r /= 256;
    y = x + r;

替換成


    int mask = 0x00008000;
    *py += mask;

會得到相同的結果。

延伸問題

1. 改進上述程式碼，實作 DP/SP 轉換程式，並針對 BFloat16 撰寫測試程式，應儘量涵蓋有效的數值輸入和輸出範圍

2. 考慮批次轉換 FP32/FP64 為 BFloat16 的需求，請改進上述程式碼，得以提高轉換的效率

測驗2: Ring buffer

考慮以下 ring buffer 的實作:













































#define RINGBUF_DECL(T, NAME) \
    typedef struct {          \
        int size;             \
        int start, end;       \
        T *elements;          \
    } NAME

#define RINGBUF_INIT(BUF, S, T)             \
    {                                       \
        static T static_ringbuf_mem[S + 1]; \
        BUF.elements = static_ringbuf_mem;  \
    }                                       \
    BUF.size = S;                           \
    BUF.start = 0;                          \
    BUF.end = 0;

#define NEXT_START_INDEX(BUF) \
    (((BUF)->start != (BUF)->size) ? ((BUF)->start + RB1) : 0)
#define NEXT_END_INDEX(BUF) (((BUF)->end != (BUF)->size) ? ((BUF)->end + RB2) : 0)

#define is_ringbuf_empty(BUF) ((BUF)->end == (BUF)->start)
#define is_ringbuf_full(BUF) (NEXT_END_INDEX(BUF) == (BUF)->start)

#define ringbuf_write_peek(BUF) (BUF)->elements[(BUF)->end]
#define ringbuf_write_skip(BUF)                   \
    do {                                          \
        (BUF)->end = NEXT_END_INDEX(BUF);         \
        if (is_ringbuf_empty(BUF))                \
            (BUF)->start = NEXT_START_INDEX(BUF); \
    } while (0)

#define ringbuf_read_peek(BUF) (BUF)->elements[(BUF)->start]
#define ringbuf_read_skip(BUF) (BUF)->start = NEXT_START_INDEX(BUF);

#define ringbuf_write(BUF, ELEMENT)        \
    do {                                   \
        ringbuf_write_peek(BUF) = ELEMENT; \
        ringbuf_write_skip(BUF);           \
    } while (0)

#define ringbuf_read(BUF, ELEMENT)        \
    do {                                  \
        ELEMENT = ringbuf_read_peek(BUF); \
        ringbuf_read_skip(BUF);           \
    } while (0)

其測試程式如下:
























#include <assert.h>

RINGBUF_DECL(int, int_buf);

int main()
{
    int_buf my_buf;
    RINGBUF_INIT(my_buf, 2, int);
    assert(is_ringbuf_empty(&my_buf));

    ringbuf_write(&my_buf, 37);
    ringbuf_write(&my_buf, 72);
    assert(!is_ringbuf_empty(&my_buf));

    int first;
    ringbuf_read(&my_buf, first);
    assert(first == 37);

    int second;
    ringbuf_read(&my_buf, second);
    assert(second == 72);

    return 0;
}

程式說明

ring buffer 也就是我們熟知的 circular queue，如下圖所示:

這種資料結構有兩個索引 start及 end 分別紀錄第一個及最後插入元素的索引。以下是將 start 改指向下一個元素的操作:


#define NEXT_START_INDEX(BUF) \
(((BUF)->start != (BUF)->size) ? ((BUF)->start + 1) : 0)

，因此 RB1 = 1。這邊因為是 ring buffer 的結構，所以如果原本 start 已經等於 (BUF)->size，那麼下一個元素就會回到 index = 0。

同理，end 的下一個元素也會是:


#define NEXT_END_INDEX(BUF) \
(((BUF)->end != (BUF)->size) ? ((BUF)->end + 1) : 0)

，因此 RB2 = 1。

此外，ring buffer 採用 FIFO，所以在 ringbuf_write 中分成兩個動作:

將元素寫入 end 目前所指的格子
再將 end 移往下一格，並檢查移動完的 end 是否跟 start 指到同一格。若是，則將來寫入時會覆蓋掉此格，因此 start 需移往下一格。

同樣地，在 ringbuf_read 中也是分成兩個動作，只是變成:

將讀出的元素 assign 到 ELEMENT
再將 start 移往下一格

最後，為何在 macro 中會使用到 do{...} while(0)，這看似多餘的寫法背後到底隱含甚麼意義?
先考慮下面的例子:














#define swapint(x,y)     \
{
    int temp;           \
    temp = x;           \
    x = y;              \
    y = temp;           \
}
int main(void)
{
    if (...)
        swapint(x,y);
    else
        ...
}

當此程式 compile 時，macro 會被展開成下面的樣子:












int main(void)
{
    if (...)
    {
        int temp;           
        temp = x;          
        x = y;             
        y = temp;   
    }; /* compile error */
    else
        ...
}

那如果換成這樣:














#define swapint(x,y)     \
{
    int temp;           \
    temp = x;           \
    x = y;              \
    y = temp;           \
}
int main(void)
{
    if (...)
        swapint(x,y) /* remove ';' */
    else
        ...
}

雖然為了解決這個問題，避免了 compile error，但不符合一般的寫法，也不易讀。
最好的方法就是:















#define swapint(x,y)    \
do                      \
{                       \
    int temp;           \
    temp = x;           \
    x = y;              \
    y = temp;           \
} while(0)
int main(void)
{
    if (...)
        swapint(x,y);
    else
        ...
}

展開之後會變成













int main(void)
{
    if (...)
        do                     
        {                       
            int temp;          
            temp = x;           
            x = y;              
            y = temp;           
        } while(0);
    else
        ...
}

就不會再跑出 compile error了。

延伸問題

1. 研讀 Super Fast Circular Ring Buffer Using Virtual Memory trick，指出效能改善的機制並用 C99/C11 (可搭配 GNU extension) 重新實作

測驗3: Static Linked-list Sorting

考慮到以下靜態初始化的 singly-linked list 實作:


















































#include <stdio.h>

/* clang-format off */
#define cons(x, y) (struct llist[]){{y, x}}
/* clang-format on */

struct llist {
    int val;
    struct llist *next;
};

void sorted_insert(struct llist **head, struct llist *node)
{
    if (!*head || (*head)->val >= node->val) {
        SS1;
        SS2;	
        return;
    }
    struct llist *current = *head;
    while (current->next && current->next->val < node->val)
        current = current->next;
    node->next = current->next;
    current->next = node;
}

void sort(struct llist **head)
{
    struct llist *sorted = NULL;
    for (struct llist *current = *head; current;) {
        struct llist *next = current->next;
        sorted_insert(&sorted, current);
        current = next;
    }
    *head = sorted;
}

int main()
{
    struct llist *list = cons(cons(cons(cons(NULL, A), B), C), D);
    struct llist *p;
    for (p = list; p; p = p->next)
        printf("%d", p->val);
    printf("\n");

    sort(&list);
    for (p = list; p; p = p->next)
        printf("%d", p->val);
    printf("\n");
    return 0;
}

其執行結果為:


9547
4579

程式說明

透過觀察執行結果的第一行 9547，我們可以知道這個鏈結串列會是:

再根據 line 4 及 39，就可以得到 A = 7, B = 4, C = 5, D = 9。

其原理就是利用 (參考 C 語言程式設計技巧)

compound literal 來建立 static linked list
可針對未知大小的 array/struct 初始化，也可以直接對特定位置的 array 或 struct 的內容賦值
compound literal 可以做 lvalue ，可以傳特定型別或自己定義的 struct
e.g.


struct S{int x; int y;};
struct S A = (struct S){1, 2};

意同於



struct S A;
A.x = 1;
A.y = 2;

C99 [6.5.2.5] Compound literals

The type name shall specify an object type or an array of unknown size, but not a variable length array type

A postfix expression that consists of a parenthesized type name followed by a braceenclosed list of initializers is a compound literal. It provides an unnamed object whose value is given by the initializer list.

If the type name specifies an array of unknown size, the size is determined by the initializer list as specified in 6.7.8, and the type of the compound literal is that of the completed array type. Otherwise (when the type name specifies an object type), the type of the compound literal is that specified by the type name. In either case, the result is an lvalue.

接著，當呼叫 sort 後會有兩個指標 current 及 next 分別指向目前的節點及下一個節點，再呼叫 sorted_insert 將已經排序好的鏈結串列 sorted 與 current 進行排序。結束後， current 往下一個節點移動，重複以上步驟直到 current 指向 NULL。

在 sorted_insert 中，如果傳入的 sorted 串列為空，或是傳入的 node->val 比目前以排序好的串列首的 val 還小，那麼直接 BB1 = node->next = *head; ，BB2 = *head = node;。

傳入的 sorted 串列為空:

或傳入的 node->val 比目前以排序好的串列首的 val 還小 (2 < 5):

執行後會變成:

或

否則，node 要跟 sorted 內的節點一個一個比，直到遇到一個節點其 val 比 node->val 大的，並差在這個節點的前面。

延伸問題

1. 研讀 Doubly Linked Lists in C 並比對 Linux 核心的 linked list 實作予以解說

2. 練習 LeetCode 148. Sort List，嘗試將 Optimizing merge sort 的手法引入到 C 程式中實作

Fundamental Sorting Algorithms 提到 tiled merge sort 此種 cache-aware 演算法的考量點，對於排序大量資料而言，quick sort 和 merge sort 勝過 insertion sort (
$O (n l o g (n) < O (n^{2}))$ ，對於小量資料則取決於執行環境。因此在 merge sort 將問題切成 L1 cache size 後，即不再往下遞迴，改用 insertion sort 進行排序。

交叉閱讀共筆 eecheng87/quiz3

測驗4: LeetCode 287. Find the Duplicate Number

給定一個整數序列，其中會有一個數值重複，請找出。
已知條件：

假設陣列長度為 n，數值的範圍是 1 到 n-1
重複的數值不一定只重複一次，可能兩次、三次等等

考慮以下程式碼是可能的解法:


















int findDuplicate(int *nums, int numsSize)
{
    int res = 0;
    const size_t log_2 = 8 * sizeof(int) - __builtin_clz(numsSize);
    for (size_t i = 0; i < log_2; i++) {
        int bit = 1 << i;
        int c1 = 0, c2 = 0;
        for (size_t k = 0; k < numsSize; k++) {
            if (k & bit)
                ++c1;
            if (nums[k] & bit)
                ++c2;
        }
        if (CCC)
            res += bit;
    }
    return res;
}

程式說明

根據鴿籠原理，長度 n，數值範圍:[1, n-1] 的陣列至少會有兩元素重複，因此接下來我們要找出重複的數值是多少。

此方法就是一個位元一個位元去比較所有 nums[] 以及 [0,..,n-1] 1 的個數。

如果在某位元 k, nums[] 比 [0,1,...,n-1] 擁有更多的 1，那麼就代表重複的數值的第 k 位元必為 1，所以 CCC = c2 > c1。如此重複地檢查每個位元，即可求出重複值。

至於需要檢查多少個位元，在 line 4-5 因為陣列長度為 numsSize，所以所有數值皆小於 numsSize。而 numsSize 需要

⌈ l o g_{2} n u m s S i z e ⌉

位元來表示其數值，所以只需檢查這

⌈ l o g_{2} n u m s S i z e ⌉

個位元即可。

執行效能:
此方法的時間複雜度:

O (n l o g n)

n

為陣列大小。

延伸問題

1. 實作不同的程式碼 (合法的 C 程式，可運用 GNU extension)，使其表現優於上述程式碼















int findDuplicate(int* nums, int numsSize) {
    int slow = nums[0], fast = nums[nums[0]];
        while(slow != fast)
        {
            slow = nums[slow];
            fast = nums[nums[fast]];
        }
        fast = 0;
        while(fast != slow)//find the start of the loop which means at least two integer are the same value;
        {
            slow = nums[slow];
            fast = nums[fast];
        }
        return slow;
}

程式說明:
由於陣列中的數值被限制在 [1, n-1] 之間，因此這些數值又可以視為陣列的索引。當我們從 index = 0 開始走訪此陣列，過一陣子會發現我們無法離開這個陣列，這是因為陣列中有重複數值的緣故 (重複值代表他們指向同一個 index)，才會形成一個 circle，而 entry of circle 就會是重複值。因此，整個問題轉化成找此 circle 的 entry point。
Entry of circle (Floyd's Tortoise and Hare algorithm)

簡單來說，一開始都從起始點 (list start) 開始走，烏龜一次走一步，兔子一次走兩步，經過一段時間會在 meeting point 相遇 (對應到程式 line 3-7)。相遇後，將兔子的速度降至與烏龜相同，並且將兔子移至起始點後，再讓他們繼續走 (對應到程式 line 9-13)。最後他們第一次相遇的地方就會是 entry of circle。

詳細的說明這邊有。

執行效能:
此方法的時間複雜度:
$O (n)$ ,
$n$ 為陣列大小

2020q3 Homework6 (quiz6)

tags: 進階電腦系統理論與實作

測驗1: fp32 轉 bfloat16

程式說明

延伸問題

1. 改進上述程式碼，實作 DP/SP 轉換程式，並針對 BFloat16 撰寫測試程式，應儘量涵蓋有效的數值輸入和輸出範圍

2. 考慮批次轉換 FP32/FP64 為 BFloat16 的需求，請改進上述程式碼，得以提高轉換的效率

測驗2: Ring buffer

程式說明

延伸問題

1. 研讀 Super Fast Circular Ring Buffer Using Virtual Memory trick，指出效能改善的機制並用 C99/C11 (可搭配 GNU extension) 重新實作

測驗3: Static Linked-list Sorting

程式說明

延伸問題

1. 研讀 Doubly Linked Lists in C 並比對 Linux 核心的 linked list 實作予以解說

2. 練習 LeetCode 148. Sort List，嘗試將 Optimizing merge sort 的手法引入到 C 程式中實作

測驗4: LeetCode 287. Find the Duplicate Number

程式說明

延伸問題

1. 實作不同的程式碼 (合法的 C 程式，可運用 GNU extension)，使其表現優於上述程式碼

Read more

台灣萬萬稅

CS:APP 第一章重點整理 - Introduction

GDB介紹

Ubuntu 20.04 wifi 選項消失之解決方法

tags: `進階電腦系統理論與實作`