2022q1 Homework2 (quiz2)

contributed by < cy023 >

測驗題

測驗 `1`

Version 1 : 可能 Overflow

先計算 a + b 再除以 2，但有機會在計算 a + b 時就發生溢位，導致計算結果錯誤。

uint32_t average(uint32_t a, uint32_t b)
{
    return (a + b) / 2;
}

Version 2 : 避免 Overflow

可以避免在 Version 1 中將 a, b 直接相加溢位而導致的錯誤。
但是使用這個函式前提是需要確保大數小數傳參傳入的順序。
例如呼叫 average(100, 10); 在函式內會計算 (10 - 100) 因為傳參的型態都是無號數 (unsigned) ，所以也會導致預期外的溢位。

uint32_t average(uint32_t low, uint32_t high)
{
    return low + (high - low) / 2;
}

試著將傳參改為 int32_t 型態想解決上述問題，但發現 (high - low) 為奇數時，大數小數傳入的順序不同，甚至可能導致輸出結果不同。

eg.

uint32_t average21(int32_t low, int32_t high)
{
    return low + (high - low) / 2;
}

當 (high - low) < 0 會將 high - low 向上取整。
- 呼叫 average21(100, 9); 結果為 55
  
  100 + (9 - 100) / 2 = 100 + (int32_t)-45.5 = 100 + (-45) = 55
當 (high - low) > 0 會將 high - low 向下取整。
- 呼叫 average21(9, 100); 結果為 54
  
  9 + (100 - 9) / 2 = 9 + (int32_t)45.5 = 9 + 45 = 54

Version 3 : 將 Version 1 改寫為 `bitwise` 操作

其中 EXP1 為 a, b, 1 (數字) 進行某種特別 bitwise 操作，限定用 OR, AND, XOR 這三個運算子。

uint32_t average(uint32_t a, uint32_t b)
{
    return (a >> 1) + (b >> 1) + (EXP1);
}

(a >> 1), (b >> 1) 直接將 a, b 兩數值分別除以 2 (進行 >> 運算)，考慮到 a, b 兩數值為 uint32_t 型態，若無法整除，會有小數被捨去。

如果 a, b 皆為奇數，計算

\frac{a}{2}

與

\frac{b}{2}

，分別會被捨去 0.5 (共捨去 1)，因此計算平均時需要將 a, b 皆為奇數的情況下補回 1。

a & 1 運算可以檢查數字 a 的 LSB 是否為 1，等效於檢查 a 是否為奇數。

根據以上，EXP1 敘述，為當 a, b 都是奇數時需要補回的數，填入 a & b & 1 檢查 a, b 是不是都是奇數。

uint32_t average(uint32_t a, uint32_t b)
{
    return (a >> 1) + (b >> 1) + (a & b & 1);
}

Version 4 : `bitwise` 操作

其中 EXP2 和 EXP3 為 a 和 b 進行某種特別 bitwise 操作，限定用 OR, AND, XOR 這三個運算子。

uint32_t average(uint32_t a, uint32_t b)
{
    return (EXP2) + ((EXP3) >> 1);
}

參考二進位加法的概念。

先考慮 a, b 為 1 bit
a & b 用於檢查 a, b 是不是同為 1，如果是則需要進位。

a b a & b Description

0 0 0 a + b = 0 不需要進位

0 1 0 a + b = 1 不需要進位

1 0 0 a + b = 1 不需要進位

1 1 1 a + b = 2 需要進位
a ^ b 用於將兩數相加，但不會保留進位的資訊。

a b a ^ b Description

0 0 0 a + b = 0

0 1 1 a + b = 1

1 0 1 a + b = 1

1 1 0 a + b = 2 (需要進位變成
$10_{b}$ )
討論 a, b 為 uint32_t 型態
可以將 a + b 表示成相似題目表達式的，((a & b) << 1) + (a ^ b) (可以想成小學的直式加法，需要進位時，在比當前高一位的數字加一)。
(a + b) / 2 可以表示成，(((a & b) << 1) + (a ^ b)) >> 1，簡化成 (a & b) + (a ^ b) >> 1

a	b	a & b	Description
0	0	0	a + b = 0 不需要進位
0	1	0	a + b = 1 不需要進位
1	0	0	a + b = 1 不需要進位
1	1	1	a + b = 2 需要進位

a	b	a ^ b	Description
0	0	0	a + b = 0
0	1	1	a + b = 1
1	0	1	a + b = 1
1	1	0	a + b = 2 (需要進位變成 $10_{b}$ )

uint32_t average(uint32_t a, uint32_t b)
{
    return (a & b) + ((a ^ b) >> 1);
}

延伸問題:

1. 解釋上述程式碼運作的原理
2. 比較上述實作在編譯器最佳化開啟的狀況，對應的組合語言輸出，並嘗試解讀 (可研讀 CS:APP 第 3 章)
3. 研讀 Linux 核心原始程式碼 include/linux/average.h，探討其 Exponentially weighted moving average (EWMA) 實作，以及在 Linux 核心的應用

移動平均（Moving average)，又稱滾動平均值、滑動平均，在統計學中是種藉由建立整個資料集合中不同子集的一系列平均數，來分析資料點的計算方法。
移動平均通常與時間序列資料一起使用，以消除短期波動，突出長期趨勢或周期。短期和長期之間的閾值取決於應用，移動平均的參數將相應地設置。例如，它通常用於對財務數據進行技術分析，如股票價格、收益率或交易量。它也用於經濟學中研究國內生產總值、就業或其他宏觀經濟時間序列。

測驗 `2`

改寫〈解讀計算機編碼〉一文的「不需要分支的設計」一節提供的程式碼 min，我們得到以下實作 (max):

#include <stdint.h>
uint32_t max(uint32_t a, uint32_t b)
{
    return a ^ ((EXP4) & -(EXP5));
}

當 a > b
max 需要回傳 a ，也就是 a ^ 0 (任意數字對 0 作 XOR 操作，結果與原數值相同)
此時 ((EXP4) & -(EXP5)) 為 0
當 a < b
max 需要回傳 b ，也就是 a ^ a ^ b (b 對 a 作亮次 XOR 操作，結果為 b)
此時 ((EXP4) & -(EXP5)) 為 a ^ b
觀察 ((EXP4) & -(EXP5)) 結構。
- EXP4 填入 a ^ b
- EXP5 作為 a, b 兩數值比較的判斷
  - 當 a > b
    - -(EXP5) 要是 0 (
      $00000000000000000000000000000000_{b}$ )
  - 當 a < b
    - -(EXP5) 要是 -1 (
      $11111111111111111111111111111111_{b}$ )
  - 取 EXP5 為 a < b

#include <stdint.h>
uint32_t max(uint32_t a, uint32_t b)
{
    return a ^ ((a ^ b) & -(a < b));
}

延伸問題:

1. 解釋上述程式碼運作的原理
2. 針對 32 位元無號/有號整數，撰寫同樣 branchless 的實作

3. Linux 核心也有若干 branchless / branch-free 的實作，例如 lib/sort.c:

/*
 * Logically, we're doing "if (i & lsbit) i -= size;", but since the
 * branch is unpredictable, it's done with a bit of clever branch-free
 * code instead.
 */
__attribute_const__ __always_inline
static size_t parent(size_t i, unsigned int lsbit, size_t size)
{
    i -= size;
    i -= size & -(i & lsbit);
    return i / 2;
}

請在 Linux 核心原始程式碼中，找出更多類似的實作手法。請善用 git log 檢索。

測驗 `3`

輾轉相除法

#include <stdint.h>
uint64_t gcd64(uint64_t u, uint64_t v)
{
    if (!u || !v) return u | v;
    int shift;
    for (shift = 0; !((u | v) & 1); shift++) {
        u /= 2, v /= 2;
    }
    while (!(u & 1))
        u /= 2;
    do {
        while (!(v & 1))
            v /= 2;
        if (u < v) {
            v -= u;
        } else {
            uint64_t t = u - v;
            u = v;
            v = t;
        }
    } while (COND);
    return RET;
}

註: GCD 演算法

If both x and y are 0, gcd is zero
$g c d (0, 0) = 0$ .
$g c d (x, 0) = x$ and
$g c d (0, y) = y$ because everything divides 0.
If x and y are both even,
$g c d (x, y) = 2 * g c d (\frac{x}{2}, \frac{y}{2})$ because
$2$ is a common divisor. Multiplication with
$2$ can be done with bitwise shift operator.
If x is even and y is odd,
$g c d (x, y) = g c d (\frac{x}{2}, y)$ .

On similar lines, if x is odd and y is even, then

g c d (x, y) = g c d (x, \frac{y}{2})

. It is because

2

is not a common divisor.

while (!(u & 1))
    u /= 2;
do {
    while (!(v & 1))
        v /= 2;
    if (u < v) {
        v -= u;
    } else {
        uint64_t t = u - v;
        u = v;
        v = t;
    }
} while (COND);

將 u 保持

If both x and y are 0, gcd is zero
$g c d (0, 0) = 0$ .

g c d (x, 0) = x

and

g c d (0, y) = y

because everything divides 0.

if (!u || !v)
    return u | v;

If x and y are both even,
$g c d (x, y) = 2 * g c d (\frac{x}{2}, \frac{y}{2})$ because
$2$ is a common divisor. Multiplication with
$2$ can be done with bitwise shift operator.
```
for (shift = 0; !((u | v) & 1); shift++) {
    u /= 2, v /= 2;
}
```
其中 !((u | v) & 1) 判斷 u, v 是否同為偶數。shift 變數用來紀錄要將後續結果作幾次 *2 (<< 1) 的運算。
If x is even and y is odd,
$g c d (x, y) = g c d (\frac{x}{2}, y)$ .
On similar lines, if x is odd and y is even, then
$g c d (x, y) = g c d (x, \frac{y}{2})$ . It is because
$2$ is not a common divisor.

延伸問題:

1. 解釋上述程式運作原理;
2. 在 x86_64 上透過 __builtin_ctz 改寫 GCD，分析對效能的提升;
3. Linux 核心中也內建 GCD (而且還不只一種實作)，例如 lib/math/gcd.c，請解釋其實作手法和探討在核心內的應用場景。

測驗 `4`

參考 Introduction to Low Level Bit Hacks

Bit Hack #7. Isolate the rightmost 1-bit.
y = x & (-x)

This bit hack finds the rightmost 1-bit and sets all the other bits to 0. The end result has only that one rightmost 1-bit set. For example, 01010100 (rightmost bit in bold) gets turned into 00000100.

#include <stddef.h>
size_t improved(uint64_t *bitmap, size_t bitmapsize, uint32_t *out)
{
    size_t pos = 0;
    uint64_t bitset;
    for (size_t k = 0; k < bitmapsize; ++k) {
        bitset = bitmap[k];
        while (bitset != 0) {
            uint64_t t = EXP6;
            int r = __builtin_ctzll(bitset);
            out[pos++] = k * 64 + r;
            bitset ^= t;                           
        }
    }
    return pos;
}

bitset & -bitset

測驗 `5`

LeetCode 提交版本

#include <stdbool.h>                       
#include <stdio.h>
#include <stdlib.h>
#include <string.h>

////////////////////////////////////////////////////////////
// #include "list.h"

struct list_head {
    struct list_head *prev;
    struct list_head *next;
};

static inline void INIT_LIST_HEAD(struct list_head *head)
{
    head->next = head;
    head->prev = head;
}

static inline void list_add(struct list_head *node, struct list_head *head)
{
    struct list_head *next = head->next;

    next->prev = node;
    node->next = next;
    node->prev = head;
    head->next = node;
}

#define container_of(ptr, type, member)                            \
    __extension__({                                                \
        const __typeof__(((type *) 0)->member) *__pmember = (ptr); \
        (type *) ((char *) __pmember - offsetof(type, member));    \
    })

#define list_entry(node, type, member) container_of(node, type, member)

#define list_for_each_entry(entry, head, member)                       \
    for (entry = list_entry((head)->next, __typeof__(*entry), member); \
         &entry->member != (head);                                     \
         entry = list_entry(entry->member.next, __typeof__(*entry), member))

////////////////////////////////////////////////////////////

struct rem_node {
    int key;
    int index;
    struct list_head link;
};  
    
static int find(struct list_head *heads, int size, int key)
{
    struct rem_node *node;
    int hash = key % size;
    list_for_each_entry (node, &heads[hash], link) {
        if (key == node->key)
            return node->index;
    }
    return -1;
}

char *fractionToDecimal(int numerator, int denominator)
{
    int size = 1024;
    char *result = malloc(size);
    char *p = result;

    if (denominator == 0) {
        result[0] = '\0';
        return result;
    }

    if (numerator == 0) {
        result[0] = '0';
        result[1] = '\0';
        return result;
    }

    /* using long long type make sure there has no integer overflow */
    long long n = numerator;
    long long d = denominator;

    /* deal with negtive cases */
    if (n < 0)
        n = -n;
    if (d < 0)
        d = -d;

    bool sign = (float) numerator / denominator >= 0;
    if (!sign)
        *p++ = '-';

    long long remainder = n % d;
    long long division = n / d;

    sprintf(p, "%ld", division > 0 ? (long) division : (long) -division);
    if (remainder == 0)
        return result;

    p = result + strlen(result);
    *p++ = '.';

    /* Using a map to record all of reminders and their position.
     * if the reminder appeared before, which means the repeated loop begin,
     */
    char *decimal = malloc(size);
    memset(decimal, 0, size);
    char *q = decimal;

    size = 1333;
    struct list_head *heads = malloc(size * sizeof(*heads));
    for (int i = 0; i < size; i++)
        INIT_LIST_HEAD(&heads[i]);

    for (int i = 0; remainder; i++) {
        int pos = find(heads, size, remainder);
        if (pos >= 0) {
            while (pos-- > 0)
                *p++ = *decimal++;
            *p++ = '(';
            while (*decimal != '\0')
                *p++ = *decimal++;
            *p++ = ')';
            *p = '\0';
            return result;
        }
        struct rem_node *node = malloc(sizeof(*node));
        node->key = remainder;
        node->index = i;

        list_add(&node->link, &heads[remainder % size]);

        *q++ = (remainder * 10) / d + '0';
        remainder = (remainder * 10) % d;
    }

    strcpy(p, decimal);
    return result;
}

測驗 `6`

/*
 * ALIGNOF - get the alignment of a type
 * @t: the type to test
 *
 * This returns a safe alignment for the given type.
 */
#define ALIGNOF(t) \
    ((char *)(&((struct { char c; t _h; } *)0)->M) - (char *)X)

請補完上述程式碼，使得功能與 GNU extension: __alignof__ 等價。

將表示式拆解

// 將地址 0x0 轉型為指向 struct { char c; t _h; } 型態的指標
// 也可以理解成把結構開頭對齊到地址 0x0
(struct { char c; t _h; } *)0

接著往外層看

// 此處 `M` 應為 `struct { char c; t _h; }` 的成員
&((struct { char c; t _h; } *)0)->M

// M 填入 _h 可以得到型態 t 的地址，也就是相對於地址 0x0 的對齊
&((struct { char c; t _h; } *)0)->_h

最後

// 這裡的 X 填入 0，可以被省略，不太清楚為何要這樣設計
((char *)(&((struct { char c; t _h; } *)0)->M) - (char *)X)

延伸問題

__alignof__

$ grep -r "__alignof__" | wc -l
260

drivers/of/fdt.c

/* Allocate memory for the expanded device tree */
    mem = dt_alloc(size + 4, __alignof__(struct device_node));

kernel/sched/sched.h

/*
 * Helper to define a sched_class instance; each one is placed in a separate
 * section which is ordered by the linker script:
 *
 *   include/asm-generic/vmlinux.lds.h
 *
 * Also enforce alignment on the instance, not the type, to guarantee layout.
 */
#define DEFINE_SCHED_CLASS(name) \
const struct sched_class name##_sched_class \
    __aligned(__alignof__(struct sched_class)) \
    __section("__" #name "_sched_class")

include/linux/align.h

/* @a is a power of 2 value */
#define ALIGN(x, a)		__ALIGN_KERNEL((x), (a))
#define ALIGN_DOWN(x, a)	__ALIGN_KERNEL((x) - ((a) - 1), (a))
#define __ALIGN_MASK(x, mask)	__ALIGN_KERNEL_MASK((x), (mask))
#define PTR_ALIGN(p, a)		((typeof(p))ALIGN((unsigned long)(p), (a)))
#define PTR_ALIGN_DOWN(p, a)	((typeof(p))ALIGN_DOWN((unsigned long)(p), (a)))
#define IS_ALIGNED(x, a)		(((x) & ((typeof(x))(a) - 1)) == 0)

include/uapi/linux/const.h

#define __ALIGN_KERNEL(x, a)		__ALIGN_KERNEL_MASK(x, (typeof(x))(a) - 1)
#define __ALIGN_KERNEL_MASK(x, mask)	(((x) + (mask)) & ~(mask))

tools/testing/selftests/net/tcp_mmap.c

#define ALIGN_UP(x, align_to)	(((x) + ((align_to)-1)) & ~((align_to)-1))

tools/testing/selftests/vm/pkey-helpers.h

#define ALIGN_UP(x, align_to)	(((x) + ((align_to)-1)) & ~((align_to)-1))
#define ALIGN_DOWN(x, align_to) ((x) & ~((align_to)-1))
#define ALIGN_PTR_UP(p, ptr_align_to)	\
    ((typeof(p))ALIGN_UP((unsigned long)(p), ptr_align_to))
#define ALIGN_PTR_DOWN(p, ptr_align_to)	\
    ((typeof(p))ALIGN_DOWN((unsigned long)(p), ptr_align_to))

延伸問題:

1. 解釋上述程式碼運作原理
2. 在 Linux 核心原始程式碼中找出 __alignof__ 的使用案例 2 則，並針對其場景進行解說
3. 在 Linux 核心源使程式碼找出 ALIGN, ALIGN_DOWN, ALIGN_UP 等巨集，探討其實作機制和用途，並舉例探討 (可和上述第二點的案例重複)。思索能否對 Linux 核心提交貢獻，儘量共用相同的巨集

測驗 `7`

static inline bool is_divisible(uint32_t n, uint64_t M)
{
    return n * M <= M - 1;
}

static uint64_t M3 = UINT64_C(0xFFFFFFFFFFFFFFFF) / 3 + 1;
static uint64_t M5 = UINT64_C(0xFFFFFFFFFFFFFFFF) / 5 + 1;

int main(int argc, char **argv)
{
    for (size_t i = 1; i <= 100; i++) {
        uint8_t div3 = is_divisible(i, M3);
        uint8_t div5 = is_divisible(i, M5);
        unsigned int length = (2 << KK1) << KK2;

        char fmt[9];
        strncpy(fmt, &"FizzBuzz%u"[(9 >> div5) >> (KK3)], length);
        fmt[length] = '\0';

        printf(fmt, i);
        printf("\n");
    }
    return 0;
}

先從以下程式碼段落推敲出判斷整除的核心邏輯

static inline bool is_divisible(uint32_t n, uint64_t M)
{
    return n * M <= M - 1;
}

static uint64_t M3 = UINT64_C(0xFFFFFFFFFFFFFFFF) / 3 + 1;
static uint64_t M5 = UINT64_C(0xFFFFFFFFFFFFFFFF) / 5 + 1;

利用到編碼循環的特性 (但不是很好理解 :(

is_divisible(2, M3);
// 2 * M3 <= M3 - 1
// 2 * (UINT64_MAX / 3 + 1) <= (UINT64_MAX / 3)
// false

is_divisible(3, M3);
// 3 * M3 <= M3 - 1
// 3 * (UINT64_MAX / 3 + 1) <= (UINT64_MAX / 3)
// 3 * (UINT64_MAX / 3 + 1) 會溢位，變成 3 - 1
// true

is_divisible(4, M3);
// 4 * M3 <= M3 - 1
// 4 * (UINT64_MAX / 3 + 1) <= (UINT64_MAX / 3)
// 4 * (UINT64_MAX / 3 + 1) 會溢位，變成 M3 + 3 - 1
// false

簡單測試

#include <stdio.h>
#include <stdint.h>
#include <stdbool.h>

static inline bool is_divisible(uint32_t n, uint64_t M)
{
    return n * M <= M - 1;
}

static uint64_t M3 = UINT64_C(0xFFFFFFFFFFFFFFFF) / 3 + 1;
static uint64_t M5 = UINT64_C(0xFFFFFFFFFFFFFFFF) / 5 + 1;

int main(int argc, char **argv)
{
    printf("UINT64_MAX = 0x%lx\n", UINT64_MAX);
    printf("M3 = 0x%lx\n", M3);
    printf("M5 = 0x%lx\n\n", M5);
    
    int i;
    for (i = 0; i <= 10; i++) {
        printf("%d * M3 = 0x%lx \n", i, i * M3);
        printf("M3 - 1 = 0x%lx\n\n", M3 - 1);
    }
    return 0;
}

簡單來說 3 如果可以整除 i，is_divisible(i, M3) 會回傳 true

再來看主要功能，

unsigned int length = (2 << KK1) << KK2;
strncpy(fmt, &"FizzBuzz%u"[(8 >> div5) >> (KK3)], length);

Condition	length	(8>>div5)>>(KK3)
3 的倍數 (非 5 的倍數)	4	0
5 的倍數 (非 3 的倍數)	4	4
15 的倍數	8	0
其他	digit length	8

length: (2 << div3) << div5
index: (8 >> div5) >> (div3 << 2)

延伸問題:

1. 解釋上述程式運作原理並評估 naive.c 和 bitwise.c 效能落差
避免 stream I/O 帶來的影響，可將 printf 更換為 sprintf
2. 分析 Faster remainders when the divisor is a constant: beating compilers and libdivide 一文的想法 (可參照同一篇網誌下方的評論)，並設計另一種 bitmask，如「可被 3 整除則末位設為 1」「可被 5 整除則倒數第二位設定為 1」，然後再改寫 bitwise.c 程式碼，試圖運用更少的指令來實作出 branchless;
參照 fastmod: A header file for fast 32-bit division remainders on 64-bit hardware
3. 研讀 The Fastest FizzBuzz Implementation 及其 Hacker News 討論，提出 throughput (吞吐量) 更高的 Fizzbuzz 實作
4. 解析 Linux 核心原始程式碼 kernel/time/timekeeping.c 裡頭涉及到除法運算的機制，探討其快速除法的實作 (注意: 你可能要對照研讀 kernel/time/ 目錄的標頭檔和程式碼)

過程中，你可能會發現可貢獻到 Linux 核心的空間，請充分討論

2022q1 Homework2 (quiz2)

測驗 1

Version 1 : 可能 Overflow

Version 2 : 避免 Overflow

Version 3 : 將 Version 1 改寫為 bitwise 操作

Version 4 : bitwise 操作

測驗 2

測驗 3

測驗 4

測驗 5

測驗 6

測驗 7

Read more

2022q1 Homework3 (quiz3)

2022q1 Homework3 (fibdrv)

2022q1 Homework1 (lab0)

2022q1 Homework1 (quiz1)

測驗 `1`

Version 3 : 將 Version 1 改寫為 `bitwise` 操作

Version 4 : `bitwise` 操作

測驗 `2`

測驗 `3`

測驗 `4`

測驗 `5`

測驗 `6`

測驗 `7`