2020q3 Homework4 (quiz4)

contributed by < Rsysz >

tags: `sysprog`

測驗題

1.

LeetCode 461. Hamming Distance 提及，兩個整數間的 Hamming distance 為其二進位的每個位元的差。請計算輸入參數兩整數 x 與 y 的 Hamming distance，例如整數 1 的二進位為 0 0 0 1，而整數 4 的二進位為 0 1 0 0，則 1 與 4 的 Hamming distance 為 2。

運用第 3 周測驗提到的 __builtin_popcount (gcc extension 之一)，我們可實作如下:

int hammingDistance(int x, int y) {
    return __builtin_popcount(x OP y);
}

因 Hamming Distance 取兩者不同的位數用來計算距離，因此OP = (c) ^

延伸問題

int hammingDistance(int x, int y) {
	int num = x ^ y;
    num = (num & 0x55555555) + ((num >> 1) & 0x55555555);
    num = (num & 0x33333333) + ((num >> 2) & 0x33333333);
    num = (num & 0x0F0F0F0F) + ((num >> 4) & 0x0F0F0F0F);
    num = (num & 0x00FF00FF) + ((num >> 8) & 0x00FF00FF);
    num = (num & 0x0000FFFF) + ((num >> 16) & 0x0000FFFF);
    return num;

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

如同 quiz3 一樣，每個 byte 獨立計算 1 的數量再予以加總輸出

Leetcode 477. Total Hamming Distance

最簡單的寫法就是每兩兩計算一次 hammingDistance ，然後很不意外的 TLE 了，代表更有效率的解法是存在的，因此我們先把範例中的數值列出以表格做觀察

Number	Bits
4	0100
14	1110
2	0010

hammingDistance 就是求取兩兩不同的 bit 數目的合，因此可以發現若將每位 bit 裡 0 與 1 的數目分別加總並相乘，剛好會等於 totalHammingDistance

2_{0} \times 1_{1} + 1_{0} \times 2_{1} + 1_{0} \times 2_{1} = 6

因此我們可以將程式改寫為這樣

int totalHammingDistance(int* nums, int numsSize){
    int res = 0;
    for(int j = 0, c = 0; j < 30; j++, c = 0){
        for(int i = 0; i < numsSize; nums[i] >>= 1, i++)
            c += (nums[i] & 1);
    	res += c * (numsSize - c);
    }
    return res;
}

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

ECC, Hamming Distance & Hamming Weight

因雜訊，碟片損傷等，資料傳輸過程中需要面對部分資訊遺失或錯誤的風險，因此錯誤更正碼應運而生，錯誤更正碼的精神在如何確保資訊正確的情況下，盡量減少 redundant bits 的使用。

如下圖所示以 (15, 11) Hamming code 為例，透過 1, 2, 4, 8 ， 4 個 redundant bits 也就是 Q1 ~ Q4 的相互比較確認 1bit 錯誤發生的地方，可以對 15 bits 內發生的 1 bit 進行糾正，但若發生兩個 bits 以上的錯誤便無法偵測

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

但若使用 0 號 bit 檢查整體的 1-bits 是否為偶數，雖然無法進行更正，但能夠檢測出兩個 bits 以上的錯誤，是為 extended Hamming code

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

2.

LeetCode 1483. Kth Ancestor of a Tree Node 大意:

給你一棵樹，樹上有 n 個節點，編號自
$0 到 n - 1$
。樹以父節點陣列的形式提供，其中 parent[i] 是節點 i 的父節點。樹的根節點是編號為 0 的節點。請設計 treeAncestorGetKthAncestor(TreeAncestor *obj, int node, int k) 函式，回傳節點 node 的第 k 個祖先節點。若不存在這樣的祖先節點，回傳 -1。樹節點的第 k 個祖先節點是從該節點到根節點路徑上的第 k 個節點

Image Not Showing Possible Reasons
The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported
Learn More →

Input:

["TreeAncestor","getKthAncestor","getKthAncestor","getKthAncestor"]
[[7,[-1,0,0,1,1,2,2]],[3,1],[5,2],[6,3]]

Output:

[null,1,0,-1]

#include <stdlib.h>

typedef struct {
    AAA;
    int max_level;
} TreeAncestor;

TreeAncestor *treeAncestorCreate(int n, int *parent, int parentSize)
{
    TreeAncestor *obj = malloc(sizeof(TreeAncestor));
    obj->parent = malloc(n * sizeof(int *));

    int max_level = 32 - __builtin_clz(n) + 1;
    for (int i = 0; i < n; i++) {
        obj->parent[i] = malloc(max_level * sizeof(int));
        for (int j = 0; j < max_level; j++)
            obj->parent[i][j] = -1;
    }
    for (int i = 0; i < parentSize; i++)
        obj->parent[i][0] = parent[i];

    for (int j = 1;; j++) {
        int quit = 1;
        for (int i = 0; i < parentSize; i++) {
            obj->parent[i][j] = obj->parent[i][j + BBB] == -1
                                    ? -1
                                    : obj->parent[obj->parent[i][j - 1]][j - 1];
            if (obj->parent[i][j] != -1) quit = 0;
        }
        if (quit) break;
    }
    obj->max_level = max_level - 1;
    return obj;
}

int treeAncestorGetKthAncestor(TreeAncestor *obj, int node, int k)
{
    int max_level = obj->max_level;
    for (int i = 0; i < max_level && node != -1; ++i)
        if (k & (CCC))
            node = obj->parent[node][i];
    return node;
}

void treeAncestorFree(TreeAncestor *obj)
{
    for (int i = 0; i < obj->n; i++)
        free(obj->parent[i]);
    free(obj->parent);
    free(obj);
}

本題透過建立 Ancestor Table 儲存每個節點的每

2^{n}

以上父代直到 root 也就是 0 ，因此可以判斷 parent 為二維陣列，AAA = (b) int **parent

    for (int i = 0; i < parentSize; i++)
        obj->parent[i][0] = parent[i];

程式運行到這行之後便會建立如下圖所示的 Ancestor Table ， [i] 為 row 而 [j] 為 column

parent[i][j]	[0]	[1]	[2]	[3]
[0]	-1	-1	-1	-1
[1]	0	-1	-1	-1
[2]	0	-1	-1	-1
[3]	1	-1	-1	-1
[4]	1	-1	-1	-1
[5]	2	-1	-1	-1
[6]	2	-1	-1	-1

但這張 Ancestor Table 缺少祖父輩以上節點的資訊，若以此進行查詢將會花費過多的時間，因此透過下方程式碼更新 Ancestor Table ，BBB = (b) (-1)

...
obj->parent[i][j] = obj->parent[i][j + BBB] == -1
                                    ? -1
                                    : obj->parent[obj->parent[i][j - 1]][j - 1];
...

parent[i][j]	[0]	[1]	[2]	[3]
[0]	-1	-1	-1	-1
[1]	0	-1	-1	-1
[2]	0	-1	-1	-1
[3]	1	0	-1	-1
[4]	1	0	-1	-1
[5]	2	0	-1	-1
[6]	2	0	-1	-1

以 GetKthAncestor(5, 2) 為例，目標在於找尋 5 節點的上 2 代，也就是上

2^{1}

代，觀察 node = obj->parent[node][i] 與上方 Ancestor Table ，因 Table 內直接存有目標節點，又

2^{i} = 2^{1}

因此最有效率的找法就是直接令 node = parent[5][1]

代表 if (k & (CCC)) 必須在 i = 1 時成立，其餘則不成立，以 Bits 型式觀察 2 ，可以得到
CCC = (f) 1 << i。

Number	Bits
2	0010

int treeAncestorGetKthAncestor(TreeAncestor *obj, int node, int k)
{
    int max_level = obj->max_level;
    for (int i = 0; i < max_level && node != -1; ++i)
        if (k & (CCC))
            node = obj->parent[node][i];
    return node;
}

延伸問題

typedef struct {
    int **parent;
    int max_level;
} TreeAncestor;

TreeAncestor* treeAncestorCreate(int n, int* parent, int parentSize) {
    TreeAncestor *obj = malloc(sizeof(TreeAncestor));
    
    obj->max_level = 32 - __builtin_clz(n);
    obj->parent = malloc(obj->max_level * sizeof(int *));
    
    for(int i = 1; i < obj->max_level; i++)
        obj->parent[i] = malloc(n * sizeof(int));
    obj->parent[0] = parent;
    
    for(int i = 1; i < obj->max_level; ++i){
        for(int j = 0; j < n; ++j)
            obj->parent[i][j] = obj->parent[i - 1][j] == -1 ? -1 : obj->parent[i-1][obj->parent[i-1][j]];
    }
    return obj;
}

int treeAncestorGetKthAncestor(TreeAncestor* obj, int node, int k) {
  for (int i = 0; i < obj->max_level && node != -1; ++i)
      if (k & (1 << i))
        node = obj->parent[i][node];
    return node;
}

void treeAncestorFree(TreeAncestor* obj) {
    for (int i = 0; i < obj->max_level; i++)
        free(obj->parent[i]);
    free(obj->parent);
    free(obj);
}

將原來 parent[i][j] 型式翻轉為 parent[j][i] ，這個小改動可以使得原來

...
    for (int i = 0; i < n; i++) {
        obj->parent[i] = malloc(max_level * sizeof(int));
        for (int j = 0; j < max_level; j++)
            obj->parent[i][j] = -1;
    }
    for (int i = 0; i < parentSize; i++)
        obj->parent[i][0] = parent[i];
...

改動為

...
    obj->parent[0] = parent;
...

減少存取陣列元素的次數，也可以避免重複對陣列寫入數值，此外因原始程式碼中沒有對

    for (int j = 1;; j++) {
        int quit = 1;
        for (int i = 0; i < parentSize; i++) {
            obj->parent[i][j] = obj->parent[i][j + BBB] == -1
                                    ? -1
                                    : obj->parent[obj->parent[i][j - 1]][j - 1];
            if (obj->parent[i][j] != -1) quit = 0;
        }
        if (quit) break;
    }

中的 j 設邊界條件，此舉雖然減少了平衡樹 assign 的數量，但也因此需要對 max_level 額外加 1 來避免 j 的越界存取。

舉例來說，以 [5, [-1, 0, 1, 2, 3]] 代入，若 max_level 沒有加 1 的話， j 便會存取到 Highligh 部分的記憶體，從而導致 heap-buffer-overflow。

parent[i][j]	[0]	[1]	[2]	[3]
[0]	-1	-1	-1	-1
[1]	0	-1	-1	-1
[2]	1	0	-1	-1
[3]	2	1	-1	-1
[4]	3	2	0	-1

3.

貌似簡單卻蘊含實作深度的 FizzBuzz 題目:

從 1 數到 n，如果是 3的倍數，印出 “Fizz”
如果是 5 的倍數，印出 “Buzz”
如果是 15 的倍數，印出 “FizzBuzz”
如果都不是，就印出數字本身

#include <stdio.h>
int main() {
    for (unsigned int i = 1; i < 100; i++) {
        if (i % 3 == 0) printf("Fizz");
        if (i % 5 == 0) printf("Buzz");
        if (i % 15 == 0) printf("FizzBuzz");
        if ((i % 3) && (i % 5)) printf("%u", i);
        printf("\n");
    }
    return 0;
}

觀察 printf 的(格式)字串，可分類為以下三種:

整數格式字串 "%d" : 長度為 2 B
“Fizz” 或 “Buzz” : 長度為 4 B
“FizzBuzz” : 長度為 8 B

#define MSG_LEN 8

static inline bool is_divisible(uint32_t n, uint64_t M) {
    return n * M <= M - 1;
}

static uint64_t M3 = UINT64_C(0xFFFFFFFFFFFFFFFF) / 3 + 1;
static uint64_t M5 = UINT64_C(0xFFFFFFFFFFFFFFFF) / 5 + 1;

int main(int argc, char *argv[]) {
    for (size_t i = 1; i <= 100; i++) {
        uint8_t div3 = is_divisible(i, M3);
        uint8_t div5 = is_divisible(i, M5);
        unsigned int length = (2 << div3) << div5;

        char fmt[MSG_LEN + 1];
        strncpy(fmt, &"FizzBuzz%u"[(MSG_LEN >> KK1) >> (KK2 << KK3)], length);
        fmt[length] = '\0';

        printf(fmt, i);
        printf("\n");
    }
    return 0;
}

is_divisible : 透過先前 quiz2 學過的 fast divide 技巧，快速判斷 i 是否為 3 或 5 的倍數。

length有三種情形

當 i 是 15 的倍數， length = 8
當 i 是 3 或 5 的倍數，且非 15 的倍數， length = 4
當 i 非 3 或 5 的倍數， length = 2

且根據前面提到的印出值因此，當 i 有因數 3 時 MSG_LEN 要右移 3 以上，當 i 有因數 5 時 MSG_LEN 要右移 1 ，答案裡所有組合只有 KK1 = (a) div5, KK2 = (d) div3, KK3 = (c) 2 能達到目的。

延伸問題

將兩者 printf 修改為 sprintf 後以 sudo perf stat 量測兩者的差異，因為兩者最大的差異應是，進行大數除法所耗費的時間，因此將輸入 i 設定為從30億到40億
但沒想到 naive 耗費的時間遠小於 bitwise ，然而在 branch-misses 上因 bitwise 是使用 branchless 的寫法，因此有所改善。

 Performance counter stats for './naive':

         3,7177.70 msec task-clock                #    1.000 CPUs utilized
               130      context-switches          #    0.003 K/sec
                 0      cpu-migrations            #    0.000 K/sec
                49      page-faults               #    0.001 K/sec
    1542,8953,8117      cycles                    #    4.150 GHz
    4501,7560,0779      instructions              #    2.92  insn per cycle
     756,0705,1347      branches                  # 2033.667 M/sec
       4,8374,4583      branch-misses             #    0.64% of all branches

      37.178599565 seconds time elapsed

      37.177932000 seconds user
       0.000000000 seconds sys

 Performance counter stats for './bitwise':

         5,3985.27 msec task-clock                #    1.000 CPUs utilized
               167      context-switches          #    0.003 K/sec
                 1      cpu-migrations            #    0.000 K/sec
                50      page-faults               #    0.001 K/sec
    2241,7260,7612      cycles                    #    4.152 GHz
    6612,6007,9371      instructions              #    2.95  insn per cycle
    1122,7667,1320      branches                  # 2079.765 M/sec
       4,7134,9005      branch-misses             #    0.42% of all branches

      53.986089902 seconds time elapsed

      53.985592000 seconds user
       0.000000000 seconds sys

4.

考慮到整數 PAGE_QUEUES 可能有以下數值:

1, 2, 3, 4, 5, 6, 7, 8 \times (2^{n})

, n from 1 to 14

給定 size_t offset ，試著將原本運算:

#include <stdint.h>
size_t blockidx = offset / PAGE_QUEUES;

由於 PAGE_QUEUES 的數值分佈已知，在整數除法時，可思考加速運算的機制。

我們可預先進行計算，從而避免整數除法指令的使用: (假設在 x86_64 硬體執行 GNU/Linux 並透過 gcc/clang 編譯)

#include <stdint.h>
#define ffs32(x)  ((__builtin_ffs(x)) - 1)
size_t blockidx;
size_t divident = PAGE_QUEUES;
blockidx = offset >> ffs32(divident);
divident >>= ffs32(divident);
switch (divident) {
    CASES
}

其中 CASES 可能形如:

    case 2: blockidx /= 2; 
        break;

請用最少的 case-break 敘述做出同等 size_t blockidx = offset / PAGE_QUEUES; 效果的程式碼。

因 PAGE_QUEUES 為某數

\times 2^{n}

因此，可將整體

2^{n}

部分先行提出並先進行除法再除以剩下的數加速運算過程。
首先透過 offset >> ffs32(divident) 將， PAGE_QUEUES 數值中

2^{n}

先行除去，再以 switch-case 除去剩下的部分，因

3, 5, 6, 7 非 2^{n}

，可提出，又

6 可 以 拆 為 2 \times 3

因此最終剩下

3, 5, 7

CASES = (b) 3, (d) 5, (f) 7

延伸問題

N-Queens 問題是非常有名的題目，早在高中時期就聽過有這東西，不過懶散的我一直沒去深度鑽研這題目的奧妙，回歸正題首先我們知道這題有三個需要回傳的數值，分別是:

int *returnSize : 代表回傳陣列 ans 的長度。
int **returnColumnSizes : 代表回傳陣列 ans[i] ， 0 < i < returnSize 的長度。
char ***ans : 回傳陣列本身。

此外還要求每個陣列都必須使用 malloc 取得，參考RinHizakura大神的筆記可以知道， n > 18 時解的數量會超過 int 能容納的上限，因此可以將 sizes[] 先行列出以確保 malloc 到足夠的空間，避免重複呼叫 realloc 。

為了加速運算，使用 Bitwise 操作

upperlim : 代表棋盤大小
vd : 代表垂直上的限制
ld : 代表右上到左下斜線上的限制
rd : 代表左上到右下斜線上的限制

如下圖所示藍色是 ld ，紅色是 rd透過由下往上不斷遞迴擺放，根據棋盤限制，窮舉出每種解並放入 ans 。

static int N;
static int sol;
static int upperlim;
static int sizes[] = {1,      1,       0,        0,        2,        10,    4,
               40,     92,      352,      724,      2680,     14200, 73712,
               365596, 2279184, 14772512, 95815104, 666090624};

static char ***ans;
static char **ret;

//vd: Vertical
//ld: Left slash
//rd: Right slash
void solve(int level, int vd, int ld, int rd){
    if (level == N){
        for (int i = 0; i < N; i++)
            memcpy(ans[sol][i], ret[i], N + 1);
        sol++;
        return;
    }
    
    int limit = (vd | ld | rd) ^ upperlim;
    int pos;
    
    while(limit){
        pos = limit & (-limit);        //LSB
        limit = limit & (limit - 1);       //cutLSB
        
        ret[level][N - __builtin_ffs(pos)] = 'Q';
        solve(level + 1, vd | pos, ((ld | pos) >> 1) & upperlim, ((rd | pos) << 1) & upperlim);
        ret[level][N - __builtin_ffs(pos)] = '.';
    }
}

char *** solveNQueens(int n, int *returnSize, int **returnColumnSizes){
    *returnSize = sizes[n];
    *returnColumnSizes = malloc(sizes[n] * sizeof(int));
    for (int i = 0; i < sizes[n]; i++)
        (*returnColumnSizes)[i] = n;
    N = n;
    sol = 0;
    upperlim = 1;
    
    ans = malloc(sizes[n] * sizeof(char **));
    for (int i = 0; i < sizes[n]; i++)
        ans[i] = malloc(n * sizeof(char *));
    for (int i = 0; i < sizes[n]; i++){
        for(int j = 0; j < n; j++)
            ans[i][j] = malloc((n + 1) * sizeof(char));     // 1 for '\0'
    }
    
    ret = malloc(n * sizeof(char *));
    for(int i = 0; i < n; i++){
        ret[i] = malloc((n + 1) * sizeof(char));
        for(int j = 0; j < n; j++)
            ret[i][j] = '.';
        ret[i][n] = '\0';
    }
    
    upperlim = (upperlim << n) - 1;
    
    solve(0, 0, 0, 0);
    
    for (int i = 0; i < n; i++)
        free(ret[i]);
    free(ret);
    
    return ans;
}

Leetcode Runtime Error
另外比較值得注意的是如果有使用全域或以靜態宣告的變數時，必須在每次進入函數時對其初始化，否則就會遇上

等問題，因為 Leetcode 使用同一個實體對所有 test case 做測試，而這些變數會影響到下個 test cas 的運行結果。

parent[i][j]	[0]	[1]	[2]	[3]
[0]	-1	-1	-1	-1
[1]	0	-1	-1	-1
[2]	0	-1	-1	-1
[3]	1	-1	-1	-1
[4]	1	-1	-1	-1
[5]	2	-1	-1	-1
[6]	2	-1	-1	-1

parent[i][j]	[0]	[1]	[2]	[3]
[0]	-1	-1	-1	-1
[1]	0	-1	-1	-1
[2]	0	-1	-1	-1
[3]	1	0	-1	-1
[4]	1	0	-1	-1
[5]	2	0	-1	-1
[6]	2	0	-1	-1

parent[i][j]	[0]	[1]	[2]	[3]
[0]	-1	-1	-1	-1
[1]	0	-1	-1	-1
[2]	0	-1	-1	-1
[3]	1	-1	-1	-1
[4]	1	-1	-1	-1
[5]	2	-1	-1	-1
[6]	2	-1	-1	-1

parent[i][j]	[0]	[1]	[2]	[3]
[0]	-1	-1	-1	-1
[1]	0	-1	-1	-1
[2]	0	-1	-1	-1
[3]	1	0	-1	-1
[4]	1	0	-1	-1
[5]	2	0	-1	-1
[6]	2	0	-1	-1

2020q3 Homework4 (quiz4)

tags: sysprog

1.

延伸問題

ECC, Hamming Distance & Hamming Weight

2.

延伸問題

3.

延伸問題

4.

延伸問題

Read more

MTK / Bluetooth

ONVM-IPS / Hyperscan

VPP & DPDK

2020q3 Gameboy-emu

tags: `sysprog`

parent[i][j]	[0]	[1]	[2]	[3]
[0]	-1	-1	-1	-1
[1]	0	-1	-1	-1
[2]	0	-1	-1	-1
[3]	1	-1	-1	-1
[4]	1	-1	-1	-1
[5]	2	-1	-1	-1
[6]	2	-1	-1	-1

parent[i][j]	[0]	[1]	[2]	[3]
[0]	-1	-1	-1	-1
[1]	0	-1	-1	-1
[2]	0	-1	-1	-1
[3]	1	0	-1	-1
[4]	1	0	-1	-1
[5]	2	0	-1	-1
[6]	2	0	-1	-1