2020q1 Homework3 (quiz3)

contributed by < KYG-yaya573142 >

第 3 週測驗題

測驗 1

XOR linked list 原理

XOR 運算特性為

\(A \oplus A = 0\)
\(A \oplus 0 = A\)
\(A \oplus 1 = \neg A\)
\((A \oplus B) \oplus B = A\)

typedef struct __list list;
struct __list {
    int data;
    struct __list *addr;
};

當初填做答表單會全錯就是因為我沒理解 XOR linked list 實作的核心邏輯，在 XOR linked list 中 node 儲存的 addr 內容是 address of previous node (之後簡稱 prev) 及 address of next node (之後簡稱 next) 以 XOR 運算的結果 addr = prev ^ next，根據上述 XOR 的運算特性，可以用以下方式在 node 間移動

addr ^ prev = next
addr ^ next = prev

而在 list head 和 list tail 的 node，根據 \(A \oplus 0 = A\) 可知其儲存的 addr 剛好都是鄰近的唯一 node 的指標

0 ^ next = addr = next
prev ^ 0 = addr = prev

#define XOR(a, b) ((list *) ((uintptr_t) (a) ^ (uintptr_t) (b)))

void insert_node(list **l, int d) {
    list *tmp = malloc(sizeof(list));
    tmp->data = d;

    if (!(*l)) {
        tmp->addr = NULL;
    } else {
        (*l)->addr = XOR(tmp, (*l)->addr);
        tmp->addr = *l;
    }
    *l = tmp;
}

從 list head 新增 node，所以原本 list head 的 *l 需要更新儲存的 addr 為 next node 與 prev node 的 XOR
- *l 原本位於 list head，(*l)->addr 直接是 next node 的位置
- 新增的 tmp 就是 prev node
若尚無任何 node，新增的 node 會是唯一的 node 所以儲存的指標為 NULL

void delete_list(list *l) {
    while (l) {
        list *next = l->addr;
        if (next)
            next->addr = XOR(next->addr, l);
        free(l);
        l = next;
    }
}

從 list head 開始依序 free 掉 node
l 原本位於 list head，l->addr 直接是 next node 的位置
free 掉 l 以前須先更新 next node 中儲存的位置，因為原本的 next node 會變成新的 list head
- XOR(next->addr, l) 的結果會是 next 的下個 node 的位置
while 直到 free 完所有 node

解題思路












list *sort(list *start)
{
    if (!start || !start->addr)
        return start;

    list *left = start, *right = start->addr;
    left->addr = NULL;
    right->addr = XOR(right->addr, left);

    left = sort(left);
    right = sort(right);
...

6~11 行的部分利用遞迴的方式分割 list，但一次遞迴只分割出一個 node
- left 指向分割出的單一 node
- right 指向剩下的 node
- start 會指向已經排序完成的 list head


















    ...
    for (list *merge = NULL; left || right;) {
        if (!right || (left && left->data < right->data)) {
            list *next = left->addr;
            if (next)
                next->addr = XOR(left, next->addr);

            if (!merge) {
                start = merge = left;
                merge->addr = NULL;
            } else {
                merge->addr = LL1;
                left->addr = LL2;
                merge = left;
            }
            left = next;
        } else {
...

第 13 行開始執行 merge sort
if (!right || (left && left->data < right->data)) 是實行排序判斷的位置，採用遞增順序所以數值小的排在前面
- 先探討將 left 的 node 接上 merge 的情況 (也就是 left->data 比較小，或是奇數 node 的情況)
- 15~17 行先將 node 從 list 分離出來
  - 此時 left 指向這個分離出來的單一 node
  - next 指向剩餘的 list
- 19~21 行代表此時 merge 無任何 node
  - left 會是唯一的 node，所以儲存的指標為 NULL
  - start 會指向已經排序完成的 list head，所以也是指向 left
- 22~25 行要將 left 接在 merge 的前面
  - merge 是原本的 list head，merge->addr 直接是 next node 的位置
  - merge 的 prev node 就是前面接上的 left，
  - merge->addr 須更新為 prev node ^ next node，因此 LL1 應填入 XOR(merge->addr, left)
  - left 會是新的 list head，所以 left->addr 會直接指向 next node，因此 LL2 應填入 merge




















...
        } else {
            list *next = right->addr;
            if (next)
                next->addr = XOR(right, next->addr);

            if (!merge) {
                start = merge = right;
                merge->addr = NULL;
            } else {
                merge->addr = RR1;
                right->addr = RR2;
                merge = right;
            }
            right = next;
        }
    }

    return start;
}

整體的邏輯與將 left 的 node 接上 merge 時的邏輯一樣，只是改為將 right 接上，就不再贅述原理
RR1 應填入 XOR(merge->addr, right)
RR2 應填入 merge

改善程式碼

`delete_list`

後來實際使用時，發現 delete_list(queue) 後不會將 queue 指標設為 NULL，這會導致接下來 insert_node(&queue) 會發生 segmentation fault

int main(int argc, char const *argv[])
{
    list *queue = NULL;
    insert_node(&queue, 666);
    delete_node(queue);
    insert_node(&queue, 777); //seg. fault
}

因此我將 delete_list 改為使用 pointer to pointer 來實作

void delete_list(list **l) {
    while (*l) {
        list *next = (*l)->addr;
        if (next)
            next->addr = XOR(next->addr, *l);
        free(*l);
        *l = next;
    }
}

`show`

原本的實作缺乏簡單展示 list 內容的函式，而且 XOR linked-list 處理起來較複雜，因此實作一個輔助函式來顯示 list 的所有內容

void show(list *target)
{
    if (!target) {
        printf("empty list\n");
        return;
    }
    
    printf("head-> ");
    list *prev = NULL, *next = target;
    while(target) {
        printf ("%d ", target->data);
        next = XOR(target->addr, prev);
        prev = target;
        target = next;
    }
    printf("<-tail\n");
    return;
}

最佳化策略 - 修改 `sort` 分割 node 的方式

原本 sort 中每次只會分割出一個 node，參照 lab-0 中 merge sort 的做法，修改為每次會將 list 分為一半。實作的程式碼如下，明顯有使用過多變數的缺點，但現階段還想不到更漂亮的寫法

list *merge_sort(list *start)
{
    if (!start || !start->addr)
        return start;

    list *left = start, *right = start->addr;
    list *tmp1 = start, *tmp2 = start->addr;
    list *prev = NULL, *next = start;
    /* split list into half */
    while (right && XOR(right->addr, tmp1)) {
        /* advance right by 2 nodes */
        right = XOR(right->addr, tmp1);
        tmp1 = right;
        right = XOR(right->addr, tmp2);
        tmp2 = right;
        /* advance left by 1 node */
        next = XOR(left->addr, prev);
        prev = left;
        left = next;
    }
    right = XOR(left->addr, prev);
    right->addr = XOR(right->addr, left);
    left->addr = XOR(left->addr, right);

    left = merge_sort(start);
    right = merge_sort(right);
...

以下列程式碼檢查，結果正確

#define total_nodes 100
int main(int argc, char const *argv[])
{
    list *queue = NULL;
    srand(time(NULL));
    
    for (int i = 0; i < total_nodes; i++) {
        insert_node(&queue, (rand() % 100));
    }
    
    queue = merge_sort(queue);
    show(queue);
}

效能差異

使用以下程式碼進行測試

#define total_nodes 1000
#define sample_size 100

int main(int argc, char const *argv[])
{
    struct timespec t_start, t_end;
    list *queue1 = NULL;
    srand(time(NULL));
    FILE *fp = fopen("./out_statistic", "w");

    for (int i = 1; i < total_nodes + 1; i++) {
        double t1[sample_size] = {0};
        double mean1 = 0.0, sd1 = 0.0, result1 = 0.0;
        int count1 = 0;
        /* measure sample_size times of data and
         * remove outliers based on the 95% confidence level
         */
        for (int n = 0; n < sample_size; n++) {
            /* generate linked-list for testing */
            for (int node = 0; node < i; node++)
                insert_node(&queue1, (rand() % 100));
            clock_gettime(CLOCK_MONOTONIC, &t_start);
            queue1 = merge_sort(queue1, i); /* measurement */
            clock_gettime(CLOCK_MONOTONIC, &t_end);
            t1[n] = (long long)(t_end.tv_sec * 1e9 + t_end.tv_nsec)
                    - (long long)(t_start.tv_sec * 1e9 + t_start.tv_nsec);
            mean1 += t1[n]; //sum
            delete_list(&queue1);
        }
        mean1 /= sample_size; //mean

        for (int n = 0; n < sample_size; n++) {
            sd1 += (t1[n] - mean1) * (t1[n] - mean1);
        }
        sd1 = sqrt(sd1 / (sample_size - 1)); //standard deviation

        for (int n = 0; n < sample_size; n++) { //remove outliers
            if (t1[n] <= (mean1 + 2 * sd1) && t1[n] >= (mean1 - 2 * sd1)) {
                result1 += t1[n];
                count1++;
            }
        }
        result1 /= count1;
        fprintf(fp, "%d %.5lf samples: %d\n", i, (result1/1000), count1);
        assert(count1 > 80); /* remove too many samples, considered unrepresentative */
    }
    fclose(fp);
    return 0;
}

排序所需時間明顯降低

最佳化策略 - 使用 tiled merge sort

Wiki 的 Merge sort 中談到 Optimizing merge sort 的策略

For example, the tiled merge sort algorithm stops partitioning subarrays when subarrays of size S are reached, where S is the number of data items fitting into a CPU's cache. Each of these subarrays is sorted with an in-place sorting algorithm such as insertion sort, to discourage memory swaps, and normal merge sort is then completed in the standard recursive fashion.

先將子序列分割，直到子序列小於某個閾值 S 就停止分割
S 值為可完整放進處理器 cache 中的元素數量
每個大小為 S 的子序列，都透過適用於小規模 in-place (即不需要額外配置記憶體) 排序演算法進行排序，減少在記憶體內交換特定區域的次數

CPU 硬體組態

Intel® Core™ i7-3770 CPU @ 3.40GHz
- 4 CPU cores
- 8 threads
L1i cache: 32 KB (x4 cores)
L1d cache: 32 KB (x4 cores)
- 8-way set associative
L2 cache: 256 KB (x4 cores)
- 8-way set associative
L3 cache: 8 MB
- 16-way set associative shared cache

剛好可以用 CSAPP Cache memories 講義中的範例

64 bytes per block (cache line)
8 blocks per set
total 64 sets
node 的大小為 16 bytes
- 一個 block 可以存放 4 nodes
- \(4 (nodes)\times 64(sets)=256\) 個連續的 node 會用掉所有 set 中的一個 block (因為是 set associative)
- \(256\times 8=2048\) 個 node 會剛好用完 L1 cache
- 但這只是理想的算法，快取還需要存放除了 node 以外的資料，且每個 node 的記憶體位置不一定是連續的

in-place algorithm - insertion sort

我選用 insertion sort 做為 in-place 排序演算法

list *insert_sort(list *head)
{
    if (!head || !head->addr)
        return head;

    /* pick first node from head */
    list *result = head;
    head = head->addr;
    head->addr = XOR(result, head->addr);
    result->addr = NULL;

    while (head) {
        /* detach node from head */
        list *next = head->addr;
        if (next)
            next->addr = XOR(head, next->addr);
        
        /* insertion sort */
        list *node = result, *prev = NULL, *tmp = NULL;
        while (node && node->data < head->data) {
            tmp = node;
            node = XOR(prev, node->addr);
            prev = tmp;
        }
        if (!prev) { /* insert at head */
            head->addr = result;
            result->addr = XOR(result->addr, head);
            result = head;
        }
        if (!node) { /* insert at tail */
            head->addr = prev;
            prev->addr = XOR(prev->addr, head);
        }
        if (prev && node) { /* insert in between prev and node */
            prev->addr = XOR(XOR(prev->addr, node), head);
            node->addr = XOR(head, XOR(node->addr, prev));
            head->addr = XOR(prev, node);
        }
        head = next;
    }
    return result;
}

找出適當的 S

修改原本 merge sort 部分的程式碼

list *merge_sort(list *start, int count)
{
...
    /* split list */
...
    /* use in-place alg. when list size <= threshold */
    if ((count >> 1) > threshold) {
        left = merge_sort(start, (count >> 1) + (count & 0x1));
        right = merge_sort(right, count >> 1);
    } else {
        left = insert_sort(start);
        right = insert_sort(right);
    }
...

當子序列數量低於 S 就切換成使用 insertion sort
threshold 就是閾值 S，直接設為 global variable 以方便實驗時進行更動
為了方便計算子序列資料數量，直接把總資料數量 count 當作參數帶進 merge_sort
最後一樣用 merge sort 來合併所有子序列

實驗的程式碼如下

#define total_nodes 1000
int main(int argc, char const *argv[])
{
    list *queue = NULL;
    struct timespec t_start, t_end;
    srand(time(NULL));
    
    for (threshold = 0; threshold < 500; threshold++) { /* use different threshold */
        for (int i = 0; i < total_nodes; i++) { /* generate random list */
            insert_node(&queue, (rand() % 100));
        }
        clock_gettime(CLOCK_MONOTONIC, &t_start);
        queue = merge_sort(queue, total_nodes);
        clock_gettime(CLOCK_MONOTONIC, &t_end);
        long long t1 = (long long)(t_end.tv_sec * 1e9 + t_end.tv_nsec)
                        - (long long)(t_start.tv_sec * 1e9 + t_start.tv_nsec);
        printf("%d %lld\n", threshold, (t1/1000));
        delete_list(&queue);
    }
    return 0;
}

量測不同 S 時，排序所需的時間
實驗的 list 總共有 1000 nodes

水平黑線是原本 merge_sort 排序 1000 個 nodes 的所需時間
分別在 62、125、250、500 有明顯的速度落差
- 當 S 在 125 以內時，排序所需的時間會快於單純使用 merge sort
- S 在 61 以內時具有最快的速度

原本以為這漂亮的圖有特別的意義，後來才想到一件事情

每次 merge sort 都會把 list 分為一半
所以我使用的 1000 nodes 經過分割後，產生的子序列只會有幾種固定的大小，剛好是 500、250、125、62 …
例如當 S 在 126~250 之間時，實際能被 insertion sort 處理的子序列大小永遠是 125，因為上一個子序列大小是 250

比對原本的 merge sort 可以看到所需時間降低
會有階梯狀的原因如上述說明，是因為子序列大小每次都是對半分造成的

使用 CSAPP 的 cache lab

本來在思考是否有方法可以了解 merge sort 在不同 S 下使用 cache 的情況，發現可以使用 Valgrind - Cachegrind，但同時也想到之前在 CSAPP 的 cache lab 中有提供一個模擬 cache 行為的程式，因此決定先嘗試這個比較簡單的方法

首先在程式碼本身盡量移除跟 merge sort 無關的部分，避免干擾

#define total_nodes 1000
int main(int argc, char const *argv[])
{
    list *queue = NULL;
    srand(time(NULL));
    for (int i = 0; i < total_nodes; i++) { /* generate random list */
        insert_node(&queue, (rand() % 100));
    }
    queue = merge_sort(&queue, total_nodes);
    return 0;
}

使用 valgrind 來紀錄不同 S 值下執行程式的 address trace

$ valgrind --log-fd=1 --tool=lackey -v --trace-mem=yes ./test > out

接著使用 CSAPP 的作業參考範例 csim-ref 來模擬，使用的參數是對照 CPU-硬體組態設置的

$ ./csim-ref -s 6 -E 8 -b 6 -t out

S = 0 (單純 merge sort)

hits:609821 misses:3846 evictions:3334

S = 50

hits:519978 misses:3764 evictions:3252

S = 100

hits:591181 misses:3759 evictions:3247

S = 200

hits:737422 misses:3733 evictions:3221

S = 400

hits:1140759 misses:3730 evictions:3218

cache eviction 和 cache miss 都沒有明顯的差異，但是 cache hit 隨著 S 提高而增加
在假設模擬器正確的前提下，可以推斷出
- 在測試的範圍內，都不會出現 L1 cache 不夠用的情況
- 提高 S 確實能讓處理器更善用 L1 cache，但由於 insertion sort 比 merge sort 慢，過高的 S 會導致 insertion sort 的效能劣勢將 cache 的優勢抵銷，甚至是降低整體的效能

修改 lab0-c 裡頭的 harness.c (待補)

測驗 2

解題思路

#include <stddef.h>
struct node {
    int data;
    struct node *next;
} *head;

void insert(struct node *newt) {
    struct node *node = head, *prev = NULL;
    while (node != NULL && node->data < newt->data) {
        prev = node;
        node = node->next;
    }
    newt->next = node;
    if (prev == NULL)
        head = newt;
    else
        prev->next = newt;
}

while 迴圈實際上是在排序，將 newt 根據資料大小插入 list 中適當的位置

void insert(struct node *newt)
{
    struct node **link = &head;
    while (*link && AA)
        BB;
    newt->next = *link;
    *link = newt;
}

link 是 pointer to pointer，可以理解為一般 pointer 儲存的是某個變數的位置，而 pointer to pointer 儲存的就是某個指標的位置，因此 *link 是指標
while 迴圈做的事情一樣是排序，必須先對 link 進行 dereference 一次才會得到指向 list element 的指標，因此 AA 應填入 (*link)->data < newt->data
pointer to pointer 儲存的是指標的位置，因此必須使用取址運算子 & 來取用指標的位置，因此 BB 應填入 link = &(*link)->next

對執行時間與 Branch predictor 的影響

使用以下程式碼進行實驗，分別測試兩種不同的 insert 的所需時間

int main(int argc, char const *argv[])
{
    struct timespec t_start, t_end;
    head = NULL;

    for (int i = 0; i < 1000; i++) {
        struct node *newt = malloc(sizeof(struct node));
        newt->data = i;
        newt->next = NULL;
        clock_gettime(CLOCK_MONOTONIC, &t_start);
        //insert(newt);
        insert_ptp(newt);
        clock_gettime(CLOCK_MONOTONIC, &t_end);
        long long t1 = (long long)(t_end.tv_sec * 1e9 + t_end.tv_nsec)
                        - (long long)(t_start.tv_sec * 1e9 + t_start.tv_nsec);
        printf("%d %lld\n", i, t1);
    }

    return 0;
}

無顯著的差異，可能要進一步提升測試的資料總數

使用 perf 來分析 branch 的差異

$ sudo perf stat --repeat 5 -e branch-instructions:u,branch-misses:u ./test

insert

 Performance counter stats for './test' (5 runs):

         1,383,346      branch-instructions:u                                         ( +-  0.01% )
            10,871      branch-misses:u           #    0.79% of all branches          ( +-  1.04% )

          0.004962 +- 0.000679 seconds time elapsed  ( +- 13.68% )

insert_ptp

 Performance counter stats for './test' (5 runs):

         1,382,351      branch-instructions:u                                         ( +-  0.00% )
            10,659      branch-misses:u           #    0.77% of all branches          ( +-  0.75% )

          0.004709 +- 0.000558 seconds time elapsed  ( +- 11.85% )

insert_ptp 在 branch-instructions:u 還是 branch-misses:u 有些微的改善，但是不明顯

嘗試改用 perf record & perf report 來分析

$ sudo perf record -F 100000 -e branch-misses:u,branch-instructions:u ./test
$ sudo perf report

insert

333 branch-misses:u          ◆
919 branch-instructions:u

Samples: 919  of event 'branch-instructions:u', Event count (approx.): 1386496
Overhead  Command  Shared Object     Symbol
  69.75%  test     test              [.] insert
   6.16%  test     libc-2.27.so      [.] vfprintf
   5.70%  test     libc-2.27.so      [.] _IO_file_xsputn@@GLIBC_2.2.5
   2.57%  test     libc-2.27.so      [.] _int_malloc

Samples: 333  of event 'branch-misses:u', Event count (approx.): 11279
Overhead  Command  Shared Object  Symbol
  16.23%  test     libc-2.27.so   [.] _dl_addr
  12.54%  test     test           [.] insert
  11.38%  test     libc-2.27.so   [.] printf
   9.02%  test     libc-2.27.so   [.] vfprintf
   8.80%  test     libc-2.27.so   [.] _IO_do_write@@GLIBC_2.2.5
   7.87%  test     test           [.] main

insert_ptp

365 branch-misses:u       ◆
947 branch-instructions:u

Samples: 947  of event 'branch-instructions:u', Event count (approx.): 1385333
Overhead  Command  Shared Object     Symbol
  72.25%  test     test              [.] insert_ptp
   7.83%  test     libc-2.27.so      [.] vfprintf
   3.23%  test     libc-2.27.so      [.] _int_malloc

Samples: 365  of event 'branch-misses:u', Event count (approx.): 11535
Overhead  Command  Shared Object  Symbol
  17.59%  test     libc-2.27.so   [.] _dl_addr
  15.87%  test     test           [.] main
  14.58%  test     libc-2.27.so   [.] _IO_file_xsputn@@GLIBC_2.2.5
  11.35%  test     libc-2.27.so   [.] _IO_file_write@@GLIBC_2.2.5
   9.71%  test     libc-2.27.so   [.] vfprintf
   7.63%  test     libc-2.27.so   [.] _IO_do_write@@GLIBC_2.2.5
   5.21%  test     test           [.] insert_ptp

從 branch-misses:u 的詳細報告就可以看出差異
- insert 造成的 branch-misses 占了 333 筆資料的 12.54%
- insert_ptp 造成的 branch-misses 占了 365 筆資料的 5.21%

2020q1 Homework3 (quiz3)

測驗 1

XOR linked list 原理

解題思路

改善程式碼

delete_list

show

最佳化策略 - 修改 sort 分割 node 的方式

效能差異

最佳化策略 - 使用 tiled merge sort

CPU 硬體組態

in-place algorithm - insertion sort

找出適當的 S

使用 CSAPP 的 cache lab

S = 0 (單純 merge sort)

S = 50

S = 100

S = 200

S = 400

修改 lab0-c 裡頭的 harness.c (待補)

測驗 2

解題思路

對執行時間與 Branch predictor 的影響

tags: linux 2020

Read more

2020q1 Homework2 (fibdrv)

shell lab

課堂作業與專題

malloc lab

`delete_list`

`show`

最佳化策略 - 修改 `sort` 分割 node 的方式

tags: `linux 2020`