2025q1 Homework1 (lab0)

contributed by < SimonLiu423 >

作業書寫規範:

無論標題和內文中，中文和英文字元之間要有空白字元 (對排版和文字搜尋有利)
文字訊息 (尤其是程式執行結果) 請避免用圖片來表示，否則不好搜尋和分類
共筆書寫請考慮到日後協作，避免過多的個人色彩，用詞儘量中性
不要在筆記內加入 [TOC] : 筆記左上方已有 Table of Contents (TOC) 功能，不需要畫蛇添足
不要變更預設的 CSS 也不要加入任何佈景主題: 這是「開發紀錄」，用於評分和接受同儕的檢閱
當在筆記中貼入程式碼時，避免非必要的行號，也就是該手動將 c= 或 cpp= 變更為 c 或 cpp。行號只在後續討論明確需要行號時，才要出現，否則維持精簡的展現。可留意「你所不知道的 C 語言: linked list 和非連續記憶體」裡頭程式碼展現的方式
HackMD 不是讓你張貼完整程式碼的地方，GitHub 才是！因此你在開發紀錄只該列出關鍵程式碼 (善用 diff 標示)，可附上對應 GitHub commit 的超連結，列出程式碼是為了「檢討」和「便於他人參與討論」
留意科技詞彙的使用，請參見「資訊科技詞彙翻譯」及「詞彙對照表」
不要濫用 :::info, :::success, :::warning 等標示，儘量用清晰的文字書寫。:::danger 則僅限授課教師作為批注使用
避免過多的中英文混用，已有明確翻譯詞彙者，例如「鏈結串列」(linked list) 和「佇列」(queue)，就使用該中文詞彙，英文則留給變數名稱、人名，或者缺乏通用翻譯詞彙的場景
在中文敘述中，使用全形標點符號，例如該用「，」，而非 ","。注意書名號的使用，即 〈 和 〉，非「小於」和「大於」符號
避免使用不必要的 emoji 字元

Development Environment

$ gcc --version
gcc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0

$ lscpu
Architecture:             x86_64
  CPU op-mode(s):         32-bit, 64-bit
  Address sizes:          40 bits physical, 48 bits virtual
  Byte Order:             Little Endian
CPU(s):                   4
  On-line CPU(s) list:    0-3
Vendor ID:                GenuineIntel
  Model name:             Common KVM processor
    CPU family:           15
    Model:                6
    Thread(s) per core:   1
    Core(s) per socket:   4
    Socket(s):            1
    Stepping:             1
    BogoMIPS:             5999.99
    Flags:                fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx lm cons
                          tant_tsc nopl xtopology cpuid tsc_known_freq pni cx16 x2apic hypervisor lahf_lm cpuid_fault pti
Virtualization features:  
  Hypervisor vendor:      KVM
  Virtualization type:    full
Caches (sum of all):      
  L1d:                    128 KiB (4 instances)
  L1i:                    128 KiB (4 instances)
  L2:                     16 MiB (4 instances)
  L3:                     16 MiB (1 instance)
NUMA:                     
  NUMA node(s):           1
  NUMA node0 CPU(s):      0-3
Vulnerabilities:          
  Gather data sampling:   Not affected
  Itlb multihit:          KVM: Mitigation: VMX unsupported
  L1tf:                   Mitigation; PTE Inversion
  Mds:                    Vulnerable: Clear CPU buffers attempted, no microcode; SMT Host state unknown
  Meltdown:               Mitigation; PTI
  Mmio stale data:        Unknown: No mitigations
  Reg file data sampling: Not affected
  Retbleed:               Not affected
  Spec rstack overflow:   Not affected
  Spec store bypass:      Vulnerable
  Spectre v1:             Mitigation; usercopy/swapgs barriers and __user pointer sanitization
  Spectre v2:             Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Retpoline
  Srbds:                  Not affected
  Tsx async abort:        Not affected

Implementation of queue operations

q_insert_head()

Initial code:

bool q_insert_head(struct list_head *head, char *s)
{
    if (!head)
        return false;

    /* Create element node with value copied from s */
    element_t *node = malloc(sizeof(element_t));
    if (!node)
        return false;

    char *val = malloc((strlen(s) + 1) * sizeof(char));
    if (!val) {
        free(node);
        return false;
    }

    strcpy(val, s);
    node->value = val;

    /* Add element to the head of queue */
    list_add(node, head);

    return true;
}

In the code above, the "Create element" segment will also be used in q_insert_tail(). To follow DRY principle, I extracted the part to another function.

static inline element_t *create_element(char *s)
{
    element_t *node = malloc(sizeof(element_t));
    if (!node)
        return NULL;

    char *val = malloc((strlen(s) + 1) * sizeof(char));
    if (!val) {
        free(node);
        return NULL;
    }

    strcpy(val, s);
    node->value = val;

    return node;
}

這個函式使用了 static 和 inline 做宣告，static 可讓該函式只能在該檔案中使用，無法從外部存取，提供封裝性 (Encapsulation)；inline 會建議編譯器在編譯時，將函式被呼叫的地方，直接取代成該函式中的程式碼，如此一來可以避免 function call，提升程式碼的效率。

/* 待編輯，還沒看懂 */
首先來看 static，根據 C99 規格書 6.2.2 Linakges of identifiers，…
inline: 6.7.4 Function specifiers …

/******************/

完成後執行 git commit -a 可以發現被 pre-commit 擋下來了，理由是 strcpy() 不安全，因為在複製前不會先比較兩個字串 buffer 的長度，可能會存取到預期外的記憶體位址。
解決方法是改用 strncpy，修改後的程式碼如下

  char *val = malloc((strlen(s) + 1) * sizeof(char));
  if (!val) {
      free(node);
      return NULL;
  }
    
- strcpy(val, s);
+ strncpy(val, s, strlen(val));
  node->value = val;
    
  return node;

q_delete_mid()

在 jserv 的教材中有提到利用快慢指標找到 list 中點，其中一部分程式碼如下：

node_t *slow = head;
for (node_t *fast = head->next; fast && fast->next; fast = fast->next->next)
    slow = slow->next;
node_t *mid = slow->next;
slow->next = NULL;

這段程式碼在 queue size 為偶數時，得出的 mid 會往右多移一個，例如當 size = 4，mid 會指向 queue 第三個 element 的位址；另外，我認為這段程式碼的可讀性較低，因此改用 do-while 實作。

修改上述兩點後的程式碼： Commit ddf0ada

q_swap()

Commit 18c4e7b
實作方法為，遍歷 list 中每個 pair 的第一個節點，將其從 list 中移除，這時節點所指向的 prev 跟 next 維持不變，再透過 list_add 將 node 插入到下個節點之後。

struct list_head *node = head->next;
while (node != head && node->next != head) {
    list_del(node);
    list_add(node, node->next);
    node = node->next;
}

list_add(node, node->next) 這裡 node->next 並非 list head，與該函式敘述所預期的參數不一致

/**
 * list_add - Insert a node at the beginning of a circular list
 * @node: Pointer to the list_head structure to add.
 * @head: Pointer to the list_head structure representing the list head.
 *
 * Adds the specified @node immediately after @head in a circular doubly-linked
 * list, effectively placing it at the beginning. The existing first node, if
 * any, shifts to follow @node, and the list's circular structure is maintained.
 */

但仔細了解該函式的實作可以發現，完全可以將其視為將 node 插入在 head 之後，而不一定是插在 list 的開頭。

在 torvalds/linux/include/linux/lish.h 中 list_add 的敘述也是說將 new 插入在 head 之後。

/**
 * list_add - add a new entry
 * @new: new entry to be added
 * @head: list head to add it after
 *
 * Insert a new entry after the specified head.
 * This is good for implementing stacks.
 */

針對這問題我有開一個 pull request #225 更改 list_add 的敘述。

以下開發日記將用英文撰寫，上面的有空再改成英文。

q_reverseK()

The strategy here is to traverse the list and, for every k nodes, form a local circular list by attaching the first and last node to a temporary head.

Then, use q_reverse() with the temporary head to reverse the k elements.

Finally, reattach the reversed nodes to the original list and continue the process for the remaining nodes.

Note that when k = 2, this operation is equivalent to q_swap().

if (k == 2) {
    q_swap(head);
    return;
}

q_ascend() & q_descend()

The goal of these two functions is to remove every node that has a node with a strictly smaller (for q_ascend()) or larger (for q_descend()) value anywhere to its right.
In other words, q_ascend() creates a monotonically non-increasing sequence, while q_descend() creates a monotonically non-decreasing sequence, both starting from the last element working backward.

The only difference between the two functions is whether the sequence should be non-increasing or non-decreasing. To handle this, we can use the same function monotonic_from_right(), and pass an enumeration value to specify the desired order.

typedef enum _order { NON_DECREASING = 1, NON_INCREASING = -1 } Order;

Implementation for q_ascend():

int q_ascend(struct list_head *head)
{
    return monotonic_from_right(head, NON_INCREASING);
}

q_descend():

int q_descend(struct list_head *head)
{
    return monotonic_from_right(head, NON_DECREASING);
}

For implementation of monotonic_from_right(), see Commit ba136d7

q_sort()

Originally, inspired by jserv's implementation, I created an array of pointers to the nodes in the list and used bottom-up merge sort. However, since the array size is fixed, this approach has limitations. If the length of the queue is too small, it wastes memory; if it's too large, the program can't handle it.

To address this, I considered implementing a vector in C++ with an exponential growth policy(doubling the capacity upon reallocation). Unfortunately, this assignment forbids the use of malloc, so I came up with an alternative: maintaining a stack and merging two lists when their lengths are equal. This approach reduces space complexity from O(n) to O(log(n)), which is much more efficient.

For more details, see Commit 621bf36.

q_merge()

The approach is simple: move all nodes to the first queue then perform a q_sort().

If there's only one queue in the chain, we can return immediately, as the queue is already sorted by definition.

For more details, see Commit d1ef5be.

Dude, is my code constant time?

About the paper

Problem

Timing attacks are a type of side-channel attack used to extract secrets, such as cryptographic keys or passwords, by analyzing execution time. They are relatively easy to perform compared to other side-channel attacks.

Existing detection methods have a major drawback: they require hardware modeling, which is extremely difficult.

Solution

This method detects timing leakage by analyzing execution time. It measures execution time for two different input data classes and then checks whether their distributions are statistically different.

Step 1: Measure execution time

Collect two sets of execution time measurements for two different input data classes.
Use fix-vs-random tests, where the fixed class may be chosen to trigger edge cases.
Use cycle counters to measure execution time.
To minimize the effect of environmental changes in the results, randomly assign each measurement to a class.
Interleave class sequences during evaluation to prevent interference from concurrent processes.

Step 2: Apply post-processing

Remove outliers by cropping measurements that exceed a fixed threshold.
- Measurements may be positively skewed due to artifacts such as OS interruptions or other extraneous activities.
Apply higher-order preprocessing techniques.

Step 3 Apply statistical test

Disprove the null hypothesis "The two timing distributions are equal".
Use statistical methods such as:
- t-test
- Non-parametric test

Issues

Originally, constant-time measurement passed on my local machine but failed on GitHub Actions. After analyzing dudect/fixture, dudect/constant and comparing it with operaz/dudect. I found some poorly implemented code that could lead to inaccurate results.

1. Measurement dropping

As mentioned earlier, the sequence of classes should be interleaved during evaluation. However, the original code did not do this correctly.

In dudect/constant.c:

for (size_t i = DROP_SIZE; i < N_MEASURES - DROP_SIZE; i++) {
    ...
    before_ticks[i] = cpucycles();
    /* Target function to measure */
    after_ticks[i] = cpucycles();
    ...
}

the loop did not properly discard the first and last DROP_SIZE measurements. This resulted in biased data being included in the final analysis.

To fix this, I made the following changes:

+ int measure_idx = 0;
+ int64_t a_ticks, b_ticks;
  for (size_t i = 0; i < N_MEASURES; i++) {
      ...
-     before_ticks[i] = cpucycles();
+     b_ticks = cpucycles();
      /* Target function to measure */
-     after_ticks[i] = cpucycles();
+     a_ticks = cpucycles();
      ...
+     if (i < DROP_SIZE || i >= N_MEASURES - DROP_SIZE)
+         continue;
+     before_ticks[measure_idx] = b_ticks;
+     after_ticks[measure_idx] = a_ticks;
+     measure_idx++;
}

This ensures that the measurements are made even if we're about to discard it.

2. Cropping

In operaz's version, multiple tests were performed:

One first order uncropped test,
DUDECT_NUMBER_PERCENTILES tests
One second-order test

The highest t-value among these tests determines whether the target function is constant-time.

In this assignment, only the uncropped test was performed, and the cropping mechanism was missing. So I copied and modified the implementation from operaz/dudect.

Now, I calculate cutoff value using PERCENTILE_CUTOFF = 0.9, ensuring only values below the cutoff are included in the t-test:

static void prepare_cutoff_value(int64_t *exec_times, double *cutoff_value)
{
    qsort(exec_times, VALID_MEASURES, sizeof(int64_t),
          (int (*)(const void *, const void *)) cmp);
    *cutoff_value = percentile(exec_times, PERCENTILE_CUTOFF, VALID_MEASURES);
    return;
}

-     /* do a t-test on the execution time */
-     t_push(t, difference, classes[i]);
+     if (difference < cutoff_value)
+         /* do a t-test on the execution time */
+         t_push(t, difference, classes[i]);

3. Potential Bug in operaz's dudect_main()

While modifying the code, I noticed a potential bug in operaz's dudect_main():

if (first_time) {
    // throw away the first batch of measurements.
    // this helps warming things up.
    prepare_percentiles(ctx);
  } else {
    update_statistics(ctx);
    ret = report(ctx);
  }

This code intends to discard the first batch of measurements to warm up the system. However, it still computes percentiles using that first batch.
I bellieve this is the reason causing this run of github actions to stuck while testing remove_head().

Additionally, I think the percentiles should be calculated for each batch rather than depending only on the first batch.

Final Fix:

-  if (*cutoff_value == -1) {
-      prepare_cutoff_value(exec_times, cutoff_value);
+  if (first_time) {
+      /* Discard first measurement to warm things up */
+      first_time = false;
  } else {
-      update_statistics(exec_times, classes, *cutoff_value);
+      prepare_cutoff_value(exec_times, &cutoff_value);
+      update_statistics(exec_times, classes, cutoff_value);
+      ret &= report();
  }

Shuffle

The queue shuffling function is implemented using the Fisher-Yates shuffle.
See Commit 4a047ee for details.

Experiments

To verify if the shuffle implementation produces uniform random permutations, I used two methods:

Plotting the frequency distribution
Chi-squared test.

(6 elements shuffled 500400 times)

1. Plotting

X-axis: Permutation index
Y-axis: Frequency of each permutation after shuffling

Since we are shuffling 6 elements, there are a total of 6! = 720 unique permutations.
The permutation index is computed using the following code (reference):

def get_permutation_index(n, perm):
    index = 0
    for i in range(n):
        smaller_count = sum(1 for j in range(i+1, n) if perm[j] < perm[i])
        index += smaller_count * math.factorial(n - 1 - i)
    return index

Apparently, the result was not uniform. After debugging, I found the issue was due to calling srand(time(NULL)) on every shuffle call.
After removing it, the distribution becomes much more reasonable.

2. Chi-squared test

To further validate uniformity, I conducted a Chi-squared test with null hypothesis: "The shuffle algorithm produces a uniform distribution."

Formula:

X^{2} = \sum_{n}^{i = 1} \frac{(H_{i} - E)^{2}}{E}

Where:

$H_{i}$ = Observed count of permutation i
$E$ = Expected count =
$\frac{500400}{6!} = 695$

The computed

X^{2}

value was around

684.4

.
Checking the p-value for

d f = 5

, we get

p < 0.00001

, which is smaller than

α = 0.05

.
Thus, we reject the null hypothesis, meaning the shuffle was not uniform.

There are two possible reasons why the shuffle was not uniform:

Virtual Machine Environment: The test was run on a VM, which would introduce environmental factors that affect randomness.
rand() Function: The rand() function used for generating random numbers may not produce uniformly distributed values.

Further experiments are needed to identify which of these factors is the primary cause.

Reference

Amherst College - Introduction to Computer Science II Lab-06

Address Sanitizer & Valgrind

Using make SANITIZER=1 test & make valgrind find no error.

Profiling `simulation` Heap Usage With Massif

Reference YM Chen, salmoniscute

Massif is a heap profiler from Valgrind. It measures how much heap memory our program uses.

I've conducted some experiments using Massif to see if there's any issue to simulation.

The graphs shown below are plotted using massif-visualizer, which could be installed by

$ sudo apt install massif-visualizer

All

At first, I try to measure ./qtest -f traces/trace-17-complexity.cmd by running the following command:

$ valgrind --tool=massif ./qtest -f traces/trace-17-complexity.cmd

Result:

However, this isn't really informative since 4 different simulation tasks(it, ih, rh, rt) are measured together. So I try to break things down, measure each operations individually.

Additionally, the default --max-snapshots is 100, to get better insights, the following experiments set the value to 1000.

Insert tail

$ valgrind --tool=massif --max-snapshots=1000 ./qtest -f sim-it.cmd

where sim-it.cmd:

option simulation 1
it
option simulation 0

Result:

It shows that memory usage is inconsistent.
After looking into measure in dudect/constant.c:

...
dut_insert_head(
    get_random_string(),
    *(uint16_t *) (input_data + i * CHUNK_SIZE) % 10000);
...

it appears that before performing measurements, a random string would be inserted to the list 0~9999 times, causing different amount of memory allocated across measurements.

Now, fix the number to 100 and measure again

...
  dut_insert_head(
      get_random_string(),
-     *(uint16_t *) (input_data + i * CHUNK_SIZE) % 10000);
+     100);
...

Result:

The result is still not that consistent yet but better than the previous one.

We could see that the test_malloc is responsible for the inconsistent memory usage.
However, when the number of random string inserted to the list before measuremnts is set to smaller number such as 50, the memory usage becomes consistent.
Further experiments should be made to confirm the root cause.

Result when number is set to 50:

Insert head

$ valgrind --tool=massif --max-snapshots=1000 ./qtest -f sim-ih.cmd

where sim-ih.cmd:

option simulation 1
ih
option simulation 0

Result:

Result after fixing insert count to 100:

Similar to Inset tail, when insert count is set to small value, the memory usage becomes consistent.
Result for insert count 50:

Remove head

$ valgrind --tool=massif --max-snapshots=1000 ./qtest -f sim-rh.cmd

where sim-ih.cmd:

option simulation 1
rh
option simulation 0

Result:

Result after fixing insert count to 100:

Similar to Inset tail, when insert count is set to small value, the memory usage becomes consistent.
Result for insert count 50:

Remove tail

$ valgrind --tool=massif --max-snapshots=1000 ./qtest -f sim-rt.cmd

where sim-ih.cmd:

option simulation 1
rt
option simulation 0

Result:

Result after fixing insert count to 100:

Similar to Inset tail, when insert count is set to small value, the memory usage becomes consistent.
Result for insert count 50:

2025q1 Homework1 (lab0)

Development Environment

Implementation of queue operations

q_insert_head()

q_delete_mid()

q_swap()

q_reverseK()

q_ascend() & q_descend()

q_sort()

q_merge()

Dude, is my code constant time?

About the paper

Problem

Solution

Step 1: Measure execution time

Step 2: Apply post-processing

Step 3 Apply statistical test

Issues

1. Measurement dropping

2. Cropping

3. Potential Bug in operaz's dudect_main()

Shuffle

Experiments

1. Plotting

2. Chi-squared test

Reference

Address Sanitizer & Valgrind

Profiling simulation Heap Usage With Massif

All

Insert tail

Insert head

Remove head

Remove tail

Read more

2025q1 Homework3 (kxo)

2025q1 Homework2 (quiz1+2)

2025q1 Homework1 (ideas)

Profiling `simulation` Heap Usage With Massif