# 2025q1 Homework1 (lab0)
contributed by < `SimonLiu423` >
{%hackmd NrmQUGbRQWemgwPfhzXj6g %}
## Development Environment
```shell
$ gcc --version
gcc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0
$ lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 40 bits physical, 48 bits virtual
Byte Order: Little Endian
CPU(s): 4
On-line CPU(s) list: 0-3
Vendor ID: GenuineIntel
Model name: Common KVM processor
CPU family: 15
Model: 6
Thread(s) per core: 1
Core(s) per socket: 4
Socket(s): 1
Stepping: 1
BogoMIPS: 5999.99
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx lm cons
tant_tsc nopl xtopology cpuid tsc_known_freq pni cx16 x2apic hypervisor lahf_lm cpuid_fault pti
Virtualization features:
Hypervisor vendor: KVM
Virtualization type: full
Caches (sum of all):
L1d: 128 KiB (4 instances)
L1i: 128 KiB (4 instances)
L2: 16 MiB (4 instances)
L3: 16 MiB (1 instance)
NUMA:
NUMA node(s): 1
NUMA node0 CPU(s): 0-3
Vulnerabilities:
Gather data sampling: Not affected
Itlb multihit: KVM: Mitigation: VMX unsupported
L1tf: Mitigation; PTE Inversion
Mds: Vulnerable: Clear CPU buffers attempted, no microcode; SMT Host state unknown
Meltdown: Mitigation; PTI
Mmio stale data: Unknown: No mitigations
Reg file data sampling: Not affected
Retbleed: Not affected
Spec rstack overflow: Not affected
Spec store bypass: Vulnerable
Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Retpoline
Srbds: Not affected
Tsx async abort: Not affected
```
## Implementation of queue operations
### q_insert_head()
Initial code:
```c
bool q_insert_head(struct list_head *head, char *s)
{
if (!head)
return false;
/* Create element node with value copied from s */
element_t *node = malloc(sizeof(element_t));
if (!node)
return false;
char *val = malloc((strlen(s) + 1) * sizeof(char));
if (!val) {
free(node);
return false;
}
strcpy(val, s);
node->value = val;
/* Add element to the head of queue */
list_add(node, head);
return true;
}
```
In the code above, the "Create element" segment will also be used in `q_insert_tail()`. To follow [DRY](https://en.wikipedia.org/wiki/Don%27t_repeat_yourself) principle, I extracted the part to another function.
```c
static inline element_t *create_element(char *s)
{
element_t *node = malloc(sizeof(element_t));
if (!node)
return NULL;
char *val = malloc((strlen(s) + 1) * sizeof(char));
if (!val) {
free(node);
return NULL;
}
strcpy(val, s);
node->value = val;
return node;
}
```
這個函式使用了 `static` 和 `inline` 做宣告,`static` 可讓該函式只能在該檔案中使用,無法從外部存取,提供**封裝性 (Encapsulation)**;`inline` 會建議編譯器在編譯時,將函式被呼叫的地方,直接取代成該函式中的程式碼,如此一來可以避免 function call,提升程式碼的效率。
/* 待編輯,還沒看懂 */
首先來看 `static`,根據 C99 規格書 **6.2.2 Linakges of identifiers**,...
`inline`: **6.7.4 Function specifiers** ...
/******************/
完成後執行 `git commit -a` 可以發現被 `pre-commit` 擋下來了,理由是 `strcpy()` 不安全,因為在複製前不會先比較兩個字串 buffer 的長度,可能會存取到預期外的記憶體位址。
解決方法是改用 `strncpy`,修改後的程式碼如下
```diff
char *val = malloc((strlen(s) + 1) * sizeof(char));
if (!val) {
free(node);
return NULL;
}
- strcpy(val, s);
+ strncpy(val, s, strlen(val));
node->value = val;
return node;
```
### q_delete_mid()
在 [jserv 的教材](https://hackmd.io/@sysprog/c-linked-list#Merge-Sort-%E7%9A%84%E5%AF%A6%E4%BD%9C)中有提到利用快慢指標找到 list 中點,其中一部分程式碼如下:
```c
node_t *slow = head;
for (node_t *fast = head->next; fast && fast->next; fast = fast->next->next)
slow = slow->next;
node_t *mid = slow->next;
slow->next = NULL;
```
這段程式碼在 queue size 為偶數時,得出的 `mid` 會往右多移一個,例如當 `size = 4`,`mid` 會指向 queue 第三個 element 的位址;另外,我認為這段程式碼的可讀性較低,因此改用 do-while 實作。
修改上述兩點後的程式碼: [Commit ddf0ada](https://github.com/SimonLiu423/lab0-c/commit/ddf0ada5881af469be1cfa5711e3a505b0ec180f)
### q_swap()
[Commit 18c4e7b](https://github.com/SimonLiu423/lab0-c/commit/18c4e7bec777223693e64a8e168a2a6e583a9e1a)
實作方法為,遍歷 list 中每個 pair 的第一個節點,將其從 list 中移除,這時節點所指向的 prev 跟 next 維持不變,再透過 `list_add` 將 node 插入到下個節點之後。
```c
struct list_head *node = head->next;
while (node != head && node->next != head) {
list_del(node);
list_add(node, node->next);
node = node->next;
}
```
`list_add(node, node->next)` 這裡 `node->next` 並非 `list head`,與該函式敘述所預期的參數不一致
```c
/**
* list_add - Insert a node at the beginning of a circular list
* @node: Pointer to the list_head structure to add.
* @head: Pointer to the list_head structure representing the list head.
*
* Adds the specified @node immediately after @head in a circular doubly-linked
* list, effectively placing it at the beginning. The existing first node, if
* any, shifts to follow @node, and the list's circular structure is maintained.
*/
```
但仔細了解該函式的實作可以發現,完全可以將其視為將 `node` 插入在 `head` 之後,而不一定是插在 `list` 的開頭。
在 [torvalds/linux/include/linux/lish.h](https://github.com/torvalds/linux/blob/master/include/linux/list.h#L183) 中 `list_add` 的敘述也是說將 `new` 插入在 `head` 之後。
```c
/**
* list_add - add a new entry
* @new: new entry to be added
* @head: list head to add it after
*
* Insert a new entry after the specified head.
* This is good for implementing stacks.
*/
```
針對這問題我有開一個 pull request [#225](https://github.com/sysprog21/lab0-c/pull/225) 更改 `list_add` 的敘述。
---
以下開發日記將用英文撰寫,上面的有空再改成英文。
---
### q_reverseK()
The strategy here is to traverse the list and, for every `k` nodes, form a local circular list by attaching the first and last node to a temporary head.
Then, use `q_reverse()` with the temporary head to reverse the `k` elements.
Finally, reattach the reversed nodes to the original list and continue the process for the remaining nodes.
Note that when `k = 2`, this operation is equivalent to `q_swap()`.
```c
if (k == 2) {
q_swap(head);
return;
}
```
### q_ascend() & q_descend()
The goal of these two functions is to remove every node that has a node with a strictly smaller (for `q_ascend()`) or larger (for `q_descend()`) value anywhere to its right.
In other words, `q_ascend()` creates a monotonically non-increasing sequence, while `q_descend()` creates a monotonically non-decreasing sequence, both starting from the last element working backward.
The only difference between the two functions is whether the sequence should be `non-increasing` or `non-decreasing`. To handle this, we can use the same function `monotonic_from_right()`, and pass an enumeration value to specify the desired order.
```c
typedef enum _order { NON_DECREASING = 1, NON_INCREASING = -1 } Order;
```
Implementation for `q_ascend()`:
```c
int q_ascend(struct list_head *head)
{
return monotonic_from_right(head, NON_INCREASING);
}
```
`q_descend()`:
```c
int q_descend(struct list_head *head)
{
return monotonic_from_right(head, NON_DECREASING);
}
```
For implementation of `monotonic_from_right()`, see [Commit ba136d7](https://github.com/SimonLiu423/lab0-c/commit/ba136d7b7b76fa906fbb2fbd97d730b0937da7c7)
### q_sort()
Originally, inspired by [jserv's implementation](https://hackmd.io/Z-vcRTCjQdKqu4UAyswd5g?view#q_sort), I created an array of pointers to the nodes in the list and used bottom-up merge sort. However, since the array size is fixed, this approach has limitations. If the length of the queue is too small, it wastes memory; if it's too large, the program can't handle it.
To address this, I considered implementing a `vector` in C++ with an **exponential growth policy**(doubling the capacity upon reallocation). Unfortunately, this assignment forbids the use of `malloc`, so I came up with an alternative: maintaining a stack and merging two lists when their lengths are equal. This approach reduces space complexity from `O(n)` to `O(log(n))`, which is much more efficient.
For more details, see [Commit 621bf36](https://github.com/SimonLiu423/lab0-c/commit/621bf360a2f18977f7b6ab2ffd20970099646629).
### q_merge()
The approach is simple: move all nodes to the first queue then perform a `q_sort()`.
If there's only one queue in the chain, we can return immediately, as the queue is already sorted by definition.
For more details, see [Commit d1ef5be](https://github.com/SimonLiu423/lab0-c/commit/d1ef5becbf314ba771115a2901c5353e6121ece5).
## Dude, is my code constant time?
### About the paper
#### Problem
Timing attacks are a type of side-channel attack used to extract secrets, such as cryptographic keys or passwords, by analyzing execution time. They are relatively easy to perform compared to other side-channel attacks.
Existing detection methods have a major drawback: they require hardware modeling, which is extremely difficult.
#### Solution
This method detects timing leakage by analyzing execution time. It measures execution time for **two different input data classes** and then checks whether their **distributions** are **statistically** different.
##### Step 1: Measure execution time
- Collect two sets of execution time measurements for two different input data classes.
- Use *fix-vs-random* tests, where the fixed class may be chosen to trigger edge cases.
- Use cycle counters to measure execution time.
- To minimize the effect of environmental changes in the results, randomly assign each measurement to a class.
- Interleave class sequences during evaluation to prevent interference from concurrent processes.
##### Step 2: Apply post-processing

- Remove outliers by cropping measurements that exceed a fixed threshold.
- Measurements may be positively skewed due to artifacts such as OS interruptions or other extraneous activities.
- Apply higher-order preprocessing techniques.
##### Step 3 Apply statistical test
- Disprove the null hypothesis *"The two timing distributions are equal"*.
- Use statistical methods such as:
- t-test
- Non-parametric test
### Issues
Originally, constant-time measurement passed on my local machine but failed on GitHub Actions. After analyzing `dudect/fixture`, `dudect/constant` and comparing it with [operaz/dudect](https://github.com/oreparaz/dudect). I found some poorly implemented code that could lead to inaccurate results.
#### 1. Measurement dropping
As mentioned earlier, the sequence of classes should be interleaved during evaluation. However, the original code did not do this correctly.
In `dudect/constant.c`:
```c
for (size_t i = DROP_SIZE; i < N_MEASURES - DROP_SIZE; i++) {
...
before_ticks[i] = cpucycles();
/* Target function to measure */
after_ticks[i] = cpucycles();
...
}
```
the loop **did not properly discard** the first and last `DROP_SIZE` measurements. This resulted in biased data being included in the final analysis.
To fix this, I made the following changes:
```diff
+ int measure_idx = 0;
+ int64_t a_ticks, b_ticks;
for (size_t i = 0; i < N_MEASURES; i++) {
...
- before_ticks[i] = cpucycles();
+ b_ticks = cpucycles();
/* Target function to measure */
- after_ticks[i] = cpucycles();
+ a_ticks = cpucycles();
...
+ if (i < DROP_SIZE || i >= N_MEASURES - DROP_SIZE)
+ continue;
+ before_ticks[measure_idx] = b_ticks;
+ after_ticks[measure_idx] = a_ticks;
+ measure_idx++;
}
```
This ensures that the measurements are made even if we're about to discard it.
#### 2. Cropping
In **operaz's version**, multiple tests were performed:
- One first order uncropped test,
- `DUDECT_NUMBER_PERCENTILES` tests
- One second-order test
The highest *t-value* among these tests determines whether the target function is constant-time.
In this assignment, only the **uncropped test** was performed, and the cropping mechanism was missing. So I copied and modified the implementation from [operaz/dudect](https://github.com/oreparaz/dudect).
Now, I calculate **cutoff value** using `PERCENTILE_CUTOFF = 0.9`, ensuring only values **below the cutoff** are included in the t-test:
```c
static void prepare_cutoff_value(int64_t *exec_times, double *cutoff_value)
{
qsort(exec_times, VALID_MEASURES, sizeof(int64_t),
(int (*)(const void *, const void *)) cmp);
*cutoff_value = percentile(exec_times, PERCENTILE_CUTOFF, VALID_MEASURES);
return;
}
```
```diff
- /* do a t-test on the execution time */
- t_push(t, difference, classes[i]);
+ if (difference < cutoff_value)
+ /* do a t-test on the execution time */
+ t_push(t, difference, classes[i]);
```
#### 3. Potential Bug in operaz's dudect_main()
While modifying the code, I noticed a potential bug in operaz's `dudect_main()`:
```c
if (first_time) {
// throw away the first batch of measurements.
// this helps warming things up.
prepare_percentiles(ctx);
} else {
update_statistics(ctx);
ret = report(ctx);
}
```
This code intends to discard the first batch of measurements to warm up the system. However, **it still computes percentiles using that first batch**.
I bellieve this is the reason causing this [run](https://github.com/SimonLiu423/lab0-c/actions/runs/13758105737) of github actions to stuck while testing `remove_head()`.
Additionally, I think the percentiles should be **calculated for each batch** rather than depending only on the first batch.
Final Fix:
```diff
- if (*cutoff_value == -1) {
- prepare_cutoff_value(exec_times, cutoff_value);
+ if (first_time) {
+ /* Discard first measurement to warm things up */
+ first_time = false;
} else {
- update_statistics(exec_times, classes, *cutoff_value);
+ prepare_cutoff_value(exec_times, &cutoff_value);
+ update_statistics(exec_times, classes, cutoff_value);
+ ret &= report();
}
```
## Shuffle
The queue shuffling function is implemented using the [Fisher-Yates shuffle](https://en.wikipedia.org/wiki/Fisher%E2%80%93Yates_shuffle).
See [Commit 4a047ee](https://github.com/SimonLiu423/lab0-c/commit/4a047ee10a4733454262e31b299e02ba5753b8ab) for details.
### Experiments
To verify if the shuffle implementation produces **uniform random permutations**, I used two methods:
- Plotting the frequency distribution
- [Chi-squared test](https://en.wikipedia.org/wiki/Chi-squared_test).
*(**6** elements shuffled **500400** times)*
#### 1. Plotting
- **X-axis**: Permutation index
- **Y-axis**: Frequency of each permutation after shuffling
Since we are shuffling ***6 elements***, there are a total of **6! = 720** unique permutations.
The permutation index is computed using the following code ([reference](https://www.geekviewpoint.com/python/numbers/permutation_index)):
```python
def get_permutation_index(n, perm):
index = 0
for i in range(n):
smaller_count = sum(1 for j in range(i+1, n) if perm[j] < perm[i])
index += smaller_count * math.factorial(n - 1 - i)
return index
```

Apparently, the result was **not uniform**. After debugging, I found the issue was due to calling `srand(time(NULL))` on **every** shuffle call.
After removing it, the distribution becomes much more reasonable.

#### 2. Chi-squared test
To further validate uniformity, I conducted a `Chi-squared test` with null hypothesis: *"The shuffle algorithm produces a uniform distribution."*
**Formula:**
$$
X^2 = \sum^{i=1}_n {(H_i - E)^2\over E}
$$
Where:
- $H_i$ = Observed count of permutation *i*
- $E$ = Expected count = ${500400\over 6!}=695$
The computed $X^2$ value was around $684.4$.
Checking the p-value for $df=5$, we get $p<0.00001$, which is smaller than $\alpha = 0.05$.
Thus, we **reject the null hypothesis**, meaning the shuffle was not uniform.
:::info
There are two possible reasons why the shuffle was not uniform:
1. **Virtual Machine Environment**: The test was run on a VM, which would introduce environmental factors that affect randomness.
2. **`rand()` Function**: The `rand()` function used for generating random numbers may not produce uniformly distributed values.
Further experiments are needed to identify which of these factors is the primary cause.
:::
#### Reference
[Amherst College - Introduction to Computer Science II Lab-06](https://sfkaplan.people.amherst.edu/courses/2017/spring/COSC-112/assignments/lab-6.pdf)
## Address Sanitizer & Valgrind
Using `make SANITIZER=1 test` & `make valgrind` find no error.
## Profiling `simulation` Heap Usage With Massif
> Reference [YM Chen](https://hackmd.io/@cymizer/2021q1_lab0), [salmoniscute](https://hackmd.io/@salmonii/linux2025-homework1)
[Massif](https://valgrind.org/docs/manual/ms-manual.html) is a heap profiler from Valgrind. It measures how much heap memory our program uses.
I've conducted some experiments using Massif to see if there's any issue to `simulation`.
The graphs shown below are plotted using `massif-visualizer`, which could be installed by
```shell
$ sudo apt install massif-visualizer
```
### All
At first, I try to measure `./qtest -f traces/trace-17-complexity.cmd` by running the following command:
```shell
$ valgrind --tool=massif ./qtest -f traces/trace-17-complexity.cmd
```
Result:

However, this isn't really informative since 4 different simulation tasks(it, ih, rh, rt) are measured together. So I try to break things down, measure each operations individually.
Additionally, the default `--max-snapshots` is `100`, to get better insights, the following experiments set the value to `1000`.
### Insert tail
```shell
$ valgrind --tool=massif --max-snapshots=1000 ./qtest -f sim-it.cmd
```
where `sim-it.cmd`:
```
option simulation 1
it
option simulation 0
```
Result:

It shows that memory usage is inconsistent.
After looking into `measure` in `dudect/constant.c`:
```c
...
dut_insert_head(
get_random_string(),
*(uint16_t *) (input_data + i * CHUNK_SIZE) % 10000);
...
```
it appears that before performing measurements, a random string would be inserted to the list **0~9999** times, causing different amount of memory allocated across measurements.
Now, fix the number to `100` and measure again
```diff
...
dut_insert_head(
get_random_string(),
- *(uint16_t *) (input_data + i * CHUNK_SIZE) % 10000);
+ 100);
...
```
Result:

The result is still not that consistent yet but better than the previous one.
:::warning
We could see that the `test_malloc` is responsible for the inconsistent memory usage.
However, when the number of random string inserted to the list before measuremnts is set to smaller number such as `50`, the memory usage becomes consistent.
Further experiments should be made to confirm the root cause.
Result when number is set to `50`:

:::
### Insert head
```shell
$ valgrind --tool=massif --max-snapshots=1000 ./qtest -f sim-ih.cmd
```
where `sim-ih.cmd`:
```
option simulation 1
ih
option simulation 0
```
Result:

Result after fixing insert count to `100`:

:::warning
Similar to Inset tail, when insert count is set to small value, the memory usage becomes consistent.
Result for insert count `50`:

:::
### Remove head
```shell
$ valgrind --tool=massif --max-snapshots=1000 ./qtest -f sim-rh.cmd
```
where `sim-ih.cmd`:
```
option simulation 1
rh
option simulation 0
```
Result:

Result after fixing insert count to `100`:

:::warning
Similar to Inset tail, when insert count is set to small value, the memory usage becomes consistent.
Result for insert count `50`:

:::
### Remove tail
```shell
$ valgrind --tool=massif --max-snapshots=1000 ./qtest -f sim-rt.cmd
```
where `sim-ih.cmd`:
```
option simulation 1
rt
option simulation 0
```
Result:

Result after fixing insert count to `100`:

:::warning
Similar to Inset tail, when insert count is set to small value, the memory usage becomes consistent.
Result for insert count `50`:

:::