# 2022q1 Homework3 (quiz3)
contributed by < [cantfindagoodname](https://github.com/cantfindagoodname) >
> [題目](https://hackmd.io/@sysprog/linux2022-quiz3)
## 實驗環境
```shell
$ cat /proc/version
Linux version 5.13.0-30-generic (buildd@lcy02-amd64-003) (gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0, GNU ld (GNU Binutils for Ubuntu) 2.34) #33~20.04.1-Ubuntu SMP Mon Feb 7 14:25:10 UTC 2022
$ gcc --version
gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0
```
## 測驗1
```c
/* GENMASK(h, l) */
/* h l */
/* 76543210 */
/* 01110000 */
```
Fill in `LEFT`, `RIGHT`:
```c
#define GENMASK(h, l) \
(((~0UL) >> (LEFT)) & ((~0UL) >> (l) << (RIGHT)))
```
From `man operator` :
```shell
Operator Associativity
<< >> left to right
& left to right
```
`~0UL` would produce a stream of set bits, which length depends on length of `long`.
> [ISO/IEC 9899[6.5.7:5]](http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1124.pdf)
> The result of $E1$ >> $E2$ is E1 right-shifted $E2$ bit positions. If $E1$ has an unsigned type or if $E1$ has a signed type and a nonnegative value, the value of the result is the integral part of the quotient of $E1 / 2^{E2}$.
`(~0UL) >> LEFT` would clear the left most `LEFT` bits. Resulting `000...011...`, where number of clear bit would be `LEFT`.
`(~0UL) >> (l)` would discard the right most `l` bits, let the result be `r`.
> [ISO/IEC 9899[6.5.7:4]](http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1124.pdf)
> The result of $E1$ << $E2$ is $E1$ left-shifted $E2$ bit positions; vacated bits are filled with zeros.
`r << RIGHT` would clear the right-most `RIGHT` bits.
The macro should be doing the operation as followed :
```c
/* h l */
/* 01111111 */
/* & 11111000 */
```
`LEFT` : `63 - h` ( Given 64-bit architecture )
`RIGHT`: `l`
However, as stated from above, the vacated bits for right shift are filled with zero.
`(((~0UL) >> (63 - h)) & ((~0UL) << (l)))` should yield the same result.
### 延伸問題
#### 解釋上述程式碼運作原理
```c
#define __GENMASK(h, l) \
(((~UL(0)) - (UL(1) << (l)) + 1) & \
(~UL(0) >> (BITS_PER_LONG - 1 - (h))))
#define GENMASK(h, l) \
(GENMASK_INPUT_CHECK(h, l) + __GENMASK(h, l))
```
`GENMASK_INPUT_CHECK` uses `build_bug.h` library to force compilation error.
`(BITS_PER_LONG - 1 - (h))` is equivalent to `63 - h`, `BITS_PER_LONG` would not assume the word size as a constant, it instead takes the word size that is known at compile time.
`((~UL(0)) - (UL(1) << (l)) + 1)` uses the property that the first clear bit would clear all prior set bit(s) when add by 1, i.e. `0bXX0111 + 1 = 0bXX1000`. It generates the less significant bit mask.
`UL(X)` instead of `XUL` as 1UL and 0UL are hard-coded, therefore they are not available for assembly. [Source](https://github.com/torvalds/linux/commit/95b980d62d52c4c1768ee719e8db3efe27ef52b2)
#### 舉出 Linux 核心原始程式碼中 GENMASK 巨集和 include/linux/bitfield.h 的應用案例
`GENMASK` is very popular when programming in lower level, typically used when consecutive bits are used as flags.
e.g. [i2c](https://github.com/torvalds/linux/blob/5bfc75d92efd494db37f5c4c173d3639d4772966/drivers/i2c/busses/i2c-fsi.c)
## 測驗2
```c
struct foo;
struct fd {
struct foo *foo;
unsigned int flags;
};
enum {
FOO_DEFAULT = 0,
FOO_ACTION,
FOO_UNLOCK,
} FOO_FLAGS;
static inline struct fd to_fd(unsigned long v)
{
return (struct fd){EXP1, v & 3};
}
```
> [align_down](https://www.boost.org/doc/libs/1_78_0/doc/html/align/reference.html#align.reference.functions.align_down)
> Requires : alignment shall be power of 2
> [ISO/IEC9899:3.2](http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1124.pdf)
> **alignment**
> requirement that objects of a particular type be located on storage boundaries with addresses that are particular multiples of a byte address
The smallest addressable unit of memory is 1 byte, when align for 4 bytes, the address would be multiple of 4.
`EXP` : `v - v % 4`
We are required to not use `%` (modulo) operator. For power of 2, we can replace modulo with `&` mask.
`EXP` : `v & ~0x3`
The compiler would give a warning message `makes pointer from integer without a cast`.
We can cast the `intptr_t` value to `struct foo *` to dodge the warning.
`EXP` : `(struct foo *)(v & ~0x3)`
### 延伸問題
#### 在 Linux 核心原始程式碼找出類似技巧的實作
Any `intptr_t` reference more or less manipulates address as an integer type.
e.g. in [kernel.h](https://github.com/torvalds/linux/blob/master/include/linux/kernel.h), there is a macro `u64_to_user_ptr(x)` dedicated to using this technique.
The program also uses compound literals, it allows us to use initializer list to create an unnamed object.
e.g. in [list.h](https://github.com/torvalds/linux/blob/master/include/linux/list.h), `LIST_HEAD_INIT(name)` would initialize the head of list with compound literals.
The implementation of `ALIGN_DOWN` can be found in [linux/align.h](https://github.com/torvalds/linux/blob/master/include/linux/align.h)
`ALIGN_DOWN` is referenced in several files, a documented use of `ALIGN_DOWN` would be [fs/btrfs/send.c](https://github.com/torvalds/linux/blob/master/fs/btrfs/send.c), where it states that trying to clone a block would corrupt the data in destination file.
## 測驗3
```c
x = (x >> 4) | (x << 4);
x = ((x & 0xCC) >> 2) | (EXP2);
x = ((x & 0xAA) >> 1) | (EXP3);
```
`EXP2` : `((x & 0x33) << 2)`
`EXP3` : `((x & 0x55) << 1)`
```c
x = ((x & 0xF0) >> 4) | ((x & 0x0F) << 4); /* AAAA BBBB => BBBB AAAA */
x = ((x & 0xCC) >> 2) | ((x & 0x33) << 2); /* AABB CCDD => BBAA DDCC */
x = ((x & 0xAA) >> 1) | ((x & 0x55) << 1); /* ABCD EFGH => BADC FEHG */
```
Take a look at each index :
|Dec|Bin|
|:-:|:-:|
|0|000|
|1|000|
|2|010|
|3|011|
|4|100|
|5|101|
|6|110|
|7|111|
Index `i` would be moved to `(N-1) - i`, which will be equilvalent to `~i`, as `N` is power of 2.
Take `6 (110)` as an example, `~6` would be `001`, which means index 6 would do `>> 4`, `>> 2`, then `<< 1`.
As for the mask encoding, it simply encodes which bit is set.
i.e.
take `0xCC` and `0x33` as example, they are `11001100` and `00110011` respectively.
`11001100` would mask for `2`,`3`,`6`, and`7`, where all their second bit are set. Which means they would later shifted by `0b10`, which is why they do `>> 2`.
This uses the divide and conquer technique.
It splits the data by half every time it does a swap.
i.e.
```
A X X X|X X X X
A X|X X
A|X
A
```
### 延伸問題
#### 在 Linux 核心原始程式碼找出類似技巧的實作並解說其應用場景
Linux kernel would apply this technique when doing conversion between Big Endian and Little Endian.
The macros are defined in [byteorder/generic.h](https://github.com/torvalds/linux/blob/master/include/linux/byteorder/generic.h) and the implementation could be seen in [swab.h](https://github.com/torvalds/linux/blob/master/include/uapi/linux/swab.h)
Aside referencing the macros in conversion of endianness in interface between two hardware structure such as utility file [endian.h](https://github.com/torvalds/linux/blob/master/tools/include/tools/endian.h), and network interface [ipv4/esp4.c](https://github.com/torvalds/linux/blob/master/net/ipv4/esp4.c), encoding algorithm like secure hash [SHA-256](https://github.com/ilbers/linux/blob/master/arch/x86/purgatory/sha256.c) would also reference the macro to reverse bytes.
## 測驗4
```c
#include <assert.h>
#define _foreach_no_nullval(i, p, arr) \
assert((i) >= sizeof(arr) / sizeof(arr[0]) || (p))
#define foreach_int(i, ...) \
for (unsigned _foreach_i = (((i) = ((int[]){__VA_ARGS__})[0]), 0); \
_foreach_i < sizeof((int[]){__VA_ARGS__}) / sizeof(int); \
(i) = ((int[]){__VA_ARGS__, 0})[EXP4])
#define foreach_ptr(i, ...) \
for (unsigned _foreach_i = \
(((i) = (void *) ((typeof(i)[]){__VA_ARGS__})[0]), 0); \
(i); (i) = (void *) ((typeof(i)[]){__VA_ARGS__, \
NULL})[EXP5], \
_foreach_no_nullval(_foreach_i, i, \
((const void *[]){__VA_ARGS__})))
```
From [gcc documentation](https://gcc.gnu.org/onlinedocs/cpp/Variadic-Macros.html)
> Variadic Macros
> A macro can be declared to accept a variable number of arguments much as a function can.
The preprocessor would expand `__VA_ARGS__` to argument(s) in `...` (ellipsis) operator.
From [ISO/IEC9899:6.5.17](http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1124.pdf)
> Comma operator
> The left operand of a comma operator is evaluated as a void expression; there is a sequence point after its evaluation. Then the right operand is evaluated; the result has its type and value.
Then take a look at `foreach_int`
```c
/* Declaration part of for statement */
unsigned _foreach_i = (((i) = ((int[]){__VA_ARGS__})[0]), 0)
```
`_foreach_i` is an unsigned integer type, which would be initialized as 0
`i` is dereferencing a pointer to integer array, which points to first element in `__VA_ARGS__`
```c
/* In controlling expression of for statement */
_foreach_i < sizeof((int[]){__VA_ARGS__}) / sizeof(int)
```
The LHS indicates `_foreach_i` would be the iterator.
The RHS would be identical to `sizeof (int *) / sizeof(int)`, which indicates the number of elements.
```c
/* Final expression of for statement */
(i) = ((int[]){__VA_ARGS__, 0})[EXP4]
```
The statement would let `i` access the next element to the one it is currently accessing, which would be `[_foreach_i + 1]`.
Right before the statement is executed in the first iteration, the loop body would let `i` access the 0th element, and `_foreach_i` is `0`. Hence, `EXP4` should be a prefix `++` operator. By then, the loop would terminates when `_foreach_i == sizeof((int[]){__VA_ARGS__}) / sizeof(int)`, and would not cause out-of-bound memory access.
`EXP4` : `++_foreach_i`
The macro `foreach_ptr` is very similar to `foreach_int`, except it would pass pointer instead.
`EXP5` : `++_foreach_i`
One difference is the macro would check if the pointer is `NULL`
```c
#define _foreach_no_nullval(i, p, arr) \
assert((i) >= sizeof(arr) / sizeof(arr[0]) || (p))
```
From [ISO/IEC9899:6.5.14/4](http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1124.pdf)
> Logical OR operator
> If the first operand compares unequal to 0, the second operand is not evaluated.
The first argument is `_foreach_i`
assert would not give warning before `_foreach_i` reaches the last element.`p` `NULL` check would not be evaluated.
Once `_foreach_i` reaches the last element, the compiler would handle cases when the element isn't `NULL`.
## 測驗5
`EXP6` : `dvs << shift`
`EXP7` : `dvd -= dvs << shift`
```c
unsigned int dvd = dividend;
if (dividend < 0) {
signal *= -1;
dvd = ~dvd + 1;
}
unsigned int dvs = divisor;
if (divisor < 0) {
signal *= -1;
dvs = ~dvs + 1;
}
```
The first part of the program would ensure the division is operated for two positive numbers.
`signal` is recorded to let the program follow the rule for sign of multiplication.
```c
int shift = 0;
while (dvd > (dvs << shift))
shift++;
unsigned int res = 0;
while (dvd >= dvs) {
while (dvd < (dvs << shift))
shift--;
res |= (unsigned int) 1 << shift;
dvd -= dvs << shift;
}
```
The remaining part of would be doing long division in binary.
The progarm would align msb of `dvd` and `dvs` then subtract using `shift` (quotient is set after subtraction).
### 延伸問題
Improvement 1 :
```c
while (dvd > (dvs << shift))
shift++;
```
The above program would align both `dvd` and `dvs` by looping each binary digits, use `clz` would be faster. $O(digit)\rightarrow O(1)$
```c
shift = __builtin_clz(dvs) - __builtin_clz(dvd);
```
The while loop would do at most once. (Could be replaced with `if`)

Improvement 2 :
Apply hardware binary division
(Reference : [Computer Organization and Design MIPS EDITION .6E](https://www.tenlong.com.tw/products/9780128201091))

The algorithm is same as before, however, `shift` is replaced by bitmask (assume 32-bit) and shift by 1.

The result won't be faster in best case, however, it is more consistenly lower than the previous version (improvent 1).
### 延伸問題2
Fixed Point Arithmetic
When the precision is known, fixed point arithmetic would be better than floating point to not make the value "float".
This ensures the value would be correct every calculation, and the arithmetic is faster than floating point arithmetic.
The idea is as shown :
```c
val = n << precision
/* get value */
val >> precision . val & mask
```
One example could be seen at [quiz2/1.EMWA](https://hackmd.io/@sysprog/linux2022-quiz2#%E6%B8%AC%E9%A9%97-1)
## 測驗6
variables :
```c
int pointsSize; /* @args : number of points */
int *dup_cnts = malloc(pointsSize * sizeof(int));
int *hori_cnts = malloc(pointsSize * sizeof(int));
int *vert_cnts = malloc(pointsSize * sizeof(int));
int slope_size = pointsSize * pointsSize / 2 + 133;
int *slope_cnts = malloc(slope_size * sizeof(int));
```
From the name (and its implementation), `$(N)##_cnts` would mean number of points that shares some property
`dup` : same coordinates, (including self)
`hori`: same x-coordinates
`vert`: same y-coordinates
We can deduce `slope` is the remaining points, seperated by their slope/gradient.
:::info
the theoretical value for number of slope would be $\dfrac{{pointsSize}*{(pointsSize-1)}}{2}$, not including inverse for slope : (x,y) shares the same line with (y,x).
Not sure why `+133` is needed
:::
The last part of program simply takes the list with largest number :
```c
int max_num = 0;
for (i = 0; i < pointsSize; i++) {
if (hori_cnts[i] + 1 > max_num)
max_num = hori_cnts[i] + 1;
if (vert_cnts[i] + 1 > max_num)
max_num = vert_cnts[i] + 1;
}
for (i = 0; i < slope_size; i++) {
if (slope_cnts[i] > max_num)
max_num = slope_cnts[i];
}
```
The code to find group points with slope :
```c
/* i and j is every combination of two independent input points */
if (points[i].x != points[j].x && points[i].y != points[j].y) {
/* find slope as irreducible fraction */
int dx = points[j].x - points[i].x;
int dy = points[j].y - points[i].y;
int tmp = gcd(dx, dy);
dx /= tmp;
dy /= tmp;
/* Start of code segment */
int hash = dx * dy - 1333 * (dx + dy);
if (hash < 0)
hash = -hash;
hash %= slope_size;
if (can_insert(&heads[hash], i, j)) {
struct point_node *pn = malloc(sizeof(*pn));
pn->p1 = i;
pn->p2 = j;
EXP9;
}
}
/* can_insert */
static bool can_insert(struct list_head *head, int p1, int p2)
{
struct point_node *pn;
list_for_each_entry (pn, head, link)
return EXP8;
return true;
}
```
`EXP9` in the branch `can_insert`, it should be inserting into the list
`EXP9` : `list_add(&pn->link, &head[hash])`
Without knowing how the hash is calculated, assume `heads[hash]` each be list of points on the same straight line.
`can_insert` is very likely to be a function to prevent recounting line segments. Take a straight line for example :
```c
/* a--b--c */
/* 0 1 2 */
```
The current program would tranverse from front to back with :
```c
for (int i = 0;i < sz;++i) /* should be i < sz - 1 */
for (int j = i + 1;j < sz;++j)
/* ... */
```
which means, the line segment `a-b`, `a-c`, and `b-c`. In which `a-c` and `b-c` is repeated.
`can_insert` prevents this by counting points with same slope with 1 point.
`EXP8` : `pn->p1 == p1`
After finding every slopes, the program include duplicates :
```c
for (i = 0; i < slope_size; i++) {
int index = -1;
struct point_node *pn;
list_for_each_entry (pn, &heads[i], link) {
index = pn->p1;
slope_cnts[i]++;
}
if (index >= 0)
slope_cnts[i] += dup_cnts[index];
}
```
### 延伸問題
which is the implementation of hash function could cause false positive errors.
```c
printf("%d\n", maxPoints(
(struct Point []){
[0].x=0,[0].y=0,
[1].x=1,[1].y=1,
[2].x=12,[2].y=19,
[3].x=24,[3].y=38,
[4].x=36,[4].y=57,
[5].x=40,[5].y=12,
[6].x=5,[6].y=6,
[7].x=6,[7].y=7,
[8].x=7,[8].y=8,
[9].x=8,[9].y=9,
}, 10));
```

The function above would produce an output `5`, as `(dx,dy)` = `(1,1)` and `(12,19)` when `slope_count` = 10 would produce hash 103, they both would pass point `(0,0)`, hence the output is `5` instead of `4` (straight line are grouped together)
Note : mint colored line starts with `(0,0)`, blue colored line intersects x-axis at `(0,1)`
Hash map makes time complexity goes from $O(n^3)$ to $O(n^2)$, rough estimate :
n - tranverse point, n - tranverse point to compare, n (to 1 as hash) check slope list = $O(n^2)$
To satisfy space requirement, there should be time trade-off.
Change hash table to linked-list and the space complexity is immediately lowered if not worst-case, but the time complexity would goes up a dimension with the random access property from hash table lost.
We could potentially make the program more efficient by tackling the algorithm
There are few ways to confirm three points are collinear, such as :
find line equation (slope being one of the method),
calculate angle (180 degree or perpendicular bisector) (angle in computer program would have low precision, unless apply special formulas) (same with slope but potentially time save),
area = 0 (determinant) ($O(1)$ time complexity in calculation but has large hidden constant, $O(n^3) + O(1)$ vs $O(n^3) + O(logn)$)
## 測驗7
From the description, we need to find :
> the minimum number of bits required to store an unsigned 32-bit value without any leading zero bits
Which the function returns output similar to :
```c
32 - __builtin_clz(v)
```
With the similar structure repeated few times
```c
m = (v > 0xFFFU) << 4;
/* ... */
m = (v > 0xFFU) << 3
/* ... */
m = (v > 0xFU) << 2
```
We can expect similar result below :
```c
int ilog32(uint32_t v)
{
int ret = v > 0;
int m = (v > 0xFFFFU) << 4;
v >>= m;
ret |= m;
m = (v > 0xFFU) << 3;
v >>= m;
ret |= m;
m = (v > 0xFU) << 2;
v >>= m;
ret |= m;
m = (v > 0x3) << 1;
v >>= m;
ret |= m;
ret += v > 1;
return ret;
}
```
The program would use technique similar to binary search, in which it divides the value by half and determine which portion (upper or lower) the data is in.
As an example :
```c
m = (v > 0xFFFFU) << 4;
v >>= m;
```
v would be an unsigned 32-bit integer, hence the maximum value would be `0xFFFFFFFF`
`v > 0xFFFFU` would check if the most significant half of v has value. If so, at least (16 + 1) bits is needed to store the number.
e.g.
```c
/*
* base-16
* XXXX YYYY
* XXXX is not clear
* YYYY must be stored (16-bits)
*/
(v > 0xFFFFU) << 4;
1 << 4; // 16
```
Hence, we got `(v > 0xFFFFU) << 4`, which should be `(16)`
```c
m = (v > 0xFFU) << 3;
v >>= m;
```
The following part of the program would do the same, as the program already did `v >>= m`, it would check if 8 additional bits should be present to store the number, in addition to last calculation.
There should be 1 additional bit for each check
```c
/* v = 0x10000 */
m = (v > 0xFFFFU) << 4; // 16 but at least 17 bits are needed
```
The program would add the bit in the beginning :
```c
int ret = v > 0;
```
All `ret |= m ` would just add `m` to `ret`, for all those additional bits, however, as m must be power of 2, bitwise OR `|` operator is used instead.
To make the program more readable and maintainable,`ret |= m` could be replaced by `ret += m`
### 延伸問題
As stated earlier, `ilog32` did the same thing as `32 - __builtin_clz(v)`
> Tested with
```c
int main() {
for (uint32_t i = 0;i < UINT_MAX;++i){
printf("%d - Result : %d, Expected : %d\n", i, ilog32(i), 32 - __builtin_clz(i));
fflush(stdout);
assert(ilog32(i) == 32 - __builtin_clz(i));
printf("\x1b[1F\x1b[2K");
}
printf("ilog32 is no different to 32 - clz\n");
}
```
The code can be found in [bitops.h](https://github.com/torvalds/linux/blob/master/arch/arc/include/asm/bitops.h)
```c
/*
* fls = Find Last Set in word
* @result: [1-32]
* fls(1) = 1, fls(0x80000000) = 32, fls(0) = 0
*/
static inline __attribute__ ((const)) int fls(unsigned int x)
{
if (__builtin_constant_p(x))
return constant_fls(x);
return 32 - clz(x);
}
```
---
The idea of deBrujin is to find a hash function that generates unique output for each set bit.
$h(x) = (x * deBrujin) >> (n - lgn)$
```c
/* 8-bit value */
/* n = 8, lgn = 3 */
/* 1 2 3 4 5 6 7 8 */
```
${ >> (n - lgn)}$ ensures we get $lgn$ bits possible outputs
[deBrujin sequence](https://en.wikipedia.org/wiki/De_Bruijn_sequence), $B(n, lgn)$ makes sure every single output is an unique output.
This satisfies all three requirement of hash function
> we need
> - the hash table to be small
> - the hash function to be easily computable
> - the hash function to produce no collisions
:::info
The [paper](http://supertech.csail.mit.edu/papers/debruijn.pdf) gives an example to do `ffs()` (find first set) with `x & -x`, `fls()` could be done using similar technique
:::
## 測驗8
The point is to replace poniter operation with pointer to pointer operation
```c
/* Original version */
if (d < p->data) {
q = p;
p = p->left;
} else {
q = p;
p = p->right;
}
/* Modified version */
if (d < (*p)->data)
p = &(*p)->left;
else
p = &(*p)->right;
```
```c
/* Original version */
while (r->left) {
q = r;
r = r->left;
}
/* Modified version */
while ((*r)->left)
r = (*r)->left;
```
tested with :
```c
tree insert(tree &t, int d)
{
if (t == nullptr)
return new tnode(d);
if (d < t->data)
t->left = insert(t->left, d);
else
t->right = insert(t->right, d);
return t;
}
/* lvr inorder */
void print(tree &t)
{
if (t == nullptr)
return;
if (t->left)
print(t->left);
std::cout << t->data;
if (t->right)
print(t->right);
}
```
`EXP12` : `&(*p)->left`
`EXP13` : `&(*p)->right`
`EXP14` : `&(*r)->left`
:::info
Even though `*p = (*p)->left` looks indentical to `&(*p)->left`, they are completely different.
Difference :
> original

> `*p = (*p)->left`

> `p = &(*p)->left`

:::
### 延伸問題
[Static B-Tree](https://en.algorithmica.org/hpc/data-structures/s-tree/)
// TODO : Revision 2,3 Tree, 2,3,4 Tree (B-Tree), B+ Tree
> Reference
> [Fundamentals of Data Structures in C, 2/e](https://www.tenlong.com.tw/products/9780929306407)
## 測驗9
Alignment macro that rounds up would maps the least multiple of alignment greater than or equal to the argument.
i.e. $Align(x) = min\{n \in \mathbb{Z}\ |\ {ALIGNMENT}*n\ge x\}$
```c
/* maximum alignment needed for any type on this platform, rounded up to a
power of two */
#define MAX_ALIGNMENT 16
/* Given a size, round up to the next multiple of sizeof(void *) */
#define ROUND_UP_TO_ALIGNMENT_SIZE(x) \
(((x) + MAX_ALIGNMENT - MMM) & ~(NNN))
```
The alignment is 16, which means we would need to have a bitmask `1111`.
The mask could be calculated by doing `MAX_ALIGNMENT - 1`
A very straight forward way of rounding up would be :
```c
x & ~(MAX_ALIGNMENT - 1) + (MAX_ALIGNMENT ^ -(x > 0))
```
Where the lower order bits are cleared to align with the alignment, then round up the result if the argument is non-zero.
The test should use the same logic, but with a different approach.
The bitwise AND operator `&` should be clearing the lower order bits to align the input, hence :
`NNN` : `MAX_ALIGNMENT - 1`
Whats remaining would be rounding up if the argument is non-zero.
The lower order bits would later be cleared therefore only the most significant carry bit matters.
```c
x + MAX_ALIGNMENT - MMM
```
should result in a value that is between `MAX_ALIGNMENT` and `MAX_ALIGNMENT << 1`, if `x` is non-zero, and is `MAX_ALIGNMENT - 1` if `x` is zero. And so,
`MMM` : `1`
### 延伸問題
round down would be very straight forward, which is
```c
#define ROUND_DOWN_TO_ALIGNMENT_SIZE(x) \
((x) % MAX_ALIGNMENT)
```
Conveniently, `MAX_ALIGNMENT` is power of 2, hence the macro could be redefined as:
```c
#define ROUND_DOWN_TO_ALIGNMENT_SIZE(x) \
((x) & (MAX_ALIGNMENT - 1))
```
---
Not a program from Linux Kernel but still applying the idea :
In [2022q1 fibdrv](https://hackmd.io/@sysprog/linux2022-fibdrv), we are required to write a program to do arithmetic on big number that surpass the 64-bit range of `unsigned long` in LP64 Data Model.
One common approach is to encode the value as a character array then define a new function to do arithmetic operation. In those function, code segment to tranverse the array would be used extensively.
In modern CPU architecture, a character is normally 1 byte wide and the width of data bus (1 word) goes up to 8 bytes wide.
Everytime we do an operation on a character, the data bus of CPU is not fully utilized, hence we could do the following optimization :
```c
int l = strlen(s);
switch (l && 0x3) {
for (; l > 0; l -= 4) {
case 0: /* do something */
case 3: /* do something */
case 2: /* do something */
case 1: /* do something */
}
}
```
Even though the implementation above would make use of the 32-bit wide bus, there would be repeating code in `/* do something */` and `/* do something */` is called multiple times.
It is wise to explicitly cast the character to integer.
```c
int l = 0;
uint32_t v;
while((v - 0x01010101) & ~v & (0x80808080)) {
v = *(uint32_t *)s;
/* do something */
s += 4; // move pointer
}
if (((v) & 0xFF) == '\0') /* do something on s */;
if (((v >> 8) & 0xFF) == '\0') /* do something on s, s+1 */;
if (((v >> 16) & 0xFF) == '\0') /* do something on s, s+1, s+2 */;
```
Even though both of the implementations are almost identical to each other, with the latter one being more efficient in theory ( less `/* do something */` call ).
In reality, even though the latter program sometimes outperforms the previous implenentation, often more than not it suffers from severe performance degration.
Upon further investigation, I found out that this is caused operation on unaligned memory blocks.

As shown as figure, `SP` means the 4 bytes accessed in previous iteration, `SC` means 4 bytes accessed in current iteration, and `SN` means 4 bytes accessed in next iteration.
The program would have performance issue as `s += 4` sometimes makes the program access memory accross multiple blocks.
This is evident when the following code is added :
```c
while (&s != ROUND_DOWN_TO_ALIGNMENT_SIZE(&s)) {
/* do something */
++s;
}
```
## 測驗10
```c
#define DIVIDE_ROUND_CLOSEST(x, divisor) \
({ \
typeof(x) __x = x; \
typeof(divisor) __d = divisor; \
(((typeof(x)) -1) > 0 || ((typeof(divisor)) -1) > 0 || \
(((__x) > 0) == ((__d) > 0))) \
? ((RRR) / (__d)) \
: ((SSS) / (__d)); \
})
```
`(typeof(x)) -1) > 0 || ((typeof(divisor)) -1) > 0` clearly indicates that both types are unsigned. Combined with `(__x) > 0) == ((__d) > 0)` we can deduce the following :
`((RRR) / (__d))` if both `x` and `divisor` are positive,
`((SSS) / (__d))` if `x` is negative, ( `divisor` negtive undefined from description)
Take `3` as divisor :
|${\ \ x\ \ }$|$f(x,3)$|
|:-:|:-:|
|1|0|
|2|1|
|3|<font color="#F00">**1**</font>|
|4|1|
|5|2|
|6|<font color="#F00">**2**</font>|
|7|2|
|8|3|
|9|<font color="#F00">**3**</font>|
round nearest would mean $dvd/dvs$ is closer to ${\ n*dvs\ }$ or ${\ (n+1)*dvs\ }$
i.e. `5` is closer to `6` or `3`, `5` is at upper portion or lower portion of `4.5` (`(3+6)/2`)
As C truncate the result of division, this could be found by adding `dvd` by `dvs / 2`. Hence,
`RRR` : `(__x) + (__dvs >> 1)`
When dividend is negative, the magnitude of number would be increased by subtracting. Hence,
`SSS` : `(__x) - (__dvs >> 1)`
## 測驗11
`fls` would do find last set (zero-indexed) as the maximum value is 63
It will match bit-mask to check from first half (similar to test7, but opposite direction)
```c
unsigned long i_sqrt(unsigned long x)
{
unsigned long b, m, y = 0;
if (x <= 1)
return x;
m = 1UL << (fls(x) & ~1UL);
while (m) {
b = y + m;
XXX;
if (x >= b) {
YYY;
y += m;
}
ZZZ;
}
return y;
}
```
`i_sqrt` is equilvalent to `floor(sqrt(x))`
```c
m = 1UL << (fls(x) & ~1UL)
```
It is not obvious what this statement does so I design a simple experiment to see what this statement would do.
```c
for (unsigned int i = 1; i ;i <<= 1)
printf("%10u %10lu\n", i, 1UL << (fls(i) & ~1UL));
```
output :
```c
1 1
2 1
4 4
8 4
16 16
32 16
64 64
128 64
256 256
512 256
1024 1024
2048 1024
4096 4096
8192 4096
16384 16384
32768 16384
65536 65536
131072 65536
262144 262144
524288 262144
1048576 1048576
2097152 1048576
4194304 4194304
8388608 4194304
16777216 16777216
33554432 16777216
67108864 67108864
134217728 67108864
268435456 268435456
536870912 268435456
1073741824 1073741824
2147483648 1073741824
```
We can see that the result is a sequence of $2^{2n}$, repeating twice every time.
`(fls(x) & ~1UL)` is clear last bit of fls(x), which is the reason why every two of the output will yield the same result, as the sequence of `fls` would go in odd-even pair.
The sequence for $2^{2n}$ is chosen as we can easily find its square root as shown : $\sqrt{2^{2n}} = 2^{\frac{2n}{2}} = 2^n$
```c
m = 1UL << (fls(x) & ~1UL);
while (m) {
b = y + m;
XXX;
if (x >= b) {
YYY;
y += m;
}
ZZZ;
}
```
The condition of the while loop is `m`, and `m` is not modified originally.
Hence, if the loop is following good coding practice, `m` should be modified in `ZZZ`
`m` must be one possible output of `1UL << (fls(x) & ~1UL)` to maintain loop invariant.
`m` would initially be upper bound of `sqrt(x)`, as shown :
```c
printf("%-16s %-16s\n", "n", "n^2");
for (unsigned int i = 0;i < 17;++i) {
print_bin16(i);
print_bin16(i * i);
printf("%10lu\n", 1UL << (fls(i * i) & ~1UL));
}
```
Output :
```c
n n^2
0000000000000000 0000000000000000 1
0000000000000001 0000000000000001 1
0000000000000010 0000000000000100 4
0000000000000011 0000000000001001 4
0000000000000100 0000000000010000 16
0000000000000101 0000000000011001 16
0000000000000110 0000000000100100 16
0000000000000111 0000000000110001 16
0000000000001000 0000000001000000 64
0000000000001001 0000000001010001 64
0000000000001010 0000000001100100 64
0000000000001011 0000000001111001 64
0000000000001100 0000000010010000 64
0000000000001101 0000000010101001 64
0000000000001110 0000000011000100 64
0000000000001111 0000000011100001 64
0000000000010000 0000000100000000 256
```
Observation :
```c
n n^2
0000000000000011 0000000000001001 4
0000000000000100 0000000000010000 16
/* ... */
0000000000000111 0000000000110001 16
0000000000001000 0000000001000000 64
```
every time `n` needs one more bit to represent the number, `n^2` would need two.
Resulting in `fls(2^n) & ~1UL`
By every loop `m` would be divided by $2^2$, hence,
`ZZZ` : `m >>= 2`
How the algorithm works :
Reference : [二進位系統的平方根](https://hackmd.io/@sysprog/sqrt#%E4%BA%8C%E9%80%B2%E4%BD%8D%E7%B3%BB%E7%B5%B1%E7%9A%84%E5%B9%B3%E6%96%B9%E6%A0%B9), [How to calculate a square root without a calculator](https://www.homeschoolmath.net/teaching/square-root-algorithm.php), [Why the square root algorithm works](https://www.homeschoolmath.net/teaching/sqr-algorithm-why-works.php)