# 2019q3 Homework1 (review)
contributed by < `yaohsiaopid` >
## 測驗 `1`
考慮以下 C 程式的 `align4` 巨集的作用是,針對 4-byte alignment,找到給定地址的 round up alignment address。
```cpp
#include <stdio.h>
#define align4(x) (((x) + K) & (-4))
int main(void) {
int p = 0x1997;
printf("align4(p) is %08x\n", align4(p));
return 0;
}
```
預期程式輸出 `align4(p) is 00001998`
`K` = `(3)`
### Discussion
#### Why it works
For `sizeof(int) = 4`, The macro will become `(sizeof(n) + 3) & ~3`. Then we can prove the macro works by working out the following scenario:
| sizeof(n) | intermediate steps | result | comments |
| :--------: | :--------: | :--------: | :--------: |
| `4k` | `(4k+3)& 1..11100` | `4k` |
| `4k+1` | `(4k+4)& 1..11100` | `4(k+1)` |
| `4k+2` | `(4k+4+1) & 1..11100`| `4(k+1)` | the last two digits are `01`(from 5) and `01 & 00 = 0` |
| `4k+3` | `(4k+6) & 1..11100` | `4(k+1)` | likewise last two digits are `10 & 00 = 0`
#### 延伸題目: 在 Linux 核心原始程式碼找出類似的 alignment 巨集並解釋其作用
* `include/linux/kernel.h` line 33:
``` c
/* @a is a power of 2 value */
#define ALIGN(x, a) __ALIGN_KERNEL((x), (a))
```
* `include/uapi/linux/kernel.h` line 10:
``` c
#define __ALIGN_KERNEL(x, a) __ALIGN_KERNEL_MASK(x, (typeof(x))(a) - 1)
#define __ALIGN_KERNEL_MASK(x, mask) (((x) + (mask)) & ~(mask))
```
**last line matches almost perfectly with the problem with mask = 3 !!**
* Where `ALIGN` is used ?
* `grep -rnw '.' -e 'ALIGN'` yields 2443 results.
* Of these results, let's take a look at `./mm/page_alloc.c`
* `./mm/page_alloc.c` - Linux page allocator
* `./include/linux/mm.h:216`: The macro is to "align the pointer to the (next) page boundary"
```c
#define PAGE_ALIGN(addr) ALIGN(addr, PAGE_SIZE)
```
* an example involved `ALIGN`:
`alloc_pages_exact > make_alloc_exact > PAGE_ALIGN > ALIGN`
* `void *alloc_pages_exact(size_t size, gfp_t gfp_mask)`:
```c
void *alloc_pages_exact(size_t size, gfp_t gfp_mask)
{
unsigned int order = get_order(size);
unsigned long addr;
if (WARN_ON_ONCE(gfp_mask & __GFP_COMP))
gfp_mask &= ~__GFP_COMP;
addr = __get_free_pages(gfp_mask, order);
return make_alloc_exact(addr, order, size);
}
```
* Goal: allocate an exact number physically-contiguous pages.
* `size`: **number of bytes to allocate**
* different from `alloc_pages()`, which can only allocate memory in **power-of-two pages**
* how this function works:
1. Firstly it asks for power-of-two pages by `__get_free_pages(gfp_mask, order)` where `order = get_order(size)`
At this point, it obtains $2^{order}$ pages where $2^{order}\times \text{PAGE_SIZE} > size$
2. Free up the extra pages it asked in previous step with `make_alloc_exact`:
```c
static void *make_alloc_exact(unsigned long addr, unsigned int order,
size_t size)
{
if (addr) {
unsigned long alloc_end = addr + (PAGE_SIZE << order);
unsigned long used = addr + PAGE_ALIGN(size);
split_page(virt_to_page((void *)addr), order);
while (used < alloc_end) {
free_page(used);
used += PAGE_SIZE;
}
}
return (void *)addr;
}
```
* Since the allocator only distributes complete page, `PAGE_ALIGN(size)` is needed above to find out the size of pages that is able to accomodate `size` number of bytes. In return we get `used`, which is page-aligned.
* In the `while` loop, it then sequentially frees the extra pages one by one.
* Question/To do on this function:
* `split_page` not yet investigated
* A possible worst-case can be that size = `PAGE_SIZE * (2^x + 1)`, then `order = (x+1)` and `while` will unfortunately runs for `2^(x+1) -(2^x+1) = 2^x + 1` times. Not so sure if this O(size) is miserable, but one thing certain is that allocating 'exact' number of bytes can quite troublesome.
* Reference
* [LinuxMM: Page Allocation](https://linux-mm.org/PageAllocation)
* [Physical Page Allocation](https://www.kernel.org/doc/gorman/html/understand/understand009.html)
* https://ufal.mff.cuni.cz/~jernej/2018/docs/predavanja07.pdf
:::danger
So, what is the benefit? Can you expriment?
:notes: jserv
:::
---
## 測驗 `2`
考慮以下 C 程式可取得以 `sizeof(int)` 為 alignment 基礎的整數空間,也就是說,當 `sizeof(int)` 為 `4` 的時候:
* 給定 `char arr[13];`,`INT_SIZE_OF(arr)` 要得到 `16`;
* 給定 `short int s;`,`INT_SIZE_OF(s)` 要得到 `4`
```cpp
#include <stdint.h>
#define INT_SIZE_OF(n) \
((sizeof(n) + sizeof(int) + X) & ~(sizeof(int) + Y))
```
* 參考資訊: [Memory access and alignment](https://lwn.net/Articles/260832/)
`X` = `(-1)`
`Y` = `(-1)`
### 延伸題目: 解釋運作原理並在 GitHub 找出實際應用的程式碼 - device tree compiler
* `dtc/dtc.c:59`:
```c
#define ALIGN(x, a) (((x) + (a) - 1) & ~((a) - 1))
```
* Why: To generate formatted .dtb file
* In the process of generating a dtb file, Device Tree Compiler basically aims to write a linked list of structs into a file
* I think in order to make the dtb formatted in memory, dtc thus uses `ALIGN(x,a)` for different properties of different struct
* Such as when it prepares blob containing fdt header to be written as file, it will make it aligned by 8 before adding more information
```c
blob = data_append_data(blob, &fdt, vi->hdr_size);
blob = data_append_align(blob, 8);
```
* usage:
* `data_append_align()` from `dtc/data.c`:
```c
int newlen = ALIGN(d.len, align);
return data_append_zeroes(d, newlen - d.len);
```
* `make_fdt_header()` from `dtc/flattree.c`:
```c
/* Reserve map should be doubleword aligned */
reserve_off = ALIGN(vi->hdr_size, 8);
```
* `dt_to_blob()` from `dtc/flattree.c`:
```c
if (alignsize > 0)
padlen = ALIGN(fdt32_to_cpu(fdt.totalsize) + padlen, alignsize)
- fdt32_to_cpu(fdt.totalsize);
```
* Reference:
* [Device Tree(四):文件结构解析](http://www.wowotech.net/device_model/dt-code-file-struct-parse.html)
* [Device Tree Usage](https://elinux.org/Device_Tree_Usage)
---
## 課程說明 ppt slide 10
```c
/* IEEE 754 */
#include <stdio.h>
int main() {
float sum = 0.0f, corr = 0.0f;/* corrective value for rounding error */
printf("1+..+t\tsum \tcorrection\tgold\n");
for (int i = 0; i < 10000 ; i++) {
float y = (i + 1) - corr; /* add the correction to specific item */
float t = sum + y; /* bits might be lost */
corr = (t - sum) - y; /* recover lost bits */
sum = t;
if(i > 9990 ) {
printf("%d\t%f\t%f\t%d\n", i+1, sum, corr, (i+2)*(i+1)/2);
}
}
printf("Sum: %f\n", sum);
return 0;
}
```
* Original code in slide 9 does not work because there will be [round-off error](https://en.wikipedia.org/wiki/Floating-point_arithmetic#Floating-point_arithmetic_operations) under the IEEE 754 representation of floating point.
* The code in slide 10 works for 10000.
* Out of curiosity I plot out the `corr`

* Although the above code superficially fixes the problem for the summation to **10000** with error correction, the problem is still there:
* The above C output:
```
1+..+t sum correction gold
9992 49925028.000000 0.000000 49925028
9993 49935020.000000 -1.000000 49935021
9994 49945016.000000 1.000000 49945015
9995 49955008.000000 -2.000000 49955010
9996 49965008.000000 2.000000 49965006
9997 49975004.000000 1.000000 49975003
9998 49985000.000000 -1.000000 49985001
9999 49995000.000000 0.000000 49995000
10000 50005000.000000 0.000000 50005000
Sum: 50005000.000000
```
* it does not work for other numbers, but I think this is because of the resolution of the fraction part of IEEE 754.
* why does it works ?
```
assume corr = 0, then y = i+1
sum 1.f1 x 2^e1
i+1 +) 1.f2 x 2^e2 , assume: e1 >> e2
----------------------------------------------------------------
sum 1.f1 x 2^e1
i+1 +) 1.f2 x 2^(e2-e1) x 2^e1 , 1.f2 = 1.f21 + f22 and f22 will be lost after shifting
----------------------------------------------------------------
t (1.f1 + 1.f2x2^(e2-e1)) x 2^e1
t-sum = 1.f2 x 2^(e2-e1)) x 2^e1 = 1.f2 x 2^e2 ---(5)
t-sum-y = f22 x 2^e2 , the lost bits !
```
* Not yet find a universal version ....
:::danger
Check [Pairwise summation](https://en.wikipedia.org/wiki/Pairwise_summation)
:notes: jserv
:::
* Reference:
* [IEEE 754](https://en.wikipedia.org/wiki/Floating-point_arithmetic#IEEE_754:_floating_point_in_modern_computers)