# 2019q3 quiz7 contributed by < `kksweet8845` > ## quiz 1 ### The principle of bitwise.c ```cpp #define MSG_LEN 8 static inline bool is_divisible(uint32_t n, uint64_t M) { return n * M <= M - 1; } static uint64_t M3 = UINT64_C(0xFFFFFFFFFFFFFFFF) / 3 + 1; static uint64_t M5 = UINT64_C(0xFFFFFFFFFFFFFFFF) / 5 + 1; int main(int argc, char **argv) { for (size_t i = 1; i <= 100; i++) { uint8_t div3 = is_divisible(i, M3); uint8_t div5 = is_divisible(i, M5); unsigned int length = (2 << div3) << div5; char fmt[MSG_LEN + 1]; strncpy(fmt, &"FizzBuzz%u"[(MSG_LEN >> KK1) >> (KK2 << KK3)], length); fmt[length] = '\0'; printf(fmt, i); printf("\n"); } return 0; } ``` The main concept in here is that not using any branches on it. Because of the `Fizz Buzz`, which are length of four, if the number is divisible by 3, output Fizz. If the number is divisible by 5, output Buzz. If both, output FizzBuzz. Means that we can simple output the front four, front eight or middle four length of byte of string of `"FizzBuzz"`. That is, ```cpp unsigned int length = (2 << div3) << div5 ``` However, we need to care about the start place when we need to output from the beginning of string or the middle of the string. That is, ```cpp &"FizzBuzz%u"[(MSG_LEN >> div5) >> (div3 << 2)] ``` - Only can be divided by 3 - div3 <= 1 - div5 <= 0 - the length <= 4 - then, the start address of pointer is 0 - finally, we get `"Fizz"` - Only can be diviede by 5 - div3 <= 0 - div5 <= 1 - the length <= 4 - then, the start address of pointer is 4, - finally, we get `"Buzz"` - Both can be divided by 3 and 5 - div3 <= 1 - div5 <= 1 - the length <= 8 - then, the start address of pointer is 0, - finally, we get `"FizzBuzz"` ### Analysis of performance At first, I thought the branchless version of FizzBuzz is better than the naive one. But the I use the ```shell perf stat --repeat 100 \ -e cache-misses,cache-references,instructions,cycles \ ./naive Performance counter stats for './naive' (100 runs): 46,679 cache-misses # 43.589 % of all cache refs ( +- 0.91% ) 107,088 cache-references ( +- 0.84% ) 429,029,517 instructions # 3.02 insn per cycle ( +- 0.00% ) 142,295,173 cycles ( +- 0.74% ) 0.042437 +- 0.000351 seconds time elapsed ( +- 0.83% ) perf stat --repeat 100 \ -e cache-misses,cache-references,instructions,cycles \ ./bitwise Performance counter stats for './bitwise' (100 runs): 31,267 cache-misses # 36.621 % of all cache refs ( +- 1.62% ) 85,380 cache-references ( +- 2.35% ) 621,245,521 instructions # 2.88 insn per cycle ( +- 0.05% ) 215,632,285 cycles ( +- 0.52% ) 0.063427 +- 0.000412 seconds time elapsed ( +- 0.65% ) ``` However,After I use the perf-stat, the result is not what I thought. ## Quiz 3 According to the `union ele`, we can define the following grapch to describe the `union ele` ```shell union ele ------------ + + # Eight bytes offset + Part 1 + + + ------------ + + # Eight bytes offset + Part 2 + + + ------------ ``` Because the size of pointer need 8 bytes in 64-bit systems and long will also take 8 bytes. ```cpp void proc(union ele *ptr) { ptr->AAA.x = *(ptr->BBB.CCC->DDD.p) - (ptr->EEE.FFF->GGG.HHH); } ``` Let take a look to the assembly code ```shell proc: movq 8(%rdi), %rdx movq (%rdx), %rax movq (%rax), %rax subq 8(%rdx), %rax movq %rax, (%rdi) ret ``` - `%rdi` : 1st argument - `%rdx` : In here, it just uses it as a temporary register. - `%rax` : return value It directly add the 8 byte offset to the address of 1st argument, ptr, to the `%rdx`, then we can simpley infer that it is accessing the address at the `Part 2` of the `union ele`. In here, the corresponding c code is as following ```cpp ptr->BBB.CCC ``` However, we don't know which struct the BBB is ? After I see there is a pointer to struct statement, `CCC->DDD`, there is only `struct e2` has the pointer to `union ele`. I can absolutely say that - `CCC` is `next` - `BBB` is `e2` - `DDD` is `e1` because of `DDD.p`. And, how the code access `ptr->e2.next` ? The answer shown as following. It typically move the address to which `ptr->e2.next` points to another register, `%rax`. ```shell proc: movq 8(%rdi), %rdx movq (%rdx), %rax # ptr->e2.next movq (%rax), %rax subq 8(%rdx), %rax movq %rax, (%rdi) ret ``` In the next line, `movq (%rax), %rax`, it typically move the value, `Mem[%rdx]` to itself and the offset is zero, that coincides to the following statement, `ptr->e2.next->e1.p`. ```shell ptr another ------------- ------------- offset 0 + e2 + + e1 + + long x + + long *p + + + + + ------------- ------------- offset 8 + e2 + + e1 + + union ele +----- + long y + + next + | + + ------------- | ------------- & --------> & ``` Next assembly code, we can see that it is performing the subtraction to some register. ```shell proc: movq 8(%rdi), %rdx movq (%rdx), %rax movq (%rax), %rax subq 8(%rdx), %rax <-- %rax = %rax - Mem[%rdx + 8] movq %rax, (%rdi) ret ``` The interesting part is `Mem[%rdx + 8]`. I have said that the value which `%rdx` holding is the address of pointer `*ptr` point to. Then it adds 8 offset to the `%rdx`, it is accessing `ptr->e2.next`, and the subsequent assembly code has no other offset to add, that is , it is accessing the zero offset of either `struct e1` or `struct e2`. And there is a tiny thing that the whole satement is not deferenced. ```cpp (ptr->EEE.FFF->GGG.HHH) ``` It is absolutely acccessing the member `x` of `struct e2`. - `EEE` is `e2` - `FFF` is `next` - `GGG` is `e2` - `HHH` is `x` ## `.cfi_startproc` & `.cfi_endproc` It is called the call frame information for debugger to unwind the stack to monitor each functions.