2020q3 Homework2 (quiz2)

# 2020q3 Homework2 (quiz2) contributed by < `guaneec` > ###### tags: `sysprog2020` Quick links: * [Problem description](https://hackmd.io/@sysprog/2020-quiz2) The following solutions assumes a little endian system. ## Q1 - ASCII test, 8 bytes at a time :::danger Hint: always use `cpp` for syntax highlighting. Don't use capitial `C`. :notes: jserv ::: ```cpp #include <stdbool.h> #include <stddef.h> #include <stdint.h> #include <string.h> bool is_ascii(const char str[], size_t size) { if (size == 0) return false; int i = 0; while ((i + 8) <= size) { uint64_t payload; memcpy(&payload, str + i, 8); if (payload & 0x8080808080808080) return false; i += 8; } while (i < size) { if (str[i] & 0x80) return false; i++; } return true; } ``` To test one character, test the 7th bit: ``` 0xc1 & 0x80 ``` To test 8 characters packed, test the 7th, 15th, 23th, ..., 55th, 63th bits: ``` 0xc1c2c3c4c5c6c7c8 & 0x8080808080808080 ``` ### Q1.1 - Why `memcpy` ? Is the copy necessary? Can we do the operation on the input data directly? I wrote a snippet to test this: ```cpp #include <stdio.h> #include <stdlib.h> #include <string.h> int main() { const size_t n = 8; const size_t m = 4; char a[n * m]; for (size_t i = 0; i < n * m; ++i) a[i] = i; for (size_t i = 0; i < m; ++i) { char *p = a + i * n; u_int64_t b; memcpy(&b, p, n); u_int64_t *c = (u_int64_t *)p; printf("%016lx\n%016lx\n\n", b, *c); } } ``` This outputs: ``` 0706050403020100 0706050403020100 0f0e0d0c0b0a0908 0f0e0d0c0b0a0908 1716151413121110 1716151413121110 1f1e1d1c1b1a1918 1f1e1d1c1b1a1918 ``` It seems to work. However, I have not tested this with weird alignments. :::warning You can use qemu user emulation with Arm V5 ISA, which lacks of unaligned memory access. :notes: jserv ::: #### Testing on an armv5 vm I created a vm with qemu ``` root@debian-armel:~# uname -a Linux debian-armel 3.2.0-4-versatile #1 Debian 3.2.51-1 armv5tejl GNU/Linux ``` Tested programs ```cpp // aligned.c #include <stdio.h> #include <stdint.h> int main() { uint64_t a = 0x0102030405060708; printf("%x\n", *(uint32_t *)((void *)&a)); } ``` ```shell root@debian-armel:~# ./aligned.out 5060708 ``` ```cpp // unaligned.c #include <stdio.h> #include <stdint.h> int main() { uint64_t a = 0x0102030405060708; printf("%x\n", *(uint32_t *)((void *)&a + 1)); } ``` ```shell root@debian-armel:~# ./unaligned.out 4050607 ``` :::warning :confused: No error? ::: :::info References * [Unaligned Memory Accesses (kernel.org)](https://www.kernel.org/doc/Documentation/unaligned-memory-access.txt) * [Unaligned accesses in C/C++: what, why and solutions to do it properly](https://blog.quarkslab.com/unaligned-accesses-in-cc-what-why-and-solutions-to-do-it-properly.html) * [Debian Squeeze and Wheezy armel images for QEMU ](https://people.debian.org/~aurel32/qemu/armel/) ::: ### Q1.2 - `hasalpha` The alphabets in binary are: ``` 01000001 A ... 01011010 Z 01100001 a ... 01111010 z ``` There's no simple mask that tests these patterns. Here we will use the technique found in Q5 ```cpp #include <ctype.h> #include <stdbool.h> #include <stdint.h> #include <stdio.h> #include <string.h> #define PACKED_BYTE(b) (((uint64_t)(b) & (0xff)) * 0x0101010101010101u) // simply check if in range int isalpha1(char c) { return (c >= 'a' && c <= 'z') || (c >= 'A' && c <= 'Z'); } // convert to uppercase before comparing int isalpha2(char c) { c = c & ~('a' ^ 'A'); return c >= 'A' && c <= 'Z'; } // use bitwise operations instead of comparisons int isalpha3(char c) { unsigned char u = c & ~('a' ^ 'A'); // is ascii >= 'A' < 'Z' + 1 return ~(u & 0x80) & (0x80 + u - 'A') & ~(0x80 + u - 'Z' - 1) & 0x80; } // spread isalpha3 int hasalpha_u64(uint64_t chars) { uint64_t u = chars & PACKED_BYTE(~('a' ^ 'A')); uint64_t p80 = PACKED_BYTE(0x80); return ((~(u & p80)) & (p80 + u - PACKED_BYTE('A')) & (~(p80 + u - PACKED_BYTE('Z' + 1))) & p80) > 0; } bool hasalpha(const char str[], size_t size) { if (size == 0) return false; int i = 0; while ((i + 8) <= size) { uint64_t payload; memcpy(&payload, str + i, 8); if (hasalpha_u64(payload)) return true; i += 8; } while (i < size) { if (isalpha3(str[i])) return true; i++; } return false; } int main() { char *ss[] = {"12345678", " ", "A ", " c ", " X ", " q ", " X ", " X ", " X ", " z", " a d ", " @ ", " [ ", "abcdh", "123", "1234567878987a8"}; const size_t n = sizeof(ss) / sizeof(ss[0]); for (int i = 0; i < n; ++i) { printf("%s %d\n", ss[i], hasalpha(ss[i], strlen(ss[i]))); } } ``` ### Q1.3 - Validating strings If the goal is to check if a string consists of printable characters only, then implementation would be very similar to `hasalpha`, only that the ranges are changed. ## Q2 - Hex char to value ### Q2.1 - Explanation ```cpp uint8_t hexchar2val(uint8_t in) { const uint8_t letter = in & 0x40; const uint8_t shift = (letter >> 3) | (letter >> 6); return (in + shift) & 0xf; } ``` Observe the 4 smallest bits of the input: | `(char) in` | `in & 0xf` | -------- | -------- | |0|0 |1, A, a|1 |2, B, b|2 |3, C, c|3 |4, D, d|4 |5, E, e|5 |6, F, f|6 |7|7 |8|8 |9|9 For `0-9`, these bits represents the actual value, whereas for `a-f, A-F` there's a offset of 9. Therefore all we need to do is check if the input is a letter and add 9 if that's the case. The expression `in & 0x40` evaluates to `0x40` if `in` is a letter, `0` otherwise. To convert `0x40` to 9, we can use that fact that ``` 9 = 0b1001 = 1 + (1 << 3) = (0x40 >> 6) + (0x40 >> 6 << 3) = (0x40 >> 6) + (0x40 >> 3) ``` The last addition can be optimized into an OR operation. ### Q2.2 - Evaluating hex strings The implementation is simple, given `hexchar2val`: ```C uint64_t hexstr2val(char *s) { uint64_t val = 0u; // skip "0x" for (size_t i = 2; i < strlen(s); ++i) { val = 16 * val + hexchar2val(s[i]); } return val; } ``` ## Q3 - Fast mod ### Q3.1 - Explanation ```C const uint32_t D = 3; #define M ((uint64_t)((0ULL-1) / (D) + 1)) /* compute (n mod d) given precomputed M */ uint32_t quickmod(uint32_t n) { uint64_t quotient = ((__uint128_t) M * n) >> 64; return n - quotient * D; } ``` See the paper [Division by Invariant Integers using Multiplication (1991)](https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=&cad=rja&uact=8&ved=2ahUKEwiW-Za4hPXrAhUGq5QKHb03DToQFjAAegQIAxAB&url=https%3A%2F%2Fgmplib.org%2F~tege%2Fdivcnst-pldi94.pdf&usg=AOvVaw1P04lLbX2_sBwfWscngyOp). The `+ 1` in the calculation is crucial to meeting the conditions in Theorem 4.2: $$ m = \left\lfloor \frac{2^{64} - 1}d \right\rfloor+1 \\ 2^{64} \le d\left\lfloor \frac{2^{64} - 1}d \right\rfloor + d \le 2^{64}+ 2^{32} $$ Full derivation for our case: $$ \begin{aligned} 2^{64} - 1 &\equiv a &\mod d \\ n+1&\equiv -b &\mod d \end{aligned}\\ \begin{aligned} \frac{m\cdot n}{2^{64}} =\frac nd+\frac{n(d-1-a)}{2^{64} d} \end{aligned}\\ \begin{aligned} 0 \le \frac{n(d-1-a)}{2^{64}d} < \frac{b+1}d \implies \left\lfloor \frac{m\cdot n}{2^{64}} \right\rfloor = \left\lfloor \frac nd \right\rfloor \end{aligned}\\ $$ This does not work for `D = 1` since `M` overflows to 0. ### Q3.2 - Fast division in jemalloc The sizes of the parameters in jemalloc are different to those seen above: | size (bits) | n | d | m | | -------- | -------- | -------- |-|-| | quickmod | 32 | 32 | 64 | jemalloc | 32 | 32 or 64 (`size_t`) | 32 The calculation of m is also different: $$ m = \left\lceil \frac{2^{32}}{d} \right\rceil $$ Note that this does not satisfy the condition in Theorem 4.2 (see Q3.1). But it need not, since this implementation is only meant for divisions known to have no remainder. A description of why this works in this special case can be found in the [source](https://github.com/jemalloc/jemalloc/blob/259c5e3e8f4731f2e32ceac71c66f4bc7d078145/src/div.c#L7). #### Performance comparison I ran dudect on `quickdiv` and jemalloc's `div_compute` with input values pre-computed. ```cpp= int c = !classes[i]; uint64_t m0 = *(uint64_t *) (input_data + i * chunk_size); uint32_t m1 = *(uint32_t *) (input_data + i * chunk_size + 8); uint32_t n = *(uint32_t *) (input_data + i * chunk_size + 12); uint32_t d = *(uint32_t *) (input_data + i * chunk_size + 16); uint32_t qq = *(uint32_t *) (input_data + i * chunk_size + 20); uint32_t q = 0; if (c) { before_ticks[i] = cpucycles(); q = ((__uint128_t) m0 * n) >> 64; after_ticks[i] = cpucycles(); } else { before_ticks[i] = cpucycles(); q = ((uint64_t) m1 * (uint64_t) n) >> 32; after_ticks[i] = cpucycles(); } assert(q == qq); assert(n / d == qq); ``` ``` meas: 10.00 M, max t [1]: +11.98, max tau: 3.79e-03, (5/tau)^2: 1.74e+06, mu0: 2.17e+01, mu1: 2.17e+01, dmu: -2.60e-02, s0: 3.44e+00, s1: 3.44e+00, m20: 5.91e+07, m21: 5.87e+07. ERROR: Probably not constant time ``` I also fiddled around with line 2, 11, 15 to see what differences it will make. ```cpp= int c = !classes[i]; uint64_t m0 = *(uint64_t *) (input_data + i * chunk_size); uint32_t m1 = *(uint32_t *) (input_data + i * chunk_size + 8); uint32_t n = *(uint32_t *) (input_data + i * chunk_size + 12); uint32_t d = *(uint32_t *) (input_data + i * chunk_size + 16); uint32_t qq = *(uint32_t *) (input_data + i * chunk_size + 20); uint32_t q = 0; if (c) { before_ticks[i] = cpucycles(); q = ((uint64_t) m1 * (uint64_t) n) >> 32; after_ticks[i] = cpucycles(); } else { before_ticks[i] = cpucycles(); q = ((__uint128_t) m0 * n) >> 64; after_ticks[i] = cpucycles(); } ``` ``` $ ./qtest -f i.txt cmd> # option old 1 cmd> # option write 1 cmd> option measures 100000000 cmd> option simulation 1 cmd> size Testing size... meas: 99.99 M, max t [63]: +11.88, max tau: 1.19e-03, (5/tau)^2: 1.77e+07, mu0: 2.17e+01, mu1: 2.17e+01, dmu: 8.89e-03, s0: 3.74e+00, s1: 3.74e+00, m20: 6.99e+08, m21: 7.00e+08. ERROR: Probably not constant time ``` ```cpp= int c = classes[i]; uint64_t m0 = *(uint64_t *) (input_data + i * chunk_size); uint32_t m1 = *(uint32_t *) (input_data + i * chunk_size + 8); uint32_t n = *(uint32_t *) (input_data + i * chunk_size + 12); uint32_t d = *(uint32_t *) (input_data + i * chunk_size + 16); uint32_t qq = *(uint32_t *) (input_data + i * chunk_size + 20); uint32_t q = 0; if (c) { before_ticks[i] = cpucycles(); q = ((__uint128_t) m0 * n) >> 64; after_ticks[i] = cpucycles(); } else { before_ticks[i] = cpucycles(); q = ((uint64_t) m1 * (uint64_t) n) >> 32; after_ticks[i] = cpucycles(); } assert(q == qq); assert(n / d == qq); ``` ``` $ ./qtest -f i.txt cmd> # option old 1 cmd> # option write 1 cmd> option measures 10000000 cmd> option simulation 1 cmd> size Testing size... meas: 9.91 M, max t [1]: +14.06, max tau: 4.47e-03, (5/tau)^2: 1.25e+06, mu0: 2.15e+01, mu1: 2.15e+01, dmu: 2.72e-02, s0: 3.04e+00, s1: 3.04e+00, m20: 4.57e+07, m21: 4.62e+07. ERROR: Probably not constant time ``` ```cpp= int c = classes[i]; uint64_t m0 = *(uint64_t *) (input_data + i * chunk_size); uint32_t m1 = *(uint32_t *) (input_data + i * chunk_size + 8); uint32_t n = *(uint32_t *) (input_data + i * chunk_size + 12); uint32_t d = *(uint32_t *) (input_data + i * chunk_size + 16); uint32_t qq = *(uint32_t *) (input_data + i * chunk_size + 20); uint32_t q = 0; if (c) { before_ticks[i] = cpucycles(); q = ((uint64_t) m1 * (uint64_t) n) >> 32; after_ticks[i] = cpucycles(); } else { before_ticks[i] = cpucycles(); q = ((__uint128_t) m0 * n) >> 64; after_ticks[i] = cpucycles(); } ``` ``` $ ./qtest -f i.txt cmd> # option old 1 cmd> # option write 1 cmd> option measures 10000000 cmd> option simulation 1 cmd> size Testing size... meas: 4.29 M, max t [6]: +24.63, max tau: 1.19e-02, (5/tau)^2: 1.77e+05, mu0: 1.80e+01, mu1: 1.80e+01, dmu: -6.94e-03, s0: 3.09e-01, s1: 3.09e-01, m20: 2.05e+05, m21: 1.61e+05. ERROR: Probably not constant time ``` A difference is definitely there, but our tool is also sensitive to some hidden factors other than the program logic. The difference of mean is at about the order of 0.01 cycles with jemalloc being slightly faster. [Tested on this commit](https://github.com/guaneec/sysprog-2020q3-lab0-c/commit/ec883087ed3d1a6972f05af5e924a3b5b0b93a18) #### Generated assembly ``` # dudect/constant.c:134: q = ((uint64_t) m1 * (uint64_t) n) >> 32; .loc 1 134 0 movl %ecx, %ecx # m1, m1 movl %r8d, %eax # n, n imulq %rax, %rcx # n, tmp177 shrq $32, %rcx #, q ``` ``` # dudect/constant.c:138: q = ((__uint128_t) m0 * n) >> 64; .loc 1 138 0 movl %r8d, %eax # n, n mulq %rcx # m0 movq %rdx, %rcx #, tmp198 ``` Right shift 64 is not needed because the result of the multiplication is split into to two 64 bit registers. ## Q4 - Divisibility ``` bool divisible(uint32_t n) { return n * M <= M - 1; } ``` From [Faster Remainder by Direct Computation (2019)](https://arxiv.org/pdf/1902.01961.pdf): > [W]e can still quickly determine whether a number is divisible [...] by checking whether the fractional portion is less than the approximate reciprocal $$ n = qd+r\\ \begin{aligned} m n \equiv (qd+r)m &\equiv q(2^{64}-1-a+d)+rm &\mod 2^{64} \\ &\equiv q(d-1-a)+rm &\mod 2^{64} \\ \end{aligned}\\ q(d-1-a)<2^{32} < m \\ rm \le mn \mathrm{\;mod\:} 2^{64}< (r+1) m $$ ## Q5 - Vectorized `strlower` ### Q5.1 - Explanation ```C= #include <ctype.h> #include <stddef.h> #include <stdint.h> #define PACKED_BYTE(b) (((uint64_t)(b) & (0xff)) * 0x0101010101010101u) /* vectorized implementation for in-place strlower */ void strlower_vector(char *s, size_t n) { size_t k = n / 8; for (size_t i = 0; i < k; i++, s += 8) { uint64_t *chunk = (uint64_t *) s; if ((*chunk & PACKED_BYTE(VV1)) == 0) { /* is ASCII? */ uint64_t A = *chunk + PACKED_BYTE(128 - 'A' + VV2); uint64_t Z = *chunk + PACKED_BYTE(128 - 'Z' + VV3); uint64_t mask = ((A ^ Z) & PACKED_BYTE(VV4)) >> 2; *chunk ^= mask; } else strlower(s, 8); } k = n % 8; if (k) strlower(s, k); } ``` The main idea is to achieve comparison with bitwise operators. For each character `c`, if `c` is ASCII then `0 <= c < 0x80`. `(0x80 - 'A' + c >= 0x80) == (c >= 'A')`. This operation does not overflow. `(0x80 & (0x80 - 'A' + c)) == (c >= 'A' ? 0x80 : 0)` `(0x80 & (0x80 - ('Z' + 1) + c)) == (c > 'Z' ? 0x80 : 0)` ## Q6 - Single Number II ```C int singleNumber(int *nums, int numsSize) { int lower = 0, higher = 0; for (int i = 0; i < numsSize; i++) { lower ^= nums[i]; lower &= ~higher; higher ^= nums[i]; higher &= ~lower; } return lower; } ``` ### Q6.1 - Explanation It isn't immediately obvious what this piece of code does. Since in the loop body, all bits are independent, let's draw a state diagram for a single bit. ```graphviz digraph states { rankdir=LR; { node [shape=circle style=invis] Start } node [shape = circle]; Start -> S00 S00 -> S00 [ label = "0" ]; S00 -> S01 [ label = "1" ]; S10 -> S10 [ label = "0" ]; S10 -> S00 [ label = "1" ]; S01 -> S01 [ label = "0" ]; S01 -> S10 [ label = "1" ]; S11 -> S10 [ label = "0" ]; S11 -> S00 [ label = "1" ]; } ``` This looks like a binary counter with a max value of 2. ### Q6.2 - An alternative implementation ```python from collections import Counter class Solution: def singleNumber(self, nums: List[int]) -> int: c = Counter(nums) for k, v in c.items(): if v == 1: return k ``` ![](https://i.imgur.com/0CKswKZ.png) (sorry) ```cpp int cmp(const int *x, const int *y) { return (*x > *y) - (*x < *y); } int singleNumber(int* nums, int numsSize){ qsort(nums, numsSize, sizeof(int), cmp); for (int i = 0;; i += 3) { if (i + 2 >= numsSize || nums[i] != nums[i + 2]) return nums[i]; } } ``` ![](https://i.imgur.com/QaxyTNo.png) ### Q6.3 - Generalizing to n repetitions Here's a binary counter with 4 states: ```python def f4(h, l, i): l ^= i h ^= (i & ~l) return h, l ``` ```graphviz digraph states { rankdir=LR; { node [shape=circle style=invis] Start } node [shape = circle]; Start -> S00 S00 -> S00 [ label = "0" ]; S00 -> S01 [ label = "1" ]; S10 -> S10 [ label = "0" ]; S10 -> S11 [ label = "1" ]; S01 -> S01 [ label = "0" ]; S01 -> S10 [ label = "1" ]; S11 -> S11 [ label = "0" ]; S11 -> S00 [ label = "1" ]; } ``` With 5 states: ```python def f5(h, m, l, i): l ^= i m ^= (i & ~l) h ^= (i & ~l & ~m) r = ~(i & h & ~m & l) l &= r m &= r h &= r return h, m, l ``` ```graphviz digraph states { rankdir=LR; { node [shape=circle style=invis] Start } node [shape = circle]; Start -> S000 S000 -> S000 [ label = "0" ] S000 -> S001 [ label = "1" ] S100 -> S100 [ label = "0" ] S100 -> S000 [ label = "1" ] S010 -> S010 [ label = "0" ] S010 -> S011 [ label = "1" ] S110 -> S110 [ label = "0" ] S110 -> S111 [ label = "1" ] S001 -> S001 [ label = "0" ] S001 -> S010 [ label = "1" ] S101 -> S101 [ label = "0" ] S101 -> S110 [ label = "1" ] S011 -> S011 [ label = "0" ] S011 -> S100 [ label = "1" ] S111 -> S111 [ label = "0" ] S111 -> S000 [ label = "1" ] } ``` In general, it suffices to build a binary counter with a custom reset condition. Actually, we only need to build new counters when n is a prime. When is n is a composite, we could simply use the counter of n's least non-trivial factor.