--- title: '2020q3 Homework2 (quiz2)' tags: linux2020 --- # 2020q3 Homework2 (quiz2) contributed by < `ChongMingWei` > ## Outline [TOC] ## 環境 ```shell $ uname -a Linux cmw-System-Product-Name 5.4.0-47-generic #51~18.04.1-Ubuntu SMP Sat Sep 5 14:35:50 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux $ gcc --version gcc (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0 Copyright (C) 2017 Free Software Foundation, Inc. This is free software; see the source for copying conditions. There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. ``` ## 測驗 1 ```c= bool is_ascii(const char str[], size_t size) { if (size == 0) return false; int i = 0; while ((i + 8) <= size) { uint64_t payload; memcpy(&payload, str + i, 8); if (payload & 0x8080808080808080)//MMM return false; i += 8; } while (i < size) { if (str[i] & 0x80) return false; i++; } return true; } ``` - ASCII code table :::info Total 128 characters are inside the table. When we use a `char` to store it, all ASCII characters share one attribute: `bit 8 whith the value '0'` So, if we want to test whether a character is ASCII code or not, we can directly use `&` operator with value `0x80`. The result will be 0 if it's an ASCII code and 0x80 for others. ::: ![](https://i.imgur.com/kKzmb45.png) **In this code, we have to check if there exists a character which doesn't belong to ASCII code:** - In the while loop, we have to check if threr are enough characters for each check. - If so, we use a variable `payload` with datatype `uint64_t` to store the bit value of 8 characters (64 bits) at a time. - Then, we use `memcpy` to copy `8` bytes from the pointer `*(str + i)` to the address of `paylod`. - Finally, we use `0x8080808080808080` to check if the content of each byte is a ASCII code characters. (As mentioned before, for a character we use `0x80`. Here we take 8 times of that.) - If there is any non ASCII code character, the function immediately return `false`. ```cpp if (size == 0) return false; int i = 0; while ((i + 8) <= size) { uint64_t payload; memcpy(&payload, str + i, 8); if (payload & 0x8080808080808080)//MMM return false; i += 8; } ``` - If the size of remaining string is less than `8`, we use a while loop to deal with the rest. (One byte one time) ```cpp while (i < size) { if (str[i] & 0x80) return false; i++; } return true; ``` :::success 延伸問題: 1.解釋上述程式的運作原理,特別是為何用到 memcpy 呢?提示: 與 data alignment 有關,詳閱 [C 語言:記憶體管理、對齊及硬體特性](https://hackmd.io/@sysprog/c-memory) ::: - Test (Not using `memcpy` but directly cast `str` to `uint64_t) >Experiment referenced from [RinHizakura](https://hackmd.io/@RinHizakura/SJ5NuIANP) > Change from ```c= while ((i + 8) <= size) { uint64_t payload; memcpy(&payload, str + i, 8); if (payload & 0x8080808080808080) return false; i += 8; } ``` to ```c= while ((i + 8) <= size) { uint64_t *payload = (uint64_t *)str; if (*payload & 0x8080808080808080) return false; i += 8; } ``` We still get the same result. But if we add `-fsanitize=alignment` on the command line when we compile... ![](https://i.imgur.com/uVtepJ6.png) The runtime error happens due to the **load of misaligned address 0x7ffd54c2cdef for type 'uint64_t', which requires 8 byte alignment.** Why this error occurs? :::info >c99 standard 6.3.2.3 pagragh 7 > A pointer to an object or incomplete type may be converted to a pointer to a different object or incomplete type. **If the resulting pointer is not correctly aligned) for the pointed-to type, the behavior is undefined**. Otherwise, when converted back again, the result shall compare equal to the original pointer. When a pointer to an object is converted to a pointer to a character type, the result points to the lowest addressed byte of the object. Successive increments of the result, up to the size of the object, yield pointers to the remaining bytes of the object. ::: That is, the size of `char` is 1 byte (however the size of `uint64_t` is 8), and due to data alignment the access of string may be an unaligned acesses. ![](https://i.imgur.com/v38eEVi.png) - So, how `memcpy` prevents this condition??([memcpy.c](https://github.com/lattera/glibc/blob/master/string/memcpy.c)) - memcpy中使用` OP_T_THRES(設為16,在64-bit的系統中)`來作為對齊的長度,若要複製的長度小於此,那麼就使用`BYTE_COPY_FWD`直接複製 - 如果大於此預設長度,會先將小於`OPSIZ`以內長度先使用`BYTE_COPY_FWD`複製,扣掉以上長度之後再調用`PAGE_COPY_FWD_MAYBE`以及`WORD_COPY_FWD`調整對齊問題 (做上圖後半部的操作) ```c= void * memcpy (void *dstpp, const void *srcpp, size_t len) { unsigned long int dstp = (long int) dstpp; unsigned long int srcp = (long int) srcpp; /* Copy from the beginning to the end. */ /* If there not too few bytes to copy, use word copy. */ if (len >= OP_T_THRES) { /* Copy just a few bytes to make DSTP aligned. */ len -= (-dstp) % OPSIZ; BYTE_COPY_FWD (dstp, srcp, (-dstp) % OPSIZ); /* Copy whole pages from SRCP to DSTP by virtual address manipulation, as much as possible. */ PAGE_COPY_FWD_MAYBE (dstp, srcp, len, len); /* Copy from SRCP to DSTP taking advantage of the known alignment of DSTP. Number of bytes remaining is put in the third argument, i.e. in LEN. This number may vary from machine to machine. */ WORD_COPY_FWD (dstp, srcp, len, len); /* Fall out and copy the tail. */ } /* There are just a few bytes to copy. Use byte memory operations. */ BYTE_COPY_FWD (dstp, srcp, len); return dstpp; } ``` :::success 延伸問題: 2.將上述技巧應用於「給予一個已知長度的字串,檢測裡頭是否包含有效的英文大小寫字母」 ::: 使用測驗5中,檢測大寫英文字母的方式來判斷: 如果落在 `a` 和 `z` 之間, `lower_mask` 的 MSB 會是1; 如果落在 `A` 和 `Z` 之間, `higher_mask` 的 MSB 會是1,接著要對 `lower_mask` 和 `higher_mask` 分別做 `AND` 的操作以清空除了 MSB 以外的 bit ,最後判斷 `lower_mask` 和 `higher_mask` 其中有一個 MSB 是1就可以檢測是否有英文字母存在。 ```c= #define PACKED_BYTE(b) (((uint64_t)(b) & (0xff)) * 0x0101010101010101u) bool is_alphabet(const char str[], size_t size) { if (size == 0) return false; int i = 0; while ((i + 8) <= size) { uint64_t payload; memcpy(&payload, str + i, 8); uint64_t A = payload + PACKED_BYTE(128 - 'A'); uint64_t Z = payload + PACKED_BYTE(128 - 'Z' - 1); uint64_t a = payload + PACKED_BYTE(128 - 'a'); uint64_t z = payload + PACKED_BYTE(128 - 'z' - 1); uint64_t upper_mask = A ^ Z; uint64_t lower_mask = a ^ z; upper_mask &= PACKED_BYTE(0x80); lower_mask &= PACKED_BYTE(0x80); if(upper_mask | lower_mask) return true; i += 8; } while (i < size) { uint8_t A = str[i] + 128 - 'A'; uint8_t Z = str[i] + 128 - 'Z' - 1; uint8_t a = str[i] + 128 - 'a'; uint8_t z = str[i] + 128 - 'z' - 1; uint8_t upper_mask = A ^ Z; uint8_t lower_mask = a ^ z; upper_mask &= 0x80; lower_mask &= 0x80; if(upper_mask | lower_mask) return true; i++; } return false; } ``` :::success 延伸問題: 3. 承 (2),考慮常見的英文標點符號,判斷輸入字串是否為有效的字元集,避免逐一字元比對 >提示: Intel SSE 4.2 加入 STTNI (String and Text New Instructions) 指令,可加速字串比對,詳見: [Schema Validation with Intel® Streaming SIMD Extensions 4](https://software.intel.com/content/www/us/en/develop/articles/schema-validation-with-intel-streaming-simd-extensions-4-intel-sse4.html) ::: ## 測驗 2 ```c= uint8_t hexchar2val(uint8_t in) { const uint8_t letter = in & 0x40;//MASK const uint8_t shift = (letter >> 3) | (letter >> 6);//AAA, BBB return (in + shift) & 0xf; } ``` 以下摘自 ASCII 表格 - `'0'`, `'1'`, `'2'`, …, `'9'` 對應到 `0x30`, `0x31`, `0x32`, … `0x39` - `'a'`, `'b'`, `'c'`, …, `'f'` (小寫) 對應到 `0x61`, `0x62`, `0x63`, …, `0x66` - `'A'`, `'B'`, `'C'`, …, `'F'` 對應到 `0x41`, `0x42`, `0x43`, …, `0x46` 這題如果輸入是數值的話,需要能直接輸出;如果輸入是英文字母的話,需要透過處理來得到對應的數值。 其中`letter`只有在輸入是文字時會得到0x40,因為英文字母的ASCII code有一個共通點就是第6個bit一定是1。 這時如果輸入是數值,`letter`會得到0,就不會影響到`shift`以及最後輸出。 因為輸入的英文字母bit 0-3的數值加上9,才會是原本所代表的數值,所以在shift的部分,要將剛剛`letter`在bit 6的數值1分別兩次移動到bit 0以及bit 3, :::success 延伸問題: 2.將 `hexchar2val` 擴充為允許 `"0xFF"`, `"0xCAFEBABE"`, `"0x8BADF00D"` 等字串輸入,且輸出對應的十進位數值 ::: 以輸入最大10個char來看,扣除`0x`,需要使用`uint32_t`來儲存回傳數字。 ```c= uint32_t hexchar2val(const char str[]) { if (str[0] == '0' && (str[1] == 'x' || str[1] == 'X')) str += 2;//一開始先將`str`的位址+2以略過`0`以及`x` uint32_t sum = 0;//用來累加每一個16進位符號所代表的真實數值 for (int i = 0;str[i]!='\0';++i){ sum = sum << 4;//針對上一次的結果先左移4個bits uint8_t letter = str[i] & 0x40;//下三行的部分為舊版進行數值轉換的方式 uint8_t shift = letter >> 3 | letter >> 6; uint32_t transform = (str[i] + shift)&0xf; sum +=transform; } return sum; } ``` ## 測驗 3 ### [Faster Remainder by Direct Computation](https://arxiv.org/pdf/1902.01961.pdf) #### 原理 - 當我們做整數除法的時候,我們會有商:$n / d = q$,以及餘數:$n \% d = r$ 也可以寫成以下式子: $n = q * d + r$ $E.g.\ 23 = 5 * 4 + 3$ 除法也可以改寫成$n*\dfrac{1}{d}$,先行計算$\dfrac{1}{d}$再來和$n$相乘。 $E.g.\ 23 * 0.25 = 5.75,\ \ q =\lfloor 5.75 \rfloor=5, r = 0.75*4=3$ - 接著,電腦裡面的二進位表示如下(以8-bit舉例, 小數會使用低位的3個bit來置放,如範例中的0.25) $E.g.\ 23=00010111_2, 4=00000100_2, 0.25=00000.010_2$ 此時我們要求商,將$n*\dfrac{1}{d}$,這邊即使不放上小數點做運算,結果也是相同 $n*\dfrac{1}{d}=00010111_2*00000010_2 = 00101110_2$ 最後把小數點放到正確位置,結果相同 $00101.110_2 = 5.75_{10}$ - 即使除數不是2的整數次方,概念也相同 $E.g.\ \ d=6時,\dfrac{1}{6}近似於1.666666...= 0.0010101. . ._2 \\ 此時用c來代表0.0010101. .,並且小數部分用8個bits(這裡點上小數點來輔助觀看):\\計算商:\\n ∗ c = 00010111 ∗ 0.00101011 = 11.11011101\\右移8個bits後我們得到商的部分為11_2,也就是3\\計算餘數:\\f ∗ d = 0.11011101 ∗ 00000110 = 101.00101110\\右移8個bits一樣可以得到101_2,也就是5$ - 由上面的範例可以知道,由於小數點存在與否並不影響運算(`移去小數點的同時等於做了左移`),當我們給予$\dfrac{1}{d}$在二進位表示法中$x$個bits的空間,就是對其做x個bits的左移(使得其變成$\dfrac{2^x}{d}$),而做後續運算要取得商以及餘數時,要右移的bits數也會是x($(n*\dfrac{2^x}{d})>>x$)。 - 使用除數的倒數$(c)$以及小數部分$(f)$來判斷$n$是否能被$d$整除 - $if\ n = q ∗ d + r,\ then\ f = \dfrac{r}{d},\ c=\dfrac{1}{d}$ - 整除情況:$\ f<c$ --- 在第6行的部分,得知要右移64bits,所以第2行XXX就會是2^64^ ```c= const uint32_t D = 3; #define M ((uint64_t)(UINT64_C(0xFFFFFFFFFFFFFFFF) / (D) + 1))//XXX /* compute (n mod d) given precomputed M */ uint32_t quickmod(uint32_t n) { uint64_t quotient = ((__uint128_t) M * n) >> 64; return n - quotient * D; } ``` :::success 延伸問題: 2.由 Facebook 公司所維護的 [jemalloc](https://github.com/jemalloc/jemalloc) 是個高效能的記憶體配置器 (memory allocator,即 `malloc` 和 `free` 函式對應的實作),特別在多核多執行緒的環境有優異表現,在其原始程式碼 [include/jemalloc/internal/div.h](https://github.com/jemalloc/jemalloc/blob/dev/include/jemalloc/internal/div.h) 和 [src/div.c](https://github.com/jemalloc/jemalloc/blob/dev/src/div.c) 也運用類似的技巧,請閱讀原始程式碼,並抽離快速除法實作為獨立的程式碼,說明運作原理,也該撰寫測試程式,比較透過處理器的整數除法指令及本技巧的效能差距; ::: ### $Why\ \ M=\dfrac{2^N-1}{d}+1\ ?$ $M$ 理論上要是 $\dfrac{2^N}{d}$,但$d$不一定都可以整除$2^N$,這樣在程式裡面會取得$\lfloor\dfrac{2^N}{d}\rfloor$的結果,所以我們改寫成$M=\dfrac{2^N-1}{d}+1$,此時不論整除與否都會取得$\lceil\dfrac{2^N}{d}\rceil$的結果。 至於為什麼要取得$\lceil\dfrac{2^N}{d}\rceil$呢? 從以下做快速除法程式碼註解中的證明我們會發現到,$\lceil\dfrac{2^N}{d}\rceil$是必要的一項計算結果: #### jemalloc/src/div.c ```c= #include "jemalloc/internal/jemalloc_preamble.h" #include "jemalloc/internal/div.h" #include "jemalloc/internal/assert.h" /* * Suppose we have n = q * d, all integers. We know n and d, and want q = n / d. * * For any k, we have (here, all division is exact; not C-style rounding): * floor(ceil(2^k / d) * n / 2^k) = floor((2^k + r) / d * n / 2^k), where * r = (-2^k) mod d. * * Expanding this out: * ... = floor(2^k / d * n / 2^k + r / d * n / 2^k) * = floor(n / d + (r / d) * (n / 2^k)). * * The fractional part of n / d is 0 (because of the assumption that d divides n * exactly), so we have: * ... = n / d + floor((r / d) * (n / 2^k)) * * So that our initial expression is equal to the quantity we seek, so long as * (r / d) * (n / 2^k) < 1. * * r is a remainder mod d, so r < d and r / d < 1 always. We can make * n / 2 ^ k < 1 by setting k = 32. This gets us a value of magic that works. */ void div_init(div_info_t *div_info, size_t d) { /* Nonsensical. */ assert(d != 0); /* * This would make the value of magic too high to fit into a uint32_t * (we would want magic = 2^32 exactly). This would mess with code gen * on 32-bit machines. */ assert(d != 1); uint64_t two_to_k = ((uint64_t)1 << 32); uint32_t magic = (uint32_t)(two_to_k / d); /* * We want magic = ceil(2^k / d), but C gives us floor. We have to * increment it unless the result was exact (i.e. unless d is a power of * two). */ if (two_to_k % d != 0) { magic++; } div_info->magic = magic; #ifdef JEMALLOC_DEBUG div_info->d = d; #endif } ``` 註解中解釋道, 我們有事先假設$n=q*d$,$n,\ d$已知,要求$q$ 為了不要使用C-style rounding,而是去計算**正確**的除法 (何謂"**正確**"?數學定義如下註釋) >[Division Algorithm](https://math.stackexchange.com/questions/2686378/modulus-operation-with-negative-numbers?fbclid=IwAR1QhhXL67Y6CNoVOoOu87wevYay2hgn4xbswWbsvmkNzb3ktTriWOgE7vM): $Let\ a\ and\ b\ (≠0)\ be\ any\ integers.\\ Then\ there\ exist\ unique\ integers\ q\ and\ r\ with\ the\ property\ that\\ a=bq+r,\ where\ 0≤r<|b|\\ mod\ Operator:\\ For\ the\ same\ a,b,q\ and\ r\ as\ stated\ in\ the\ definition\ above,\ we\ say\\ a\ mod\ b=r$ >E.g. >-16 **mod** 3 = **2** >The division: -16 = (-6) * 3 + 2 >What will it be like in c ? >-16 **%** 3 = **-1** >**So, it's exactly different**. > 我們把$q=\dfrac{n}{d}$表示成$\lfloor{\lceil{\dfrac{2^k}{d}}\rceil}*\dfrac{n}{2^k}\rfloor$,對其做運算(要證明這樣的表示法是正確的): $\begin{split} \lfloor{\lceil{\dfrac{2^k}{d}}\rceil}*\dfrac{n}{2^k}\rfloor\ &=\lfloor{\dfrac{2^k+r}{d}*\dfrac{n}{2^k}}\rfloor\ \ (r = (-2^k)mod\ d)\\ &= \lfloor{\dfrac{2^k}{d}*\dfrac{n}{2^k}+\dfrac{r}{d}*\dfrac{n}{2^k}}\rfloor\\ &= \lfloor{\dfrac{n}{d}+\dfrac{r}{d}*\dfrac{n}{2^k}}\rfloor\\ &= \dfrac{n}{d}+\lfloor{\dfrac{r}{d}*\dfrac{n}{2^k}}\rfloor\ (根據前面假設n可以被d整除)\\ &= \dfrac{n}{d}\ (1.\ r<d,\ 2.\ set\ k=32此二項皆為小數,向下取整結果為0) \end{split}$ >$(r = (-2^k)mod\ d)$ >這樣計算取得的 r 可以讓 $2^k + r$ 被 $d$ 整除 >E.g. >$15\div4 =3...3\\ >-15\div4=-4...1\\ >(15+1)\div4=4...0\\ >\lceil{\dfrac{15}{4}}\rceil\ =\ \dfrac{15}{4}$ >理應有數學證明,待補 >為何可以有n=q*d的假設?? Line 40, 41做的事情就是取得$\lfloor\dfrac{2^{32}}{d}\rfloor$ ```cpp #40 uint64_t two_to_k = ((uint64_t)1 << 32); #41 uint32_t magic = (uint32_t)(two_to_k / d); ``` Line 48-50 則是去得到$\lceil\dfrac{2^{32}}{d}\rceil$的值 (如果d不是2的整數次方的話才需要做這一步) ```cpp #48 if (two_to_k % d != 0) { #49 magic++; #50 } ``` #### jemalloc/include/jemalloc/internal/div.h ```c= #ifndef JEMALLOC_INTERNAL_DIV_H #define JEMALLOC_INTERNAL_DIV_H #include "jemalloc/internal/assert.h" /* * This module does the division that computes the index of a region in a slab, * given its offset relative to the base. * That is, given a divisor d, an n = i * d (all integers), we'll return i. * We do some pre-computation to do this more quickly than a CPU division * instruction. * We bound n < 2^32, and don't support dividing by one. */ typedef struct div_info_s div_info_t; struct div_info_s { uint32_t magic; #ifdef JEMALLOC_DEBUG size_t d; #endif }; void div_init(div_info_t *div_info, size_t divisor); static inline size_t div_compute(div_info_t *div_info, size_t n) { assert(n <= (uint32_t)-1); /* * This generates, e.g. mov; imul; shr on x86-64. On a 32-bit machine, * the compilers I tried were all smart enough to turn this into the * appropriate "get the high 32 bits of the result of a multiply" (e.g. * mul; mov edx eax; on x86, umull on arm, etc.). */ size_t i = ((uint64_t)n * (uint64_t)div_info->magic) >> 32; #ifdef JEMALLOC_DEBUG assert(i * div_info->d == n); #endif return i; } #endif /* JEMALLOC_INTERNAL_DIV_H */ ``` Line 34拿剛剛計算好的$\lceil\dfrac{2^{32}}{d}\rceil$來和$n$相乘,並且右移32位元,得到最後的結果 assign 給 i。 ```cpp #34 size_t i = ((uint64_t)n * (uint64_t)div_info->magic) >> 32; ``` ### Faster Division - version1 將以上 function 合併成一個,且不使用到他的 `struct div_info_s` :::warning 在後面要做效能分析時發現快速除法速度一定比一般除法慢,後來才發現是沒有想清楚快速除法的優勢處 這樣做的問題是,我把快速除法中可以預先算的部份也合併進來,這樣就喪失了他的優勢。:cry: ::: ```c= size_t div_compute(size_t n, size_t d) { /* Nonsensical. */ assert(d != 0); /* * This would make the value of magic too high to fit into a uint32_t * (we would want magic = 2^32 exactly). This would mess with code gen * on 32-bit machines. */ assert(d != 1); assert(n <= (uint32_t)-1); /* Prepare 2^32/d */ uint64_t two_to_k = ((uint64_t)1 << 32); uint32_t magic = (uint32_t)(two_to_k / d); if (two_to_k % d != 0) { magic++; } /* Compute quotient: n*(2^32/d)>>32 */ size_t i = ((uint64_t)n * (uint64_t)magic) >> 32; return i; } ``` - version2 :::success 以下就是快速除法的部份: ::: ```c= typedef struct div_info_s div_info_t; struct div_info_s { uint32_t magic; }; void div_init(div_info_t *div_info, size_t d) { /* Nonsensical. */ assert(d != 0); /* * This would make the value of magic too high to fit into a uint32_t * (we would want magic = 2^32 exactly). This would mess with code gen * on 32-bit machines. */ assert(d != 1); uint64_t two_to_k = ((uint64_t)1 << 32); uint32_t magic = (uint32_t)(two_to_k / d); /* * We want magic = ceil(2^k / d), but C gives us floor. We have to * increment it unless the result was exact (i.e. unless d is a power of * two). */ if (two_to_k % d != 0) { magic++; } div_info->magic = magic; } size_t div_compute(div_info_t *div_info, size_t n) { assert(n <= (uint32_t)-1); /* * This generates, e.g. mov; imul; shr on x86-64. On a 32-bit machine, * the compilers I tried were all smart enough to turn this into the * appropriate "get the high 32 bits of the result of a multiply" (e.g. * mul; mov edx eax; on x86, umull on arm, etc.). */ size_t i = ((uint64_t)n * (uint64_t)div_info->magic) >> 32; return i; } ``` ### 效能比較 #### 計時器 - 使用 `<time.h>` 中提供的 `clock_gettime` funciton ,可以達到奈秒級的精準度: ```c= struct timespec t; clock_getres(CLOCK_MONOTONIC, &t); printf("Resolution: %ld nanosecond\n", t.tv_nsec); ``` Output ![](https://i.imgur.com/393xVp8.png) - [clock_gettime](https://man7.org/linux/man-pages/man2/clock_gettime.2.html) ```c= #include <time.h> int clock_gettime(clockid_t clockid, struct timespec *tp); struct timespec { time_t tv_sec;/* seconds */ long tv_nsec;/* nanoseconds */ }; ``` - 計算時間 (單位: ns) ```c= struct timespec time1 = {0, 0}; struct timespec time2 = {0, 0}; clock_gettime(CLOCK_REALTIME, &time1); /* test code */ clock_gettime(CLOCK_REALTIME, &time2); long timedif = (time2.tv_sec - time1.tv_sec)*1000000000 + time2.tv_nsec-time1.tv_nsec; printf("time:%ld ns\n",timedif); ``` #### 實驗設計 (參考[RinHizakura](https://hackmd.io/@RinHizakura/SJ5NuIANP)) - 給予以下條件 - 除數: - 2~5000 - 被除數 - 取亂數,範圍:0~4294967296 - 每個除數的迴圈中,取5次的被除數,累加5次的時間後取平均 ```c= int main(){ // Write results into file FILE *f = fopen("result.csv", "w"); if (f == NULL) return -1; fprintf(f,"%s,%s\n","fast","normal"); // Timer struct timespec time1 = {0, 0}; struct timespec time2 = {0, 0}; // Division uint32_t q,d,n; srand(time(NULL)); for (d = 2;d < 5000;++d){ long fast=0; long normal=0; for (int i=0; i<5;++i){ n = rand() % 4294967296; clock_gettime(CLOCK_REALTIME, &time1); q = n / d; clock_gettime(CLOCK_REALTIME, &time2); long diff = time2.tv_nsec - time1.tv_nsec; normal += diff; // fast division part uint64_t two_to_k = ((uint64_t)1 << 32); uint32_t magic = (uint32_t)(two_to_k / d); if (two_to_k % d != 0) { magic++; } clock_gettime(CLOCK_REALTIME, &time1); q = ((uint64_t)n * (uint64_t)magic) >> 32; clock_gettime(CLOCK_REALTIME, &time2); diff = time2.tv_nsec - time1.tv_nsec; fast += diff; } printf("fast = %ld ns normal = %ld ns\n",fast/5,normal/5); // Write in the result fprintf(f, "%ld,%ld\n", fast/5, normal/5); } fclose(f); return 0; } ``` #### 實驗結果 ![](https://i.imgur.com/C1v4nYq.png) ## 測驗 4 - $M$就是剛剛提到,將除數做倒數後忽略小數點的值(做左移N位,乘上$2^N$的部分); 而$n*M$則是會得到商(包括整數以及小數部分,且左移過後的值,一樣做左移N位,乘上$2^N$的部分)$: $M = \dfrac{1}{d}*2^N$ $n*M = n*\dfrac{1}{d}*2^N$ 兩者在低位(**這邊是64個bits,根據測驗3**)都是小數部分: | n | n*M < M-1 | | -------- | -------- | | 0 | True | | 1 | False | | 2 | False | | 3 | False | | ... | | | M | True | - 當整除發生時 ($\dfrac{n}{d}$的小數部分逼近0): $n*M\ (\approx 0) < M-1$ > Q1: 為什麼逼近0? > A1: 因為你給的位數不足以正確表達 $\dfrac{1}{d}$ > Q2: 那會不會這個誤差因為除數的增大,最後整除的誤差會影響結果的正確性? > A2: 根據實驗,不會,結果如下,當除數是4294967295(uint_32所能表達的最大無號整數)整除發生時 > $M-1=0000000100000001$ $n*M\ =00000000fffffffe$ 可以見得,誤差還是會小於 M-1 ,情況還是會如我們預料 - 當不能整除時 ($\dfrac{n}{d}$的小數部分為 $M(\dfrac{1}{d})$ 的 n 倍): $n*M\ > M-1$ >$\ M-1$的用處:是因為當我們的被除數是1的時候,$\ n*M$ 和$\ M$ 會得到相同的值 ```c= bool divisible(uint32_t n) { return n * M <= M - 1;//YYY } ``` ## 測驗 5 為了避免測驗 1 提到的記憶體對齊問題,這邊改寫了使用 `memcpy` 來處理大小寫轉換。 ```c= #include <ctype.h> #include <stddef.h> #include <stdint.h> #define PACKED_BYTE(b) (((uint64_t)(b) & (0xff)) * 0x0101010101010101u) /* vectorized implementation for in-place strlower */ void strlower_vector(char *s, size_t n) { size_t k = n / 8; for (size_t i = 0; i < k; i++, s += 8) { uint64_t chunk; memcpy(&chunk, s, 8); if ((chunk & PACKED_BYTE(0x80)) == 0) { /* is ASCII? */ //VV1 uint64_t A = chunk + PACKED_BYTE(128 - 'A'); //VV2 uint64_t Z = chunk + PACKED_BYTE(128 - 'Z' - 1); //VV3 uint64_t mask = ((A ^ Z) & PACKED_BYTE(0x80)) >> 2; //VV4 chunk ^= mask; memcpy(s, &chunk, 8); } else strlower(s, 8); } k = n % 8; if (k) strlower(s, k); } ``` Line 5: Copy the intput byte 8 times. ```cpp #5 #define PACKED_BYTE(b) (((uint64_t)(b) & (0xff)) * 0x0101010101010101u) ``` Line 14: Check if input string contains ASCII code. ```cpp #14 if ((*chunk & PACKED_BYTE(VV1)) == 0) { /* is ASCII? */ //VV1 ``` >- 以下說明為了簡便,都以 1 個字元解說,實際上要做到一次檢查 8 個字元,只要將檢測會用到的 1-byte 資料放進 `PACKED_BYTE(b)` 內擴充成 64 bits 長度就好。 >- 要做的事情是將大寫字母改成小寫字母,以 bit 的觀點來看,就是把 bit 5 從 0 改成 1 即可,**但要先判斷待測字元是否為 A-Z**。 > Line 15-16: 確認待測字元是否落在 `'A'` 和 `'Z'` 之間 `uint64_t A` 的部份: 順序是先計算0x80和 `'A'` 的差,再把待測字元加上去,實際上就是在計算待測字元和 `A` 的差; `uint64_t Z` 的部份: 順序是先計算0x80和 `'Z'-1` 的差,再把待測字元加上去,實際上就是在計算待測字元和 `Z-1` 的差; 這樣計算的目的是為了: **如果待測字元是落在 `A` 和 `Z` 之間的話,那麼 `A` 的MSB會等於1,`Z` 的 MSB 會等於0** Line 17: 有了剛剛得到的 `A` 和 `Z` ,將其先做 `XOR` 的運算,若待測字元落在 `A` 和 `Z` 之間,那我們會得到 MSB 是 1 的結果,經過後面的 `&` 運算以及右移 2 bits 後,會得到 **`0x20`** 的結果。 Line 18: 將原本的待測字元和剛剛得到的 `mask` 做 `XOR` 的運算,將其由大寫改成小寫。 ```cpp #15 uint64_t A = *chunk + PACKED_BYTE(128 - 'A'); //VV2 #16 uint64_t Z = *chunk + PACKED_BYTE(128 - 'Z' - 1); //VV3 #17 uint64_t mask = ((A ^ Z) & PACKED_BYTE(0x80)) >> 2; //VV4 #18 *chunk ^= mask; ``` :::success 延伸問題: 2. 將前述測試程式 main 函式的 char str[] 更換為 char *str,會發生什麼事?請參閱 C 語言規格書,說明 string literal 的行為 ::: 會出現 Segmentation fault,使用 gdb來檢視,結果如下: ![](https://i.imgur.com/K5B0XvW.png) 錯誤的地方是第28行,也就是準備要把大小寫轉換結果寫進 `s` 的時候 ```c= memcpy(s, &chunk, 8); ``` >C99 6.7.8 Example 8 >EXAMPLE 8 The declaration char s[] = "abc", t[3] = "abc"; defines "plain" **char** array objects s and t whose elements are initialized with character string literals. This declaration is identical to char s[] = { 'a', 'b', 'c', '\0' }, t[] = { 'a', 'b', 'c' }; ***The contents of the arrays are modifiable***. On the other hand, the declaration char *p = "abc"; defines **p** with type "pointer to **char**" and initializes it to point to an object with type "array of char" with length 4 whose elements are initialized with a character string literal. ***If an attempt is made to use p to modify the contents of the array, the behavior is undefined.*** ## 測驗 6 - XOR的特性: - 真值表 - 下面的真值表中,若兩者相同則得到 0 ;兩者相異則得到 1。 - 或者,我們可以說,遇到 0 的時候,保持原本數值;遇到 1 的時候,對原本數值做反運算 | A | B | A ^ B| | -------- | -------- | -------- | | 0 | 0 | 0 | | 0 | 1 | 1 | | 1 | 0 | 1 | | 1 | 1 | 0 | - 滿足結合律 - 也就是說,做 XOR 的順序,並不會影響最後的結果 例如. 有數字 A, B, C, D,那麼 `A XOR B XOR C XOR D` 會等於 `B XOR C XOR D XOR A` 以及其他順序的結果皆會相同 > 解題想法參考: https://leetcode.com/problems/single-number-ii/discuss/43294/Challenge-me-thx/176039 - `lower` 和 `higher` 是用來統計每個bit的累積次數,`lower` 的每個 bit 如果是1,那麼就是這個位置出現了1次1,而 `higher` 的每個 bit 如果是1,則是表示這個位置已經出現過2次1了。 - Line6 是用來檢查,如果該個 bit 在 `lower` 和 `higher`裡面都出現,那表示是第3次了,必須要被清除 (設成0)。 - Line8 是用來檢查是否出現第2次,因為如果出現第2次,那麼在 Line5 `lower` 中的那個 bit 會被清除 (設成0),那麼這個 bit 確定出現第2次,必須被留下來。 - 整個流程就是一直不斷的去拿新的 input ,然後檢查每一個 bit 出現次數 (Line 6, 8),正確地把它放在 `lower` 或是 `higher` 裡面。 ```c= int singleNumber(int *nums, int numsSize) { int lower = 0, higher = 0; for (int i = 0; i < numsSize; i++) { lower ^= nums[i]; lower &= ~higher;//KKK higher ^= nums[i]; higher &= ~lower;//JJJ } return lower; } ``` :::success 延伸問題: 2. 請撰寫不同上述的程式碼,並確保在 [LeetCode 137. Single Number II](https://leetcode.com/problems/single-number-ii/) 執行結果正確且不超時; 3. 延伸題意,將重複 3 次改為改為重複 4 次或 5 次,並找出可容易推廣到多次的程式碼; ::: ### 有限狀態機解題: 使用真值表以及卡諾圖 > 解題方法參考: https://leetcode.com/problems/single-number-ii/discuss/43294/Challenge-me-thx/42086 - 題目提到所有數字都會重複3次,只有一個會出現1次,可以將此題視為 `finite state machine` 來解, `int` 總共有32個 bits ,我們只需要探討1個 bit 的情況,便能將其套用到所有 bits 上面 我們會有出現0次、出現1次、出現2次和出現3次的狀態,出現3次和出現0次視為相同狀態,所以總共3種狀態,使用2個 bits 就可以表示 (bit 0 表示出現1次,變數名稱使用 `lower`; bit 1 表示出現2次,變數名稱使用 `higher`): - `00`: 出現0次和出現3次 - `01`: 出現1次 - `10`: 出現2次 - 接著我們使用真值表來表示,當不同狀態的 `lower` 以及 `higher`,在遇到不同的 input 時,會產出的 `lower'` 以及 `higher'` 分別為何: | input | higher | lower| higher' | lower'| | -------- | -------- | -------- | -------- | -------- | | 0 | 0 | 0 | 0 | 0 | | 0 | 0 | 1 | 0 | 1 | | 0 | 1 | 0 | 1 | 0 | | 1 | 0 | 0 | 0 | 1 | | 1 | 0 | 1 | 1 | 0 | | 1 | 1 | 0 | 0 | 0 | - 使用上面提到的**真值表**,我們使用**卡諾圖**來得到 `lower'` 和 `higher'` 分別為: - $\begin{split}lower' =& (\sim input\ \&\ \sim higher\ \&\ lower)\ |\ (input\ \&\ \sim higher\ \&\ \sim lower)\\=& (lower\ \&\ \sim input\ \&\ \sim higher)\ |\ (\sim lower\ \&\ input\ \&\ \sim higher)\\=&((lower\ \&\ \sim input\ |\sim lower\ \&\ input)\&\ \sim higher)\\=&(lower \oplus\ input)\ \&\ \sim higher\end{split}$ - $higher' = (\sim input\ \&\ higher\ \&\ \sim lower)\ |\ (input\ \&\ \sim higher\ \&\ lower)$ - 將其轉換成程式碼,我們需要一個額外的變數 `last_lower` 來儲存上一輪的 `lower` ,並將其用來計算新一輪的 `higher` ```c= int singleNumber(int *nums, int numsSize) {//repeat 3 times, 2 bits used int lower = 0, higher = 0; for (int i = 0; i < numsSize; i++) { int last_lower = lower; lower = (lower ^ nums[i]) & ~higher; higher = (~nums[i] & higher & ~last_lower) | (nums[i] & ~higher & last_lower); } return lower; } ``` ![](https://i.imgur.com/W4D4Qe9.png) ### 重複n次問題 - 這樣子的演算法使用到了**真值表**以及**卡諾圖**,若要擴充到 n 次重複的問題,只要使用 $\lceil log_2n\rceil$ 個 bits ,就能得到結果,例如,現在我們要4次重複 (需要出現1次、出現2次、出現3次以及出現4次 (和出現0次為同一個狀態)共4個 state ): | input | higher | lower| higher' | lower'| | -------- | -------- | -------- | -------- | -------- | | 0 | 0 | 0 | 0 | 0 | | 0 | 0 | 1 | 0 | 1 | | 0 | 1 | 0 | 1 | 0 | | 0 | 1 | 1 | 1 | 1 | | 1 | 0 | 0 | 0 | 1 | | 1 | 0 | 1 | 1 | 0 | | 1 | 1 | 0 | 1 | 1 | | 1 | 1 | 1 | 0 | 0 | 一樣我們使用**卡諾圖**來得到 `lower'` 和 `higher'` 分別為: - $\begin{split}lower' =& (\sim input\ \&\ lower)\ |\ (input\ \&\ \sim lower)\\=&\ lower\ \oplus\ input \end{split}$ - $higher' = (input\ \&\ \sim higher\ \&\ lower)\ |\ (higher\ \&\ (\sim input\ \&\ \sim lower))$ - 將其轉換成程式碼,一樣需要一個額外的變數 `last_lower` 來儲存上一輪的 `lower` ,並將其用來計算新一輪的 `higher` ```c= int singleNumber(int *nums, int numsSize) {//repeat 4 times, 2 bits used int lower = 0, higher = 0; for (int i = 0; i < numsSize; i++) { int last_lower = lower; lower = lower ^ nums[i]; higher = (nums[i]&~higher&last_lower) | (higher&(~nums[i]&~last_lower)); } return lower; } ``` ```c= int singleNumber(int *nums, int numsSize) {//repeat 5 times, 3 bits used int one = 0, two = 0, three = 0;//which means repeat 1, 2, 3 times respectively for (int i = 0; i < numsSize; i++) { int lastone = one; int lasttwo = two; one = ~three & (one ^ nums[i]); two = (~nums[i]&~three&two) | (~three&two&~lastone) | (nums[i]&~three&~two&lastone); three = (~nums[i]&three&~lasttwo&~lastone)|(nums[i]&~three&lasttwo&lastone); } return one; } ```