2020q3 Homework2 (quiz2)

contributed by <hsiehong>
github

tags: `進階電腦系統理論與實作`

quiz2
已提交的作業共筆

測驗1

判斷指定的記憶體範圍內是否全是有效的 ASCII 字元























#include <stdbool.h>
#include <stddef.h>
#include <stdint.h>
#include <string.h>
bool is_ascii(const char str[], size_t size)
{
    if (size == 0)
        return false;
    int i = 0;
    while ((i + 8) <= size) {
        uint64_t payload;
        memcpy(&payload, str + i, 8);
        if (payload & MMM)
            return false;
        i += 8;
    }
    while (i < size) {
        if (str[i] & 0x80)
            return false;
        i++;
    }
    return true;
}

MMM = 0x8080808080808080

題目原理解釋

因為有效的 ASCII 範圍是 0 - 127，二進制表示是 0x0000 0000 - 0x0111 1111 ，所以可以透過 alpha & 1000 0000 來檢測字母的 MSB 是否為 1，為 1 表示不為有效的 ASCII 字元 (ASCII > 127)，在 64 位元觀點此 mask 即為 0x8080808080808080

為何用到 `memcpy` ?

參閱 memcpy - Linux manual page 可以得知若系統沒有 memory alignment，在 memcpy 會做 memory alignment，可以避免 Unaligned Memory Accesses，因為並不是所有的硬體都有支援 unaligned memory acces，memcpy 就是透過軟體來達到 memory alignment，雖然執行速度會較硬體慢，但不在這邊的討論範圍。
其中 OPSIZ 會因不同的架構而異，可能是 8 bytes 或 16 bytes，而把多出來不足 8( or 16) bytes 的部分先行複製


len -= (-dstp) % OPSIZ;
BYTE_COPY_FWD (dstp, srcp, (-dstp) % OPSIZ);

Question

why (-dstp) ?

reference:

將上述技巧應用於「給予一個已知長度的字串，檢測裡頭是否包含有效的英文大小寫字母」

判斷是否為字母的部分使用第 5 題的技巧，利用 A Z 的 MSB 判斷是否是大寫字母，利用 a z 的 MSB 判斷是否是小寫字母，並利用 PACK_BYTE 來增加可讀性
detectEffectiveChar.c

考慮常見的英文標點符號，判斷輸入字串是否為有效的字元集，避免逐一字元比對。Intel SSE 4.2 加入 STTNI (String and Text New Instructions) 指令，可加速字串比對，閱讀 Schema Validation with Intel® Streaming SIMD Extensions 4

TODO
reference: eecheng

測驗2

將十六進位表示的字串 (hexadecimal string) 轉為數值






uint8_t hexchar2val(uint8_t in)
{
    const uint8_t letter = in & MASK;
    const uint8_t shift = (letter >> AAA) | (letter >> BBB);
    return (in + shift) & 0xf;
}

MASK = 0x40
AAA = 3
BBB = 6

題目原理解釋

觀察一個 byte 的狀況

字元	ASCII(in)	output(dec)	output(bin)
'0'	0011 0000	0	0000 0000
'6'	0011 0110	6	0000 0110
'9'	0011 1001	9	0000 1001
'A'	0100 0001	10	0000 1010
'a'	0110 0001	10	0000 1010
'F'	0100 0110	15	0000 1111
'f'	0110 0110	15	0000 1111

MASK 為 0100 0000，推測是判斷字元是數字還是字母，若是字母則 letter 為0100 0000，數字的話 letter 為0000 0000
從 line5 回傳值 (in + shift) & 0xf 推測，0xf 二進制為 0000 1111，即保留 (in + shift)末四位，所以只須探討這末四位是何方神聖
若字元是字母，A 和 a 的末四位都是 0001，要變成正確輸出的 1010 需要加上 1001，可以推斷正確的 offset 為 1001 。這裡利用 letter 0100 0000 來產生 shift，letter >> 3 | letter >> 6 可以得到正確的 offset 0000 1001。
若字元是數字，letter 就是 0，line 5 就沒有作用，in 末四碼就是正確的結果

將 hexchar2val 擴充為允許 "0xFF", "0xCAFEBABE", "0x8BADF00D" 等字串輸入，且輸出對應的十進位數值

將參數改成一次處理一個字串，且能處理的範圍不大於 64 bits (即最多只能處理八個字元，不包含 0x)，並檢查是否為 Hexspeak
hexchar2val.c













uint64_t hexchar2val(char *in)
{
    uint64_t result = 0;
    int len = strlen(in);
    if (in[0] == '0' && in[1] == 'x') {
        for (int i = 2; i < len; i++) {
            const uint8_t letter = *(in + i) & 0x40;
            uint8_t shift = (letter >> 3) | (letter >> 6);
            result = (result << 4) | (in[i] + shift) & 0xf;
        }
    }
    return result;
}

測試結果







int main(void)
{
    printf("%llu\n", hexchar2val("0xFF"));        
    printf("%llu\n", hexchar2val("0xCAFEBABE"));
    printf("%llu\n", hexchar2val("0x8BADF00D"));
    return 0;
}

255
3405691582
2343432205

測驗3

將除法改為乘法運算








const uint32_t D = 3;
#define M ((uint64_t)(UINT64_C(XXX) / (D) + 1))

/* compute (n mod d) given precomputed M */
uint32_t quickmod(uint32_t n)
{   uint64_t quotient = ((__uint128_t) M * n) >> 64;
    return n - quotient * D;
}

XXX = 0xFFFFFFFFFFFFFFFF

題目原理解釋

根據數學式

\frac{n}{d} = n \times (\frac{\frac{2^{N}}{d}}{2^{N}}) = (n \times \frac{2^{N}}{D}) \times \frac{1}{2^{N}}

line 6 可以推得是除以
$2^{64}$ ，所以 N 應為64，((__uint128_t) M * n) >> 64 即為
$n / d$ 的整數部份( quotient )
精確值 XXX 應為
$2^{64}$ ，但 XXX 最大只能儲存
$2^{64} - 1$ ，就是 0xFFFFFFFFFFFFFFFF，最後 +1 代表無條件進位，確保
$\frac{2^{64}}{D}$ 與
$\frac{2^{64} - 1}{D} + 1$ 的整數部份是會相同的
從上述可以得知， M 是對應到
$⌈ \frac{2^{32}}{d} ⌉$
line7 算出商之後，n - q * D 即可得到餘數

測試程式

fastmod 實驗

對 fast_mod 與 normal_mod 分別跑 10 萬次
繪圖 script
比較不做最佳化與做 O1, O2, O3 最佳化的結果
可以看出有做最佳化的話其實結果不會差太多，但 normal 版本還是會稍微較好
O0
O1
O2
O3

Question

fast_mod 在全部的情況下表現都沒有比較優異？
如何驗證實驗結果是正確的？

關於 jemalloc 快速除法

由 Facebook 公司所維護的 jemalloc 是個高效能的記憶體配置器 (memory allocator，即 malloc 和 free 函式對應的實作)，特別在多核多執行緒的環境有優異表現，在其原始程式碼 include/jemalloc/internal/div.h 和 src/div.c 也運用類似的技巧，請閱讀原始程式碼，並抽離快速除法實作為獨立的程式碼，說明運作原理，也該撰寫測試程式，比較透過處理器的整數除法指令及本技巧的效能差距
假設
$n = q \times d$ ，
$n, q, d$ 都是整數，已知
$n, d$ 求
$q$ ，
$q = \frac{n}{d}$ ，可以轉換成下面的形式

q =

⌊ ⌈ \frac{2^{k}}{d} ⌉ \times \frac{n}{2^{k}} ⌋ = ⌊ \frac{2^{k} + r}{d} \times \frac{n}{2^{k}} ⌋

，其中

r

(- 2)^{k}

mod

d

= ⌊ \frac{2^{k}}{d} \times \frac{n}{2^{k}} + \frac{r}{d} \times \frac{n}{2^{k}} ⌋ = ⌊ \frac{n}{d} + \frac{r}{d} \times \frac{n}{2^{k}} ⌋

，因為

n = q \times d

，所以

\frac{n}{d}

為整數

= \frac{n}{d} + ⌊ \frac{r}{d} \times \frac{n}{2^{k}} ⌋

，推導到這邊，只要能確保

⌊ \frac{r}{d} \times \frac{n}{2^{k}} ⌋ < 1

，就代表

\frac{n}{d}

可以被改寫成

⌊ ⌈ \frac{2^{k}}{d} ⌉ \times \frac{n}{2^{k}} ⌋

其中
$r$ =
$(- 2)^{k}$ mod
$d$ ，可以確定
$r < d$ ，而
$\frac{n}{2^{k}}$ 可以令
$k = 32$ 來確保其
$< 1$
推導完再來看快速除法的資料結構 div.h









































#ifndef JEMALLOC_INTERNAL_DIV_H
#define JEMALLOC_INTERNAL_DIV_H

#include "jemalloc/internal/assert.h"

/*
 * This module does the division that computes the index of a region in a slab,
 * given its offset relative to the base.
 * That is, given a divisor d, an n = i * d (all integers), we'll return i.
 * We do some pre-computation to do this more quickly than a CPU division
 * instruction.
 * We bound n < 2^32, and don't support dividing by one.
 */

typedef struct div_info_s div_info_t;
struct div_info_s {
	uint32_t magic;
#ifdef JEMALLOC_DEBUG
	size_t d;
#endif
};

void div_init(div_info_t *div_info, size_t divisor);

static inline size_t
div_compute(div_info_t *div_info, size_t n) {
	assert(n <= (uint32_t)-1);
	/*
	 * This generates, e.g. mov; imul; shr on x86-64. On a 32-bit machine,
	 * the compilers I tried were all smart enough to turn this into the
	 * appropriate "get the high 32 bits of the result of a multiply" (e.g.
	 * mul; mov edx eax; on x86, umull on arm, etc.).
	 */
	size_t i = ((uint64_t)n * (uint64_t)div_info->magic) >> 32;
#ifdef JEMALLOC_DEBUG
	assert(i * div_info->d == n);
#endif
	return i;
}

#endif /* JEMALLOC_INTERNAL_DIV_H */

在 structure div_info_s 中有一個 uint32_t magic，magic 就是上述推導中的
$⌈ \frac{2^{k}}{d} ⌉$ ，在 div.c 中實做
div_compute 回傳的是
$⌊ ⌈ \frac{2^{k}}{d} ⌉ \times \frac{n}{2^{k}} ⌋$
再來實做部份 div.c (省略推導過程)































#include "jemalloc/internal/jemalloc_preamble.h"
#include "jemalloc/internal/div.h"
#include "jemalloc/internal/assert.h"

void
div_init(div_info_t *div_info, size_t d) {
	/* Nonsensical. */
	assert(d != 0);
	/*
	 * This would make the value of magic too high to fit into a uint32_t
	 * (we would want magic = 2^32 exactly). This would mess with code gen
	 * on 32-bit machines.
	 */
	assert(d != 1);

	uint64_t two_to_k = ((uint64_t)1 << 32);
	uint32_t magic = (uint32_t)(two_to_k / d);

	/*
	 * We want magic = ceil(2^k / d), but C gives us floor. We have to
	 * increment it unless the result was exact (i.e. unless d is a power of
	 * two).
	 */
	if (two_to_k % d != 0) {
		magic++;
	}
	div_info->magic = magic;
#ifdef JEMALLOC_DEBUG
	div_info->d = d;
#endif
}

這邊有避免除數為 0 的狀況，在 ISO/IEC 9899 中

6.5.5-5
The result of the / operator is the quotient from the division of the first operand by the second; the result of the % operator is the remainder. In both operations, if the value of the second operand is zero, the behavior is undefined.

two_to_k 即
$2^{32}$
magic 為
$⌈ \frac{2^{k}}{d} ⌉$ ，但 / 預設是取 floor，所以在 line 24 中還要做取 ceiling 的動作

Questions

在這裡 .h file 與 .c file 的區分？
在 div.c 中使用到 %，影響？

jemalloc 實驗

測試程式

























int main(void)
{
    div_info_t DIV;
    size_t temp;
    struct timespec start, end;
    srand(time(NULL));
    printf("divisor  fast_time   norm_time\n");
    for (size_t i = 2; i <= 10000; i++) {
        div_init(&DIV, i);
        uint64_t dividend = rand() % UINT_MAX;
        
        clock_gettime(CLOCK_MONOTONIC, &start);
        temp = div_compute(&DIV, dividend);
        clock_gettime(CLOCK_MONOTONIC, &end);
        long long fast_cost = (long long)(end.tv_sec * 1e9 + end.tv_nsec) -     \
                              (long long)(start.tv_sec * 1e9 + start.tv_nsec);
        clock_gettime(CLOCK_MONOTONIC, &start);
        temp = dividend / i;
        clock_gettime(CLOCK_MONOTONIC, &end);
        long long norm_cost = (long long)(end.tv_sec * 1e9 + end.tv_nsec) -     \
                              (long long)(start.tv_sec * 1e9 + start.tv_nsec);
        printf("%-7ld  %-7lld %-7lld\n", i, fast_cost, norm_cost);
    }
    return 0;
}

下面分別比較編譯時分別使用不同最佳化的結果

Question

與同學的實驗結果有落差

參考同學的實驗：
WeiCheng14159
RinHizakura

實驗結果的震盪探討

測驗4




bool divisible(uint32_t n)
{
    return n * M <= YYY;
}

const uint32_t D = 3;
#define M ((uint64_t)(UINT64_C(XXX) / (D) + 1))

YYY = M-1

題目原理解釋

已知 0xFFFFFFFFFFFFFFFF / 3 = 0x5555555555555556 = M，又 d = 3，n 只可能有 3 種情況: 3k + 1, 3k + 2, 3k，k 為正整數，分別討論 n 可能的 3 種狀況
n = 3k:
$n * M = 2 k$
n = 3k + 1:
$n * M = (3 k + 1) * M = 3 K M + M = 2 k + M$
n = 3k + 2:
$n * M = (3 k + 2) * M = 3 K M + 2 M = 2 k + 2 M$
可以得知只有 n = 3k 時 n 才能被整除，所以 n * M = 2k 才有可能，也可以表示成 n * M <= M - 1 或 n * M < M (k 有可能為 0)

閱讀 Faster remainders when the divisor is a constant: beating compilers and libdivide

note

測驗5

將指定的字串 (或部分字串) 的英文字母向量化地全改為小寫(一次改變8個 byte) ，使用 in-place 作法

























#include <ctype.h>
#include <stddef.h>
#include <stdint.h>

#define PACKED_BYTE(b) (((uint64_t)(b) & (0xff)) * 0x0101010101010101u)

/* vectorized implementation for in-place strlower */
void strlower_vector(char *s, size_t n)
{
    size_t k = n / 8;
    for (size_t i = 0; i < k; i++, s += 8) {
        uint64_t *chunk = (uint64_t *) s;
        if ((*chunk & PACKED_BYTE(VV1)) == 0) { /* is ASCII? */
            uint64_t A = *chunk + PACKED_BYTE(128 - 'A' + VV2); 
            uint64_t Z = *chunk + PACKED_BYTE(128 - 'Z' + VV3);
            uint64_t mask = ((A ^ Z) & PACKED_BYTE(VV4)) >> 2;
            *chunk ^= mask;
        } else
            strlower(s, 8);
    }
    
    k = n % 8;
    if (k)
        strlower(s, k);
}

VV1 = 0x80
VV2 = 0
VV3 = -1
VV4 = 0x80

題目原理解釋

define PACKED_BYTE(b) PACKED_BYTE(b) 的作用是取出 b 的最後 8 個位元並複製 8 次形成 8 個 byte，比如說 b 的後 8 bits 是 0xab，則 PACKED_BYTE(b) = 0xabababababababab
line13: 要確認是否為有效的 ASCII 字元，使用技巧同第一題，又這裡要一次處理 8 個 char (8 bytes)，故將 0x80 -> 0x8080808080808080
line17，推測 mask 的作用，若大寫字母才需要做轉換，小寫字母不須轉換，字母為大寫的話 mask 是 0b0010 0000 (原本是0x80，在 line16 做了 >> 2 的動作)，反之是小寫字母的話 mask 為 0b0000 0000
我們要判斷一個字母 c 是否在大寫字母範圍簡單的方法如下，有兩個部分， c > 'A' 且 c < 'Z'

c > 'A' && c < 'Z'

128 二進制表示是 1000 0000，這裡設計兩個變數 A Z 檢查字母是否在 A-Z 之間，變數 A 判斷 c 是否大於等於 'A'，若 c 大於 'A' 的話 128 - 'A' + VV2 的 MSB 會是 1 (>127)，故 vv2 = 0 。變數 Z 同理，紀錄字母 c 是否大於 'Z'，但因為 'Z' 是允許的(計算127 - 'Z'的距離就好，故 vv3 = -1)，大於 'Z' 的話變數 Z 的 MSB 才要為 1，故要再減 1 。
這裡只需要 A^Z MSB 的資訊來判斷是否在大寫字母的範圍，所以此處 mask 取 0x1000 0000 ，取出 MSB 的資訊，來判斷要做大寫還是小寫的動作。

延伸問題














#include <stdio.h>
#include <string.h>
int main()
{
    /* quote from https://www.gutenberg.org/files/74/74-h/74-h.htm */
    char str[] =
        "This eBook is for the use of anyone anywhere at no cost and with \
almost no restrictions whatsoever.  You may copy it, give it away or \
re-use it under the terms of the Project Gutenberg License included \
with this eBook or online at www.gutenberg.net";
    int n = strlen(str);
    strlower_vector(str, n);
    puts(str);
}

若將 char str[] ... 改成 char *str ... 執行時會產生錯誤

參閱 ISO/IEC 9899 (即 C99 規格)

6.4.5 String literals
第 2 點

6.4.5
A character string literal is a sequence of zero or more multibyte characters enclosed in double-quotes, as in "xyz".

5.1.1.2-6
Adjacent string literal tokens are concatenated.

第 5 點

In translation phase 7, a byte or code of value zero is appended to each multibyte character sequence that results from a string literal or literals. The multibyte character sequence is then used to initialize an array of static storage duration and length just sufficient to contain the sequence. For character string literals, the array elements have type char, and are initialized with the individual bytes of the multibyte character sequence; for wide string literals, the array elements have type wchar_t, and are initialized with the sequence of wide characters corresponding to the multibyte character sequence, as defined by the mbstowcs function with an implementation-defined current locale. The value of a string literal containing a multibyte character or escape sequence not represented in the execution character set is implementation-defined.

第 6 點

It is unspecified whether these arrays are distinct provided their elements have the
appropriate values. If the program attempts to modify such an array, the behavior is undefined.

string literal 的型態是 array of char
char array 會在 static storage 做初始化的動作，且被分配到的空間大小是剛好的 (大小是有加入 null byte 的)
試圖修改其內容是 undefine behavior

supplement

Description of mbstowcs function in 7.20.8.1

The mbstowcs function converts a sequence of multibyte characters that begins in the initial shift state from the array pointed to by s into a sequence of corresponding wide characters and stores not more than n wide characters into the array pointed to by pwcs. No multibyte characters that followanull character (which is converted into a null wide character) will be examined or converted. Each multibyte character is converted as if by a call to the mbtowc function, except that the conversion state of the mbtowc function is not affected.

Description of wchar_t in 7.17

which is an integer type whose range of values can represent distinct codes for all members of the largest extended character set specified among the supported locales; the null character shall have the code value zero. Each member of the basic character set shall have a code value equal to its value when used as the lone character in an integer character constant if an implementation does not define

`char str[]` 與 `char *str` 的差別參考:

6.7.8 Initialization
第 32 點

char s[] = "abc"

defines plain char array objects s whose elements are initialized with character string literals.
The contents of the arrays are modifiable.

定義一個字元陣列，其內容是 string literal 所初始的，且陣列的內容是可以更改的

char *p = "abc";

defines p with type pointer to char and initializes it to point to an object with type array of char with length 4 whose elements are initialized with a character string literal. If an attempt is made to use p to modify the contents of the array, the behavior is undefined.

首先會宣告一個指向 char 的 pointer，並將其初始化，初始化會指向一個
字元陣列，這個陣列是 string literal 的，如上所述，嘗試對其內容修改是 undefined behavior 的

some other introduction of string literal

測驗6

LeetCode 題目敘述，完成以下程式碼











int singleNumber(int *nums, int numsSize)
{
    int lower = 0, higher = 0;
    for (int i = 0; i < numsSize; i++) {
        lower ^= nums[i];
        lower &= KKK;
        higher ^= nums[i];
        higher &= JJJ;
    }
    return lower;
}

KKK = ~higher
JJJ = ~lower

題目原理解釋

這題是運用 bitwise 操作中，每個 bits 都是獨立的個體的特性，透過 {higher,lower} 紀錄每個 bit 出現的次數，因為每個數只會出現3次或1次，所以有三種狀態要記錄，分別是出現1次，出現2次和出現3次的，{higher,lower} 有 2 bits，其實最多可以記錄4種狀態。
可以利用 01 表示該 bit 出現了1次，10 表示該 bit 出現2次，00 表示出現3次，其實就是一個 Mealy 有現狀態機，當前的狀態會與輸入還有目前的狀態有關，狀態轉換如下
current state 和 after state 分別是紀錄每個位元在該 iteration 的狀態
{higher, lower}

current state	input	after state
00	0	00
00	1	01
01	0	01
01	1	10
10	0	10
10	1	00

lower ^= nums[i];

將 nums[i] 加入 lower (LSB)，用 xor operation。若此 bit 是 1 (代表前面已經出現過1次了)，那會被消成 0，代表這個位置出現兩次了，後面要再加入 higher

lower &= ~higher;

檢查 higher 的狀態，若 higher 是 1 (表示前面已經出現過 2 次了)，那最終的狀態應該要是 0 0。若 higher 是 0 ，表示還沒出現過，所以做 ~higher 運算可以達到這個效果

higher ^= nums[i];

將 nums[i] 加入 higher (MSB)，用 xor operation，若此 bit 是 1 (代表前面已經出現過 2 次了，現在是第 3 次出現，這樣狀態應該要是 0)，那會被消成 0

higher &= ~lower;

檢查 lower 的狀態，若 lower 是 1 (表示這個 bit 是第 1 次出現，所以並不用加到 higher)，那狀態應該要是 0 1。若 lower 是 0 ，表示不是第 1 次出現，可能是第 2 或 3 次，此時上一個動作就已經幫我們做好處理了，所以做 ~higher 運算可以達到這個效果

另解

 * 思路：利用一個陣列 `bitCnt[32]`，記錄每個位置的 bit 出現的次數，利用本題只會有出現3次或1次的狀況，所以可以用 `bitCnt[i] % 3` 來判斷該位置的 bit 是否是只出現一次的數字

* 缺點：另外宣告了一個 int 陣列，問題是這個陣列並不會被有效的使用，而且下面的迴圈內用到了 mod 運算，所以還有蠻多地方可以加強的

















int singleNumber(vector<int>& nums) {
    int numSize = nums.size();
    int bitCnt[32] = {0};
    for (int i = 0; i < numSize; ++i) {
        int num = nums[i];
        for (int j = 0; j < 32; ++j) {
            if (num & (1 << j))
                bitCnt[j]++;
        }
    }
    int res = 0;
    for (int i = 0; i < 32; ++i) {
        if (bitCnt[i] % 3)
            res |= (1 << i);
    }
    return res;
}

執行結果

推廣

推廣到出現 n 次，可以先分成兩種狀況，出現偶數次跟奇數次
偶數次比較簡單，根據 xor operation 的特性: A ^ A = 0，出現偶數次的 bit 在運算時可以直接被消去，所以只需記錄兩種狀態，出現 1 次或出現偶數次，這樣一個 bit 就可以代表了，實作方式跟 Single Nnumber 一樣







int evenNumber(int *nums, int numsSize)
{
    int one = 0;
    for (int i = 0; i < numsSize; i++)
        one ^= nums[i];
    return one;
}

討論奇數次的狀況，若出現 n 次，理論上需要
$⌈$ log₂n
$⌉$ 個變數來記錄狀態
比較直覺的方法同上面，利用 1 個大小為 32 個 int 的 int 陣列 bitCnt 記錄每個 bit 出現的次數，最後改成 bitCnt[i] % n 即可，但問題也相同，可能大多數並不需要用到 32 個 int，只需要
$⌈$ log₂n
$⌉$ 個 int 即可，在 n < 2³¹ 的情況下都能較有效的利用到 memory，同時也盡量避免 module operation
參考這篇的解說我覺得講的非常仔細，圖示也把核心概念解釋得很清楚如下：

其中的 m 就是
$⌈$ log₂n
$⌉$ ，也是所需 counter 的數量
這裡考慮出現 5 次的要求，因為最多會出現 5 次，所以需要
$⌈$ log₂5
$⌉$ = 3 個 counter x1, x2, x3， bit 的轉換狀態如下：

000 -> 001 -> 010 -> 011 -> 100 -> 101 -> 000
       一次    兩次   三次    四次    五次

注意第 X_i 個 bit 從 0 -> 1 的過程中，X_i-1, X_i-2, … , X₁ 這些 bits 必須全為 1 才能成立，要注意實做時要 X_m 計算到 X₁ ，不然會 data loss 導致進位錯誤，到這邊我們已經可以正常統計每個位置的 bit 出現的次數，但接下來會有一個問題，m bits counter 理論上會統計到 2^m - 1 而不是我們要的 n，根據上面的範例解釋就是該如何讓 100 -> 000 而不是 100 -> 101，所以還需要設立一個機制能統計到 m 時歸零而不是繼續累計。
這裡很巧妙的利用到了 cutting 的機制，當 counter 累積到 n 的時時候會歸零，可以透過使用一個 mask 對所有的 counter 做檢查，mask 是個人覺得這題最 tricky 的地方，mask 的建制如下：

mask = ~ (y₁ & y₂ & … & y_m)
y_i = x_i if k_i is 1
y_i = ~ x_i if k_i is 0
(k_i 是數字出現次數二進制表示的第 i 個 bit)

說明：假設各 counter 的狀態如下 MSB 是 101，即將要備歸零，而 LSB 的狀態是 110，各 counter 表示如下：

x1: 1xx...x1
x2: 0xx...x1
x3: 1xx...x0

做取 mask 的動作後 MSB 會得到 1 (需要歸零的 signal) (1 & ~0 & 1)，LSB 會得到 0 (不需要歸零，所以並沒有效果)(1 & ~1 & 0)，mask 再取 ~ 即可達到將 counter 歸零的效果

mask 建立完成之後再依序對每個 counter 做檢查即可

程式碼






















#include <stdio.h>

int main(void)
{
    int arr[] = {2,5,9,7,7,6,9,2,5,9,5,2,7,2,9,5,5,7,7,9,2};
    int x1, x2, x3;
    x1 = x2 = x3 = 0;
    int size = sizeof(arr) / sizeof(int);

    for (int i = 0; i < size; i++) {
        x3 ^= (x2 & x1 & arr[i]);
        x2 ^= (x1 & arr[i]);
        x1 ^= arr[i];
            /* in this situation k = 5, which is 101 */
        int mask = ~(x1 & ~x2 & x3);
        x3 &= mask;
        x2 &= mask;
        x1 &= mask;
    }
    printf("The single number is %d\n", x1);
    return 0;
}

執行結果

The single number is 6

推廣到 r 次也是以此類推，並利用陣列的形式紀錄 counter 就可以了

2020q3 Homework2 (quiz2)

tags: 進階電腦系統理論與實作

測驗1

判斷指定的記憶體範圍內是否全是有效的 ASCII 字元

題目原理解釋

為何用到 memcpy ?

將上述技巧應用於「給予一個已知長度的字串，檢測裡頭是否包含有效的英文大小寫字母」

考慮常見的英文標點符號，判斷輸入字串是否為有效的字元集，避免逐一字元比對。Intel SSE 4.2 加入 STTNI (String and Text New Instructions) 指令，可加速字串比對，閱讀 Schema Validation with Intel® Streaming SIMD Extensions 4

測驗2

將十六進位表示的字串 (hexadecimal string) 轉為數值

題目原理解釋

將 hexchar2val 擴充為允許 "0xFF", "0xCAFEBABE", "0x8BADF00D" 等字串輸入，且輸出對應的十進位數值

測驗3

將除法改為乘法運算

題目原理解釋

fastmod 實驗

關於 jemalloc 快速除法

jemalloc 實驗

測驗4

題目原理解釋

閱讀 Faster remainders when the divisor is a constant: beating compilers and libdivide

測驗5

將指定的字串 (或部分字串) 的英文字母向量化地全改為小寫(一次改變8個 byte) ，使用 in-place 作法

題目原理解釋

延伸問題

supplement

char str[] 與 char *str 的差別參考:

測驗6

題目原理解釋

另解

執行結果

推廣

程式碼

tags: `進階電腦系統理論與實作`

為何用到 `memcpy` ?

`char str[]` 與 `char *str` 的差別參考: