# 2020q3 Homework3 (quiz3)
contributed by < `Rsysz` >
###### tags: `sysprog`
> [測驗題](https://hackmd.io/@sysprog/2020-quiz3#%E6%B8%AC%E9%A9%97-3)
## 1.
根據C99標準的6.5.7節
> “The result of `E1` >> `E2` is E1 right-shifted E2 bit positions. … If E1 has a signed type and a negative value, the resulting value is ==implementation-defined==.
對有號型態的物件進行算術右移(Arithmetic Shift Right)屬於 `Unspecified Behavior`
因此想實作一套可跨平台的ASR,代表當輸入為負數時右移必須補 1 ,反之則必須補 0
```cpp
int asr_i(signed int m, unsigned int n)
{
const int logical = (((int) -1) OP1) > 0;
unsigned int fixu = -(logical & (OP2));
int fix = *(int *) &fixu;
return (m >> n) | (fix ^ (fix >> n));
}
```
根據題意,已知m為輸入數值,n為右移次數,由 `return` 回推
* 假設 `m` 為正數若 `m >> n` 為補0,則 `fix ^ (fix >> n)` = 0
若 `fix = 0` 相當於 `fixu = 0` 代表 `logical` 必須等於 0
因此得知 `logical` 為測試此平台上負數右移補 0 或 1 , OP1 = `(b) >> 1`
* 假設 `m` 為負數若 `m >> n` 為補0,則 `fix ^ (fix >> n)` 必須補上對應 `n` 位的 1
若 `logical` 為 1 代表右移補 0 , `fix = -OP2` 又 `m < 0` 時 `fix` 必須等於 -1
因此 OP2 = `(c) m < 0`
```graphviz
digraph ASR{
node [shape="box", fontcolor=black, fontsize=13, style=filled, fillcolor=aquamarine]
start[shape=ellipse, fontcolor=black,style=filled, fillcolor=white,label="start"]
init[label="input value = m, shift value = n", fillcolor=white]
shift[label="(((int) -1) >> 1) > 0",shape="diamond", fontcolor=black, fillcolor=white]
fix[label="fix = fixu = -(logical & (m < 0))", fillcolor=white]
m[label="m < 0", shape="diamond", fillcolor=white]
m_true[label="m > 0, logical = 0 or 1
m < 0, logical = 0
fix = 0", fillcolor=white]
m_false[label="m < 0, logical = 1
fix = -1", fillcolor=white]
return[label=" (m >> n) | (fix ^ (fix >> n)) ", fillcolor=white]
end[shape=ellipse, fontcolor=black,style=filled, fillcolor=white,label="end"]
start->init->shift->fix:w
init->m->fix:e
fix->m_true->return->end
fix->m_false->return
}
```
以$-5≫^{arith}1$為例, `m = -5` , `n = 1` 若假設右移為 ==邏輯右移==
```graphviz
digraph SR{
node[shape=record]
struct1 [label="<f0>1|<f1>1|<f2>1|<f3>1|<f4>1|<f5>1|<f6>1|<f7>1"]
int [label = "((int) -1)", shape = plaintext]
struct2 [label="<f0>0|<f1>1|<f2>1|<f3>1|<f4>1|<f5>1|<f6>1|<f7>1"]
int_s [label = "((int) -1) >> 1", shape = plaintext]
{rank = same int_s struct2}
logica [label = "logical = 1"]
struct1:f0->struct2:f1
struct1:f1->struct2:f2
struct1:f2->struct2:f3
struct1:f3->struct2:f4
struct1:f4->struct2:f5
struct1:f5->struct2:f6
struct1:f6->struct2:f7
struct2:f7:e->logica:w
}
```
因 `m` 為負數, `fix = fixu = -1`
```graphviz
digraph SR{
node[shape=record]
m [label="<f0>1|<f1>1|<f2>1|<f3>1|<f4>1|<f5>0|<f6>1|<f7>1"]
m_text_1 [label = "m(-5)", shape = plaintext]
{rank = same m m_text_1}
n [label="<f0>0|<f1>1|<f2>1|<f3>1|<f4>1|<f5>1|<f6>0|<f7>1"]
m_text_2 [label = "m >> n(-5 >> 1)", shape = plaintext]
{rank = same n m_text_2}
result [label="<f0>1|<f1>1|<f2>1|<f3>1|<f4>1|<f5>1|<f6>0|<f7>1"]
result_text [label = "(m >> n) | (fix ^ (fix >> 1)) (-3)", shape = plaintext]
{rank = same result result_text}
struct1 [label="<f0>1|<f1>1|<f2>1|<f3>1|<f4>1|<f5>1|<f6>1|<f7>1"]
fix [label = "fix", shape = plaintext]
struct2 [label="<f0>0|<f1>1|<f2>1|<f3>1|<f4>1|<f5>1|<f6>1|<f7>1"]
fix_s [label = "fix >> 1", shape = plaintext]
{rank = same fix_s struct2}
struct3 [label="<f0>1|<f1>0|<f2>0|<f3>0|<f4>0|<f5>0|<f6>0|<f7>0"]
fix_xor [label = "fix ^ (fix >> 1)", shape = plaintext]
{rank = same fix_xor struct3}
n->result
m:f0->n:f1
m:f1->n:f2
m:f2->n:f3
m:f3->n:f4
m:f4->n:f5
m:f5->n:f6
m:f6->n:f7
struct1:f0->struct2:f1
struct1:f1->struct2:f2
struct1:f2->struct2:f3
struct1:f3->struct2:f4
struct1:f4->struct2:f5
struct1:f5->struct2:f6
struct1:f6->struct2:f7
struct2:f0->struct3:f0
struct2:f1->struct3:f1
struct2:f2->struct3:f2
struct2:f3->struct3:f3
struct2:f4->struct3:f4
struct2:f5->struct3:f5
struct2:f6->struct3:f6
struct2:f7->struct3:f7
}
```
## 2.
```cpp
bool isPowerOfFour(int num)
{
return num > 0 && (num & (num - 1))==0 &&
!(__builtin_ctzl(num) OPQ);
}
```
$4^n(n為整數)必然大於0$,因此 `num > 0`
$4^n可化為2^{2n},若num不為2^n必定不為4^n$, 因此 `(num & (num - 1)) == 0`
> Built-in Function: `int __builtin_ctz (unsigned int x)`
> * Returns the number of trailing 0-bits in x, starting at the least significant bit position. If x is 0, the result is undefined.
>
> Built-in Function: `int __builtin_clz (unsigned int x)`
> * Returns the number of leading 0-bits in x, starting at the most significant bit position. If x is 0, the result is undefined.
| $4^n$ | Binary |
| -------- | -------- |
| $4^0$| 0000 0001|
| $4^1$| 0000 0100|
| $4^2$| 0001 0000|
已知$4^n = {\overbrace{0000....0100}^{尾端數來共2n個0}}$
以 `!(__builtin_ctz(num) OPQ)` 確保尾端數來皆是偶數個 0 排除 2 的次方數
因此 OPQ = `(f) & 0x1`
### 延伸問題
```cpp
bool isPowerOfFour(int num)
{
return (num & 0x55555555) && (!(num & (num - 1)));
}
```
![](https://i.imgur.com/moPAh7T.png)
透過 `0x55555555` 確保 `num` 的二進位只有奇數位有值,又 4 的次方數轉為二進位只有一個 1 於是再排除有複數位的情況。
```cpp
int bitwiseComplement(int N){
if(N)
return ~N & ((1 << (32 - __builtin_clz(N))) - 1);
return 1;
}
```
![](https://i.imgur.com/w7WveKO.png)
找出 MSB 後有效位並對 1 求補數,利用 `1 << (32 - __builtin_clz(N)) - 1` 製造有效位 Mask 並求剩下的數。
```cpp
#define SWAP(a, b) ((a) ^= (b), (b) ^= (a), (a) ^= (b))
int firstMissingPositive(int* nums, int numsSize){
for(int i = 0; i < numsSize; i++){
if(nums[i] > 0 && nums[i] < numsSize && nums[i] != nums[nums[i] - 1]){
int j = nums[i] - 1;
SWAP(nums[i], nums[j]);
i--;
}
}
for(int i = 0; i < numsSize; i++){
if(nums[i] != i + 1)
return i + 1;
}
return numsSize + 1;
}
```
![](https://i.imgur.com/5SL8iIl.png)
參考[SwappingValuesXOR](http://graphics.stanford.edu/~seander/bithacks.html#SwappingValuesXOR)
## 3.
針對 [LeetCode 1342. Number of Steps to Reduce a Number to Zero](https://leetcode.com/problems/number-of-steps-to-reduce-a-number-to-zero/),我們可運用 gcc 內建函式實作,假設 int 寬度為 32 位元,程式碼列表如下:
```cpp
int numberOfSteps (int num)
{
return num ? AAA + 31 - BBB : 0;
}
```
> `__builtin_popcount(x)`: 計算 x 的二進位表示中,總共有幾個 `1`
參考題二得知對應的`__builtin_clz(num)`為計算 MSB 開始到第一個 1 之間 0 的個數
透過`32 - __builtin_clz(num)`求得 MSB 與 0 的距離,再以 `__builtin_popcount(x) - 1` 求得 MSB 以外的 1 與 0 的距離,因此 AAA = `__builtin_popcount(num)`,BBB = `__builtin_clz(num)`
### 延伸問題
```cpp
int numberOfSteps (int num)
{
return __builtin_popcount(num) + 31 - __builtin_clz(num | 1);
}
```
![](https://i.imgur.com/qkdTj9Y.png)
於 `__builtin_clz(num | 1)` 針對 LSB 為 0 的情況加 1 ,而避免了傳入 0
```cpp
int popcount(int num){
/*__builtin_popcount(num)*/
num = (num & 0x55555555) + ((num >> 1) & 0x55555555);
num = (num & 0x33333333) + ((num >> 2) & 0x33333333);
num = (num & 0x0F0F0F0F) + ((num >> 4) & 0x0F0F0F0F);
num = (num & 0x00FF00FF) + ((num >> 8) & 0x00FF00FF);
num = (num & 0x0000FFFF) + ((num >> 16) & 0x0000FFFF);
return num;
}
int clz(int v){
/*31 - __builtin_clz(num | 1)*/
unsigned int r;
unsigned int shift;
r = (v > 0xFFFF) << 4; v >>= r;
shift = (v > 0xFF ) << 3; v >>= shift; r |= shift;
shift = (v > 0xF ) << 2; v >>= shift; r |= shift;
shift = (v > 0x3 ) << 1; v >>= shift; r |= shift;
r |= (v >> 1);
return r;
}
int numberOfSteps (int num)
{
return popcount(num) + clz(num | 1);
}
```
![](https://i.imgur.com/g5Czm3q.png)
每 `byte` 切割輸入值最終加總為 `popcount` ,而下方 `clz` 則可以計算 `__builtin_clz` 的相反數值
參考[Find the log base 2 of an N-bit integer in O(lg(N)) operations](http://graphics.stanford.edu/~seander/bithacks.html#IntegerLog)
## 4.
考慮以下 64-bit GCD (greatest common divisor, 最大公因數) 求值函式:
```cpp
#include <stdint.h>
uint64_t gcd64(uint64_t u, uint64_t v) {
if (!u || !v) return u | v;
while (v) {
uint64_t t = v;
v = u % v;
u = t;
}
return u;
}
```
改寫為以下等價實作:
```cpp
#include <stdint.h>
uint64_t gcd64(uint64_t u, uint64_t v) {
if (!u || !v) return u | v;
int shift;
for (shift = 0; !((u | v) & 1); shift++) {
u /= 2, v /= 2;
}
while (!(u & 1))
u /= 2;
do {
while (!(v & 1))
v /= 2;
if (u < v) {
v -= u;
} else {
uint64_t t = u - v;
u = v;
v = t;
}
} while (XXX);
return YYY;
}
```
上述程式就是在實作==輾轉相除法==,首先透過 `shift` 判斷當 `u` 和 `v` 都是偶數時,不斷除以 2 直到一方變成奇數,再對 `u` 單獨除到變為奇數,接著以不斷將兩數中較小的數置於 `u` 不斷以 `v -= u` 對 `v` 取餘,因此最終XXX = `(b) v` YYY = `(e) u << shift`
以上 GCD 加速方法可以參考[Binary GCD](https://hackmd.io/@sysprog/gcd-impl#Binary-GCD)利用
> binary gcd 的精神就是當兩數為偶數時,必有一 $2$ 因子。
$$\text{gcd}(x, y) = 2·\text{gcd}(\frac{x}{2}, \frac{y}{2})$$
且一數為奇數另一數為偶數,則無 $2$ 因子。
$$\text{gcd}(x, y) = \text{gcd}(x, \frac{y}{2})$$
於是我們可以改良為:
$$ \text{even_factor} = \text{min}(\text{ctz}(x), \text{ctz}(y))$$
$$\text{gcd}(x, y) = \text{even_factor}·\text{gcd}((x\gg \text{even_factor}), (y\gg \text{even_factor}))$$
透過 `shift` 加速 GCD 運算,最終於 `return u << shift` 補回先前右移的數值
### 延伸問題
```cpp
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
uint64_t gcd64(uint64_t u, uint64_t v) {
unsigned t = u | v;
if (!u || !v)
return t;
while (u) {
u >>= __builtin_ctz(u);
v >>= __builtin_ctz(v);
if (u >= v)
u = (u - v) / 2;
else
v = (v - u) / 2;
}
return (v << __builtin_ctz(t));
}
```
透過 `__builtin_ctz` 改寫除 2 部分加速整體運算速度,然而透過 `perf stat` 指令查看兩者在處理==極大的值==時消耗的時間相差甚少,因此於 `main` 新增
```cpp
for (int i = 0; i < 0xFFF; i+=2) {
for (int j = 0; j < 0xFFF; j+=2)
gcd64(i, j);
}
```
以大量==偶數==數據帶入運算針對 `__builtin_ctz` 與題目所提供的 `shift` 方法比較計算速度
```bash
Performance counter stats for './ex_builtin_ctz':
218.58 msec task-clock # 0.999 CPUs utilized
1 context-switches # 0.005 K/sec
0 cpu-migrations # 0.000 K/sec
49 page-faults # 0.224 K/sec
9,0579,4689 cycles # 4.144 GHz
6,9513,1948 instructions # 0.77 insn per cycle
1,2077,3811 branches # 552.541 M/sec
1246,1088 branch-misses # 10.32% of all branches
0.218806228 seconds time elapsed
0.218822000 seconds user
0.000000000 seconds sys
```
```bash
Performance counter stats for './ex_original':
415.57 msec task-clock # 0.999 CPUs utilized
1 context-switches # 0.002 K/sec
0 cpu-migrations # 0.000 K/sec
48 page-faults # 0.116 K/sec
17,2396,2751 cycles # 4.148 GHz
11,1716,9482 instructions # 0.65 insn per cycle
2,8125,5819 branches # 676.788 M/sec
3811,1653 branch-misses # 13.55% of all branches
0.415865646 seconds time elapsed
0.415875000 seconds user
0.000000000 seconds sys
```
才能得到較為顯著的測試結果,採用 `__builtin_ctz` 明顯減少了 `CPU cycles` , `fetch instructions` 和 `branch` 的次數,得到效能的提升。這裡值得注意的是如果使用 `printf`將得到的數值印出,將會消耗大量的時間,因此強烈不推薦使用。
## 5.
在影像處理中,[bit array](https://en.wikipedia.org/wiki/Bit_array) (也稱 `bitset`) 廣泛使用,考慮以下程式碼:
```cpp
#include <stddef.h>
size_t naive(uint64_t *bitmap, size_t bitmapsize, uint32_t *out)
{
size_t pos = 0;
for (size_t k = 0; k < bitmapsize; ++k) {
uint64_t bitset = bitmap[k];
size_t p = k * 64;
for (int i = 0; i < 64; i++) {
if ((bitset >> i) & 0x1)
out[pos++] = p + i;
}
}
return pos;
}
```
可用 clz 改寫為效率更高且等價的程式碼:
```cpp
#include <stddef.h>
size_t improved(uint64_t *bitmap, size_t bitmapsize, uint32_t *out)
{
size_t pos = 0;
uint64_t bitset;
for (size_t k = 0; k < bitmapsize; ++k) {
bitset = bitmap[k];
while (bitset != 0) {
uint64_t t = KKK;
int r = __builtin_ctzll(bitset);
out[pos++] = k * 64 + r;
bitset ^= t;
}
}
return pos;
}
```
上述程式碼相互對應後 `int r = __builtin_ctzll` 取代 `if ((bitset >> i) & 0x1)` 的功能,但改寫後的程式碼對 `out[pos++]` 賦值後還必須 clear bitset 的 LSB 以確保下個 Loop 的正確運行
> 根據題三提到 $-n=∼n+1$
利用 `bitset & -bitset` 留下 LSB 再於 `bitset ^= t` 清除 LSB
因此 KKK = `(e) bitset & -bitset`