2022q1 Homework2 (quiz2)

contributed by < KJay221 >

實驗環境

$ gcc --version
gcc (Ubuntu 11.2.0-19ubuntu1) 11.2.0

$ lscpu
Architecture:            x86_64
  CPU op-mode(s):        32-bit, 64-bit
  Address sizes:         39 bits physical, 48 bits virtual
  Byte Order:            Little Endian
CPU(s):                  8
  On-line CPU(s) list:   0-7
Vendor ID:               GenuineIntel
  Model name:            11th Gen Intel(R) Core(TM) i7-11370H @ 3.30GHz
    CPU family:          6
    Model:               140
    Thread(s) per core:  2
    Core(s) per socket:  4
    Socket(s):           1
    Stepping:            1
    CPU max MHz:         4800.0000
    CPU min MHz:         400.0000
    BogoMIPS:            6604.80
    Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse ss
                         e2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopol
                         ogy nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 sd
                         bg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdran
                         d lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l2 invpcid_single cdp_l2 ssbd ibrs ibpb stibp ibrs_enhan
                         ced tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid rd
                         t_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb intel_pt avx512cd sha_ni avx512bw avx512
                         vl xsaveopt xsavec xgetbv1 xsaves split_lock_detect dtherm ida arat pln pts hwp hwp_notify hwp_act_windo
                         w hwp_epp hwp_pkg_req avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bit
                         alg avx512_vpopcntdq rdpid movdiri movdir64b fsrm avx512_vp2intersect md_clear flush_l1d arch_capabiliti
                         es
Virtualization features: 
  Virtualization:        VT-x
Caches (sum of all):     
  L1d:                   192 KiB (4 instances)
  L1i:                   128 KiB (4 instances)
  L2:                    5 MiB (4 instances)
  L3:                    12 MiB (1 instance)
NUMA:                    
  NUMA node(s):          1
  NUMA node0 CPU(s):     0-7
Vulnerabilities:         
  Itlb multihit:         Not affected
  L1tf:                  Not affected
  Mds:                   Not affected
  Meltdown:              Not affected
  Mmio stale data:       Not affected
  Spec store bypass:     Mitigation; Speculative Store Bypass disabled via prctl and seccomp
  Spectre v1:            Mitigation; usercopy/swapgs barriers and __user pointer sanitization
  Spectre v2:            Mitigation; Enhanced IBRS, IBPB conditional, RSB filling
  Srbds:                 Not affected
  Tsx async abort:       Not affected

目前解不了的問題(文中黃色區塊標記)

第一大題： >30問題
第一大題：延伸例子程式解讀
第五大題：相關例子尋找
7-3 7-4 不太會寫

測驗 `1`

考慮以下對二個無號整數取平均值的程式碼：
A

#include <stdint.h>
uint32_t average(uint32_t a, uint32_t b)
{
    return (a + b) / 2;
}

解決 overflow 的問題：
B

#include <stdint.h>
uint32_t average(uint32_t low, uint32_t high)
{
    return low + (high - low) / 2;
}

改寫為以下等價的實作：
C

#include <stdint.h>
uint32_t average(uint32_t a, uint32_t b)
{
    return (a >> 1) + (b >> 1) + (EXP1);
}

再次改寫：
D

uint32_t average(uint32_t a, uint32_t b)
{
    return (EXP2) + ((EXP3) >> 1);
}

EXP1 = a & b & 1

EXP2 = a & b

EXP3 = a ^ b

1.解釋上述程式碼運作的原理

B段程式：

$l o w + \frac{h i g h - l o w}{2} = \frac{l o w}{2} + \frac{h i g h}{2} = \frac{l o w + h i g h}{2}$
上述為取平均之公式
C段程式：因為當兩數都為奇數，且都右移一位時，總共會少二，所以在取平均時就會少一，所以就要先測試兩數是否為奇數(a & 1 和 b & 1)，若是則a & b & 1 = 1，再加回去剛好就是答案
例：3 和 7 皆為奇數所以
$(3 + 7) / 2 = ⌊ 3 / 2 ⌋ + ⌊ 7 / 2 ⌋ + 1$
D段程式：參考你所不知道的 C 語言：數值系統篇
將加法器的邏輯轉換成程式，x & y為進位，x ^ y為低位數，又取平均為
(x + y) >> 1
= (((x & y) << 1) + (x ^ y)) >> 1
= (x & y) + ((x ^ y) >> 1)

2.比較上述實作在編譯器最佳化開啟的狀況，對應的組合語言輸出，並嘗試解讀

執行: gcc -O3 -S q1.c

#include <stdint.h>

uint32_t average_1(uint32_t a, uint32_t b)
{
    return (a >> 1) + (b >> 1) + (a & b &1);
}

uint32_t average_2(uint32_t a, uint32_t b)
{
    return (a & b) + ((a ^ b) >> 1);
}

int main(){}

可得assembly code

C段程式：

movl	%edi, %eax    //參數移至暫存器
movl	%esi, %edx    //參數移至暫存器
andl	%esi, %edi    //a & b
shrl	%eax          //a >> 1  
shrl	%edx          //b >> 1
andl	$1, %edi      //a & b & 1
addl	%edx, %eax    //(a >> 1) + (b >> 1)
addl	%edi, %eax    //(a >> 1) + (b >> 1) + (a & b & 1)

D段程式：

movl	%edi, %eax    //參數移至暫存器
andl	%esi, %edi    //a & b
xorl	%esi, %eax    //a ^ b
shrl	%eax          //(a ^ b) >> 1
addl	%edi, %eax    //(a & b) + ((x ^ y) >> 1)

D段程式碼省了3行指令，而且只需要用到一個暫存器

3.研讀 Linux 核心原始程式碼 include/linux/average.h，探討其 Exponentially weighted moving average (EWMA) 實作，以及在 Linux 核心的應用

首先探討 Exponentially weighted moving average 是什麼，根據 wiki 上的解釋， EWMA 是以指數式遞減加權的移動平均，越舊的資料加權越小並以指數程度遞減，加權程度以常數 α 決定， α 介於 0 到 1，計算公式如下：

E W M A_{t} = α \times p_{t} + (1 - α) \times E W M A_{t - 1}

然後將遞迴展開

E W M A_{t} = α \times [p_{t} + (1 - α) p_{t - 1} + (1 - α)^{2} p_{t - 2} + (1 - α)^{3} p_{t - 3} + . . .]

因為 1 - α < 0，可以看出越舊的資料權重越小

接著再來看 linux 程式碼

/* SPDX-License-Identifier: GPL-2.0 */
#ifndef _LINUX_AVERAGE_H
#define _LINUX_AVERAGE_H

#include <linux/bug.h>
#include <linux/compiler.h>
#include <linux/log2.h>

/*
 * Exponentially weighted moving average (EWMA)
 *
 * This implements a fixed-precision EWMA algorithm, with both the
 * precision and fall-off coefficient determined at compile-time
 * and built into the generated helper funtions.
 *
 * The first argument to the macro is the name that will be used
 * for the struct and helper functions.
 *
 * The second argument, the precision, expresses how many bits are
 * used for the fractional part of the fixed-precision values.
 *
 * The third argument, the weight reciprocal, determines how the
 * new values will be weighed vs. the old state, new values will
 * get weight 1/weight_rcp and old values 1-1/weight_rcp. Note
 * that this parameter must be a power of two for efficiency.
 */

#define DECLARE_EWMA(name, _precision, _weight_rcp)			\
	struct ewma_##name {						\
		unsigned long internal;					\
	};								\
	static inline void ewma_##name##_init(struct ewma_##name *e)	\
	{								\
		BUILD_BUG_ON(!__builtin_constant_p(_precision));	\
		BUILD_BUG_ON(!__builtin_constant_p(_weight_rcp));	\
		/*							\
		 * Even if you want to feed it just 0/1 you should have	\
		 * some bits for the non-fractional part...		\
		 */							\
		BUILD_BUG_ON((_precision) > 30);			\
		BUILD_BUG_ON_NOT_POWER_OF_2(_weight_rcp);		\
		e->internal = 0;					\
	}								\
	static inline unsigned long					\
	ewma_##name##_read(struct ewma_##name *e)			\
	{								\
		BUILD_BUG_ON(!__builtin_constant_p(_precision));	\
		BUILD_BUG_ON(!__builtin_constant_p(_weight_rcp));	\
		BUILD_BUG_ON((_precision) > 30);			\
		BUILD_BUG_ON_NOT_POWER_OF_2(_weight_rcp);		\
		return e->internal >> (_precision);			\
	}								\
	static inline void ewma_##name##_add(struct ewma_##name *e,	\
					     unsigned long val)		\
	{								\
		unsigned long internal = READ_ONCE(e->internal);	\
		unsigned long weight_rcp = ilog2(_weight_rcp);		\
		unsigned long precision = _precision;			\
									\
		BUILD_BUG_ON(!__builtin_constant_p(_precision));	\
		BUILD_BUG_ON(!__builtin_constant_p(_weight_rcp));	\
		BUILD_BUG_ON((_precision) > 30);			\
		BUILD_BUG_ON_NOT_POWER_OF_2(_weight_rcp);		\
									\
		WRITE_ONCE(e->internal, internal ?			\
			(((internal << weight_rcp) - internal) +	\
				(val << precision)) >> weight_rcp :	\
			(val << precision));				\
	}

#endif /* _LINUX_AVERAGE_H */

可以看出整段程式都是用 macro 撰寫，另外一共有三個 argument 功用分別為

name：主要影響 struct 及 function 的名子，像是當 name 為 price 時 ewma_##name 會變成 ewma_price，## 的用法參考你所不知道的 C 語言：前置處理器應用篇
_precision ：控制精度的一個參數，像是在 ewma_##name##_read 的return 值 e->internal >> (_precision) 利用 right shift _precision 個 bits 來控制回傳值的精度
_weight_rcp：加權程度常數，也就是上述數學式中的 α 取倒數，另外因為效能因素，此參數只能用 2 的冪，在程式中則是用 BUILD_BUG_ON_NOT_POWER_OF_2 來檢查參數是否為 2 的冪

接著一段一段看

struct ewma_##name {						\
    unsigned long internal;					\
};								\

此段程式依照 name 的內容來宣告 struct

	static inline void ewma_##name##_init(struct ewma_##name *e)	\
	{								\
		BUILD_BUG_ON(!__builtin_constant_p(_precision));	\
		BUILD_BUG_ON(!__builtin_constant_p(_weight_rcp));	\
		/*							\
		 * Even if you want to feed it just 0/1 you should have	\
		 * some bits for the non-fractional part...		\
		 */							\
		BUILD_BUG_ON((_precision) > 30);			\
		BUILD_BUG_ON_NOT_POWER_OF_2(_weight_rcp);		\
		e->internal = 0;					\
	}

此段程式用於初始化上述 struct 結構變數，只要傳入指標就能進行初始化，另外此處用到許多 BUILD_BUG_ON() ，參照 linux/include/linux/build_bug.h 第 41~50 行中的註解 break compile if a condition is true，在這裡檢查了 _precision 和 _weight_rcp 兩個參數是否為常數，並確保 _weight_rcp 為 2 的冪，若不是則會在編譯時出錯，最後再將 struct 中的 internal 初始化為 0

	static inline unsigned long					\
	ewma_##name##_read(struct ewma_##name *e)			\
	{								\
		BUILD_BUG_ON(!__builtin_constant_p(_precision));	\
		BUILD_BUG_ON(!__builtin_constant_p(_weight_rcp));	\
		BUILD_BUG_ON((_precision) > 30);			\
		BUILD_BUG_ON_NOT_POWER_OF_2(_weight_rcp);		\
		return e->internal >> (_precision);			\
	}								\

此段則是輸入 struct 指標後回傳其 internal 值，首先檢查各項條件，若通過則利用 >> (_precision) 來控制回傳精度並回傳 internal 值

	static inline void ewma_##name##_add(struct ewma_##name *e,	\
					     unsigned long val)		\
	{								\
		unsigned long internal = READ_ONCE(e->internal);	\
		unsigned long weight_rcp = ilog2(_weight_rcp);		\
		unsigned long precision = _precision;			\
									\
		BUILD_BUG_ON(!__builtin_constant_p(_precision));	\
		BUILD_BUG_ON(!__builtin_constant_p(_weight_rcp));	\
		BUILD_BUG_ON((_precision) > 30);			\
		BUILD_BUG_ON_NOT_POWER_OF_2(_weight_rcp);		\
									\
		WRITE_ONCE(e->internal, internal ?			\
			(((internal << weight_rcp) - internal) +	\
				(val << precision)) >> weight_rcp :	\
			(val << precision));				\
	}

最後這段的功能則是加入一個新的數 val 並將 EWMA 更新，其中 READ_ONCE() 和 WRITE_ONCE() 都是要確保 Cache coherence，前三行是將算式所需要的變數處理好，接著再檢查條件，最後在 WRITE_ONCE() 將值更新

若 internal 為 0 則將 internal 更新為 val 的值
若 internal 為非 0 ，則利用公式
$E W M A_{t} = α \times p_{t} + (1 - α) \times E W M A_{t - 1}$ 來更新值，先將程式拆解來看
(((internal << weight_rcp) - internal) + (val << precision)) >> weight_rcp = ((internal << weight_rcp) - internal) >> weight_rcp + (val << precision) >> weight_rcp
(val << precision) >> weight_rcp 對照
$α \times p_{t}$
((internal << weight_rcp) - internal) >> weight_rcp = internal - (internal >> weight_rcp) 對照
$(1 - α) \times E W M A_{t - 1} = E W M A_{t - 1} - α \times E W M A_{t - 1}$

上述程式中

BUILD_BUG_ON((_precision) > 30);

為什麼要用 30???

linux 核心的應用

先將 linux 核心整個 clone 下來，然後用 grep -nr 'ewma_' . 找有用到的地方

grep 用法參考：
- How can I use grep to find a word inside a folder?
- Linux grep 命令

或是利用 ag

ag 用法參考：
- Linux ag 搜尋字串用法與範例(比 grep 還快)
- Search for files & file names using silver searcher
- 搜尋例子 ag 'ewma_'
- 特定字串特定檔案、目錄名稱搜尋例子 ag -G wireless 'ewma_'

mediatek wifi chip driver

無線網路因為訊號強度會不停改變，所以可以用 ewma 來計算評估訊號強度，將過去一段時間訊號強度乘上不同的權值來代表訊號強度，另外在電信領域 RSSI 是一種量測接收訊號強度的指標，可以用 ewma 計算後的結果當作此指標的依據，以下在是 wifi chip driver 運用 ewma 和 rssi 的例子

linux/drivers/net/wireless/mediatek/mt7601u/phy.c

static void mt7601u_agc_tune(struct mt7601u_dev *dev)
{
	u8 val = mt7601u_agc_default(dev);
	long avg_rssi;

	if (test_bit(MT7601U_STATE_SCANNING, &dev->state))
		return;

	/* Note: only in STA mode and not dozing; perhaps do this only if
	 *	 there is enough rssi updates since last run?
	 *	 Rssi updates are only on beacons and U2M so should work...
	 */
	spin_lock_bh(&dev->con_mon_lock);
	avg_rssi = ewma_rssi_read(&dev->avg_rssi);
	spin_unlock_bh(&dev->con_mon_lock);
	if (avg_rssi == 0)
		return;
    
        //依照不同信號大小對 agc 進行調整
	avg_rssi = -avg_rssi;
	if (avg_rssi <= -70)
		val -= 0x20;
	else if (avg_rssi <= -60)
		val -= 0x10;

	if (val != mt7601u_bbp_rr(dev, 66))
		mt7601u_bbp_wr(dev, 66, val);

	/* TODO: also if lost a lot of beacons try resetting
	 *       (see RTMPSetAGCInitValue() call in mlme.c).
	 */
}

以上程式依照 ewma_rssi 的大小修改 val 的值
再看到前面宣告 u8 val = mt7601u_agc_default(dev); 可以知道 val 是一個與 agc 有關的值，參考 Automatic gain control 和 AGC的作用（自动增益控制），簡單來說 agc 是針對不同信號強度使用不同程度的放大，達到最終信號輸出維持一定範圍內大小，所以上面程式就依照 rssi 大小來調整 agc 的值

        /* Start/stop collecting beacon data */
	spin_lock_bh(&dev->con_mon_lock);
	ether_addr_copy(dev->ap_bssid, info->bssid);
	ewma_rssi_init(&dev->avg_rssi);
	dev->bcn_freq_off = MT_FREQ_OFFSET_INVALID;
	spin_unlock_bh(&dev->con_mon_lock);

在 Start/stop collecting beacon data 時將 ewma 值初始化

linux/drivers/net/wireless/mediatek/mt7601u/mac.c

u32 mt76_mac_process_rx(struct mt7601u_dev *dev, struct sk_buff *skb,
			u8 *data, void *rxi)
{
	struct ieee80211_rx_status *status = IEEE80211_SKB_RXCB(skb);
	struct mt7601u_rxwi *rxwi = rxi;
	u32 len, ctl = le32_to_cpu(rxwi->ctl);
	u16 rate = le16_to_cpu(rxwi->rate);
	int rssi;

	len = FIELD_GET(MT_RXWI_CTL_MPDU_LEN, ctl);
	if (len < 10)
		return 0;

	if (rxwi->rxinfo & cpu_to_le32(MT_RXINFO_DECRYPT)) {
		status->flag |= RX_FLAG_DECRYPTED;
		status->flag |= RX_FLAG_MMIC_STRIPPED;
		status->flag |= RX_FLAG_MIC_STRIPPED;
		status->flag |= RX_FLAG_ICV_STRIPPED;
		status->flag |= RX_FLAG_IV_STRIPPED;
	}
	/* let mac80211 take care of PN validation since apparently
	 * the hardware does not support it
	 */
	if (rxwi->rxinfo & cpu_to_le32(MT_RXINFO_PN_LEN))
		status->flag &= ~RX_FLAG_IV_STRIPPED;

	status->chains = BIT(0);
	rssi = mt7601u_phy_get_rssi(dev, rxwi, rate);
	status->chain_signal[0] = status->signal = rssi;
	status->freq = dev->chandef.chan->center_freq;
	status->band = dev->chandef.chan->band;

	mt76_mac_process_rate(status, rate);

	spin_lock_bh(&dev->con_mon_lock);
	if (mt7601u_rx_is_our_beacon(dev, data))
		mt7601u_rx_monitor_beacon(dev, rxwi, rate, rssi);
	else if (rxwi->rxinfo & cpu_to_le32(MT_RXINFO_U2M))
		ewma_rssi_add(&dev->avg_rssi, -rssi);
	spin_unlock_bh(&dev->con_mon_lock);

	return len;
}

mt7601u_rx_monitor_beacon(struct mt7601u_dev *dev, struct mt7601u_rxwi *rxwi,
			  u16 rate, int rssi)
{
	dev->bcn_freq_off = rxwi->freq_off;
	dev->bcn_phy_mode = FIELD_GET(MT_RXWI_RATE_PHY, rate);
	ewma_rssi_add(&dev->avg_rssi, -rssi);
}

rx tx 的意思

某些情況下更新 ewma_rssi 值，目前能力不足無法解讀前後程式意思

mt76_mac_process_rx 函式的功用？
為什麼要在 mt76_mac_process_rx 函式改 ewma_rssi 的值
rxwi 變數意思

Linux DRM

DRM 是 linux 的圖形顯示框架，相關內容參考

drm_self_refresh_helper.c 中大量用到 ewma，主要是用於 psr_time

psr 技術主要是用於省電，一般來說螢幕需要靠 GPU 不斷運算來刷新影像，所以像是在玩 3D 遊戲時電量消耗較大，反之在瀏覽網頁或閱讀電子書時不需要一直刷新螢幕，這時候就要靠 psr 技術將之前儲存的數據顯示於螢幕上，這樣就能減少 GPU 運算，從而達到省電目的，以上參考Tcon的省电功能PSR(panel self refresh)简介

DECLARE_EWMA(psr_time, 4, 4)

struct drm_self_refresh_data {
	struct drm_crtc *crtc;
	struct delayed_work entry_work;

	struct mutex avg_mutex;
	struct ewma_psr_time entry_avg_ms;
	struct ewma_psr_time exit_avg_ms;
};

void
drm_self_refresh_helper_update_avg_times(struct drm_atomic_state *state,
					 unsigned int commit_time_ms,
					 unsigned int new_self_refresh_mask)
{
	struct drm_crtc *crtc;
	struct drm_crtc_state *old_crtc_state;
	int i;

	for_each_old_crtc_in_state(state, crtc, old_crtc_state, i) {
		bool new_self_refresh_active = new_self_refresh_mask & BIT(i);
		struct drm_self_refresh_data *sr_data = crtc->self_refresh_data;
		struct ewma_psr_time *time;

		if (old_crtc_state->self_refresh_active ==
		    new_self_refresh_active)
			continue;

		if (new_self_refresh_active)
			time = &sr_data->entry_avg_ms;
		else
			time = &sr_data->exit_avg_ms;

		mutex_lock(&sr_data->avg_mutex);
		ewma_psr_time_add(time, commit_time_ms);
		mutex_unlock(&sr_data->avg_mutex);
	}
}

void drm_self_refresh_helper_alter_state(struct drm_atomic_state *state)
{
	struct drm_crtc *crtc;
	struct drm_crtc_state *crtc_state;
	int i;

	if (state->async_update || !state->allow_modeset) {
		for_each_old_crtc_in_state(state, crtc, crtc_state, i) {
			if (crtc_state->self_refresh_active) {
				state->async_update = false;
				state->allow_modeset = true;
				break;
			}
		}
	}

	for_each_new_crtc_in_state(state, crtc, crtc_state, i) {
		struct drm_self_refresh_data *sr_data;
		unsigned int delay;

		/* Don't trigger the entry timer when we're already in SR */
		if (crtc_state->self_refresh_active)
			continue;

		sr_data = crtc->self_refresh_data;
		if (!sr_data)
			continue;

		mutex_lock(&sr_data->avg_mutex);
		delay = (ewma_psr_time_read(&sr_data->entry_avg_ms) +
			 ewma_psr_time_read(&sr_data->exit_avg_ms)) * 2;
		mutex_unlock(&sr_data->avg_mutex);

		mod_delayed_work(system_wq, &sr_data->entry_work,
				 msecs_to_jiffies(delay));
	}
}

int drm_self_refresh_helper_init(struct drm_crtc *crtc)
{
	struct drm_self_refresh_data *sr_data = crtc->self_refresh_data;

	/* Helper is already initialized */
	if (WARN_ON(sr_data))
		return -EINVAL;

	sr_data = kzalloc(sizeof(*sr_data), GFP_KERNEL);
	if (!sr_data)
		return -ENOMEM;

	INIT_DELAYED_WORK(&sr_data->entry_work,
			  drm_self_refresh_helper_entry_work);
	sr_data->crtc = crtc;
	mutex_init(&sr_data->avg_mutex);
	ewma_psr_time_init(&sr_data->entry_avg_ms);
	ewma_psr_time_init(&sr_data->exit_avg_ms);

	/*
	 * Seed the averages so they're non-zero (and sufficiently large
	 * for even poorly performing panels). As time goes on, this will be
	 * averaged out and the values will trend to their true value.
	 */
	ewma_psr_time_add(&sr_data->entry_avg_ms, SELF_REFRESH_AVG_SEED_MS);
	ewma_psr_time_add(&sr_data->exit_avg_ms, SELF_REFRESH_AVG_SEED_MS);

	crtc->self_refresh_data = sr_data;
	return 0;
}

entry_avg_ms 和 exit_avg_ms 變數意義？
delay 算法意義？

測驗 `2`

Branchless Max

#include <stdint.h>
uint32_t max(uint32_t a, uint32_t b)
{
    return a ^ ((EXP4) & -(EXP5));
}

EXP4 = a ^ b

EXP5 = a < b

1.解釋上述程式碼運作的原理

若 a > b 則 -(a < b) 的值為0
a ^ ((a ^ b) & -(a < b)) = a ^ 0 = a
若 a < b 則 -(a < b) 的值為-1 (0xffffffff)
a ^ ((a ^ b) & -(a < b)) = a ^ (a ^ b) = a ^ a ^ b = 0 ^ b = b

2.針對 32 位元無號/有號整數，撰寫同樣 branchless 的實作

int max(int a, int b)
{
    return a ^ ((a ^ b) & -(a < b));
}

無號數的問題即為上述的問題
有號數的作法與無號數相同，在 -(a < b) 這個運算中， a 與 b 的正負值並不會影響結果，結果只與大小有關，又觀察上面程式結果
a > b => a ^ ((a ^ b) & 0) = a ^ 0 = a
a < b => a ^ ((a ^ b) & -1) = a ^ (a ^ b) = a ^ a ^ b = 0 ^ b = b
可以發現 bitwise 的運算結果與正負號無關，因此有號數/無號數用相同方法實做即可

3.Linux 核心也有若干 branchless / branch-free 的實作，例如 lib/sort.c，請在 Linux 核心原始程式碼中，找出更多類似的實作手法，請善用 git log 檢索

/*
 * Logically, we're doing "if (i & lsbit) i -= size;", but since the
 * branch is unpredictable, it's done with a bit of clever branch-free
 * code instead.
 */
__attribute_const__ __always_inline
static size_t parent(size_t i, unsigned int lsbit, size_t size)
{
    i -= size;
    i -= size & -(i & lsbit);
    return i / 2;
}

if (i & lsbit) i -= size; 改寫成 i -= size & -(i & lsbit);
當 (i & lsbit) 為 true 的時候 i -= size & -(i & lsbit) = i -= size & -1 = i -= size
當 (i & lsbit) 為 false 的時候 i -= size & -(i & lsbit) = i -= size & 0 = i -= 0

找其它案例

先用 git log --oneline --grep branchless 和 git log --oneline --grep branch-free 找到符合的 commit hash 再用 git show hash 來查看紀錄，或是在 github 上查看比較清楚

這邊探討 Replace int2float() with an optimized version. 案例

此 function 的功能是將 unsigned integer 轉換成 32-bit IEEE floating point representation

原版：

uint32_t int2float(uint32_t input)
{
	u32 result, i, exponent, fraction;

	if ((input & 0x3fff) == 0)
		result = 0; /* 0 is a special case */
	else {
		exponent = 140; /* exponent biased by 127; */
		fraction = (input & 0x3fff) << 10; 
                /* cheat and only handle numbers below 2^^15 */
		for (i = 0; i < 14; i++) {
			if (fraction & 0x800000)
				break;
			else {
				fraction = fraction << 1; 
                                /* keep hifting left until top bit = 1 */
				exponent = exponent - 1;
			}
		}
		result = exponent << 23 | (fraction & 0x7fffff);
                /* mask off top bit; assumed 1 */
	}
	return result;
}

需要用 for 迴圈來確認 exponent 大小，並同時修改 fraction 值，因為 fraction 部份剛好 23 bits ，所以用 fraction & 0x800000 來確認 fraction 部份是否有超出範圍

改進版：

/* 23 bits of float fractional data */
#define I2F_FRAC_BITS  23
#define I2F_MASK ((1 << I2F_FRAC_BITS) - 1)

/*
 * Converts unsigned integer into 32-bit IEEE floating point representation.
 * Will be exact from 0 to 2^24.  Above that, we round towards zero
 * as the fractional bits will not fit in a float.  (It would be better to
 * round towards even as the fpu does, but that is slower.)
 */
uint32_t int2float(uint32_t x)
{
	uint32_t msb, exponent, fraction;

	/* Zero is special */
	if (!x) return 0;

	/* Get location of the most significant bit */
	msb = __fls(x);

	/*
	 * Use a rotate instead of a shift because that works both leftwards
	 * and rightwards due to the mod(32) behaviour.  This means we don't
	 * need to check to see if we are above 2^24 or not.
	 */
	fraction = ror32(x, (msb - I2F_FRAC_BITS) & 0x1f) & I2F_MASK;
	exponent = (127 + msb) << I2F_FRAC_BITS;

	return fraction + exponent;
}

__fls 的功能是找到 most-significant set bit ，這裡用 __fls 就可以省去用 for 迴圈確認 exponent 的部份，只要 127 + __fls(x) ，然後再 shift left 23 就符合浮點數 exponent 的表達

測驗 `3`

64 位元 GCD 求值函式

#include <stdint.h>
uint64_t gcd64(uint64_t u, uint64_t v)
{
    if (!u || !v) return u | v;
    while (v) {                               
        uint64_t t = v;
        v = u % v;
        u = t;
    }
    return u;
}

改寫為以下等價實作:

#include <stdint.h>
uint64_t gcd64(uint64_t u, uint64_t v)
{
    if (!u || !v) return u | v;
    int shift;
    for (shift = 0; !((u | v) & 1); shift++) {
        u /= 2, v /= 2;
    }
    while (!(u & 1))
        u /= 2;
    do {
        while (!(v & 1))
            v /= 2;
        if (u < v) {
            v -= u;
        } else {
            uint64_t t = u - v;
            u = v;
            v = t;
        }
    } while (COND);
    return RET;
}

COND = v
RET = u << shift

1.解釋上述程式運作原理

if (!u || !v) return u | v;

處理

g c d (0, 0) = 0

g c d (x, 0) = x

g c d (0, y) = y

的情況

for (shift = 0; !((u | v) & 1); shift++) {
    u /= 2, v /= 2;
}

將二的冪公因數先除掉以減少後續計算量，並將二的冪公因數資訊存到 shift 中

while (!(u & 1))
    u /= 2;

處理 u 為偶數，v 為奇數的情況，2必定不為公因數

do {
    while (!(v & 1))
        v /= 2;
    if (u < v) {
        v -= u;
    } else {
        uint64_t t = u - v;
        u = v;
        v = t;
    }
} while (v);
return u << shift;

先處理 v 為偶數，u 為奇數的情況，接下來就是做輾轉相除法，若 v 為零則輾轉相除法結束，若為非零則繼續做，最後將結果 u 左移 shift 次用以乘回之前多除的二，然後再將結果 return 即為所求

2.在 x86_64 上透過 __builtin_ctz 改寫 GCD，分析對效能的提升

uint64_t gcd64(uint64_t u, uint64_t v)
{
    if (!u || !v) return u | v;
    int shift = __builtin_ctz(u) ^ ((__builtin_ctz(u) ^ __builtin_ctz(v)) & -(__builtin_ctz(u) > __builtin_ctz(v)));
    u >>= __builtin_ctz(u);
    do {
        v >>= __builtin_ctz(v);
        if (u < v) {
            v -= u;
        } else {
            uint64_t t = u - v;
            u = v;
            v = t;
        }
    } while (v);
    return u << shift;
}

__builtin_ctz(x) 回傳為 x 最低位元有連續幾位是0，像是 10100000 會回傳 5

在第 4 行的地方利用第二題的技巧來 find_min ，找出兩數誰可以被 2 整除最多次，然後再將資訊存在 shift 中，接著用 right shift 將 u 除成奇數，對應原版的 code

    int shift;
    for (shift = 0; !((u | v) & 1); shift++) {
        u /= 2, v /= 2;
    }
    while (!(u & 1))
        u /= 2;

在第 7 行的部份則是用 v >>= __builtin_ctz(v); 來改寫原本除法的部份

效能分析

先測試三個 function 的執行時間，測試程式連結，測試方法為產生 100000 組亂數，測試三個 function 執行完全部資料所需的時間，執行結果如下

time: 0.012528 s     (1)最初版本
time: 0.031859 s     (2)改寫版本
time: 0.019089 s     (3)__builtin_ctz 版本

多執行幾次也得到差不多的結果，我原本以為速度快到慢的排序應該是(3)(2)(1)，沒想到卻是(1)(3)(2)，接著用 perf 分析，將每個函式分開執行測試

perf record ./q3_2_1.out && perf annotate
perf record ./q3_2_2.out && perf annotate
perf record ./q3_2_3.out && perf annotate

(1) 花最多時間在除法上。取餘數在組合語言中也是用除法指令來做，雖然有用到除法，但在其它運算上省下的時間可以彌補除法所花費的時間
(2) 花最多時間在 mov 上，可能是有太多賦值運算的關係
(3) 花最多時間在 mov 上，另外 tzcnt 為 Count the Number of Trailing Zero Bits

但如果開啟編譯器最佳化 gcc -O3 -o q3.out q3.c 則會得到下面結果

time: 0.009081 s     (1)最初版本
time: 0.019232 s     (2)改寫版本
time: 0.008992 s     (3)__builtin_ctz 版本

經過最佳化後，版本(3)的速度會是最快的，接著用 perf 分析

(1) 一樣是花最多時間在除法上
(2) 花在 jump 和 mov 的時間較多，根本問題還是 branch 太多及太多賦值運算
(3) 可以看出花最多時間在 sub 和 tzcnt 和 shr 但這三項都是必要運算

3.Linux 核心中也內建 GCD (而且還不只一種實作)，例如 lib/math/gcd.c，請解釋其實作手法和探討在核心內的應用場景

主要有兩種實做，一種是有 __ffs，一種是沒有的，__ffs 的功能與 __builtin_ctz 是一樣的，首先解釋有 __ffs 的版本：

unsigned long gcd(unsigned long a, unsigned long b)
{
    /* 處理 a 或 b 其中一個為零的情況，邏輯與前面的程式相同
     * r & -r 會是兩數最大二的冪公因數
     */
    unsigned long r = a | b;

	if (!a || !b)
		return r;
    /* 先將 b 除二的冪使其變成奇數，然後再檢查 b 是否為 1
     * 若是代表 b 為二的冪，於是 r & -r 就是答案
     */ 
	b >>= __ffs(b);
	if (b == 1)
		return r & -r;
    //輾轉相除法的部份
	for (;;) {
        //將 a 除二的冪使其變為奇數
		a >>= __ffs(a);
        //a == 1 與前面 b == 1 邏輯相同
		if (a == 1)
			return r & -r;
        //a == b 代表找到兩者最大奇數公因數，將他乘 r 就是兩者最大公因數
		if (a == b)
			return a << __ffs(r);
        //a < b 則兩者交換以保持 a > b 這樣 a - b 才不會是負數
		if (a < b)
			swap(a, b);
		a -= b;
	}
}

a | b 舉例 a = 11010000 , b = 10000100 做 | 的話必定能將兩者最低位 1 保留， a | b = xxxxx100
r & -r 這個操做可以將 r 的最低位 1 保留其它 bits 設為 0
舉例 r = 11011000 -r = 00101000 可以先不看最低位 1 以後的 bit 會比較好理解 r = xxxx1000 -r = xxxx0111 + 1 = xxxx1000 x 的部份會剛好都相反， & 完後都會變成 0

沒有 __ffs 的版本：

unsigned long gcd(unsigned long a, unsigned long b)
{
	unsigned long r = a | b;

	if (!a || !b)
		return r;
    //將最大二的冪公因數存在 r 中
	r &= -r;
    /* 先將 b 一直除二，直到剩下二的冪公因數與共同最大二的冪公因數相同
     * 若是此時 b == r 則 r 即為答案，也代表 b 為偶數
     */ 
	while (!(b & r))
		b >>= 1;
	if (b == r)
		return r;

	for (;;) {
        //將 a 一直除二，直到剩下二的冪公因數與共同最大二的冪公因數相同
		while (!(a & r))
			a >>= 1;
        /* a == b 情況的回傳值不用再 shift 了
         * 因為這裡沒有將最大二的冪公因數除掉，所以不用 left shift 將它補回
         */ 
		if (a == r)
			return r;
		if (a == b)
			return a;
            //保持 a > b 才不會減成負數
		if (a < b)
			swap(a, b);
		a -= b;
		a >>= 1;
		if (a & r)
			a += b;
		a >>= 1;
	}
}

a -= b; 後程式碼參考 Nomad1230 解釋， a -= b; 時 a 與 b 必同為奇數或偶數，所以相減結果必為偶數，用 a >>= 1; 可將相減所產生的 2 除掉，而 if (a & r) 則是防止當除過頭時，則將 b 加回來，最後再執行 a >>= 1;，上述這些步驟的目的是為了減少迴圈數量，實際測試程式，跑 100000 組亂數測試有沒有加上述程式對時間的影響，測試時用 -O3 最佳化編譯結果如下

time: 0.014942 s (有加32~35行的程式)
time: 0.016552 s (沒加32~35行的程式)

linux __ffs 問題

之前看到 vacantron 的測驗三有討論 __ffs 的功能，後來去看原始碼發現 ffs 和 __ffs 是不一樣的

ffs 功能和 compiler 的 __builtin_ffs(x) 是一樣的，都是回傳從低位 bit 數來，第一個 bit 1 是在第幾個 bit ，其值與 __builtin_ctz(x) + 1 相同，相關程式碼在 linux/include/asm-generic/bitops/ffs.h
__ffs 的功能則和 compiler 的 __builtin_ctz(x) 是一樣的，相關程式碼在 linux/include/asm-generic/bitops/__ffs.h

核心內的應用場景

linux/drivers/gpu/drm/radeon/radeon_display.c

static void avivo_reduce_ratio(unsigned *nom, unsigned *den,
			       unsigned nom_min, unsigned den_min)
{
	unsigned tmp;

	/* reduce the numbers to a simpler ratio */
	tmp = gcd(*nom, *den);
	*nom /= tmp;
	*den /= tmp;

	/* make sure nominator is large enough */
	if (*nom < nom_min) {
		tmp = DIV_ROUND_UP(nom_min, *nom);
		*nom *= tmp;
		*den *= tmp;
	}

	/* make sure the denominator is large enough */
	if (*den < den_min) {
		tmp = DIV_ROUND_UP(den_min, *den);
		*nom *= tmp;
		*den *= tmp;
	}
}

上面程式中 gcd 是用來約分分數，在 gpu driver 裡出現

測驗 `4`

在影像處理中，bit array (也稱 bitset) 廣泛使用，考慮以下程式碼:

#include <stddef.h>
// bitmap : input bitmap data 的起始位置
// bitmapsize : 總共有幾個 64 bits 的 bitmap
// out : 拿來存 bitmap set bit position 的 array
size_t naive(uint64_t *bitmap, size_t bitmapsize, uint32_t *out)
{
    size_t pos = 0;
    for (size_t k = 0; k < bitmapsize; ++k) {
        /* 將 bitmap 中當前元素存到 bitset 中 
         * 因為 bitmap 每個元素都為 64bits 所以每個元素起始位置都是 k*64
         */ 
        uint64_t bitset = bitmap[k];
        size_t p = k * 64;
        /* 64bits 都測試一遍，若為 1 則 pos++
         * 另外將每個 1 的位置紀錄到 out array 中
         */ 
        for (int i = 0; i < 64; i++) {
            if ((bitset >> i) & 0x1)
                out[pos++] = p + i;
        }
    }
    return pos;
}

用 __builtin_ctzll 改寫為下列程式碼

#include <stddef.h>
size_t improved(uint64_t *bitmap, size_t bitmapsize, uint32_t *out)
{
    size_t pos = 0;
    uint64_t bitset;
    for (size_t k = 0; k < bitmapsize; ++k) {
        bitset = bitmap[k];
        while (bitset != 0) {
            /* -bitset & bitset 可保留從低位數來第一個 1，其餘 set to zero
             * 上一題 Linux GCD 中也有相同的手法(r &= -r)
             * __builtin_ctzll(bitset) 可得知 1 位置資訊
             * bitset ^= t 則是將剛測到的 1 變成 0 以免重複測到
             */
            uint64_t t = EXP6;
            int r = __builtin_ctzll(bitset);
            out[pos++] = k * 64 + r;
            bitset ^= t;                           
        }
    }
    return pos;
}

EXP6 = -bitset & bitset

1.解釋上述程式運作原理，並舉出這樣的程式碼用在哪些真實案例中

主要的功能是將 bitmap 中 1 的總數記在 pos 中，並將每個 1 的位子分別記在 out array 中，其餘解釋打在程式的 comment 裡，實際跑一次會比較清楚到底這個 function 在做什麼

真實案例

目前找不到關鍵字去核心尋找到相關程式

2.設計實驗，檢驗 ctz/clz 改寫的程式碼相較原本的實作有多少改進？應考慮到不同的 bitmap density

實驗程式碼，測試不同資料量及不同 bitmap density 兩函式所需要的執行時間，實驗結果如下

bitmapsize = 1000

bitmapsize = 10000

bitmapsize = 100000

圖片繪製，先將 code clone 下來然後執行：

make plot size=10000 code=q4_2

從上面的圖可發現改進版的整體效能比原版的還高，但有兩個特別的點

原版函式當 bitmap density 靠近中間值時所需的時間也越多，反而 bitmap density 越高所需的時間也越少，初步推測是因為 branch prediction 的問題，後來實際去用 perf 去測試相同資料集跑 100 次得到下列結果

bitmap density = 64

bitmap density = 32

可以看出在 bitmap density = 32 時 branch-miss 是較高的，推測是因為每個 bit 是 set bit 的機率是 50% 所以 CPU 很難猜下個 bit 是不是 set bit ， branch-miss 的比例自然也高，但在 bitmap density = 64 的狀況下，每個 bit 都是 set bit，所以 CPU 自然會猜下一個 bit 為 set bit ，所以 branch-miss 的比例就會很低，而 branch-miss 越多的情況自然會降低效能的表現
改進版的隨 bitmap density 上升所需的時間也跟著成長，觀察 code 可發現當 density 越高每次呼叫函式所需要跑的迴圈次數越多，且 __builtin_ctzll 的 latency 較高，每次迴圈所需的時間較長，所以當 density 越高與原版的時間差距也越小，反之 bitmap density 越低，使用 __builtin_ctzll 的效益越高

3.思考進一步的改進空間

在上面的測試中可發現 bitmap density 越低，使用 __builtin_ctzll 的效益越高，當 bitmap density 越高用 __builtin_ctzll 反而優勢就消失了，所以可以從高 bitmap density 的情況下去改進，參考 qwe661234 的想法和程式，改進 naive 版本一次迴圈只測一個 bit 的缺點，變成一次迴圈測兩個 bits ，並用 bitwise 來改寫一些算式

size_t my_improve_1(uint64_t *bitmap, size_t bitmapsize, uint32_t *out)
{
    size_t pos = 0;
    uint64_t bitset;
    for (size_t k = 0; k < bitmapsize; ++k) {
        bitset = bitmap[k];
        size_t p = k << 6;
        for (int i = 0; i < 64; i+=2) {
            switch ((bitset >> i) & 0x3){
                case 1:
                    out[pos++] = p | i;
                    break;
                case 2:
                    out[pos++] = p | (i + 1);
                    break;
                case 3:
                    out[pos++] = p | i;
                    out[pos++] = p | (i + 1);
                    break;
            }              
        }
    }
    return pos;
}

另外也有寫一次迴圈測四個 bits 的程式

完整程式

size_t my_improve_2(uint64_t *bitmap, size_t bitmapsize, uint32_t *out)
{
    size_t pos = 0;
    uint64_t bitset;
    for (size_t k = 0; k < bitmapsize; ++k) {
        bitset = bitmap[k];
        size_t p = k << 6;
        for (int i = 0; i < 64; i+=4) {
            switch ((bitset >> i) & 0xf){
                case 1:
                    out[pos++] = p | i;
                    break;
                case 2:
                    out[pos++] = p | (i + 1);
                    break;
                case 3:
                    out[pos++] = p | i;
                    out[pos++] = p | (i + 1);
                    break;
                case 4:
                    out[pos++] = p | (i + 2);
                    break;
                case 5:
                    out[pos++] = p | i;
                    out[pos++] = p | (i + 2);
                    break;
                case 6:
                    out[pos++] = p | (i + 1);
                    out[pos++] = p | (i + 2);
                    break;
                case 7:
                    out[pos++] = p | i;
                    out[pos++] = p | (i + 1);
                    out[pos++] = p | (i + 2);
                    break;
                case 8:
                    out[pos++] = p | (i + 3);
                    break;
                case 9:
                    out[pos++] = p | i;
                    out[pos++] = p | (i + 3);
                    break;
                case 10:
                    out[pos++] = p | (i + 1);
                    out[pos++] = p | (i + 3);
                    break;
                case 11:
                    out[pos++] = p | i;
                    out[pos++] = p | (i + 1);
                    out[pos++] = p | (i + 3);
                    break;
                case 12:
                    out[pos++] = p | (i + 2);
                    out[pos++] = p | (i + 3);
                    break;
                case 13:
                    out[pos++] = p | i;
                    out[pos++] = p | (i + 2);
                    out[pos++] = p | (i + 3);
                    break;
                case 14:
                    out[pos++] = p | (i + 1);
                    out[pos++] = p | (i + 2);
                    out[pos++] = p | (i + 3);
                    break;
                case 15:
                    out[pos++] = p | i;
                    out[pos++] = p | (i + 1);
                    out[pos++] = p | (i + 2);
                    out[pos++] = p | (i + 3);
                    break;
            }                
        }
    }
    return pos;
}

將兩者與 __builtin_ctzll 版本放在一起比較，測試跑 bitmapsize = 10000 的情況，結果如下

可以發現一次測四個 bits 所需要的時間比一次測兩個 bits 所需的時間還要少，且兩者在 bitmap density 較高時都比 __builtin_ctzll 版本好，而 __builtin_ctzll 版本在 bitmap density = 0 ~ 40 左右有絕對的優勢

4.閱讀 Data Structures in the Linux Kernel 並舉出 Linux 核心使用 bitmap 的案例;

在 linux/include/linux/cpumask.h 中用到

首先這個檔案在做的事就如註解所講的

Cpumasks provide a bitmap suitable for representing the set of CPU's in a system, one bit position per CPU number.

在程式的開頭先定義結構

typedef struct cpumask { DECLARE_BITMAP(bits, NR_CPUS); } cpumask_t;

DECLARE_BITMAP 則自於 linux/include/linux/types.h

#define DECLARE_BITMAP(name,bits) \
	unsigned long name[BITS_TO_LONGS(bits)]

name 為 bitmap 的名子，而 bits 則是需要的 bit 數量。在這裡 bits 參數為 NR_CPUS ，代表系統所支持最大 CPU 數

定義完 bitmap 結構後下面則有許多應用例如

extern struct cpumask __cpu_possible_mask;
extern struct cpumask __cpu_online_mask;
extern struct cpumask __cpu_present_mask;
extern struct cpumask __cpu_active_mask;

在這裡有四種 CPU 狀態的 bitmap
possible : 標示可移至內核的 CPU
online : 標示已經移至內核的 CPU
present : 標示已經初始化後可被調用的 CPU
active : 標示哪些 CPU 可能已經被某些任務佔據

static inline void
set_cpu_possible(unsigned int cpu, bool possible)
{
	if (possible)
		cpumask_set_cpu(cpu, &__cpu_possible_mask);
	else
		cpumask_clear_cpu(cpu, &__cpu_possible_mask);
}

上面是 set_cpu_possible 的 function ，用於更改 __cpu_possible_mask 中的 bit ，其中 bool possible 決定將 bitmap 特定 bit set to 1 or set to 0 ，至於是哪個 bit 要 set 則是由 unsigned int cpu 來決定，這裡又會用到另外兩個 function

static inline void cpumask_set_cpu(unsigned int cpu, struct cpumask *dstp)
{
	set_bit(cpumask_check(cpu), cpumask_bits(dstp));
}

static inline void cpumask_clear_cpu(int cpu, struct cpumask *dstp)
{
	clear_bit(cpumask_check(cpu), cpumask_bits(dstp));
}

在這裡 set_bit 和 clear_bit 的第一個參數都為 cpumask_check(cpu) ， cpu 代表是第幾個 bit 要被 set ， cpumask_check() 則是確認是否有超出範圍，而第二個參數為 cpumask_bits(dstp) ， dstp 為 bitmap struct 的位置 cpumask_bits() 則是取 struct 中 bitmap 的值

另外還有許多其它應用像是
cpumask_first 對應 find_first_bit
cpumask_next 對應 find_next_bit
cpumask_next_zero 對應 find_next_zero_bit
等就不一一列舉介紹了

參考：
Linux进程管理之CPU启动相关的四大状态
 Linux 内核 | CPU 状态管理
 CPU masks

測驗 `5`

以下是 LeetCode 166. Fraction to Recurring Decimal 的可能實作:



















































































































#include <stdbool.h>                       
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include "list.h"
    
struct rem_node {
    int key;
    int index;
    struct list_head link;
};  

//找是否有餘數重複，若重複傳回其位子，沒有重複則傳回 -1
static int find(struct list_head *heads, int size, int key)
{
    struct rem_node *node;
    int hash = key % size;
    //list_for_each_entry : Iterate over list of given type
    list_for_each_entry (node, &heads[hash], link) {
        if (key == node->key)
            return node->index;
    }
    return -1;
}

//numerator:被除數    denominator:除數
char *fractionToDecimal(int numerator, int denominator)
{
    int size = 1024;
    char *result = malloc(size);
    char *p = result;
    
    //除數為 0 無法除
    if (denominator == 0) {
        result[0] = '\0';
        return result;
    }
    
    //被除數為 0 商數為 0
    if (numerator == 0) {
        result[0] = '0';
        result[1] = '\0';
        return result;
    }

    /* using long long type make sure there has no integer overflow */
    long long n = numerator;
    long long d = denominator;

    /* deal with negtive cases */
    if (n < 0)
        n = -n;
    if (d < 0)
        d = -d;
    
    //若為負數結果加負號
    bool sign = (float) numerator / denominator >= 0;
    if (!sign)
        *p++ = '-';
    
    //做整數除法，因為 n 和 d 都大於 0，商與餘數必大於零
    long long remainder = n % d;
    long long division = n / d;
    
    //餘數為 0 直接輸出商就是答案
    sprintf(p, "%ld", division > 0 ? (long) division : (long) -division);
    if (remainder == 0)
        return result;
    
    //有餘數先加小數點
    p = result + strlen(result);
    *p++ = '.';

    /* Using a map to record all of reminders and their position.
     * if the reminder appeared before, which means the repeated loop begin,
     */
    char *decimal = malloc(size);
    memset(decimal, 0, size);
    char *q = decimal;

    size = 1333;
    struct list_head *heads = malloc(size * sizeof(*heads));
    for (int i = 0; i < size; i++)
        INIT_LIST_HEAD(&heads[i]);

    for (int i = 0; remainder; i++) {
        //確認餘數是否重複，若重複則為循環小數
        int pos = find(heads, size, remainder);
        //重複的話 pos >= 0 ，處理循環小數的輸出
        if (pos >= 0) {
            while (PPP > 0)
                *p++ = *decimal++;
            *p++ = '(';
            while (*decimal != '\0')
                *p++ = *decimal++;
            *p++ = ')';
            *p = '\0';
            return result;
        }
        //remainder 不重複，將餘數存進hash table內
        struct rem_node *node = malloc(sizeof(*node));
        node->key = remainder;
        node->index = i;
        
        //list_add : Insert a new entry after the specified head.
        MMM(&node->link, EEE);
        
        //更新餘數
        *q++ = (remainder * 10) / d + '0';
        remainder = (remainder * 10) % d;
    }

    strcpy(p, decimal);
    return result;
}

PPP = pos--
MMM = list_add
EEE = &heads[remainder % size]

1.解釋上述程式碼運作原理，指出其中不足，並予以改進

例如，判斷負號只要寫作 bool isNegative = numerator < 0 ^ denominator < 0;
搭配研讀 The simple math behind decimal-binary conversion algorithms

第 34~37 行
可以將這段刪掉，因為在題目的敘述中已經標明 denominator != 0

第 57~58 行

bool sign = (float) numerator / denominator >= 0;
if (!sign)
    *p++ = '-';

負數判斷改成

bool isNegative = numerator < 0 ^ denominator < 0;
if (isNegative)
    *p++ = '-';

第 66 行
在 51~54 行已經處理過負數了，所以這裡 division 必定是大於零，所以改寫成以下
```
sprintf(p, "%ld", (long)division);
```
第 81 行
不懂為何 hash table 的 bucket size 設為 1333 ，參考 cantfindagoodname 的想法可將 size 設為
$2^{k}$ ，然後第 17 行 key % size; 就可以改為 key & (1 << k - 1) ，這樣就可避免使用 % 以提高效率

2.在 Linux 核心原始程式碼的 mm/ 目錄 (即記憶體管理) 中，找到特定整數比值 (即 fraction) 和 bitwise 操作的程式碼案例，予以探討和解說其應用場景

目前用關鍵字 fraction 去搜尋只找到下列相關檔案，但裡面 fraction 的意思和題目所要探討的不同且與 bitwise 操作也無關，目前想不到如何尋找題目所要求探討的程式碼部份
linux/mm/kasan/quarantine.c
linux/mm/slub.c
linux/mm/page_alloc.c
linux/mm/vmscan.c

測驗 `6`

alignof 是 GNU extension，以下是其可能的實作方式:

/*
 * ALIGNOF - get the alignment of a type
 * @t: the type to test
 *
 * This returns a safe alignment for the given type.
 */
#define ALIGNOF(t) \
    ((char *)(&((struct { char c; t _h; } *)0)->M) - (char *)X)

M = _h
X = 0

1.解釋上述程式碼運作原理

參考：

#define ALIGNOF(t) \
    ((char *)(&((struct { char c; t _h; } *)0)->_h) - (char *)0)

//分段來看
A = (struct { char c; t _h; } *)0
B = A -> _h
C = &B
D = (char *)C
E = D - (char *)0

A：將 0 強制轉形成 (struct { char c; t _h; } *)
B：Dereference and access struct member _h
C：取 _h 的 address
D：強制轉形成成 (char *)
E：減去 (char *)0

這裡最後用 (char *) 轉型是因為一個 char 的空間為一個 byte，以下將 char 改成 int 測試 t = uint64_t ，結果會是 2

# include <stdio.h>
# include <stdint.h>

int main(){
	printf("%ld\n", ((int *)(&((struct { int c; uint64_t _h; } *)0)->_h) - (int *)0));
}

另外因為 data alignment ，所以 compiler 會將 struct 的大小對齊 struct 裡最大的元素，所以 char c 後會 padding some bytes，這樣相減結果就會知道 alignment requirement

2.在 Linux 核心原始程式碼中找出 alignof 的使用案例 2 則，並針對其場景進行解說

linux/kernel/sched/sched.h

#define DEFINE_SCHED_CLASS(name) \
const struct sched_class name##_sched_class \
	__aligned(__alignof__(struct sched_class)) \
	__section("__" #name "_sched_class")

linux/net/netfilter/ipset/ip_set_bitmap_port.c

struct bitmap_port {
	unsigned long *members;	/* the set members */
	u16 first_port;		/* host byte order, included in range */
	u16 last_port;		/* host byte order, included in range */
	u32 elements;		/* number of max elements in the set */
	size_t memsize;		/* members size */
	struct timer_list gc;	/* garbage collection */
	struct ip_set *set;	/* attached to this ip_set */
	unsigned char extensions[]	/* data extensions */
		__aligned(__alignof__(u64));
};

在上面這兩個例子中 __alignof__ 都和 __aligned 並用，並且在 linux 核心中許多地方也都有他們兩個並用的情形，這樣的用法是為了對某些變數做特定大小的 alignment ，像是第一個例子就是先用 __alignof__(struct sched_class) 取得要對齊的大小，接著再用 __aligned() 來對齊 const struct sched_class name##_sched_class

3.在 Linux 核心源使程式碼找出 `ALIGN`, `ALIGN_DOWN`, `ALIGN_UP` 等巨集，探討其實作機制和用途，並舉例探討 (可和上述第二點的案例重複)。思索能否對 Linux 核心提交貢獻，儘量共用相同的巨集

linux/include/linux/align.h

#define ALIGN(x, a)		__ALIGN_KERNEL((x), (a))
#define ALIGN_DOWN(x, a)	__ALIGN_KERNEL((x) - ((a) - 1), (a))

若上述程式 x = 6 a = 4 會得到結果 ALIGN: 8 ALIGN_DOWN: 4 ，可以發現 ALIGN 是向上對齊， ALIGN_UP 可能因為功能與 ALIGN 相同所以才沒有，而不管是 ALIGN 或是 ALIGN_DOWN 都有用到 __ALIGN_KERNEL

linux/include/uapi/linux/const.h

#define __ALIGN_KERNEL(x, a)		__ALIGN_KERNEL_MASK(x, (typeof(x))(a) - 1)
#define __ALIGN_KERNEL_MASK(x, mask)	(((x) + (mask)) & ~(mask))

舉個例子比較好理解程式的運作， x = 6 a = 4 ， __ALIGN_KERNEL(6, 4) = __ALIGN_KERNEL_MASK(6, 3) ，然後兩者相加會變成 9 ，而 9 的二進制為 0x1001 ，然後再 0x1001 & ~(0x0011) = 0x1000 ，可以發現若沒有剛好對齊， x + mask 就會變成下個 4 的倍數再加某數，像是 6 + 3 = 8 + 1 7 + 3 = 8 + 2，然後再用 mask 遮掉後面的數讓它剛好為 4 的倍數就完成了向上對齊，而向下對齊就是先減，讓它 4 的倍數部份減少再向上對齊而已

在許多地方都有用到，像是記憶體管理 linux/mm/page_alloc.c

void set_zone_contiguous(struct zone *zone)
{
	unsigned long block_start_pfn = zone->zone_start_pfn;
	unsigned long block_end_pfn;

	block_end_pfn = ALIGN(block_start_pfn + 1, pageblock_nr_pages);
	for (; block_start_pfn < zone_end_pfn(zone);
			block_start_pfn = block_end_pfn,
			 block_end_pfn += pageblock_nr_pages) {

		block_end_pfn = min(block_end_pfn, zone_end_pfn(zone));

		if (!__pageblock_pfn_to_page(block_start_pfn,
					     block_end_pfn, zone))
			return;
		cond_resched();
	}

	/* We confirm that there is no hole */
	zone->contiguous = true;
}

上面用 ALIGN 來使 page frame number(pfn) 對齊 pageblock_nr_pages ，程式主要功能是確認某一個區塊的記憶體是連續的

static void __init init_unavailable_range(unsigned long spfn,
					  unsigned long epfn,
					  int zone, int node)
{
	unsigned long pfn;
	u64 pgcnt = 0;

	for (pfn = spfn; pfn < epfn; pfn++) {
		if (!pfn_valid(ALIGN_DOWN(pfn, pageblock_nr_pages))) {
			pfn = ALIGN_DOWN(pfn, pageblock_nr_pages)
				+ pageblock_nr_pages - 1;
			continue;
		}
		__init_single_page(pfn_to_page(pfn), pfn, zone, node);
		__SetPageReserved(pfn_to_page(pfn));
		pgcnt++;
	}

	if (pgcnt)
		pr_info("On node %d, zone %s: %lld pages in unavailable ranges",
			node, zone_names[zone], pgcnt);
}

這裡 ALIGN_DOWN 也是用來使 page frame number(pfn) 對齊 pageblock_nr_pages

測驗 `7`

FizzBuzz 題目:

從 1 數到 n，如果是 3的倍數，印出 “Fizz”
如果是 5 的倍數，印出 “Buzz”
如果是 15 的倍數，印出 “FizzBuzz”
如果都不是，就印出數字本身

直覺的實作程式碼如下: (naive.c)

#include <stdio.h>
int main() {
    for (unsigned int i = 1; i < 100; i++) {
        if (i % 3 == 0) printf("Fizz");
        if (i % 5 == 0) printf("Buzz");
        if ((i % 3) && (i % 5)) printf("%u", i);
        printf("\n");
    }
    return 0;
}

觀察 printf 的(格式)字串，可分類為以下三種:

整數格式字串 "%d" : 長度為 2 B
“Fizz” 或 “Buzz” : 長度為 4 B
“FizzBuzz” : 長度為 8 B

考慮下方程式碼:

char fmt[M9];
strncpy(fmt, &"FizzBuzz%u"[start], length);
fmt[length] = '\0';
printf(fmt, i);
printf("\n");

我們若能精準從給定輸入 i 的規律去控制 start 及 length，即可符合 FizzBuzz 題意:

 string literal: "fizzbuzz%u"
         offset:  0   4   8

以下是利用 bitwise 和上述技巧實作的 FizzBuzz 程式碼: (bitwise.c)

static inline bool is_divisible(uint32_t n, uint64_t M)
{
    return n * M <= M - 1;
}

static uint64_t M3 = UINT64_C(0xFFFFFFFFFFFFFFFF) / 3 + 1;
static uint64_t M5 = UINT64_C(0xFFFFFFFFFFFFFFFF) / 5 + 1;

int main(int argc, char **argv)
{
    for (size_t i = 1; i <= 100; i++) {
        uint8_t div3 = is_divisible(i, M3);
        uint8_t div5 = is_divisible(i, M5);
        unsigned int length = (2 << KK1) << KK2;

        char fmt[9];
        strncpy(fmt, &"FizzBuzz%u"[(8 >> div5) >> (KK3)], length);
        fmt[length] = '\0';

        printf(fmt, i);
        printf("\n");
    }
    return 0;
}

KK1 = div3
KK2 = div5
KK3 = div3 << 2

1.解釋上述程式運作原理並評估 `naive.c` 和 `bitwise.c` 效能落差，避免 stream I/O 帶來的影響，可將 `printf` 更換為 `sprintf`

程式解釋

首先 is_divisible 的原理用例子比較好解釋

12 可被 3 整除：

(3 \times 4) \times (\frac{0 X F F . . .}{3} + 1) <= \frac{0 X F F . . .}{3}

4 \times (0 X F F . . . + 3) <= \frac{0 X F F . . .}{3}

4 \times 0 X 0000000000000010 <= \frac{0 X F F . . .}{3}

若是整除， (0xff… + 3) 的部份會 overflow 後變成很小的數，回傳值也會變成是 1

20 不可被 3 整除：

(5 \times 4) \times (\frac{0 X F F . . .}{3} + 1) <= \frac{0 X F F . . .}{3}

(5 \times (3 + 1)) \times (\frac{0 X F F . . .}{3} + 1) <= \frac{0 X F F . . .}{3}

5 \times ((0 X F F . . . + 3) + (\frac{0 X F F . . .}{3} + 1)) <= \frac{0 X F F . . .}{3}

若不是整除必可拆成三加上某數，多出的部份乘起來必定比右式大，所以回傳值也會變成是 0

接著 div3 div5 有四種組合，對應到 length 和 start

(div5, div3)	(0,0)	(0,1)	(1,0)	(1,1)
length	2	4	4	8
strat	8	0	4	0

再看以下，發現剛好可以對應到正確的輸出

 string literal: "fizzbuzz%u"
         offset:  0   4   8

length 只要多整除一個數，輸出便會多一倍，所以用 left shift 剛好，而 start 只要遇到可以整除 3 便會變成 0 ，所以必須 8 right shift 4 ，而若是只有整除 5 只需要 8 right shift 1 就好

效能評估

測試跑 1000000 筆資料時間。
naive:0.037916832 s
bitwise:0.042828712 s
結果發現 bitwise 版本竟然比較慢，試了幾次都是相同的結果，所以想說看什麼指令花比較多時間，於是就先將 sprintf 先改成 count++ 來測試，不然若是用 sprintf 則會佔用太多時間而無法觀測其它指令時間差別

perf record ./q7_1_1.out && perf annotate
perf record ./q7_1_2.out && perf annotate

naive

發現花最多時間在 shl 和 imul ，然而卻沒有任何除法指令，初步推估是因為編譯器優化掉除法指令了，這可能是造成原版竟然比 bitwise 版還要快的原因

bitwise

而 bitwise 版則是花最多時間在 call function 及 mov 上

2.分析 Faster remainders when the divisor is a constant: beating compilers and libdivide 一文的想法 (可參照同一篇網誌下方的評論)，並設計另一種 bitmask，如「可被 3 整除則末位設為 1」「可被 5 整除則倒數第二位設定為 1」，然後再改寫 `bitwise.c` 程式碼，試圖運用更少的指令來實作出 branchless，參照 fastmod: A header file for fast 32-bit division remainders on 64-bit hardware

看了上述資料後還是想不到如何改進，後來偶然間看到另一種作法，參考奇葩编程之FizzBuzz，作者的想法主要是利用模同餘的性質配上 bitwise 操作，將所有數字映射到 0~7 之間

#include <stdio.h>
#include <stdint.h>

#define SIZE 1000000

static const char* s[] = {"", "Buzz", "Fizz", "FizzBuzz"};

int main() {
    char str[9];
    
    for(int i = 1; i <= SIZE; i++){
        uint64_t number = i;
        //0~15
        number = (number >> 32) + (number & 0xffffffff);
        number = (number >> 32) + ((number >> 16) & 0xffff) + (number & 0xffff);
        number = (number >> 16) + ((number >> 8) & 0xff) + (number & 0xff);
        number = (number >> 8) + ((number >> 4) & 0xf) + (number & 0xf);
        number = (number >> 4) + (number & 0xf);
        //0~7
        number ^= (number & 8) | ((number & 8) >> 1) | ((number & 8) >> 2) | ((number & 8) >> 3);
        if (number == 0) sprintf(str, "FizzBuzz");
        else if (number == 3 || number == 6) sprintf(str, "Fizz");
        else if (number == 5) sprintf(str, "Buzz");
        else sprintf(str, "%u", i);
    }
    
    return 0;
}

簡單來講分成兩個部份，第一個部份是將所有數字映射到 0~15 ，作法則是利用模同餘的性質，

16 k + m = 15 k + (k + m) \equiv k + m (m o d 15)

，所以只要不停的做模同餘變換，每次都用 16 的倍數下去切再做加減，最後就會映射到 0~15 之間，接著就是如何將 0~15 映射到 0~7 ，可以發現 15➜0 12➜3 9➜6 10➜5 ，上述的整除性質都一樣又 15 - k 和 15 ^ k 的結果是一樣的，然後 8~15 最高位都是 1 所以利用最高位位移加上 xor 即可做出映射變換

效能分析

一樣是測試跑 1000000 筆資料時間，最近發現可以用 -r 參數來取多次實驗平均結果

perf stat -r 100 ./q7_1_1.out
perf stat -r 100 ./q7_1_2.out
perf stat -r 100 ./q7_2_1.out

時間：
naive : 0.0339814 ± 0.0000960 seconds time elapsed ( ± 0.28% )
bitwise : 0.040525 ± 0.000393 seconds time elapsed ( ± 0.97% )
improve : 0.0364116 ± 0.0000975 seconds time elapsed ( ± 0.27% )

instructions:
naive : 632,119,366
bitwise : 689,331,934
improve : 657,274,531

可以看的出 improve 版不管是在時間上或者是 instruction 數上都比 bitwise 版還要再更好

3.研讀 The Fastest FizzBuzz Implementation 及其 Hacker News 討論，提出 throughput (吞吐量) 更高的 Fizzbuzz 實作

上述文章內許多組語看不懂，然後有想說試試看用平行程式下去優化，但文章內又提到

Alex did attempt to build a multi-threaded version but was unable to find any performance improvement over the single-threaded version. The reason stated was that the cost of communication between CPU cores is sufficiently high enough that any gains from parallel computation are ultimately negated.

所以目前不太知道下一步如何做

4.解析 Linux 核心原始程式碼 kernel/time/timekeeping.c 裡頭涉及到除法運算的機制，探討其快速除法的實作 (注意: 你可能要對照研讀 `kernel/time/` 目錄的標頭檔和程式碼)

看不太懂程式碼，不確定快速除法的實做在哪裡

Jim Huang

2022/05/11 03:33:10

## 測驗 `4` 在影像處

用 GDB 單步執行來確認 (Edited)

2022/05/11 03:34:05

### 3.

留意到 ctz/clz 一類指令的 latency 較大 (Edited)

2022/05/11 03:35:14

需要更詳盡的探討，可搭配 Demystifying the Linux CPU Scheduler (Edited)

2022/05/16 02:44:05

```c static inline void ewma_##name##_init(struct ewma_##name *e) \ { \ BUILD_BUG_ON(!__builtin_constant_p(_precision)); \ BUILD_BUG_ON(!__builtin_constant_p(_weight_rcp)); \ /* \ * Even if you want to feed it just 0/1 you should have \ * some bits for the non-fractional part... \ */ \ BUILD_BUG_ON((_precision) > 30); \ BUILD_BUG_ON_NOT_POWER_OF_2(_weight_rcp); \ e->internal = 0; \ } ```

> 30 怎麼來的？ (Edited)

2022/05/16 02:47:38

思考進一步的改進空間

TODO: 準備不同的 data set (Edited)

2022/05/16 13:57:17

特定

directory 的翻譯是「目錄」，而非「資料夾」 (Edited)

2022/05/18 08:39:54

請先紀錄問題，稍候我找聯發科技的無線網路軟體工程師來回答你的疑惑 (Edited)

2022q1 Homework2 (quiz2)

實驗環境

測驗 1

1.解釋上述程式碼運作的原理

2.比較上述實作在編譯器最佳化開啟的狀況，對應的組合語言輸出，並嘗試解讀

3.研讀 Linux 核心原始程式碼 include/linux/average.h，探討其 Exponentially weighted moving average (EWMA) 實作，以及在 Linux 核心的應用

linux 核心的應用

mediatek wifi chip driver

Linux DRM

測驗 2

1.解釋上述程式碼運作的原理

2.針對 32 位元無號/有號整數，撰寫同樣 branchless 的實作

3.Linux 核心也有若干 branchless / branch-free 的實作，例如 lib/sort.c，請在 Linux 核心原始程式碼中，找出更多類似的實作手法，請善用 git log 檢索

找其它案例

測驗 3

1.解釋上述程式運作原理

2.在 x86_64 上透過 __builtin_ctz 改寫 GCD，分析對效能的提升

效能分析

3.Linux 核心中也內建 GCD (而且還不只一種實作)，例如 lib/math/gcd.c，請解釋其實作手法和探討在核心內的應用場景

linux __ffs 問題

核心內的應用場景

測驗 4

1.解釋上述程式運作原理，並舉出這樣的程式碼用在哪些真實案例中

真實案例

2.設計實驗，檢驗 ctz/clz 改寫的程式碼相較原本的實作有多少改進？應考慮到不同的 bitmap density

bitmapsize = 1000

bitmapsize = 10000

bitmapsize = 100000

3.思考進一步的改進空間

4.閱讀 Data Structures in the Linux Kernel 並舉出 Linux 核心使用 bitmap 的案例;

在 linux/include/linux/cpumask.h 中用到

測驗 5

1.解釋上述程式碼運作原理，指出其中不足，並予以改進

2.在 Linux 核心原始程式碼的 mm/ 目錄 (即記憶體管理) 中，找到特定整數比值 (即 fraction) 和 bitwise 操作的程式碼案例，予以探討和解說其應用場景

測驗 6

1.解釋上述程式碼運作原理

2.在 Linux 核心原始程式碼中找出 alignof 的使用案例 2 則，並針對其場景進行解說

3.在 Linux 核心源使程式碼找出 ALIGN, ALIGN_DOWN, ALIGN_UP 等巨集，探討其實作機制和用途，並舉例探討 (可和上述第二點的案例重複)。思索能否對 Linux 核心提交貢獻，儘量共用相同的巨集

測驗 7

1.解釋上述程式運作原理並評估 naive.c 和 bitwise.c 效能落差，避免 stream I/O 帶來的影響，可將 printf 更換為 sprintf

程式解釋

效能評估

效能分析

3.研讀 The Fastest FizzBuzz Implementation 及其 Hacker News 討論，提出 throughput (吞吐量) 更高的 Fizzbuzz 實作

4.解析 Linux 核心原始程式碼 kernel/time/timekeeping.c 裡頭涉及到除法運算的機制，探討其快速除法的實作 (注意: 你可能要對照研讀 kernel/time/ 目錄的標頭檔和程式碼)

Read more

股票小幫手

測驗 `1`

測驗 `2`

測驗 `3`

測驗 `4`

測驗 `5`

測驗 `6`

3.在 Linux 核心源使程式碼找出 `ALIGN`, `ALIGN_DOWN`, `ALIGN_UP` 等巨集，探討其實作機制和用途，並舉例探討 (可和上述第二點的案例重複)。思索能否對 Linux 核心提交貢獻，儘量共用相同的巨集

測驗 `7`

1.解釋上述程式運作原理並評估 `naive.c` 和 `bitwise.c` 效能落差，避免 stream I/O 帶來的影響，可將 `printf` 更換為 `sprintf`

4.解析 Linux 核心原始程式碼 kernel/time/timekeeping.c 裡頭涉及到除法運算的機制，探討其快速除法的實作 (注意: 你可能要對照研讀 `kernel/time/` 目錄的標頭檔和程式碼)