位元旋轉實作和 Linux 核心案例

針對 2021 年暑期「Linux 核心」課程先決測驗題的測驗

α

共筆貢獻者: jserv, ekangmonyet, ChinYikMing, henrybear327, dmefs, JingWangTW, mickey30606, tanchihpin0517, ccs100203, RinHizakura, st9540808, yian02, 2011eric, alan23273850

位元旋轉

位元旋轉 (bitwise rotation 或 bit rotation) 的操作展示如下圖:

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

位元旋轉也稱為循環移位運算 (circular shift operation)，其作用是移動位元，但不捨棄位元或增加位元。分二種操作:

循環右移運算 (向右旋轉): 向右移動每個位元一個位置，最右邊的位元被循環成為最左邊的位元
循環左移運算 (向左旋轉): 向左移動每個位元一個位置，最左邊的位元被循環成為最右邊的位元

以下程式碼可建立針對不同位元的向左/向右位元旋轉的實作:

#include <stdint.h>
  
#define __DECLARE_ROTATE(bits, type)                   \
    static inline type rotl##bits(const type v, int c) \
    {                                                  \
        const int mask = (bits) - (1);                 \
        c &= mask;                                     \
                                                       \
        return (v << c) | (v >> (-c & mask));          \
    }                                                  \
                                                       \
    static inline type rotr##bits(const type v, int c) \
    {                                                  \
        const int mask = (bits) - (1);                 \
        c &= mask;                                     \
                                                       \
        return (v >> c) | (v << (-c & mask));          \
    }

#define DECLARE_ROTATE(bits) __DECLARE_ROTATE(bits, uint##bits##_t)

使用案例:

DECLARE_ROTATE(64);
DECLARE_ROTATE(32);
DECLARE_ROTATE(16);
DECLARE_ROTATE(8);

上述程式碼運用到 C 語言前置處理器的 ## (concatenation，有符號和識別「連結」的意思)，一旦 DECLARE_ROTATE 展開後，就可定義對應二個 static inline 函式，例如 rotl8 和 rotr8 分別是針對 8 位元整數型態的位元向左和向右旋轉: (為了便於討論，我們將此段程式碼稱為 ROT0)

static inline uint8_t rotl8(const uint8_t v, int c) {
    const int mask = (8) - (1);
    c &= mask;
    return (v << c) | (v >> (-c & mask));
}
static inline uint8_t rotr8(const uint8_t v, int c) {
    const int mask = (8) - (1);
    c &= mask;
    return (v >> c) | (v << (-c & mask));
};

可用命令 gcc -E 觀察巨集展開

在 Linux 核心原始程式碼 include/linux/bitops.h 類似的實作是 rol8 (針對 8 位元資料型態的位元向左旋轉): (ROT1)

/* rol8 - rotate an 8-bit value left
 * @word: value to rotate
 * @shift: bits to roll
 */
static inline __u8 rol8(__u8 word, unsigned int shift)
{
    return (word << (shift & 7)) | (word >> ((-shift) & 7));
}

針對 8 位元數值，若 rotl/rotr 的旋轉量是 8 的倍數，相當於所有位元都回到和原來相同的位置，所以只要取 8 的餘數，而「取 8 的餘數」的操作等同於 & 7 (7₁₀ = 0111₂)，相當於取 8 (= 1000₂) 的餘數，因為更高的位數皆可被 8 整除。這技巧也適用於其他 2 的冪 (power of 2)。

為了便於討論，我們將數值的二進位表示的 MSB (most significant bit) 稱為「左側」，LSB (least significant bit) 相較來說就是「右側」。在 ROT0，mask 相當於 ROT1 的 7，c 相當於 ROT1 程式碼的 (shift & 7)，最後，再合併「數值左移 c 個位元」和「將數值因左移截掉的位元，右移到最右邊」，使用位元 OR 運算子 |。

要如何確認 ROT0 和 ROT1 作用等價呢？在 ROT1，word 最高位起算 (shift) 個位元進行右移 (8 - (shift & 7)) 的距離，何以會等於 >> ((-shift) & 7) 呢？也就是說，為什麼 (8 - (shift & 7)) 等同於 ((-shift) & 7) 呢？這可用二補數來推導。

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

觀察上圖，根據二補數的特性，+6 (0110₂) 跟 -2 (1110₂) 的差別只有最高位元 (+6

\to

0___, -2

\to

1___)，因此將 +2 取負號後再利用 mask 移去 signed bit ，即可得到 +6。

延伸閱讀: 解讀計算機編碼

已知

s h i f t < 2^{7}

，令 shift 的二補數為 shift'，依據二補數:

s h i f t + s h i f t^{'} = 2^{8}

因為 -shift 就是 shift'，所以

(- s h i f t) & 7 = (2^{8} - s h i f t) & 7

因為

(2^{8} - s h i f t) > 0

，所以

(2^{8} - s h i f t) & 7

邏輯上等價於

\begin{aligned} (2^{8} - s h i f t) m o d 8 & = (0 - s h i f t) m o d 8 \\ = (8 - s h i f t) m o d 8 \\ = 8 - (s h i f t & 7) \end{aligned}

藉由上面的推導，我們可確定 (8 - (shift & 7)) 跟 ((-shift) & 7) 最後結果一致。

直觀的實作有何問題？

看了 ROT0 和 ROT1，你或許會納悶：「難道不能用更簡單的實作嗎？」

針對 ROT0，以最簡單的 8 位元操作、向左旋轉 c 個位元來說，可拆解為:

先向左位移 c 個位元
高位元的 c 個位元需要回到低位，即向右位移 8 - c 之結果

對應的程式碼如下: (ROT2)

static inline uint8_t rotl8(const uint8_t v, int c) {
    return (v << c) | (v >> (8 - c));
}

前面提到 Linux 核心 include/linux/bitops.h 也有相似的實作，值得關注的是，並非一開始就寫為 ROT1 的樣貌。

Linux 核心開發者針對位元旋轉，曾有一系列變革。在 d7e35df 修改納入 Linux 核心之前，採用以下程式碼: (ROT3)

static inline __u32 rol32(__u32 word, unsigned int shift) {
    return (word << shift) | (word >> (32 - shift));
}

乍看之下，和 ROT2 極為相似，不過為何後來變成 ROT1 的形式呢？我們可發現 (32 - shift) 和 (-shift) & 31 實際的行為不同：當 shift 為 0 時，前者 (即 ROT3) 的 >> 32 對於 32 位元整數來說，會遇到 C 語言中規範的 undefined behavior:

The integer promotions are performed on each of the operands. The type of the result is that of the promoted left operand. If the value of the right operand is negative or is greater than or equal to the width of the promoted left operand, the behavior is undefined.

反之，ROT1 則有明確行為的 >> 0。因此 d7e35df 進行了修改，使 shift 參數的適用範圍變成了

[0, 31]

，程式碼如下: (ROT4)

static inline __u32 rol32(__u32 word, unsigned int shift)
{
    return (word << shift)  | (word >> ((-shift) & 31));
}

相較於 ROT3，ROT4 雖然更正確，但 ROT4 對於較舊的編譯器 (即舊版 GCC) 卻不友善，會遭遇到產生的組合語言效率低於 ROT3，也就是無法產生 x86/x86_64 處理器的 rotation 指令。在 GCC Bug 57157: Poor optimization of portable rotate idiom 針對 ROT4 進行編譯器最佳化的討論，最終 GCC 於 2013 年改進指令生成的最佳化實作。

後續 Linux 核心修改 ef4d6f6 引入更完善的改進，包含其它不同寬度整數的位元旋轉中，未被修改的 (bits - shift) 樣式，也對 32 位元的 shift 超出 31 時的行為予以調整: (ROT5)

static inline __u32 rol32(__u32 word, unsigned int shift)
{
    return (word << (shift & 31)) | (word >> ((-shift) & 31));
}

需要注意的是，並非所有資料寬度的位元旋轉都要完全遵守上述模式，才可滿足想要的正確性，需要考慮 C 語言標準的 integer promotion 所對應的程式行為。例如 8 bits 的 rol 實作: (ROT6)

static inline __u8 rol8(__u8 word, unsigned int shift)
{
    return (word << shift) | (word >> (8 - shift));
}

當 shift == 0 時，對 8 位元無號整數的 word 進行 >> 8 行為並非 undefined behavior，因為要注意到 integer promotion 的影響。根據 C 語言規格書 6.3.1.1:

If an int can represent all values of the original type (as restricted by the width, for a bit-field), the value is converted to an int; otherwise, it is converted to an unsigned int. These are called the integer promotions. All other types are unchanged by the integer promotions.

當 word 被提升成 int 型態後，右移 8 位元的語言行為就是明確規範 (well-defined)，於是 ROT6 可視為 ROT1 的改進。

Linux 核心的位元旋轉應用案例

ror 和 rol 系列的函式大多用於密碼學的操作，例如 SHA-256。Linux 核心提供 Crypto API 封裝介面給開發者。

SHA-256

lib/crypto/sha256.c




#define e0(x)       (ror32(x, 2) ^ ror32(x, 13) ^ ror32(x, 22))
#define e1(x)       (ror32(x, 6) ^ ror32(x, 11) ^ ror32(x, 25))
#define s0(x)       (ror32(x, 7) ^ ror32(x, 18) ^ (x >> 3))
#define s1(x)       (ror32(x, 17) ^ ror32(x, 19) ^ (x >> 10))

SHA256_ROUND 裡面就有使用到以上用 circular shift 定義的巨集，SHA256_ROUND 總共會做 64 次，完成後便是用 SHA-256 演算法計算出的雜湊值。







#define SHA256_ROUND(i, a, b, c, d, e, f, g, h) do {		\
	u32 t1, t2;						\
	t1 = h + e1(e) + Ch(e, f, g) + SHA256_K[i] + W[i];	\
	t2 = e0(a) + Maj(a, b, c);				\
	d += t1;						\
	h = t1 + t2;						\
} while (0)

在 Linux 核心何時會用到 SHA-256 呢？其中一個應用是搭配 RSA 產生數位簽章，當 Linux 核心模組 (LKM) 載入到核心時，會檢查雜湊值，並在核心計算雜湊值是否相等，如此一來便能避免惡意的 LKM 破壞系統，而在沒有 RSA key 或是不匹配的情況下，Linux 核心會將試圖載入的 LKM 標示為 tainted。

延伸閱讀: Kernel module signing facility

除了 crypto/sha3_generic.c，lib/crypto/chacha.c 也用到位元旋轉。至於 lib/xxhash.c 源自 Cyan4973/xxHash
也定義位元旋轉，使用到 bits - c 的手法。

搜尋 rol32，可發現在 fs/ext4/hash.c 見到位元旋轉的使用，摘錄程式：

/*
 * The generic round function.  The application is so specific that
 * we don't bother protecting all the arguments with parens, as is generally
 * good macro practice, in favor of extra legibility.
 * Rotation is separate from addition to prevent recomputation
 */
#define ROUND(f, a, b, c, d, x, s)	\
	(a += f(b, c, d) + x, a = rol32(a, s))

SHA-1 實作可見 lib/sha1.c:

#define SHA_MIX(t) rol32(W(t+13) ^ W(t+8) ^ W(t+2) ^ W(t), 1)

BLAKE2

BLAKE2 是第一代 BLAKE 的改良版，主要克服 MD5 和 SHA-1 等一類加密雜湊函數運算速度過慢的問題。BLAKE2 有三種，分別是 BLAKE2b, BLAKE2s, BLAKE2x。Linux 核心原始程式碼實作 BLAKE2b 和 BLAKE2s。這三種加密雜湊函數的差別在於，最後輸出的位元數不同，BLAKE2b 是 512 位元，BLAKE2s 是 256 位元，BLAKE2x 則支援任何位元。以下只討論 BLAKE2s。

Linux 核心原始程式碼 crypto 目錄下的 BLAKE2 實作，用到 Mixing G 函數，即有 ror32 的蹤跡。接下來將分析 BLAKE2 大致運算流程和 ror32 被用多少次。

可參考 lib/crypto/blake2s-generic.c
詳細實作可參考 RFC7693

一開始 BLAKE2s 的輸入是 8 個 word size 的資料，用這 8 個 word size 的資料搭配一些 constant(f) 和 counter(t) 組合出 16 個 word size的vector。接著，再把這個 vector 傳遞給 Compression F function, Compression F function 總共作 10 round, 每個 round 會作 8 次 Mixing G function 運算 (前 4 次是 column step，後 4 次是 diagonal step)，所以總共使用 80 次Mixing G function，而每次 Mixing G function 會使用 4 次ror32，因此 ror32 在 BLAKE2s 總共被使用 320 次。Compression F function 結束後，會再進行 Finalization 運算，才會得出 256 位元的雜湊值。

延伸閱讀: BLAKE2 解說影片

SipHash

SipHash: a fast short-input PRF 論文提到，現在的 message-authentication code (MAC) 都圍繞在如何使長字串的加密處理更加的快速，進而忽略掉了短字串的執行效率，而且為了避免 hash flooding 的狀況，於是出現 SipHash，其關鍵特徵如下:

High Security
High Speed
Key Agility: 這邊提到密鑰不論在 setting up new key 或 hashing a message 的時候都不會 extension 。
Simplicity: SipHash 的 simple round function 只由 +, <<<, ^ 構成。
Autonomy: No external primitive is required.
Small State: 只會使用到 4 個 64-bit variables 。
No state between messages: Hashing 是確定的，而且不會用到隨機亂數。
No software side channels
Minimal overhead: Authenticated messages 只比8原檔案多 8-byte 。

SipHash-c-d 當中的 c 代表在每次的 Compression 中要做幾次的 SipRound，d 代表在全部的 word 都做完後，要做幾次的 SipRound 來完成 finalization，常用的是 SipHash-2-4。

Initialization: 首先 4 個 64-bit words 會被
$k_{0}, k_{1}$ (64-bit word 代表 128-bit 的 key) 初始化：
$\begin{array}{r} v_{0} = k_{0} \oplus 736 f 6 d 6570736575 \\ v_{1} = k_{1} \oplus 646 f 72616 e 646 f 6 d \\ v_{2} = k_{0} \oplus 6 c 7967656 e 657261 \\ v_{3} = k_{1} \oplus 7465646279746573 \end{array}$
Compression: SipHash 會把要操作
$b$ 個 byte 長度的字串，轉換成
$w = (b + 1) / 8$ 個 64-bit little-endian words 記為
$m_{0}, m_{1}, . . ., m_{m - 1}$ ，並對每個
$m$ 的最高位的 byte 與
$b m o d 256$ 進行 encode 完成轉換。依序將每個轉換完的
$m_{i}$ 進行
$v_{3} \oplus m_{i}$ ，並進行
$c$ 次的 SipRound ，再進行
$v_{0} \oplus m_{i}$ ，依序跑完每一個
$m$ 。
Finialization: 在全部的
$m$ 都跑完後，會進行
$v_{2} \oplus f f$ ，再跑
$d$ 次的 SipRound ，最後 return
$v_{0} \oplus v_{1} \oplus v_{2} \oplus v_{3}$ 。

SipHash Round 定義於 include/linux/prandom.h






#define PRND_SIPROUND(v0, v1, v2, v3) ( \
	v0 += v1, v1 = rol64(v1, 13),  v2 += v3, v3 = rol64(v3, 16), \
	v1 ^= v0, v0 = rol64(v0, 32),  v3 ^= v2,                     \
	v0 += v3, v3 = rol64(v3, 21),  v2 += v1, v1 = rol64(v1, 17), \
	v3 ^= v0,                      v1 ^= v2, v2 = rol64(v2, 32)  \
)

AES

針對 Arm 架構的AES (進階加密標準)，內部實作用到位元旋轉。

此加密演算法的四個動作：SubBytes, ShiftRows, MixColumns, AddRoundKey 中的三個操作 (SubBytes, ShiftRows, MixColumns) 可藉由查表同時完成，這樣的操作會需要 16 個 lookup tables，但這 16 個 lookup table 中，只有 4 個需要被儲存起來，其餘的 12 個 lookup tables 都可透過儲存起來的 4 個 tables 進行位元旋轉得到，這樣可節省 75% 的 D-Cache。

程式碼可見 arch/arm64/crypto/aes-cipher-core.S，其中使用到 Arm 架構中用來向右旋轉指令 ror。

延伸閱讀: Accelerated AES for the Arm64 Linux kernel

編譯器最佳化議題

上述 C 程式碼經過編譯器最佳化 (例如使用 gcc) 後，能否運用到這硬體位元旋轉指令呢？

建立檔案名稱為rot8.c 的測試程式，使用 ROT0:

#include <stdint.h>
#include <stdio.h>

DECLARE_ROTATE(8);

void foo(int i) {
    printf("rotl8:  0x%x\n", rotl8(i, 1));
    printf("rotr8:  0x%x\n", rotr8(i, 1));
}

int main(int argc, char const* argv[]) {
    uint8_t i = 0x88;
    foo(i);
    return 0;
}

執行以下命令，可得抑制最佳化的組合語言輸出，檔名為rot8.s

$ gcc -S -fverbose-asm -O0 rot8.c

摘錄rot8.s 中關於 rotl8 的操作





































	.text
	.type	rotl8, @function
rotl8:
.LFB0:
	.cfi_startproc
	pushq	%rbp	#
	.cfi_def_cfa_offset 16
	.cfi_offset 6, -16
	movq	%rsp, %rbp	#,
	.cfi_def_cfa_register 6
	movl	%edi, %eax	# v, tmp93
	movl	%esi, -24(%rbp)	# c, c
	movb	%al, -20(%rbp)	# tmp94, v
# rotate.c:23: DECLARE_ROTATE(8);
	movl	$7, -4(%rbp)	#, mask
	movl	-4(%rbp), %eax	# mask, tmp95
	andl	%eax, -24(%rbp)	# tmp95, c
	movzbl	-20(%rbp), %edx	# v, _1
	movl	-24(%rbp), %eax	# c, tmp96
	movl	%eax, %ecx	# tmp96, tmp100
	sall	%cl, %edx	# tmp100, _1
	movl	%edx, %eax	# _1, _2
	movl	%eax, %esi	# _2, _3
	movzbl	-20(%rbp), %edx	# v, _4
	movl	-24(%rbp), %eax	# c, tmp97
	negl	%eax	# _5
	andl	-4(%rbp), %eax	# mask, _6
	movl	%eax, %ecx	# _6, tmp102
	sarl	%cl, %edx	# tmp102, _4
	movl	%edx, %eax	# _4, _7
	orl	%esi, %eax	# _3, _9
	popq	%rbp	#
	.cfi_def_cfa 7, 8
	ret	
	.cfi_endproc

從第 16 行起，是 rotl8 的函式本體，並未見到 rotl 指令的蹤跡。補上對應的 C 程式碼做對照:










































	.text
	.type	rotl8, @function
rotl8:
.LFB0:
	.cfi_startproc
	pushq	%rbp	#
	.cfi_def_cfa_offset 16
	.cfi_offset 6, -16
	movq	%rsp, %rbp	#,
	.cfi_def_cfa_register 6
	movl	%edi, %eax	# v, tmp93
	movl	%esi, -24(%rbp)	# c, c
	movb	%al, -20(%rbp)	# tmp94, v
# rotate.c:23: DECLARE_ROTATE(8);
# const int mask = (bits) - (1)
	movl	$7, -4(%rbp)	#, mask
	movl	-4(%rbp), %eax	# mask, tmp95
# c &= mask;
	andl	%eax, -24(%rbp)	# tmp95, c
	movzbl	-20(%rbp), %edx	# v, _1
	movl	-24(%rbp), %eax	# c, tmp96
	movl	%eax, %ecx	# tmp96, tmp100
# v << c
	sall	%cl, %edx	# tmp100, _1
	movl	%edx, %eax	# _1, _2
	movl	%eax, %esi	# _2, _3
	movzbl	-20(%rbp), %edx	# v, _4
	movl	-24(%rbp), %eax	# c, tmp97
# -c
	negl	%eax	# _5
# -c & mask
	andl	-4(%rbp), %eax	# mask, _6
	movl	%eax, %ecx	# _6, tmp102
# v >> (-c & mask)
	sarl	%cl, %edx	# tmp102, _4
	movl	%edx, %eax	# _4, _7
# (v << c) || (v >> (-c & mask))
	orl	%esi, %eax	# _3, _9
	popq	%rbp	#
	.cfi_def_cfa 7, 8
	ret	
	.cfi_endproc

接著執行以下命令，產生最佳化等級 -O1 的組合語言輸出:
$ gcc -S -fverbose-asm -O1 rot8.c

可發現 rotl8 及rotr8 的函式主體已經不見，也就是 static inline 發揮作用，觀察 foo 函式的主體，則可看見編譯器產生 rolb 和 rorb 指令。

總結上述觀察，抑制編譯器最佳化時，不會產生 rol 和 ror 指令，開啟最佳化 -O1 時，就會產生rol 和 ror 指令。

也可先產生 ELF 執行檔，再用以下命令反組譯:

$ gcc -o rot8 rot8.c
$ objdump -M intel -d rot8

猶他大學電腦科學系教授 John Regehr 在 Safe, Efficient, and Portable Rotate in C/C++ 一文探討如何進行正確且有效率的位元旋轉。

針對 32 位元和 64 位元整數的向左旋轉: (檔名: rotl.c)

#include <stdint.h>

uint32_t rotl32a(uint32_t x, uint32_t n)
{
    return (x<<n) | (x>>(32-n));
}

uint64_t rotl64a(uint64_t x, uint64_t n)
{
    return (x << n) | (x >> (64-n));
}

使用 gcc 9.3.0 編譯並查看輸出:

$ gcc -o rotl.o rotl.c
$ objdump -M intel -d rotl.o

參考輸出:

0000000000001129 <rotl32a>:
    1129:       f3 0f 1e fa             endbr64 
    112d:       55                      push   rbp
    112e:       48 89 e5                mov    rbp,rsp
    1131:       89 7d fc                mov    DWORD PTR [rbp-0x4],edi
    1134:       89 75 f8                mov    DWORD PTR [rbp-0x8],esi
    1137:       8b 45 f8                mov    eax,DWORD PTR [rbp-0x8]
    113a:       8b 55 fc                mov    edx,DWORD PTR [rbp-0x4]
    113d:       89 c1                   mov    ecx,eax
    113f:       d3 c2                   rol    edx,cl
    1141:       89 d0                   mov    eax,edx
    1143:       5d                      pop    rbp
    1144:       c3                      ret    

0000000000001145 <rotl64a>:
    1145:       f3 0f 1e fa             endbr64 
    1149:       55                      push   rbp
    114a:       48 89 e5                mov    rbp,rsp
    114d:       48 89 7d f8             mov    QWORD PTR [rbp-0x8],rdi
    1151:       48 89 75 f0             mov    QWORD PTR [rbp-0x10],rsi
    1155:       48 8b 45 f0             mov    rax,QWORD PTR [rbp-0x10]
    1159:       89 c2                   mov    edx,eax
    115b:       48 8b 45 f8             mov    rax,QWORD PTR [rbp-0x8]
    115f:       89 d1                   mov    ecx,edx
    1161:       48 d3 c0                rol    rax,cl
    1164:       5d                      pop    rbp
    1165:       c3                      ret

這裡可見 rol 指令，向右旋轉也使用 ror 指令。

John Regehr 在文中提及:

Although this x86-64 rotate instruction will have the expected behavior (acting as a nop) when asked to rotate by 0 or 32 places, this C/C++ function must only be called with n in the range 1-31. Outside of that range, the code has undefined behavior due to shifting a 32-bit value by 32 places. In general, crypto codes are designed to avoid rotating 32-bit words by 32 places. However, as seen in my last post, the current versions of five of the 10 open source crypto libraries that I examined perform rotations by 0 places — a potentially serious bug because C/C++ compilers have unpredictable results when shifting past bitwidth. Bear in mind that the well-definedness of the eventual instruction does not rescue the situation: it is the source code that places (or in this case, fails to place) obligations upon the compiler.

可知當給定的位元旋轉量是 0 或 32 時，可能會導致 undefined behavior，因此限制位移量在

[1, 31]

uint32_t rotl32b(uint32_t x, uint32_t n) {
    assert(n < 32);
    if (!n) return x;
    return (x << n) | (x >> (32 - n));
}

再來，藉由 bitwise 操作，可達到 branchless，並利用 mask 取代assert，程式碼如下:

uint32_t rotl32c(uint32_t x, uint32_t n) {
    uint32_t mask = 32 - 1;
    n &= mask;
    return (x << n) | (x >> (-n & mask));
}

延伸閱讀: 你所不知道的 C 語言: bitwise 操作

開啟最佳化等級 -O3，GCC 輸出以下指令:

00000000000011c0 <rotl32c>:
    11c0:       f3 0f 1e fa             endbr64 
    11c4:       89 f8                   mov    eax,edi
    11c6:       89 f1                   mov    ecx,esi
    11c8:       d3 c0                   rol    eax,cl
    11ca:       c3                      ret

符合預期，出現 rol 指令。

在 ROT0 中，針對不同資料寬度的位元旋轉實作定義為 static inline 函式，這允許編譯器進行更多最佳化，當給定的二個參數都是 constant expression 時。

以下測試程式，使用 ROT0:









void func(unsigned char i) {
    printf("%d\n", rotr8(150, 1));
    printf("%d\n", rotr8(i, 1));
}

int main() {
    func(150);
    return 0;
}

在上方程式碼的第 2 行，函式 rotr8 的二個參數都是 constant expression，以下簡稱「敘述 A」，而在第 3 行，第一個參數為變數，另一者為 constant expression，以下簡稱「敘述 B」。

以下命令開啟 GCC 編譯器最佳化 (等級 -O1)，並列印出中間表示式 (IR):

$ gcc rotation.c -fdump-tree-gimple -S -O1

我們可觀察函式 func 裡頭，A 和 B 這二個敘述對應的 GIMPLE IR，其中針對「敘述 A」:

func (unsigned char i) {
    _1 = rotr8 (150, 1);
    _2 = (int) _1;
    printf ("%d\n", _2);
}

在 IR 內部暫存器 _1，rotr8(150, 1) 在編譯時期即可確定，我們在對應的組合語言輸出可印證:

func:
    .cfi_startproc
    endbr64
    subq	$8, %rsp
   .cfi_def_cfa_offset 16
    movl	$75, %edx ; 常數 150 / 2 = 75 並存入 %edx 暫存器
    leaq	.LC0(%rip), %rsi
    movl	$1, %edi
    movl	$0, %eax
    call	__printf_chk@PLT
    addq	$8, %rsp
    .cfi_def_cfa_offset 8
    ret
    .cfi_endproc

反之，「敘述 B」對應的 IR 如下:

func (unsigned char i) {
    _1 = (int) i;
    _2 = rotr8 (_1, 1);
    _3 = (int) _2;
    printf ("%d\n", _3);
}

在 IR 內部暫存器 _2，GCC 無法編譯時期得知 rotr(i, 1) 的數值，次之的最佳化策略就是使用 ror 一類的指令來替換，如果編譯器後端支援的話。對應的組合語言輸出:

func:
    .cfi_startproc
    endbr64
    subq	$8, %rsp
    .cfi_def_cfa_offset 16
    rorb	%dil
    movzbl	%dil, %edx
    leaq	.LC0(%rip), %rsi
    movl	$1, %edi
    movl	$0, %eax
    call	__printf_chk@PLT
    addq	$8, %rsp
    .cfi_def_cfa_offset 8
    ret
    .cfi_endproc

我們可見 rorb 指令出現在 GCC 的輸出，隨後結果置入 %edx 暫存器。換言之，使用變數傳遞的方式來進行函式呼叫，才會產生 ror 指令。

位元旋轉實作和 Linux 核心案例

位元旋轉

直觀的實作有何問題？

Linux 核心的位元旋轉應用案例

SHA-256

BLAKE2

SipHash

AES

編譯器最佳化議題

Read more

軟體缺失導致的危害

2025 年 Linux 核心設計課程作業 —— lab0 (A)

建構 User-Mode Linux 的實驗環境

並行程式設計: Atomics 操作