最大公因數特性和實作考量

貢獻者: jserv, TibaChang, Tcc0403

簡介

本文探討最大公因數 (Greatest Common Divisor, GCD) 特性，考慮微處理器架構實作帶來的效能衝擊，並探討如何運用 Linux 核心提供的 perf 工具進行效能分析並改進。

Greatest Common Divisor 中譯「最大公因數」，簡稱 GCD。
給定整數

a, b

，二者各自有自己的因數^[1]，取相同的因數中最大的數，即最大公因數

gcd (a, b)

GCD 性質

$gcd (a, b) = gcd (b, a)$
$gcd (a, 0) = | a |$
$gcd (a, b) = gcd (a, b % a)$

$%$ 是取餘數運算，表示存在
$k$ 滿足
$0 \leq b % a = b - k \cdot a < a$

Euclidean algorithm

歐幾里得法，也就是「輾轉相除法」

給定整數

a, b

求出

gcd (a, b)

令

a_{0} = a, b_{0} = b

，根據上述 GCD 性質：

\begin{aligned} gcd (a_{0}, b_{0}) & = gcd (b_{0} % a_{0} = \underset{―}{b_{1}}, a_{0}) \\ gcd (\underset{―}{b_{1}}, a_{0}) & = gcd (a_{0} % b_{1} = \underset{―}{a_{1}}, b_{1}) \\ gcd (\underset{―}{a_{1}}, b_{1}) & = gcd (b_{1} % a_{1} = b_{2}, a_{1}) \\ ⋮ \\ gcd (a_{i}, b_{i}) & = gcd (b_{i} % a_{i} = a_{i + 1}, b_{i}) \\ ⋮ \\ gcd (0, b_{n}) & = | b_{n} | \end{aligned}

^[2]

展示動畫

綜合上面過程，可得程式碼實作：

int gcd(int a, int b)
{
    return a ? gcd(b % a, a) : b;
}

Bezout's Theorem

對於所有整數

a, b

，存在整數

x, y

使得

a x + b y = gcd (a, b)

Extended Euclidean algorithm

給定整數

a, b

求出整數

x, y

滿足

a x + b y = gcd (a, b)

令

g = gcd (a, b)

，配合上述 Bezout's Theorem 可得：

0 \cdot x_{n} + g \cdot y_{n} = g

此處

x_{n} \in Z, y_{n} = 1

即
$x_{n}$ 為任意整數

而根據 GCD 性質

gcd (a_{i}, b_{i}) = gcd (b_{i} % a_{i}, a_{i})

以及

b_{i} % a_{i} = b_{i} - ⌊ \frac{b_{i}}{a_{i}} ⌋ \cdot a_{i}

得到

\begin{aligned} g & = x_{j} \cdot (b_{i} % a_{i}) & + y_{j} & \cdot a_{i} \\ = x_{j} \cdot (b_{i} - ⌊ \frac{b_{i}}{a_{i}} ⌋ \cdot a_{i}) & + y_{j} & \cdot a_{i} \\ = x_{j} \cdot b_{i} & + (y_{j} - ⌊ \frac{b_{i}}{a_{i}} ⌋ \cdot x_{j}) & \cdot a_{i} \end{aligned}

也就是說

g = a_{i} \cdot x_{j - 1} + b_{i} \cdot y_{j - 1} for {\begin{cases} x_{j - 1} = y_{j} - ⌊ \frac{b_{i}}{a_{i}} ⌋ \cdot x_{j} \\ y_{j - 1} = x_{j} \end{cases}

綜合上述過程：

/* It uses pointers to return multiple values */
int extended_gcd(int a, int b, int *x, int *y)
{
    if (a == 0) {
        *x = 0;
        *y = 1;
        return b;
    }

    int _x, _y;
    int gcd = extended_gcd(b % a, a, &_x, &_y);

    *x = _y - (b/a) * _x;
    *y = _x;

    return gcd;
}

GCD 實作和微處理器指令

回顧歐幾里得法:

大數 := 大數 - 小數
若「最大」和「最小」兩數相同，那麼它就是最大公因數，否則，回到第 1 步

示範 GCD(48, 40) 的計算過程

48 - 40, 40
$\to$ 8, 40
8 != 40, 於是重複上述動作: 40 - 8, 8
$\to$ 32, 8
8 != 32, 於是重複上述動作: 32 - 8, 8
$\to$ 24, 8
8 != 24, 於是重複上述動作: 24 - 8, 8
$\to$ 16, 8
8 != 16, 於是重複上述動作: 16 - 8, 8
$\to$ 8, 8
8 == 8 成立，於是 8 為 GCD

用 C 語言實現歐幾里得法: (v1)

int findGCD(int a, int b)
{
    while (1) {
        if (a > b) a -= b;
        else if (a < b) b -= a;
        return a;
    }
}

針對 Pentium 4 硬體的實驗

以 Intel C compiler 產生的機械碼來說: (Pentium 4)

資料來源: The Software Optimization Cookbook: High Performance Recipes for IA-32 Platforms, 2nd Edition
針對 2004 年的個人電腦

指令	數量	延遲	cycle count
減法	5	1	5
比較	5	1	5
分支	14	1	14
其他	0	1	0
			24

改成 mod 操作的 C 語言程式如下： (v2)

int findGCD(int a, int b)
{
    while (1) {
        a %= b;
        if (a == 0) return b;
        if (a == 1) return 1;
        b %= a;
        if (b == 0) return a;
        if (b == 1) return 1;
    }
}

以 Intel C compiler 產生的機械碼來說:

指令	數量	延遲	cycle count
mod (整數除法)	2	68	136
比較	3	1	3
分支	3	1	3
其他	6	1	6
			148

若可排除整數除法和減法的負擔，得到以下實作： (v3)

int findGCD(int a, int b)
{
    while (1) {
        if (a > (b * 4)) {
            a %= b;
            if (a == 0) return b;
            if (a == 1) return 1;
        } else if (a >= b) {
            a -= b;
            if (a == 0) return b;
            if (a == 1) return 1;
        }
        if (b > (a * 4)) {
            b %= a;
            if (b == 0) return a;
            if (b == 1) return 1;
        } else if (b >= a) {
            b -= a;
            if (b == 0) return a;
            if (b == 1) return 1;
        }
    }
}

可得到更快的實作:

    v1: 14.56s
    v2: 18.55s
    v3: 12.14s

上述都是針對 2004 年的個人電腦所得的實驗結果。

針對 Arm Cortex-A8 進行實驗

平台：BeagleBone Black (ARMv7-A, 1 GHz)
實驗程式碼：GCD_Efficiency_Compare

採用組合語言撰寫 (GCD 的部份)，由於 Cortex-A8 不支援 idiv (要在 ARM Cortex-A15 或 A7 以上才有)，因此改用以 Arm 組合語言實作

發現 gcc 的除法會針對 2 的倍數進行判斷，移位（除法）
實作的版本尚未考慮 exception
- A Fast Hi Precision Fixed Point Divide 列出 exception 重點

實驗方法

圖上一個點，代表 GCD (big, small) 中的 big，會尋找比它小的所有正整數（>1）的GCD所需的時間「總和」
若存在 big=9995，那麼會找 9993 組 GCD，small 的範圍從 2 到 9994
- 如: (9995,2), (9995,3), (9995,4), …, (9995,9994)
所以 big 越大，找尋的組數越多，整體趨勢微微上升（從左到右）是正常的
重點是比較各演算法的趨勢

ARM vs. Thumb instruction

ARM 指令執行較快
直線為 curve fitting 的結果 (時間越低，效能越高)

Assembly

MOD:

@ Entry  r0: low (remainder low) must be postive<input:Dividend>
@        r1: high (remainder high) any number
@        r2: Divisor (divisor) must be non-zero and
@            positive<input:Divisor>

    low .req r0;
    high .req r1;
    divisor .req r2;

    mov high,#0
    .rept 33    @ repeat 33 times
        subs high, high, divisor    @ remainder = remainder - divisor
        it lo
        addlo high, high, divisor
        @ low << 1; ((high-divisor) > 0) ? (new bit = 1) : (new bit = 0)
        adcs low, low, low
        @ divisor >> 1 == high(64bit for high & low) << 1
        adc high, high, high
    .endr

    @ shift remainder to correct position
    @ because last divide don't know this is the last divide,
    @ it still prepare for the next divide<shift>)
    lsr high, #1

@ Exit   r0: Quotient (remainder low)
@        r1: remainder (remainder high)
    mov pc, lr

Euclidean(thumb) V.S. mod(thumb) V.S. mod_minus(thumb)

不符合預期的原因推估

v3 版本的 branch 過多（if,else），thumb 的 branch 效能不佳，導致拉低平均
- if 數量比較（一個 loop）
  - v1 : 3 個
  - v3 : 7 個

Euclidean(arm) V.S. mod(arm) V.S. mod_minus(arm)

比較

與 thumb 版本
- arm 的 branch 效能較佳，所以 v3 版本速度改善最多，這裡符合預期
v1 輸給 v2
- arm 的 v1 與 thumb 的 v1 相比，只有些許提昇 (對照第一張圖），但是 mod 卻提昇較多（幾乎是半張圖片的範圍），導致這個結果
curve fitting

C 語言

arm-angstrom-linux-gnueabi-gcc 內的除法經過最佳化後，效能比減法更佳

arm-linux-gnueabihf-gcc(linaro) (without vfpv3)
- 上下圖比較，可看到整體在 arm-angstrom-linux-gnueabi-gcc 效能比較好
  - 即使在這裡，單純除法比單純減法（的演算法）還快…
  - 可是綜合起來看 v3 更快(符合預測)

arm-linux-gnueabihf-gcc(linaro) (with vfpv3)
- mod 加快，減法變慢
- 開了導致反效果

Binary GCD

Binary GCD Algorithm

binary GCD 的精神就是當兩數為偶數時，必有一

2

因子。

gcd (x, y) = 2 \cdot gcd (\frac{x}{2}, \frac{y}{2})

且一數為奇數另一數為偶數，則無

2

因子。

gcd (x, y) = gcd (x, \frac{y}{2})

於是我們可以改良為：

even_factor = min (ctz (x), ctz (y)) gcd (x, y) = even_factor \cdot gcd ((x ≫ even_factor), (y ≫ even_factor))

其中符號

≫

是 right shift。
剩下的過程就直接採用 Euclidean algorithm 的作法即可。

注意，僅在微處理器具備有效的 ctz 指令時，Binary GCD 才會展現相較於 Euclidean 的優勢，這也是為何 Linux 核心提供二種 gcd 實作。

unsigned long gcd(unsigned long a, unsigned long b)
{
    unsigned long r = a | b;
    if (!a || !b) return r;
    b >>= __builtin_ctzl(b);
    if (b == 1) return r & -r;
    for (;; a -= b) {
        a >>= __builtin_ctzl(a);
        if (a == 1) return r & -r;
        if (a == b) return a << __builtin_ctzl(r);
        if (a < b) { unsigned long t = a; a = b; b = t; }
    }
}

改進實作效能

bingcd

it requires GCC or Clang, a 64-bit x86 architecture, and a target CPU that supports the BMI2 opcodes (i.e. Haswell or later, in the line of Intel CPU).

考慮以下實作：

#include <stdint.h>
uint64_t gcd64(uint64_t u, uint64_t v)
{
    if (!u || !v) return u | v;
    int shift;
    for (shift = 0; !((u | v) & 1); shift++) {
        u /= 2, v /= 2;
    }
    while (!(u & 1))
        u /= 2;
    do {
        while (!(v & 1))
            v /= 2;
        if (u < v) {
            v -= u;
        } else {
            uint64_t t = u - v;
            u = v;
            v = t;
        }
    } while (v);
    return u<<shift;
}

GCD 演算法

If both x and y are 0, gcd is zero
$g c d (0, 0) = 0$ .
$g c d (x, 0) = x$ and
$g c d (0, y) = y$ because everything divides 0.
If x and y are both even,
$g c d (x, y) = 2 \times g c d (\frac{x}{2}, \frac{y}{2})$ because
$2$ is a common divisor. Mulitplication with
$2$ can be done with betwise shift operator.
If x is even and y is odd,
$g c d (x, y) = g c d (\frac{x}{2}, y)$ .
On similar lines, if x is odd and y is even, then
$g c d (x, y) = g c d (x, \frac{y}{2})$ . It is because
$2$ is not a common divisor.

以下分段說明：

if (!u || !v) return u | v;

\to

對應至 GCD 演算法第 2 點

$g c d (x, 0) = x$ and
$g c d (0, y) = y$ because everything divides 0.

若 u 和 v 中其中一數為 0 ，則直接回傳另一數

x | 0 會得到 x

int shift;
for (shift = 0; !((u | v) & 1); shift++) {
    u /= 2, v /= 2;
}

\to

對應至 GCD 演算法第 3 點

If x and y are both even,
$g c d (x, y) = 2 \times g c d (\frac{x}{2}, \frac{y}{2})$ because
$2$ is a common divisor. Mulitplication with
$2$ can be done with betwise shift operator.

先取出u 和 v 之間屬於

2^{n}

的公因數，shift 即為最大的

n

while (!(u & 1))
    u /= 2;

\to

對應至 GCD 演算法第4 點

If x is even and y is odd,
$g c d (x, y) = g c d (\frac{x}{2}, y)$ .

當 u 為偶數時，可以除以

2

do {
    while (!(v & 1))
        v /= 2;
    if (u < v) {
        v -= u;
    } else {
        uint64_t t = u - v;
        u = v;
        v = t;
    }
} while (v);

\to

其中 while(!(v &1)) v /= 2; 對應至 GCD 演算法第 5 點

On similar lines, if x is odd and y is even, then
$g c d (x, y) = g c d (x, \frac{y}{2})$ . It is because
$2$ is not a common divisor.

可以透過迴圈不變性 (loop invariant) 來幫助解讀程式

u 為被除數
v 為除數

當除數 v為 0 時結束迴圈，被除數 u 為進入迴圈前 u 和 v 的最大公因數

事實上 u 在迴圈前不一定是被除數，因為先前透過 GCD 演算法第 4 點去除不必要的計算

return u << shift;

最後回傳最大公因數時，乘上原先取出屬於

2^{n}

的公因數，可以透過左移達成

透過 `__builtin_ctz` 改寫程式

已知除法運算會耗費較多時間，我們可以利用 __builtin_ctz (編譯器會產生對應到現代微處理器的專門指令) 搭配上述針對二進位特性調整的演算法，減少除法的使用

將一百萬組由 rand 函式生成的亂數傳入 gcd64 函式執行，並藉由 Linux 核心提供的 perf 工具來分析：

$ gcc -g3 -O0 gcd.c -o gcd
$ perf record ./gcd
$ perf annotate

由上圖可以發現，花費 CPU 週期數最多的段落為迴圈當中的 v /= 2 運算，佔據約 28%

以下為改寫過後的程式

uint64_t gcd64_ctz(uint64_t u, uint64_t v)
{
    if (!u || !v) return u | v;
    int shift = __builtin_ctzll(u | v);
    u = u >> __builtin_ctzll(u);
    
    do {
        v = v >> __builtin_ctzll(v);
        if (u < v) {
            v -= u;
        } else {
            uint64_t t = u - v;
            u = v;
            v = t;
        }
    } while (v);
    return u << shift;
}

透過 perf 分析前後差異

$ gcc -g -O0 gcdcmp.c -o gcdcmp
$ perf record ./gcdcmp
$ perf report

將同樣一百萬筆由 rand 函式所產生的亂數傳入 gcd64 和 gcd64_ctz 函式比較，發現改寫過後的版本花費時間幾乎是原來的一半。

透過 perf 分析改寫程式

$ gcc -g -O0 gcd_ctz.c -o gcd_ctz
$ perf record ./gcd_ctz
$ perf annotate

佔據花費時間最多的為分支指令 (如 jne)，其次是迴圈內部的減法運算

整數
$n$ 的因數為可整除
$n$ 的數 ↩︎
Wikipedia/ 輾轉相除法的展示動畫 ↩︎