2025q1 Homework5 (assessment)

contributed by < Hlunlun >

一對一討論

2025/05/09: bitwise, IEEE 754

實作 ldexpf : https://man7.org/linux/man-pages/man3/ldexp.3.html

在 Linux kernel 中可能無法使用 FPU 做運算，所以需要將浮點數轉成二進位制做運算

The function returns the result of multiplying the floating-point number x by 2 raised to the power exp.

float my_ldexpf(float x, int exp);

一開始想法
先將 float 轉成 unsigned 然後再去取 sign、exponent、mantissa

float my_ldexpf(float x, int n) // x * 2^{n} ==相當於: x <<= n
{		
	int sign = (unsigned)x >>31;
	int exponent = ((unsigned)x>>23)& 0xff;
	int mantissa = (unsigned)x & 0x7fffff;
}

根據 https://www.h-schmidt.net/FloatConverter/IEEE754.html 表示法去嘗試當 x 是 -1.2345 印出 sign、exponent、mantissa 但除了 sign 其他都不對，如下

     sign  exponent   mantissa
正確: 1      127       1967129
我的: 1      255       2097151

檢討
關於浮點數和整數的轉換可以看到以下 C99 的說明，如果浮點數轉換成整數時會捨棄小數點的部分

C99 [6.3.1.4] Real floating and integer
When a finite value of real floating type is converted to an integer type other than _Bool, the fractional part is discarded (i.e., the value is truncated toward zero).

所以，當我用 cast operator 直接將 x 轉為 unsigned 時，x 就會只剩整數位的

- 1

，雖然可以取到正確的 sign ，但根本取不到正確的 exponent 和 mantissa

expression	十進位表示	二進位表示
`x`	$- 1.2345 f$	$10111111 10011110 00000110 00010000$
`(unsigned)x`	$- 1$	$11111111 11111111 11111111 11111111$

分析
了解浮點數和整數轉換的問題後，我們知道不行直接將浮點數用 cast operator 轉換成整數，而是用 union 對同一塊記憶體取不同型態的數值，這種 type punning 類別重疊的說明如下

C99 [6.5.2.3] Structure and union members 底下的補充說明
If the member used to read the contents of a union object is not the same as the member last used to store a value in the object, the appropriate part of the object representation of the value is reinterpreted as an object representation in the new type as described in 6.2.6 (a process sometimes called "type punning"). This might be a trap representation.

此處提及的 trap representation 在 C99 [6.2.6.1] 有解釋到：

問題

為什麼這跟 type punning 有關?
「由沒有字元類型的左值表達式讀取」甚麼意思?
「如果這種表示是由副作用產生的」是指甚麼副作用?
side effect: 有 an unexpected result of a situation 未曾預料的結果的意思

出自 C99 [6.2.6.1]
Certain object representations need not represent a value of the object type. If the stored value of an object has such a representation and is read by an lvalue expression that does not have character type, the behavior is undefined. If such a representation is produced by a side effect that modifies all or any part of the object by an lvalue expression that does not have character type, the behavior is undefined.41) Such a representation is called a trap representation

讀到這裡有個問題：那剛剛直接用 unsigned 去 cast 浮點數後輸出的
$11111111 11111111 11111111 11111111$ 是代表
$- 1$ 還是未定義 NaN ？

TODO: 所以是怎麼取值的

所以將浮點數轉為二進位數的方式如下

union{
    float f;
    uint32_t u32;
}val;

val.f = x;

取出 sign, exp, frac;

uint32_t  sign = val.u32 >> 31;
uint32_t  exp = (val.u32 >> 23) & 0xff;
uint32_t  frac = val.u32 & 0x7fffff;

轉換為二進位置後，就可以來看要怎麼進行相乘運算

浮點數的二進位制運算
更多浮點數的運算可以看到教材 CS:APP 第 2 章重點提示和練習，所以取得 sign、exp、frac 就可以進行單精度浮點數和

2^{n}

的乘法運算了，以下使用

x = - 1.2345

為例

首先浮點數編碼成二進位的方式為

v a l u e = (- 1)^{s} \times M \times 2^{E}

也就是我們只要找到

s

(sign) 、

M

、

E

就可以運算浮點數和

2^{n}

的乘法，

s

已經知道是

1

，將

1.2345

轉換成二進位制會是無限循環，由於是單精度浮點數，所以就取到

23

位即可

- 1.2345 = (- 1)^{1} \times {1.00111100000010000011001 . . .}_{2} \times 2^{0}

Significand:

$\begin{aligned} M & = 1. {\underset{―}{00111100000010000011001}}_{2} \\ f r a c & = 00111100000010000011001 \end{aligned}$
Exponent

$\begin{aligned} E & = 0 \\ B i a s & = 127 \\ e x p & = 127 = 01111111 \end{aligned}$

Result

現在回到題目，我們這個時候會取得需要相乘的 2 的

n

次方，而 ~~2 的次方~~ 2 的冪之間的相乘就是將次方數直接相加，如下

2^{E} \times 2^{n} = 2^{E + n}

而此時的

e x p

會變成

e x p = E + B i a s

e x p^{'} = (E + n) + B i a s

那其實剛剛已經取出 sign、exp、frac，所以可以直接將 n 加到 exp

int new_exp = exp + n;

那最後的答案就會是這三個值的 logical or

val.u = sign << 31 | new_exp << 23 | frac

完全不會動到 frac 和 sign，只要處理 exp 的部份，其實只要理解浮點數在電腦中的編碼模式就能解這題了

最後回傳 val.f 就是答案

測試
CS:APP 第二章有提到浮點數編碼後會有三種狀態

denormalized :
$e x p = 00. . .00$
normalized :
$e x p \neq 00. . .00$ and
$e x p \neq 11. . .11$
special :
$e x p = 11. . .11$

在運算 new_exp 之前可以先檢查 x 是否為有效值

x 是 NaN，其實就是在檢查 exp 的部份

 if (val.u32 > 0x7f800000)...

x 是無限大或無限小

if(val.u32 == 0x7f800000)...

x 是 0
```
if(val.u32 == 0)...
```

再來需要檢查 new_exp 是否為有效值

new_exp overflow
因為
$e x p$ 只有 8 個位元，所以需要檢查 new_exp 是否超過
$128$ 也就是二進位制的
$11111111$
```
if(new_exp >= 0xff)...
```
new_exp underflow
```
if(new_exp <= 0)...
```
x 是 normal 但輸出為 denormal
```
if(new_exp < 1)...
```

完整程式碼

float my_ldexpf(float x, int n){
    union{
        float f;
        uint32_t u32;
    }val;
    
    val.f = x;
    uint32_t sign = val.u32 & 0x80000000;
    uint32_t exp = (val.u32 >> 23) & 0xff;
    uint32_t frac = val.u32 & 0x7fffff;
    
    // x = +-inf or 0 or n=0
    if (exp >= 0xff || x == 0 || n ==0)
        return x;
        
    uint32_t new_exp = exp + n;
    
    if(new_exp >= 0xff){ // overflow or nan
        val.u32 = sign | 0x7f800000;
        return val.f;
    }
    
    if(new_exp <=0){ // underflow
        val.u32 = sign;
        return val.f;
    }
    
    val.u32 = sign | new_exp << 23 | frac;
    
    return val.f;
}

參考

2025/05/14 `my_ldexpf` 改進

驗證不夠，且需要更精簡

因為我們可以取得的有意義的值就是 normalized的部分，在單經度浮點數的世界的話就是

e x p \in (0 x 00, 0 x f f)

所以可以在最後算完 exp 後直接和 0xff 用 logical and 取出有效範圍，然後再

float my_ldexpf(float x, int n){
    union{
        float f;
        uint32_t u32;
    }val;
    
    val.f = x;
    uint32_t sign = val.u32 & 0x80000000;
    uint32_t exp = (val.u32 >> 23) & 0xff;
    uint32_t frac = val.u32 & 0x7fffff;
            
    exp += n;
    exp = exp >= 0xff ? 0 : exp;
    
    val.u32 = sign | exp << 23 | frac;
    
    return val.f;
}

2025/05/15 幾何平均數 Geometric Mean

幾何平均數

定點數

在維基百科介紹中描述了定點數表示法中，整數和小數的部分會用相同底數

b

來表示(就是會是一樣是十進位或是二進位)，如果此有禮數的小數的部分到第

n

位，此有理數會是

b^{- n}

的整數倍

In the fixed-point representation, the fraction is often expressed in the same number base as the integer part, but using negative powers of the base b. …Thus, if n fraction digits are stored, the value will always be an integer multiple of
$b^{- n}$ .

乍看之下，其實跟浮點數很相似，都是表示有理數的一種方式，但是可以看到，定點數會有 scaling factors 可以讓數值表示成整數在運算，如下:

假設我們現在要將

0.123

和

2.5

相乘，我們可以想成這兩個定點數相乘

(1230 \times 10^{- 3}) \times (25 \times 10^{- 1})

縮放因子在此次運算中即為

10^{- 3} \times 10^{- 1} = 10^{- 4}

此時就可以不用用到 FPU 的將兩個整數相乘

123 \times 25 = 3075

最後再去乘上縮放因子即可得到結果

3075 \times 10^{- 4} = 0.3075

回到 linux kernel 中的浮點數相乘，我們如果可以用定點數的概念，那麼就可以用以下步驟得到結果

$M$ 變成定點數
$M$ 整數部分相乘
用
$E$ 和縮放因子的指數做簡單的加法運算

浮點數相乘

CMU CS:APP 課程中 Floating Point 講義第 27 頁提到有關於浮點數編碼成二進位的乘法運算

開根號
因為浮點數是轉換成二進位制，所以需要轉換成底數為 2 的冪，算出指數的部分

x

，就可以直接和

E

做運算

\begin{aligned} M^{\frac{1}{n}} = 2^{x} \\ \Rightarrow x = \frac{1}{n} l o g_{2} M \end{aligned}

運算

l o g_{2} M

的部分則可以透過以下時做完成

int log = -1; 
while(M) { 
    log++; 
    M >>= 1; 
}

參考

你所不知道的 C 語言: 浮點數運算

自動販賣機而延畢的那一年：閱讀啟發

作者 Opass 的在開發自動手搖飲料機的過程就如同 Linux Torvalds 的那句話：「我是看著地面的人，我想修補就在我面前的坑洞，以免跌進去。」

2025q1 Homework5 (assessment)

一對一討論

2025/05/09: bitwise, IEEE 754

2025/05/14 `my_ldexpf` 改進

2025/05/15 幾何平均數 Geometric Mean

自動販賣機而延畢的那一年：閱讀啟發

研讀第 1 到第 6 週「課程教材」和 CS:APP 3/e (至少到第二章)

第一章

第二章

學習狀況

簡述想投入的專案

2025q1 Homework5 (assessment)

一對一討論

2025/05/09: bitwise, IEEE 754

2025/05/14 my_ldexpf 改進

2025/05/15 幾何平均數 Geometric Mean

自動販賣機而延畢的那一年：閱讀啟發

研讀第 1 到第 6 週「課程教材」和 CS:APP 3/e (至少到第二章)

第一章

第二章

學習狀況

簡述想投入的專案

Read more

Linux 期末自我評量

`.venv` 虛擬環境

2025q1 Homework3 (kxo)

2025q1 Homework1 (ideas)

2025/05/14 `my_ldexpf` 改進