浮點數乘法探討

## float_mul2 探討 IEEE 754, float is single precision. Assume 32-bit ```c float float_mul2(float x) { // using bitwise operation, no mul int a = *((int*)&x); int b = *((int*)&x); a = (a & 0x7F800000) >> 23; a++; b = (b & 0x807FFFFF) | (a << 23); return *((float*)&b); } ``` 考慮 overflow 後的做法 : ```c float float_mul2(float x) { // using bitwise operation, no mul int a = *((int*)&x); int b = *((int*)&x); a = (a & 0x7F800000) >> 23; a++; if(a == 0xFFFFFFFF) return NAN; else b = (b & 0x807FFFFF) | (a << 23); return *((float*)&b); } ``` > https://godbolt.org/ * Instruction latency 從老師提供的 instruction tables 找到對應的處理器架構，並且將上方程式碼轉換成組合語言去逐行計算，得到的結果為： ```c float_mul2: # @float_mul2 push rbp mov rbp, rsp movss dword ptr [rbp - 4], xmm0 mov eax, dword ptr [rbp - 4] mov dword ptr [rbp - 8], eax mov eax, dword ptr [rbp - 4] mov dword ptr [rbp - 12], eax mov eax, dword ptr [rbp - 8] and eax, 2139095040 sar eax, 23 mov dword ptr [rbp - 8], eax mov eax, dword ptr [rbp - 8] add eax, 1 mov dword ptr [rbp - 8], eax mov eax, dword ptr [rbp - 12] and eax, -2139095041 mov ecx, dword ptr [rbp - 8] shl ecx, 23 or eax, ecx mov dword ptr [rbp - 12], eax movss xmm0, dword ptr [rbp - 12] # xmm0 = mem[0],zero,zero,zero pop rbp ret ``` |Move instructions|Arithmetic instructions|Logic|Control transfer instructions| |-|-|-|-| |push * 1, mov * 12, movss * 2, pop * 1|add * 1|and * 2, sar * 1, shl * 1, or * 1|ret * 1| 計算過後可知該程式總共需要 29 (2+12+2+1+1+2+7+1+1)個 clock cycle 。以下為稍微改進過後的版本，嘗試只用一個變數來節省記憶體空間 : ```c float float_mul2(float x) { // using bitwise operation, no mul int a = *((int*)&x); a = ((a & 0x7F800000) + 0x00800000) | (a & 0x807FFFFF); return *((float*)&a); } ``` 而這麼做還是太複雜，因為要先得到 exponent 的那 8 個 bits 加 1 後再把其他部分補回，因此更簡單的想法為直接在 exponent 那項加 1 ，修改過後的程式碼如下 : ```c float float_mul2(float x) { // using bitwise operation, no mul *(int*)&x += 0x00800000; return x; } ``` 轉換為組合語言 : ```c float_mul2: # @float_mul2 push rbp mov rbp, rsp movss dword ptr [rbp - 4], xmm0 mov eax, dword ptr [rbp - 4] add eax, 8388608 mov dword ptr [rbp - 4], eax movss xmm0, dword ptr [rbp - 4] # xmm0 = mem[0],zero,zero,zero pop rbp ret ``` |Move instructions|Arithmetic instructions|Logic|Control transfer instructions| |-|-|-|-| |push * 1, mov * 3, movss * 2, pop * 1|add * 1|none|ret * 1| 計算過後可知該程式總共需要 9 (2+3+2+1+1)個 clock cycle ，可以發現確實比第一個版本少了許多。參考 [類神經網路的 ReLU 及其常數時間複雜度實作](https://hackmd.io/@sysprog/constant-time-relu?fbclid=IwZXh0bgNhZW0CMTAAAR0rA_gp9TtuqOBuqAJE2uutnvVZcFOSFlBPO9HspdhAA27KtGx6_rdC1aQ_aem_AbKPshOIUtxP5VUsyxPO6IG4EvaWPm2n-8Idh7E0hXijU7Y45cxOfZIWXO1McstmzjxNWThJiOkVZLV_kqQ-1Fby) 教材，裡面提到了由於對浮點數的操作成本太高，因此可以使用 union 的技巧來優化， union 可以讓浮點數和整數共用一段記憶體空間，並且在任一時刻只有一個成員有效，宣告的方法如下： ```c union{ float f; int i; } out = {.f=x}; ``` 宣告一個名為 out 的 union ，裡面輸入需要操作的資料型態，而最後的 `{.f=x}` 是在做資料得初始化，將 input `x` 的值指派到 `f` ，後續則可以根據需求使用 `out.f` 及 `out.i` 來獲得兩種不同的資料型態。以下給出一個簡單的例子： ```c //code float example(float x){ union{ float f; int i; } out = {.f=x}; printf("out.i = %d\n", out.i); printf("out.f = %f\n", out.f); } int main() { float x = 2; example(x); return 0; } //output out.i = 1073741824 out.f = 2.000000 ``` 原因為在這段記憶體裡存著 `0x40000000` ，以浮點數表示為 2 ，以整數表示則為 $2^{30}$ = 1073741824 。最後，將原本的程式碼改寫： ```c float float_mul2(float x) { // using bitwise operation, no mul union{ float f; int i; } out = {.f=x}; out.i += 0x00800000; return out.f; } ``` 轉換為組合語言 : ```c float_mul2: # @float_mul2 push rbp mov rbp, rsp movss dword ptr [rbp - 4], xmm0 movss xmm0, dword ptr [rbp - 4] # xmm0 = mem[0],zero,zero,zero movss dword ptr [rbp - 8], xmm0 mov eax, dword ptr [rbp - 8] add eax, 8388608 mov dword ptr [rbp - 8], eax movss xmm0, dword ptr [rbp - 8] # xmm0 = mem[0],zero,zero,zero pop rbp ret ``` |Move instructions|Arithmetic instructions|Logic|Control transfer instructions| |-|-|-|-| |push * 1, mov * 3, movss * 4, pop * 1|add * 1|none|ret * 1| 計算過後可知該程式總共需要 11 (2+3+4+1+1)個 clock cycle ，雖然相較於第二個版本多了 2 個 clock cycle ，但用 union 的寫法會比較安全，因為強制將資料型別轉換可能會造成 strict aliasing ，在 C 語言中會發生為定義的行為。 ## float_mul_power_of_2 探討 ```c float float_mul_power_of_2(float x, int e) { // using bitwise operation, no mul union{ float f; int i; } out = {.f=x}; out.i += (e << 23); return out.f; } ``` 利用和上述 `float_mul2` 相同的概念，若要乘上 2 的冪，則把指數部分的 `+1` 改為 `+e` 即可，但若是直接用 `(e << 23)` 這樣的寫法會有問題 : 若該整數 `e` 夠大或夠小，則位移後會影響到最左方的 sign bit，因此我們需想辦法將其大小控制在 8 個 bits 以內，以符合 IEEE 754 的範圍。解決方法 : 首先，由於浮點數的指數部分的範圍為 0 ~ 255 ，因此我們可以把指數部分單獨取出來看是否有在範圍內，對應的程式碼如下 : ```c int exp = (out.i >> 23) & 0xFF; exp += e; if(exp < 0 || exp > 255) return 0; ``` 該程式碼的作用為先將 fraction bits 23 位過濾掉，然後再和 8 個 1 做 `&` 運算，以取得原數的指數部分，接著用整數型態的變數和 input `e` 相加，最後再判斷是否有在範圍內，完整的程式碼如下 : ```c float float_mul_power_of_2(float x, int e) { // using bitwise operation, no mul union{ float f; int i; } out = {.f=x}; int exp = (out.i >> 23) & 0xFF; exp += e; if(exp < 0 || exp > 255) return 0; else out.i = (out.i & 0x807FFFFF) | (exp << 23); return out.f; } ``` 其中 `return 0` 的部分若是引進 `math.h` 函式庫則可以再細分為不同的情況來表達，以下為修改過後的程式碼 : ```c float float_mul_power_of_2(float x, int e) { // using bitwise operation, no mul union{ float f; int i; } out = {.f=x}; int exp = (out.i >> 23) & 0xFF; exp += e; if(exp < 0) return 0; else if(exp >= 255) return INFINITY; else out.i = (out.i & 0x807FFFFF) | (exp << 23); return out.f; } ```