2024q1 Homework5 (assessment)

# 2024q1 Homework5 (assessment) contributed by < [Petakuo](https://github.com/Petakuo) > ## 閱讀[〈因為自動飲料機而延畢的那一年〉](https://blog.opasschang.com/the-story-of-auto-beverage-machine-1/)的啟發在閱讀完這篇文章後，我第一個想法是：14個月的努力就這樣被拆解然後裝進倉庫裡！？雖然對我一個外人來說可能會覺得浪費時間，但他們在這段時間所得到的一定超乎我的想像，不僅是專業知識、技巧，甚至是在解決問題和合作方面也都獲得許多成長，這何嘗不是一個寶貴的經驗？而在整篇文章中，最吸引我注意的地方在解決冰塊問題那段，他們想了許多辦法都無法有效解決問題，但在準備放棄之時 Jserv 出現了，我非常好奇在這種情況下 Jserv 會怎麼做，畢竟倒冰塊也不是他的專業，他能給出有用的想法嗎？答案是可以， Jserv 給出了我認為非常有用的一句話：「你最大的問題在太害怕失敗了。」，這是大多數人的通病，也是因為這樣才導致許多的想法被埋沒或是中途放棄，而面對害怕失敗的自己，應該要有這樣的想法：「你該學習的不是看到事情要完蛋了就去避免失敗，而是應該學習如何處理與承受失敗，你才能變得比以前更強大。」，就像 Jserv 上課時說過的，誠實的面對自己，知道自己哪裡有不足就去補起來，這也讓我想到了我害怕作業的進度跟不上而去參考別的同學的程式碼，但這樣我原本的問題始終沒有解決，只是將其隱藏而已，我會找時間重新審視當初寫作業所遇到的問題，並且想出解決辦法。最後，文章中有一句話我非常認同，那就是「這個世界比任何人都殘酷，也比任何人都公平，犧牲了多少就會得到多少。」，在這個殘酷且公平的世界，要想得到會回報本該就要有相對的付出，不管最終的結果是以有形還是無形的方式呈現，就像修這門課一樣，雖然考試成績總是不理想，但若是願意繼續花時間去學習，相信一定能帶走許多東西。 --- ## float_mul2 探討 IEEE 754, float is single precision. Assume 32-bit ```c float float_mul2(float x) { // using bitwise operation, no mul int a = *((int*)&x); int b = *((int*)&x); a = (a & 0x7F800000) >> 23; a++; b = (b & 0x807FFFFF) | (a << 23); return *((float*)&b); } ``` ```c float float_mul_power_of_2(float x, int e) { int a = *((int*)&x); int b = *((int*)&x); a = (a & 0x7F800000) >> 23; a += e; b = (b & 0x807FFFFF) | (a << 23); return *((float*)&b); } ``` 考慮 overflow 後的做法 : ```c float float_mul2(float x) { // using bitwise operation, no mul int a = *((int*)&x); int b = *((int*)&x); a = (a & 0x7F800000) >> 23; a++; if(a == 0xFFFFFFFF) return NAN; else b = (b & 0x807FFFFF) | (a << 23); return *((float*)&b); } ``` > https://godbolt.org/ * Instruction latency 從老師提供的 instruction tables 找到對應的處理器架構，並且將上方程式碼轉換成組合語言去逐行計算，得到的結果為： ```c float_mul2: # @float_mul2 push rbp mov rbp, rsp movss dword ptr [rbp - 4], xmm0 mov eax, dword ptr [rbp - 4] mov dword ptr [rbp - 8], eax mov eax, dword ptr [rbp - 4] mov dword ptr [rbp - 12], eax mov eax, dword ptr [rbp - 8] and eax, 2139095040 sar eax, 23 mov dword ptr [rbp - 8], eax mov eax, dword ptr [rbp - 8] add eax, 1 mov dword ptr [rbp - 8], eax mov eax, dword ptr [rbp - 12] and eax, -2139095041 mov ecx, dword ptr [rbp - 8] shl ecx, 23 or eax, ecx mov dword ptr [rbp - 12], eax movss xmm0, dword ptr [rbp - 12] # xmm0 = mem[0],zero,zero,zero pop rbp ret ``` |Move instructions|Arithmetic instructions|Logic|Control transfer instructions| |-|-|-|-| |push * 1, mov * 12, movss * 2, pop * 1|add * 1|and * 2, sar * 1, shl * 1, or * 1|ret * 1| 計算過後可知該程式總共需要 29 (2+12+2+1+1+2+7+1+1)個 clock cycle 。稍微改進過後的版本，嘗試只用一個變數來節省記憶體空間 : ```c float float_mul2(float x) { // using bitwise operation, no mul int a = *((int*)&x); a = ((a & 0x7F800000) + 0x00800000) | (a & 0x807FFFFF); return *((float*)&a); } ``` 而這麼做還是太複雜，因為要先得到 exponent 的那 8 個 bits 加 1 後再把其他部分補回，因此更簡單的想法為直接在 exponent 那項加 1 ，修改過後的程式碼如下 : ```c float float_mul2(float x) { // using bitwise operation, no mul *(int*)&x += 0x00800000; return x; } ``` 轉換為組合語言 : ```c float_mul2: # @float_mul2 push rbp mov rbp, rsp movss dword ptr [rbp - 4], xmm0 mov eax, dword ptr [rbp - 4] add eax, 8388608 mov dword ptr [rbp - 4], eax movss xmm0, dword ptr [rbp - 4] # xmm0 = mem[0],zero,zero,zero pop rbp ret ``` |Move instructions|Arithmetic instructions|Logic|Control transfer instructions| |-|-|-|-| |push * 1, mov * 3, movss * 2, pop * 1|add * 1|none|ret * 1| 計算過後可知該程式總共需要 9 (2+3+2+1+1)個 clock cycle ，可以發現確實比第一個版本少了許多。參考 [類神經網路的 ReLU 及其常數時間複雜度實作](https://hackmd.io/@sysprog/constant-time-relu?fbclid=IwZXh0bgNhZW0CMTAAAR0rA_gp9TtuqOBuqAJE2uutnvVZcFOSFlBPO9HspdhAA27KtGx6_rdC1aQ_aem_AbKPshOIUtxP5VUsyxPO6IG4EvaWPm2n-8Idh7E0hXijU7Y45cxOfZIWXO1McstmzjxNWThJiOkVZLV_kqQ-1Fby) 教材，裡面提到了由於對浮點數的操作成本太高，因此可以使用 union 的技巧來優化， union 可以讓浮點數和整數共用一段記憶體空間，並且在任一時刻只有一個成員有效，宣告的方法如下： ```c union{ float f; int i; } out = {.f=x}; ``` 宣告一個名為 out 的 union ，裡面輸入需要操作的資料型態，而最後的 `{.f=x}` 是在做資料得初始化，將 input `x` 的值指派到 `f` ，後續則可以根據需求使用 `out.f` 及 `out.i` 來獲得兩種不同的資料型態。以下給出一個簡單的例子： ```c //code float example(float x){ union{ float f; int i; } out = {.f=x}; printf("out.i = %d\n", out.i); printf("out.f = %f\n", out.f); } int main() { float x = 2; example(x); return 0; } //output out.i = 1073741824 out.f = 2.000000 ``` 原因為在這段記憶體裡存著 `0x400000000` ，以浮點數表示為 2 ，以整數表示則為 $2^{30}$ = 1073741824 。最後，將原本的程式碼改寫： ```c float float_mul2(float x) { // using bitwise operation, no mul union{ float f; int i; } out = {.f=x}; out.i += 0x00800000; return out.f; } ``` 轉換為組合語言 : ```c float_mul2: # @float_mul2 push rbp mov rbp, rsp movss dword ptr [rbp - 4], xmm0 movss xmm0, dword ptr [rbp - 4] # xmm0 = mem[0],zero,zero,zero movss dword ptr [rbp - 8], xmm0 mov eax, dword ptr [rbp - 8] add eax, 8388608 mov dword ptr [rbp - 8], eax movss xmm0, dword ptr [rbp - 8] # xmm0 = mem[0],zero,zero,zero pop rbp ret ``` |Move instructions|Arithmetic instructions|Logic|Control transfer instructions| |-|-|-|-| |push * 1, mov * 3, movss * 4, pop * 1|add * 1|none|ret * 1| 計算過後可知該程式總共需要 11 (2+3+4+1+1)個 clock cycle ，雖然相較於第二個版本多了 2 個 clock cycle ，但用 union 的寫法會比較安全，因為強制將資料型別轉換可能會造成 strict aliasing ，在 C 語言中會發生為定義的行為。 ## float_mul_power_of_2 探討 ```c float float_mul_power_of_2(float x, int e) { // using bitwise operation, no mul union{ float f; int i; } out = {.f=x}; out.i += (e << 23); return out.f; } ``` 利用和上述 `float_mul2` 相同的概念，若要乘上 2 的冪，則把指數部分的 `+1` 改為 `+e` 即可，但若是直接用 `(e << 23)` 這樣的寫法會有問題 : 若該整數 `e` 夠大或夠小，則位移後會影響到最左方的 sign bit，因此我們需想辦法將其大小控制在 8 個 bits 以內，以符合 IEEE 754 的範圍。解決方法 : 首先，由於浮點數的指數部分的範圍為 0 ~ 255 ，因此我們可以把指數部分單獨取出來看是否有在範圍內，對應的程式碼如下 : ```c int exp = (out.i >> 23) & 0xFF; exp += e; if(exp < 0 || exp > 255) return 0; ``` 該程式碼的作用為先將 fraction bits 23 位過濾掉，然後再和 8 個 1 做 `&` 運算，以取得原數的指數部分，接著用整數型態的變數和 input `e` 相加，最後再判斷是否有在範圍內，完整的程式碼如下 : ```c float float_mul_power_of_2(float x, int e) { // using bitwise operation, no mul union{ float f; int i; } out = {.f=x}; int exp = (out.i >> 23) & 0xFF; exp += e; if(exp < 0 || exp > 255) return 0; else out.i = (out.i & 0x807FFFFF) | (exp << 23); return out.f; } ``` 其中 `return 0` 的部分若是引進 `math.h` 函式庫則可以再細分為不同的情況來表達，以下為修改過後的程式碼 : ```c float float_mul_power_of_2(float x, int e) { // using bitwise operation, no mul union{ float f; int i; } out = {.f=x}; int exp = (out.i >> 23) & 0xFF; exp += e; if(exp < 0) return 0; else if(exp >= 255) return INFINITY; else out.i = (out.i & 0x807FFFFF) | (exp << 23); return out.f; } ``` --- TODO: 補完程式碼 TODO: generalized as `float_mul_power_of_2(float x, int e)` 列出實作的考量 (overflow?) (term project) TODO: 重做 Homework4 (+ ??)