2019q3 Homework3 (quiz3)

# 2019q3 Homework3 (quiz3) contributed by < `Ting199708` > :::danger 注意細節！第一行的格式有明確規範 :notes: jserv ::: ## 第一題 :::info 考慮深度學習領域中，使用激勵函數 ReLU: $$ ReLU(x) = \begin{cases} x & \text{if } x \geq 0 \newline 0 & \text{if } x \lt 0 \end{cases} $$ RelU 計算量小，只要判斷輸入是否大於 `0`，沒有指數運算。下方程式 (`ReLU.c`) 是個常數時間的實作: ::: ### 程式碼 ```c float ReLU(float x) { union { float f; int32_t i; } out = {.f = x}; out.i = out.i OP1 ~(out.i >> V2); return out.f; } ``` ==分析與想法== #### union 首先，先來理解 union 是在定義什麼東西，根據 [openhome.cc](https://openhome.cc/Gossip/CGossip/union.html)， union 中所有資料成員共用一個空間，且同時間只能儲存其中一個成員的資料。 :::danger 儘量參照第一手材料，去翻閱 C 語言規格書！列出原文描述並解讀。 :notes: jserv ::: 另外，一個 union 配置的空間為最大長度的資料成員。 #### ReLU 根據 ReLU function 的定義，當 x<0 時， `ReLU(x) = 0` ， x>=0 時， `ReLU(x) = x` ，因此在程式當中我們必須判斷 x 的正負號我們利用電腦儲存有號數時的 sign bit ，透過將 `out.i >> 31` ，取得其 sign bit ，再反向即可得到以下結果 ```c if out.i<0 then ~(out.i >> 31) = 1111 1111 ... 1111 if out.i>=0 then ~(out.i >> 31) = 0000 0000 ... 0000 ``` ### 延伸問題 :::success 1. 解釋上述 ReLU 程式碼運作原理和應用場域 (歡迎寫篇深度學習的科普文章)，並寫出另一個常數時間的實作 2. 研讀 [CMSIS-NN: Efficient Neural Network Kernels for Arm Cortex-M CPUs](https://arxiv.org/pdf/1801.06601.pdf)，在 Arm Cortex-M 微處理器架構中，透過 SWAR (SIMD within a register) 的手法，可帶來 4 倍效能提升。嘗試解釋其原理 3. 透過 Intel SSE / AVX 等 SIMD 指令集，比照 [CMSIS-NN: Efficient Neural Network Kernels for Arm Cortex-M CPUs](https://arxiv.org/pdf/1801.06601.pdf)，進行加速 ReLU 的實驗 ::: 1. ReLU 函數在 Machine Learning 中經常被用在 activation function ，因為其可將負數變為 0 的特性讓 AI 在 training 的過程中可以減少其線性的特性，增加 accuracy。 :::danger 這樣的描述對得起 AI 領域數十年的發展嗎？ :notes: jserv ::: --- ## 第二題 :::info 在 8 位元 [Motorola 6809 處理器](https://en.wikipedia.org/wiki/Motorola_6809)上，有道指令叫做 [SEX](https://en.wikipedia.org/wiki/SEX_(computing))，寓意是 "Sign EXtend" `SEX 123` 應該輸出 `0`, 而 `SEX -3` 要輸出 `0xffffffff` (取決於有效位數) 考慮一個 32 位元版本的 `SEX` 實作如下，假設執行環境是 little-endian: ::: ### 程式碼 ```cpp #include <stdint.h> static inline uint32_t sex32(int32_t x) { union { TYPE w; struct { uint32_t lo, hi; }; /* FIXME: little-endian */ } z = {.w = x}; return z.hi; } ``` ==分析與想法== 這邊其實只要了解 union 的應用就可以知道答案，在題目中 union 定義為一 `type` 及 `struct {uint32_t lo, hi}` ，因此 type 型別的長度必須為 64 ，故應選擇 ==uint64_t== ### 延伸問題 :::success 1. 解釋程式運作原理，並改寫為適用於 big/little-endian 的程式碼 2. 找出 Sign Extend 的應用案例，特別是密碼學的應用 ::: 1. 當我們將一 32 bit 有號數放進 struct {uint32_t lo, hi} 中後，因為其符號延伸的特性，若該數小於零，則 `hi = 1111 1111 ... 1111` ，反之，則 `hi = 0000 0000 ... 0000`，因此程式只要 `return z.hi` 即可得到結果若要將此程式改為 big/little-endian 適用的話，可以改為以下程式碼 ```c static inline uint32_t sex32(int32_t x) { int32_t a; union { uint64_t w; struct { uint32_t lo, hi; }; } z z.w = 1; if(z.hi == 0) { a = x; } else { a = bswap_32(x); } z.w = a; return z.hi; } ``` 其中 bswap_32 可將位元反轉，因此可將 big-endian 轉為 little-endian :::warning 改為不需要執行時期判斷的實作，也就是善用 C preprocessor :notes: jserv ::: --- ## 第三題 ### 題目 :::info 延伸測驗 `2` 的 `sex32`，用以改寫 [解讀計算機編碼](https://hackmd.io/@sysprog/rylUqXLsm) 提及的常數時間 `abs` 實作 (輸入是 32 位元整數)，程式碼如下: ::: ### 程式碼 ```c static inline int abs_with_sex(int x) { return (x ^ sex32(x)) OP2 sex32(x); } ``` ==分析與想法== 這邊要透過 sex 來實作 abs ，首先，先看到題目已知的部分 ```c (x ^ sex32(x)) ``` 當 ==x < 0== 時， (x ^ sex32(x)) = ~x 當 ==x >= 0== 時， (x ^ sex32(x)) = x 此函式的輸出結果為 abs(x)，所以 ==x < 0== 時 ```c (x ^ sex32(x)) OP2 sex32(x) = ~x OP2 (-1) = abs(x) ``` 因此，根據二補數的定義， OP2 為 - 號 ==x >= 0== 時 ```c (x ^ sex32(x)) OP2 sex32(x) = x OP2 0 = abs(x) ``` 再次驗證了， OP2 為 - 號 --- ## 第四題 :::info 延伸 C 語言: goto 和流程控制，考慮以下 coroutine 實作程式碼: (coroutine.c) ::: ### 程式碼 ```c typedef enum ASYNC_EVT { ASYNC_INIT = 0, ASYNC_DONE = -1 } async; /* Declare the async state */ #define async_state unsigned int _async_kcont struct async { async_state; }; /* Mark the start of an async subroutine */ #define async_begin(k) \ switch ((k)->_async_kcont) { \ XXX1 case 0: XXX2 /* Mark the end of a generator thread */ #define async_end \ default: \ return ASYNC_DONE; \ } /* Wait until the condition succeeds */ #define await(cond) await_while(!(cond)) /* Wait while the condition succeeds */ #define await_while(cond) \ case __LINE__: \ if (cond) return __LINE__ \ /* Initialize a new async computation */ #define async_init(state) (state)->_async_kcont = ASYNC_INIT /* Check if async subroutine is done */ #define async_done(state) (state)->_async_kcont == ASYNC_DONE /* Resume a running async computation and check for completion */ #define async_call(f, state) \ (async_done(state) || ASYNC_DONE == ((state)->_async_kcont = (f)(state))) struct async_sem { unsigned int count; }; /** * Initialize a semaphore * * This macro initializes a semaphore with a value for the * counter. Internally, the semaphores use an "unsigned int" to * represent the counter, and therefore the "count" argument should be * within range of an unsigned int. * * \param s A pointer to the object representing the semaphore. * \param c (unsigned int) The initial count of the semaphore. */ #define init_sem(s, c) (s)->count = c /** * Wait for a semaphore * * This macro carries out the "wait" operation on the semaphore. The * wait operation causes the protothread to block while the counter is * zero. When the counter reaches a value larger than zero, the * protothread will continue. * * \param s A pointer to the object representing the semaphore. */ #define await_sem(s) \ do { \ await((s)->count > 0); \ --(s)->count; \ XXX3 } while (0) /** * Signal a semaphore * * This macro carries out the "signal" operation on the semaphore. The * signal operation increments the counter inside the semaphore, which * eventually will cause waiting protothreads to continue executing. * * \param s A pointer to the object representing the semaphore. */ #define signal_sem(s) ++(s)->count XXX4 /* Use case */ #include <stdio.h> #include <unistd.h> #define NUM_ITEMS 32 #define BUFSIZE 8 static int buffer[BUFSIZE]; static int bufptr; static void add_to_buffer(int item) { printf("Item %d added to buffer at place %d\n", item, bufptr); buffer[bufptr] = item; bufptr = (bufptr + 1) % BUFSIZE; } static int get_from_buffer(void) { int item = buffer[bufptr]; printf("Item %d retrieved from buffer at place %d\n", item, bufptr); bufptr = (bufptr + 1) % BUFSIZE; return item; } static int produce_item(void) { static int item = 0; printf("Item %d produced\n", item); return item++; } static void consume_item(int item) { printf("Item %d consumed\n", item); } static struct async_sem full, empty; static async producer(struct async *pt) { static int produced; async_begin(pt); for (produced = 0; produced < NUM_ITEMS; ++produced) { await_sem(&full); add_to_buffer(produce_item()); signal_sem(&empty); } async_end; } static async consumer(struct async *pt) { static int consumed; async_begin(pt); for (consumed = 0; consumed < NUM_ITEMS; ++consumed) { await_sem(&empty); consume_item(get_from_buffer()); signal_sem(&full); } async_end; } static async driver_thread(struct async *pt) { static struct async cr_producer, cr_consumer; async_begin(pt); init_sem(&empty, 0); init_sem(&full, BUFSIZE); async_init(&cr_producer); async_init(&cr_consumer); await(producer(&cr_producer) & consumer(&cr_consumer)); async_end; } int main(void) { struct async driver_pt; async_init(&driver_pt); while (!async_call(driver_thread, &driver_pt)) { usleep(10); /* why? */ } return 0; } ``` ### 延伸問題 :::success 1. 解釋上述程式運作原理，特別是和 Duff’s device 的關聯 2. 參照 ProtoThread 程式碼，比較上述差異，需要設計實驗，分析兩者在切換 coroutine 的執行時間和記憶體開銷成本 3. 整合 timeout 機制，允許 callback function 在指定的時間執行，如 TimerCallback callback function :::