2019q3 Homework3 (quiz3)

# 2019q3 Homework3 (quiz3) contributed by < `93i7xo2` > ###### tags: `sysprog2019` --- ### 測驗 `1` 考慮深度學習領域中，使用[激勵函數 ReLU](https://mropengate.blogspot.com/2017/02/deep-learning-role-of-activation.html): $$ ReLU(x) = \begin{cases} x & \text{if } x \geq 0 \newline 0 & \text{if } x \lt 0 \end{cases} $$ RelU 計算量小，只要判斷輸入是否大於 `0`，沒有指數運算。下方程式 (`ReLU.c`) 是個常數時間的實作: ```cpp float ReLU(float x) { union { float f; int32_t i; } out = {.f = x}; out.i = out.i OP1 ~(out.i >> V2); return out.f; } ``` #### 實作分析 ```cpp float ReLU(float x) { union { float f; int32_t i; } out = {.f = x}; out.i = out.i & ~(out.i >> 31); return out.f; } ``` 此函式的目的是從 float 的 sign bit 判斷數值的正負，由 [C99 standard - 6.5.7](http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1124.pdf)得知 [shift operation](https://en.cppreference.com/w/c/language/operator_arithmetic) 的 operand 只能是integer。 > Constraints > > Each of the operands shall have integer type. 因此宣告一個 [union](https://en.cppreference.com/w/c/language/union) ，利用 [C99 standard - 6.7.2.1](http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1124.pdf) 提到 union 裡面的 member 共享記憶體特性對 `int32_t` 型態的 `out.i` 進行 shift operation > As discussed in 6.2.5, a structure is a type consisting of a sequence of members, whose storage is allocated in an ordered sequence, and a union is a type consisting of a sequence of members whose storage overlap. 而 negative signed integer 的`>>`是 implementation-defined > The result of E1 >> E2 is E1 right-shifted E2 bit positions. If E1 has an unsigned type or if E1 has a signed type and a nonnegative value, the value of the result is the integral part of the quotient of E1 / 2 E2 . **If E1 has a signed type and a negative value, the resulting value is implementation-defined.** [Arithmetic operators: Shift operators](https://en.cppreference.com/w/c/language/operator_arithmetic) 進一步說明編譯器實作是 arithmetic right shift > Thus in most implementations, **right shifting a signed LHS fills the new higher-order bits with the original sign bit** (i.e. with 0 if it was non-negative and 1 if it was negative). 所以進行運算後是這樣: - `x>=0` $\rightarrow$ `~(out.i >> 31) = 111111...111` $\rightarrow$ `out.f = x` - `x<0` $\rightarrow$ `~(out.i >> 31) = 000000...000` $\rightarrow$ `out.f = 0` #### 1. 解釋上述 ReLU 程式碼運作原理和應用場域 (歡迎寫篇深度學習的科普文章)，並寫出另一個常數時間的實作 **常數時間實作** [kaeteyaruyo的實作](https://hackmd.io/@kaeteyaruyo/2019q3_homework3)，雖是$O(1)$，但非branchless ```clike float ReLU(float x) { return x < 0 ? 0 : x; } ``` > 待補，沒修過AI༼;´༎ຶ ۝ ༎ຶ༽ #### 2. 研讀 [CMSIS-NN: Efficient Neural Network Kernels for Arm Cortex-M CPUs](https://arxiv.org/pdf/1801.06601.pdf)，在 Arm Cortex-M 微處理器架構中，透過 SWAR (SIMD within a register) 的手法，可帶來 4 倍效能提升。嘗試解釋其原理 #### 3. 透過 Intel SSE / AVX 等 SIMD 指令集，比照 [CMSIS-NN: Efficient Neural Network Kernels for Arm Cortex-M CPUs](https://arxiv.org/pdf/1801.06601.pdf)，進行加速 ReLU 的實驗; --- ### 測驗 `2` 在 8 位元 [Motorola 6809 處理器](https://en.wikipedia.org/wiki/Motorola_6809)上，有道指令叫做 [SEX](https://en.wikipedia.org/wiki/SEX_(computing))，寓意是 "Sign EXtend" > [Motorola 68000](https://en.wikipedia.org/wiki/Motorola_68000) (68K) 系列處理器，還有道指令名為 [BRA](http://68k.hax.com/BRA)，寓意是 "branch"，以前的工程師都很有創意 (?) `SEX 123` 應該輸出 `0`, 而 `SEX -3` 要輸出 `0xffffffff` (取決於有效位數) 考慮一個 32 位元版本的 `SEX` 實作如下，假設執行環境是 little-endian: ```cpp #include <stdint.h> static inline uint32_t sex32(int32_t x) { union { TYPE w; struct { uint32_t lo, hi; }; /* FIXME: little-endian */ } z = {.w = x}; return z.hi; } ``` #### 實作分析 ```cpp= #include <stdint.h> static inline uint32_t sex32(int32_t x) { union { uint64_t w; struct { uint32_t lo, hi; }; /* FIXME: little-endian */ } z = {.w = x}; return z.hi; } ``` ~~同測驗1，w的大小應與 `struct { uint32_t lo, hi; }` 一致，因此 `TYPE=uint64_t`~~ 要先了解 `.w = x` 的行為。用 integer promotion 和 implicit conversion 在規格書內找到的片段極其分散，所幸 [signed-casting.c](https://www.cs.cmu.edu/~rjsimmon/15122-s13/rec/22/signed-casting.c) 有範例。 ```cpp= int8_t c; /* Finally, note that the sign extension behavior is only determined by * the type we are casting from. If we are casting from a signed type, then * we will always sign extend with the value of the sign bit; if we are * casting from unsigned, we will always sign extend with 0. The destination * type (specifically, whether or not is is signed) does not matter. */ /* Signed to unsigned: still uninteresting. */ c = 0x75; // 117 ui = (uint32_t)c; // cast 1-byte to 4-byte, signed ``` 所以cast有號數，無關目變數是有號還是無號，high bits 由有號數的 sign bit 所填滿。因此 `TYPE` 可為 `int64_t` 或是 `uint64_t` 。 - `x>=0` $\rightarrow$ `z.hi = 0....000 (0)` - `x<0` $\rightarrow$` z.hi = 1....111 (-1)` #### 1. 解釋程式運作原理，並改寫為適用於 big/little-endian 的程式碼; 閱讀 [uccuser159](https://hackmd.io/@uccuser159/2019q3Homework3review) 提到的參考資料[Big-Endian 與 Little-Endian 的差異判斷程式碼](https://blog.gtwang.org/programming/difference-between-big-endian-and-little-endian-implementation-in-c/) Big-Endian和Little-Endian是指CPU存取記憶體的順序，簡短的描述兩者差別: - Little-Endian 高位元組(byte)優先放入記憶體低位 - Big-Endian 低位元組(byte)優先放入記憶體低位改寫成適用big-endian/little-endian的程式碼 ```cpp= #include <stdint.h> static inline uint32_t sex32(int32_t x) { union { uint64_t w; struct { uint32_t lo, hi; }; /* FIXME: little-endian */ } z = {.w = x}; #if __BYTE_ORDER__ == __ORDER_LITTLE_ENDIAN__ return z.hi; #endif #if __BYTE_ORDER__ == __ORDER_LITTLE_ENDIAN__ return z.lo; #endif } ``` wiki 提到現今的 Intel x86 處理器使用 little-endian 和使用 big-endian 格式的處理器架構。 > The Intel x86 and AMD64 / x86-64 series of processors use the little-endian format. Other instruction set architectures that follow this convention, allowing only little-endian mode, include RISC-V, Nios II, Andes Technology NDS32, and Qualcomm Hexagon. > > Some architectures (including ARM versions 3 and above, PowerPC, Alpha, SPARC V9, MIPS, PA-RISC, SuperH SH-4 and IA-64) feature a setting which allows for switchable endianness in data fetches and stores, instruction fetches, or both. 在 Intel X86 處理器上，[Davinais的共筆](https://hackmd.io/@Davinais/rJm3Hm4tS)和 [Imitate/emulate a big-endian behavior in C? [duplicate]](https://stackoverflow.com/questions/3337896/imitate-emulate-a-big-endian-behavior-in-c) 皆以 qemu 驗證程式碼 > 待補 #### 2. 找出 Sign EXtend 的應用案例，特別是密碼學的應用 --- ### 測驗 `3` 延伸測驗 `2` 的 `sex32`，用以改寫 [解讀計算機編碼](https://hackmd.io/@sysprog/rylUqXLsm) 提及的常數時間 `abs` 實作 (輸入是 32 位元整數)，程式碼如下: ```cpp static inline int abs_with_sex(int x) { return (x ^ sex32(x)) OP2 sex32(x); } ``` #### 實作分析 --- ### 測驗 `4` 延伸 [C 語言: goto 和流程控制](https://hackmd.io/s/B1e2AUZeM)，考慮以下 [coroutine](https://en.wikipedia.org/wiki/Coroutine) 實作程式碼: (`coroutine.c`) ```cpp /* * Stackless coroutines. * This is an async/await implementation for C based on Duff's device. * * Features: * 1. the async state is caller-saved rather than callee-saved. * 2. Subroutines can have persistent state that is not just static state, * because each async subroutine accepts its own struct it uses as a * parameter, and the async state is stored there. * 3. Because of the more flexible state, async subroutines can be nested * in tree-like fashion which permits fork/join concurrency patterns. */ /* event status */ typedef enum ASYNC_EVT { ASYNC_INIT = 0, ASYNC_DONE = -1 } async; /* Declare the async state */ #define async_state unsigned int _async_kcont struct async { async_state; }; /* Mark the start of an async subroutine */ #define async_begin(k) \ XXX1 \ XXX2 /* Mark the end of a generator thread */ #define async_end \ default: \ return ASYNC_DONE; \ } /* Wait until the condition succeeds */ #define await(cond) await_while(!(cond)) /* Wait while the condition succeeds */ #define await_while(cond) \ case __LINE__: \ if (cond) return __LINE__ \ /* Initialize a new async computation */ #define async_init(state) (state)->_async_kcont = ASYNC_INIT /* Check if async subroutine is done */ #define async_done(state) (state)->_async_kcont == ASYNC_DONE /* Resume a running async computation and check for completion */ #define async_call(f, state) \ (async_done(state) || ASYNC_DONE == ((state)->_async_kcont = (f)(state))) struct async_sem { unsigned int count; }; /** * Initialize a semaphore * * This macro initializes a semaphore with a value for the * counter. Internally, the semaphores use an "unsigned int" to * represent the counter, and therefore the "count" argument should be * within range of an unsigned int. * * \param s A pointer to the object representing the semaphore. * \param c (unsigned int) The initial count of the semaphore. */ #define init_sem(s, c) (s)->count = c /** * Wait for a semaphore * * This macro carries out the "wait" operation on the semaphore. The * wait operation causes the protothread to block while the counter is * zero. When the counter reaches a value larger than zero, the * protothread will continue. * * \param s A pointer to the object representing the semaphore. */ #define await_sem(s) \ do { \ await((s)->count > 0); \ XXX3; \ } while (0) /** * Signal a semaphore * * This macro carries out the "signal" operation on the semaphore. The * signal operation increments the counter inside the semaphore, which * eventually will cause waiting protothreads to continue executing. * * \param s A pointer to the object representing the semaphore. */ #define signal_sem(s) XXX4 /* Use case */ #include <stdio.h> #include <unistd.h> #define NUM_ITEMS 32 #define BUFSIZE 8 static int buffer[BUFSIZE]; static int bufptr; static void add_to_buffer(int item) { printf("Item %d added to buffer at place %d\n", item, bufptr); buffer[bufptr] = item; bufptr = (bufptr + 1) % BUFSIZE; } static int get_from_buffer(void) { int item = buffer[bufptr]; printf("Item %d retrieved from buffer at place %d\n", item, bufptr); bufptr = (bufptr + 1) % BUFSIZE; return item; } static int produce_item(void) { static int item = 0; printf("Item %d produced\n", item); return item++; } static void consume_item(int item) { printf("Item %d consumed\n", item); } static struct async_sem full, empty; static async producer(struct async *pt) { static int produced; async_begin(pt); for (produced = 0; produced < NUM_ITEMS; ++produced) { await_sem(&full); add_to_buffer(produce_item()); signal_sem(&empty); } async_end; } static async consumer(struct async *pt) { static int consumed; async_begin(pt); for (consumed = 0; consumed < NUM_ITEMS; ++consumed) { await_sem(&empty); consume_item(get_from_buffer()); signal_sem(&full); } async_end; } static async driver_thread(struct async *pt) { static struct async cr_producer, cr_consumer; async_begin(pt); init_sem(&empty, 0); init_sem(&full, BUFSIZE); async_init(&cr_producer); async_init(&cr_consumer); await(producer(&cr_producer) & consumer(&cr_consumer)); async_end; } int main(void) { struct async driver_pt; async_init(&driver_pt); while (!async_call(driver_thread, &driver_pt)) { usleep(10); /* why? */ } return 0; } ``` 這段程式碼是經典的[生產者——消費者問題](https://en.wikipedia.org/wiki/Producer%E2%80%93consumer_problem)，預期執行輸出為: ```shell Item 0 produced Item 0 added to buffer at place 0 Item 1 produced Item 1 added to buffer at place 1 Item 2 produced Item 2 added to buffer at place 2 Item 3 produced Item 3 added to buffer at place 3 Item 4 produced Item 4 added to buffer at place 4 Item 5 produced Item 5 added to buffer at place 5 Item 6 produced Item 6 added to buffer at place 6 Item 7 produced Item 7 added to buffer at place 7 Item 0 retrieved from buffer at place 0 Item 0 consumed Item 1 retrieved from buffer at place 1 Item 1 consumed Item 2 retrieved from buffer at place 2 Item 2 consumed Item 3 retrieved from buffer at place 3 Item 3 consumed Item 4 retrieved from buffer at place 4 Item 4 consumed Item 5 retrieved from buffer at place 5 Item 5 consumed Item 6 retrieved from buffer at place 6 Item 6 consumed Item 7 retrieved from buffer at place 7 Item 7 consumed ``` #### 1. 解釋上述程式運作原理，特別是和 [Duff's device](https://en.wikipedia.org/wiki/Duff%27s_device) 的關聯; #### 2. 參照 [ProtoThread](http://dunkels.com/adam/pt/) 程式碼，比較上述差異，需要設計實驗，分析兩者在切換 coroutine 的執行時間和記憶體開銷成本 #### 3. 整合 timeout 機制，允許 callback function 在指定的時間執行，如 [TimerCallback callback function](https://docs.microsoft.com/en-us/previous-versions/windows/desktop/legacy/ms686790(v=vs.85)) :::warning 說好的進度呢？ :notes: jserv ::: > 慘了，不夠努力 ### Reference * [kaeteyaru 共筆](https://hackmd.io/@kaeteyaruyo/2019q3_homework3) * [Lecture 21 NotesTypes in C](https://www.cs.cmu.edu/~fp/courses/15122-f15/lectures/21-types.pdf)