# 2019q3 Homework3 (quiz3)
contributed by < `93i7xo2` >
###### tags: `sysprog2019`
---
### 測驗 `1`
考慮深度學習領域中,使用[激勵函數 ReLU](https://mropengate.blogspot.com/2017/02/deep-learning-role-of-activation.html):
$$
ReLU(x) =
\begin{cases}
x & \text{if } x \geq 0 \newline
0 & \text{if } x \lt 0
\end{cases}
$$
RelU 計算量小,只要判斷輸入是否大於 `0`,沒有指數運算。下方程式 (`ReLU.c`) 是個常數時間的實作:
```cpp
float ReLU(float x) {
union {
float f;
int32_t i;
} out = {.f = x};
out.i = out.i OP1 ~(out.i >> V2);
return out.f;
}
```
#### 實作分析
```cpp
float ReLU(float x) {
union {
float f;
int32_t i;
} out = {.f = x};
out.i = out.i & ~(out.i >> 31);
return out.f;
}
```
此函式的目的是從 float 的 sign bit 判斷數值的正負,由 [C99 standard - 6.5.7](http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1124.pdf)得知 [shift operation](https://en.cppreference.com/w/c/language/operator_arithmetic) 的 operand 只能是integer。
> Constraints
> > Each of the operands shall have integer type.
因此宣告一個 [union](https://en.cppreference.com/w/c/language/union) ,利用 [C99 standard - 6.7.2.1](http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1124.pdf) 提到 union 裡面的 member 共享記憶體特性對 `int32_t` 型態的 `out.i` 進行 shift operation
> As discussed in 6.2.5, a structure is a type consisting of a sequence of members, whose storage is allocated in an ordered sequence, and a union is a type consisting of a sequence of members whose storage overlap.
而 negative signed integer 的`>>`是 implementation-defined
> The result of E1 >> E2 is E1 right-shifted E2 bit positions. If E1 has an unsigned type or if E1 has a signed type and a nonnegative value, the value of the result is the integral part of the quotient of E1 / 2 E2 . **If E1 has a signed type and a negative value, the resulting value is implementation-defined.**
[Arithmetic operators: Shift operators](https://en.cppreference.com/w/c/language/operator_arithmetic) 進一步說明編譯器實作是 arithmetic right shift
> Thus in most implementations, **right shifting a signed LHS fills the new higher-order bits with the original sign bit** (i.e. with 0 if it was non-negative and 1 if it was negative).
所以進行運算後是這樣:
- `x>=0` $\rightarrow$ `~(out.i >> 31) = 111111...111` $\rightarrow$ `out.f = x`
- `x<0` $\rightarrow$ `~(out.i >> 31) = 000000...000` $\rightarrow$ `out.f = 0`
#### 1. 解釋上述 ReLU 程式碼運作原理和應用場域 (歡迎寫篇深度學習的科普文章),並寫出另一個常數時間的實作
**常數時間實作**
[kaeteyaruyo的實作](https://hackmd.io/@kaeteyaruyo/2019q3_homework3),雖是$O(1)$,但非branchless
```clike
float ReLU(float x) {
return x < 0 ? 0 : x;
}
```
> 待補,沒修過AI༼;´༎ຶ ༎ຶ༽
#### 2. 研讀 [CMSIS-NN: Efficient Neural Network Kernels for Arm Cortex-M CPUs](https://arxiv.org/pdf/1801.06601.pdf),在 Arm Cortex-M 微處理器架構中,透過 SWAR (SIMD within a register) 的手法,可帶來 4 倍效能提升。嘗試解釋其原理
#### 3. 透過 Intel SSE / AVX 等 SIMD 指令集,比照 [CMSIS-NN: Efficient Neural Network Kernels for Arm Cortex-M CPUs](https://arxiv.org/pdf/1801.06601.pdf),進行加速 ReLU 的實驗;
---
### 測驗 `2`
在 8 位元 [Motorola 6809 處理器](https://en.wikipedia.org/wiki/Motorola_6809)上,有道指令叫做 [SEX](https://en.wikipedia.org/wiki/SEX_(computing)),寓意是 "Sign EXtend"
> [Motorola 68000](https://en.wikipedia.org/wiki/Motorola_68000) (68K) 系列處理器,還有道指令名為 [BRA](http://68k.hax.com/BRA),寓意是 "branch",以前的工程師都很有創意 (?)
`SEX 123` 應該輸出 `0`, 而 `SEX -3` 要輸出 `0xffffffff` (取決於有效位數)
考慮一個 32 位元版本的 `SEX` 實作如下,假設執行環境是 little-endian:
```cpp
#include <stdint.h>
static inline uint32_t sex32(int32_t x) {
union {
TYPE w;
struct { uint32_t lo, hi; }; /* FIXME: little-endian */
} z = {.w = x};
return z.hi;
}
```
#### 實作分析
```cpp=
#include <stdint.h>
static inline uint32_t sex32(int32_t x) {
union {
uint64_t w;
struct { uint32_t lo, hi; }; /* FIXME: little-endian */
} z = {.w = x};
return z.hi;
}
```
~~同測驗1,w的大小應與 `struct { uint32_t lo, hi; }` 一致,因此 `TYPE=uint64_t`~~
要先了解 `.w = x` 的行為。用 integer promotion 和 implicit conversion 在規格書內找到的片段極其分散,所幸 [signed-casting.c](https://www.cs.cmu.edu/~rjsimmon/15122-s13/rec/22/signed-casting.c) 有範例。
```cpp=
int8_t c;
/* Finally, note that the sign extension behavior is only determined by
* the type we are casting from. If we are casting from a signed type, then
* we will always sign extend with the value of the sign bit; if we are
* casting from unsigned, we will always sign extend with 0. The destination
* type (specifically, whether or not is is signed) does not matter.
*/
/* Signed to unsigned: still uninteresting. */
c = 0x75; // 117
ui = (uint32_t)c; // cast 1-byte to 4-byte, signed
```
所以cast有號數,無關目變數是有號還是無號,high bits 由有號數的 sign bit 所填滿。因此 `TYPE` 可為 `int64_t` 或是 `uint64_t` 。
- `x>=0` $\rightarrow$ `z.hi = 0....000 (0)`
- `x<0` $\rightarrow$` z.hi = 1....111 (-1)`
#### 1. 解釋程式運作原理,並改寫為適用於 big/little-endian 的程式碼;
閱讀 [uccuser159](https://hackmd.io/@uccuser159/2019q3Homework3review) 提到的參考資料[Big-Endian 與 Little-Endian 的差異判斷程式碼](https://blog.gtwang.org/programming/difference-between-big-endian-and-little-endian-implementation-in-c/)
Big-Endian和Little-Endian是指CPU存取記憶體的順序,簡短的描述兩者差別:
- Little-Endian
高位元組(byte)優先放入記憶體低位
- Big-Endian
低位元組(byte)優先放入記憶體低位
改寫成適用big-endian/little-endian的程式碼
```cpp=
#include <stdint.h>
static inline uint32_t sex32(int32_t x) {
union {
uint64_t w;
struct {
uint32_t lo, hi;
}; /* FIXME: little-endian */
} z = {.w = x};
#if __BYTE_ORDER__ == __ORDER_LITTLE_ENDIAN__
return z.hi;
#endif
#if __BYTE_ORDER__ == __ORDER_LITTLE_ENDIAN__
return z.lo;
#endif
}
```
wiki 提到現今的 Intel x86 處理器使用 little-endian 和使用 big-endian 格式的處理器架構。
> The Intel x86 and AMD64 / x86-64 series of processors use the little-endian format. Other instruction set architectures that follow this convention, allowing only little-endian mode, include RISC-V, Nios II, Andes Technology NDS32, and Qualcomm Hexagon.
>
> Some architectures (including ARM versions 3 and above, PowerPC, Alpha, SPARC V9, MIPS, PA-RISC, SuperH SH-4 and IA-64) feature a setting which allows for switchable endianness in data fetches and stores, instruction fetches, or both.
在 Intel X86 處理器上,[Davinais的共筆](https://hackmd.io/@Davinais/rJm3Hm4tS)和 [Imitate/emulate a big-endian behavior in C? [duplicate]](https://stackoverflow.com/questions/3337896/imitate-emulate-a-big-endian-behavior-in-c) 皆以 qemu 驗證程式碼
> 待補
#### 2. 找出 Sign EXtend 的應用案例,特別是密碼學的應用
---
### 測驗 `3`
延伸測驗 `2` 的 `sex32`,用以改寫 [解讀計算機編碼](https://hackmd.io/@sysprog/rylUqXLsm) 提及的常數時間 `abs` 實作 (輸入是 32 位元整數),程式碼如下:
```cpp
static inline int abs_with_sex(int x) {
return (x ^ sex32(x)) OP2 sex32(x);
}
```
#### 實作分析
---
### 測驗 `4`
延伸 [C 語言: goto 和流程控制](https://hackmd.io/s/B1e2AUZeM),考慮以下 [coroutine](https://en.wikipedia.org/wiki/Coroutine) 實作程式碼: (`coroutine.c`)
```cpp
/*
* Stackless coroutines.
* This is an async/await implementation for C based on Duff's device.
*
* Features:
* 1. the async state is caller-saved rather than callee-saved.
* 2. Subroutines can have persistent state that is not just static state,
* because each async subroutine accepts its own struct it uses as a
* parameter, and the async state is stored there.
* 3. Because of the more flexible state, async subroutines can be nested
* in tree-like fashion which permits fork/join concurrency patterns.
*/
/* event status */
typedef enum ASYNC_EVT { ASYNC_INIT = 0, ASYNC_DONE = -1 } async;
/* Declare the async state */
#define async_state unsigned int _async_kcont
struct async { async_state; };
/* Mark the start of an async subroutine */
#define async_begin(k) \
XXX1 \
XXX2
/* Mark the end of a generator thread */
#define async_end \
default: \
return ASYNC_DONE; \
}
/* Wait until the condition succeeds */
#define await(cond) await_while(!(cond))
/* Wait while the condition succeeds */
#define await_while(cond) \
case __LINE__: \
if (cond) return __LINE__ \
/* Initialize a new async computation */
#define async_init(state) (state)->_async_kcont = ASYNC_INIT
/* Check if async subroutine is done */
#define async_done(state) (state)->_async_kcont == ASYNC_DONE
/* Resume a running async computation and check for completion */
#define async_call(f, state) \
(async_done(state) || ASYNC_DONE == ((state)->_async_kcont = (f)(state)))
struct async_sem { unsigned int count; };
/**
* Initialize a semaphore
*
* This macro initializes a semaphore with a value for the
* counter. Internally, the semaphores use an "unsigned int" to
* represent the counter, and therefore the "count" argument should be
* within range of an unsigned int.
*
* \param s A pointer to the object representing the semaphore.
* \param c (unsigned int) The initial count of the semaphore.
*/
#define init_sem(s, c) (s)->count = c
/**
* Wait for a semaphore
*
* This macro carries out the "wait" operation on the semaphore. The
* wait operation causes the protothread to block while the counter is
* zero. When the counter reaches a value larger than zero, the
* protothread will continue.
*
* \param s A pointer to the object representing the semaphore.
*/
#define await_sem(s) \
do { \
await((s)->count > 0); \
XXX3; \
} while (0)
/**
* Signal a semaphore
*
* This macro carries out the "signal" operation on the semaphore. The
* signal operation increments the counter inside the semaphore, which
* eventually will cause waiting protothreads to continue executing.
*
* \param s A pointer to the object representing the semaphore.
*/
#define signal_sem(s) XXX4
/* Use case */
#include <stdio.h>
#include <unistd.h>
#define NUM_ITEMS 32
#define BUFSIZE 8
static int buffer[BUFSIZE];
static int bufptr;
static void add_to_buffer(int item) {
printf("Item %d added to buffer at place %d\n", item, bufptr);
buffer[bufptr] = item;
bufptr = (bufptr + 1) % BUFSIZE;
}
static int get_from_buffer(void) {
int item = buffer[bufptr];
printf("Item %d retrieved from buffer at place %d\n", item, bufptr);
bufptr = (bufptr + 1) % BUFSIZE;
return item;
}
static int produce_item(void) {
static int item = 0;
printf("Item %d produced\n", item);
return item++;
}
static void consume_item(int item) {
printf("Item %d consumed\n", item);
}
static struct async_sem full, empty;
static async producer(struct async *pt) {
static int produced;
async_begin(pt);
for (produced = 0; produced < NUM_ITEMS; ++produced) {
await_sem(&full);
add_to_buffer(produce_item());
signal_sem(&empty);
}
async_end;
}
static async consumer(struct async *pt) {
static int consumed;
async_begin(pt);
for (consumed = 0; consumed < NUM_ITEMS; ++consumed) {
await_sem(&empty);
consume_item(get_from_buffer());
signal_sem(&full);
}
async_end;
}
static async driver_thread(struct async *pt) {
static struct async cr_producer, cr_consumer;
async_begin(pt);
init_sem(&empty, 0);
init_sem(&full, BUFSIZE);
async_init(&cr_producer);
async_init(&cr_consumer);
await(producer(&cr_producer) & consumer(&cr_consumer));
async_end;
}
int main(void) {
struct async driver_pt;
async_init(&driver_pt);
while (!async_call(driver_thread, &driver_pt)) {
usleep(10); /* why? */
}
return 0;
}
```
這段程式碼是經典的[生產者——消費者問題](https://en.wikipedia.org/wiki/Producer%E2%80%93consumer_problem),預期執行輸出為:
```shell
Item 0 produced
Item 0 added to buffer at place 0
Item 1 produced
Item 1 added to buffer at place 1
Item 2 produced
Item 2 added to buffer at place 2
Item 3 produced
Item 3 added to buffer at place 3
Item 4 produced
Item 4 added to buffer at place 4
Item 5 produced
Item 5 added to buffer at place 5
Item 6 produced
Item 6 added to buffer at place 6
Item 7 produced
Item 7 added to buffer at place 7
Item 0 retrieved from buffer at place 0
Item 0 consumed
Item 1 retrieved from buffer at place 1
Item 1 consumed
Item 2 retrieved from buffer at place 2
Item 2 consumed
Item 3 retrieved from buffer at place 3
Item 3 consumed
Item 4 retrieved from buffer at place 4
Item 4 consumed
Item 5 retrieved from buffer at place 5
Item 5 consumed
Item 6 retrieved from buffer at place 6
Item 6 consumed
Item 7 retrieved from buffer at place 7
Item 7 consumed
```
#### 1. 解釋上述程式運作原理,特別是和 [Duff's device](https://en.wikipedia.org/wiki/Duff%27s_device) 的關聯;
#### 2. 參照 [ProtoThread](http://dunkels.com/adam/pt/) 程式碼,比較上述差異,需要設計實驗,分析兩者在切換 coroutine 的執行時間和記憶體開銷成本
#### 3. 整合 timeout 機制,允許 callback function 在指定的時間執行,如 [TimerCallback callback function](https://docs.microsoft.com/en-us/previous-versions/windows/desktop/legacy/ms686790(v=vs.85))
:::warning
說好的進度呢?
:notes: jserv
:::
> 慘了,不夠努力
### Reference
* [kaeteyaru 共筆](https://hackmd.io/@kaeteyaruyo/2019q3_homework3)
* [Lecture 21 NotesTypes in C](https://www.cs.cmu.edu/~fp/courses/15122-f15/lectures/21-types.pdf)