kxo notes - HackMD

研讀預期目標和定點數計算

Brian

定點數

概念：把精度為 n 的浮點數向右位移 n 個位元

Q Notation

Q23.8: 表示該數有 23 個整數位、8 個小數位，是 32 位二補數。

乘法除法校正理由

假設兩個精度為 n 的 2 進位浮點數 a, b
其定點數表達為

a_{f} = a \times 2^{n}

b_{f} = b \times 2^{n}

a_{f} \times b_{f} = a \times b \times 2^{n} \times 2^{n}

在轉換過程多乘了一個

2^{n}

故要校正回來
同理發生在除法運算

\frac{a_{f}}{b_{f}} = \frac{a}{b} \times (\frac{2^{n}}{2^{n}})

2^{n}

被抵消所以要往左位移 n 補回來

`fixed_power_int` 解析

理解為在定點數的框架下做計算, 這也是為什麼需要在乘法後做校正
值得一提的是這行

x += 1UL << (frac_bits - 1)

個人認為是做四捨五入讓右移時不會造成誤差大於單位的一半

load average

意義：表示系統的負載量而非 cpu 的負載

L_{t} = C_{r} + C_{u}

L_{t}

: 在時間點 t 的 load

C_{r}

: 狀態為 TASK_RUNNABLE的task總數

C_{u}

: 狀態為 TASK_UNINTERRUPTIBLE的task總數

系統在顯示的load average則是移動平滑後的結果

S_{t} = α \times X_{t - 1} + (1 - α) \times S_{t - 1}, w h e r e 0 < α < 1

在load avg 中使用exponential moving average來作為load的計算
所以上述公式的

α

為

e^{- t_{s} / T}

t_{s}

:kernel取樣的秒數,這裡為5秒

T

:多少時間尺度的load avg (1min, 5min, 15min)

理解常數定義

#define FSHIFT          11              /* nr of bits of precision */
#define FIXED_1         (1<<FSHIFT)     /* 1.0 as fixed-point */
#define LOAD_FREQ       (5*HZ+1)        /* 5 sec intervals */
#define EXP_1           1884            /* 1/exp(5sec/1min) as fixed-point */
#define EXP_5           2014            /* 1/exp(5sec/5min) */
#define EXP_15          2037            /* 1/exp(5sec/15min) */

FSHIFT: 精度11的定點數
FIXED_1：1在定點數的表示法, 右移11位
LOAD_FREQ: 計算load avg的間隔, HZ在kernel的意義是1秒進行多少次timer interrput, 所以可以當作1單位秒
EXP_1：
$α$ 在1分鐘的定點數表示法 =
$e^{- 5 / 60}$
EXP_5：
$α$ 在5分鐘的定點數表示法 =
$e^{- 5 / 180}$
EXP_15：
$α$ 在15分鐘的定點數表示法 =
$e^{- 5 / 900}$

calc_global_load 函數就是根據指數移動平滑平均和定點數運算在5秒一個間隔去更新load avg 1,5,15的值

強化學習（Reinforcement Learning）

馬可夫決策過程（Markov Decision Process)

每列總和為1的轉移矩陣就是右隨機矩陣（right stochastic matrix)

折扣總和（discount sum）定義如下：

E [\sum_{t = 0}^{\infty} γ^{t} R_{a_{t}} (s_{t}, s_{t + 1})], 0 \leq γ \leq 1

為什麼需要一個

γ

去做折扣的理由是假設有一個獎勵為正的環,在沒有discount的情況下,最佳選擇就會是一直繞著環的動作,而不會採取其他考慮

自己的對這個discount的見解是下個動作之外的未來是不確定的,所需要對在這個時間點所知的報酬打折扣

理解遞迴式

V (s) = \sum_{s^{'}} P_{π (s)} (s, s^{'}) (R_{π (s)} (s, s^{'}) + γ V (s^{'}))

在狀態s的折扣回報我們可以分成兩個部分來計算
第一個部分是選擇下一個步驟的回報s'
另一個是s'之後的每個步驟s'', s''', …,

s^{n}

的折扣回報
其實這個回報就是

γ V (s^{'})

π (s) = \arg max_{a} {\sum_{s^{'}} P_{a} (s, s^{'}) (R_{a} (s, s^{'}) + γ V (s^{'}))}

π (s)

: 對應的最佳動作選擇

這樣還是有點難懂所以去找了一些解說
Bellman Equation

V (s) = R (s) + γ m a x_{a \in A (s)} \sum_{s^{'}} P (s^{'} | s, a) V (s^{'})

Value Iteration

V_{0} (s) < - 0

V_{i + 1} (s) < - R (s) + γ m a x_{a \in A (s)} \sum_{s^{'}} P (s^{'} | s, a) V (s^{'})

利用每次迭代選擇的最佳步驟至V(s)的值收斂

例子

Image Not Showing Possible Reasons

The image was uploaded to a note which you don't have access to
The note which the image was originally uploaded to has been deleted

Learn More →

左邊狀態的回報計算
第一輪:

3 + 0.5 * m a x (1.0 * 0.0, 0.5 * 0.0 + 0.5 * 0.0) = 3

第二輪:

3 + 0.5 * m a x (1.0 * - 1.0, 0.5 * 3 + 0.5 * (- 1)) = 3.5

右邊狀態回報計算
第一輪:

- 1 + 0.5 * m a x (1.0 * 0.0, 0.0 * 0.0) = - 1

第二輪:

- 1 + 0.5 * m a x (1.0 * - 1, 1.0 * 3) = 0.5

以此類推,終止條件是當兩輪的結果收斂
然後選出得到折扣報酬最大的動作

這個方法的必須先知道選擇動作的機率,在實務上可能是未知的
所以在專案ttt裡面的強化學習實作是使用時間差分學習

強化學習演算法（RL Algorithm）

總結就是如同筆記的這句話
強化學習的核心目標是學習最佳的 𝜋，以最大化𝑉𝜋(𝑠)，並透過探索與利用（exploration vs. exploitation）策略來達成最優解。

時間差分學習

學習的目標: 價值函式

V (s) = E [G_{t} | S_{t} = s]

以ttt為例, 我們學習的是在某個棋盤狀態下剩餘可以放子的格子的價值

V (S_{t}) \leftarrow V (S_{t}) + α [R_{t + 1} + γ V (S_{t + 1}) - V (S_{t})]

這個更新公式的關鍵想法是：目前狀態的價值應該向「即時獎勵 + 折扣後的下一狀態價值」靠攏

往下看括號項

[R_{t + 1} + γ V (S_{t + 1}) - V (S_{t})]

意義是實際觀察到的狀態期望扣除當下預期期望,
換句話說,我們看到的比預期的好,遞迴狀態價值就加上學習率＊

[R_{t + 1} + γ V (S_{t + 1}) - V (S_{t})]

研讀ttt程式碼

主要關注兩個檔案

train.c:訓練的主要邏輯
agents/reinforcement_learning.c: 強化學習的實作

agents/reinforcement_learning.c

**int get_action_exploit(char table, rl_agent_t agent)

找出在某個棋盤狀態下, 下一步 o/x 最大的報酬
回傳值是棋盤的位置

int get_action_exploit(char *table, rl_agent_t *agent)
{
    int max_act = -1;
    float max_q = -FLT_MAX;
    float *state_value = agent->state_value;
    int candidate_count = 1;
#ifdef VERBOSE
    printf("[ ");
#endif
    // 列舉每一個可放棋子的狀態
    for_each_empty_grid (i, table) {
        table[i] = agent->player; //棋手是誰
        float new_q = state_value[table_to_hash(table)];
#ifdef VERBOSE
        printf("%f ", new_q);
#endif
        if (new_q == max_q) {
            ++candidate_count;
            if (rand() % candidate_count == 0) {
                max_act = i;
            }
        } else if (new_q > max_q) {
            candidate_count = 1;
            max_q = new_q;
            max_act = i;
        }
        table[i] = ' ';
    }
#ifdef VERBOSE
    printf(" ]\n");
    printf("exploit %d\n", max_act);
#endif
    return max_act;
}

*int table_to_hash(char table)

這個函式是把現在的棋盤狀態編碼成一個int
方法就是用
$\sum_{i = 0} S_{i} * 3^{16 - i}$ ,最高位是棋盤i=0的位置

char *hash_to_table(int hash)

table_to_hash的相反, 把湊雜值變回棋盤

train.c

static float update_state_value
這個函式就是TD learning的遞迴函式

static float update_state_value(int after_state_hash,
                                float reward,
                                float next,
                                rl_agent_t *agent)
{
    float curr = reward - GAMMA * next; // 不懂為啥是減
    agent->state_value[after_state_hash] =
        (1 - LEARNING_RATE) * agent->state_value[after_state_hash] + // 提項V(S_t)
        LEARNING_RATE * curr;
#if MONTE_CARLO
    return curr;
#else
    return agent->state_value[after_state_hash];
}

static void train
訓練的主函式
這裡埰用兩種方法去實作下一步該往哪走

EPSILON_GREEDY
這裡利用exploration vs exploitation的概念,當取樣隨機小於
$ϵ$ 下一步就是可以走的格子隨機挑一個,也就是exploration的概念,反之則是採取NON-GREEDY的選擇方式
NON-GREEDY
選擇state_value裡面最大的格子,詳見上面的敘述

static void train(int iter)
{
    int episode_moves[N_GRIDS];  // from 0 moves to N_GRIDS moves.
    float reward[N_GRIDS];
    int episode_len = 0;
    char table[N_GRIDS];
    memset(table, ' ', N_GRIDS);
    int turn = (iter & 1) ? 0 : 1;  // 0 for 'O', 1 for 'X'
    char win = ' ';
    while (1) {
        if (win == 'D') {
#ifdef VERBOSE
            draw_board(table);
            printf("It is a draw!\n");
#endif
            break;
        } else if (win != ' ') {
#ifdef VERBOSE
            draw_board(table);
            printf("%c won!\n", win);
#endif
            break;
        }
#if EPSILON_GREEDY
        int move = get_action_epsilon_greedy(table, &agent[turn]);
#else
        int move = get_action_exploit(table, &agent[turn]);
#endif
        table[move] = "OX"[turn]; // update move to the table array
        win = check_win(table);
        episode_moves[episode_len] = table_to_hash(table);
        reward[episode_len] =
            (1 - REWARD_TRADEOFF) * get_score(table, agent[turn].player) +
            REWARD_TRADEOFF * calculate_win_value(win, agent[turn].player); // update reward
        ++episode_len;
#ifdef VERBOSE
        draw_board(table);
#endif
        turn = !turn;
    }
    // 
    turn = !turn;  // the player who makes the last move.
    float next = 0;
    for (int i_move = episode_len - 1; i_move >= 0; --i_move) {
        next = update_state_value(episode_moves[i_move], reward[i_move], next,
                                  &agent[turn]);
    }
}

研讀 預期目標 和 定點數計算

定點數