勉強会 - HackMD

# 勉強会 --- ## 8/19 --- ### 参考資料 [①つくりながら学ぶ！深層強化学習](https://www.amazon.co.jp/dp/4839965625/ref=cm_sw_em_r_mt_dp_zknjFbHFAH17G) ![](https://i.imgur.com/w1UNtOH.png =230x) --- #### 強化学習の基礎を、コードベースで理解 [ソースコード（GitHub）](https://github.com/YutaroOgawa/Deep-Reinforcement-Learning-Book) * 迷路探索の課題を題材として、強化学習（方策勾配法、Sarsa、Q学習）の手法を実装しながら学習できる * 応用編として、Cartpoleを題材として、DQNを実装しながら学習できる * この資料を主としてQ学習についてコードを追いながら理解した --- #### ② 現場で使える深層学習シリーズ [現場で使える！深層学習入門　](https://www.amazon.co.jp/dp/4798150975/ref=cm_sw_em_r_mt_dp_0fnjFbBJAFXCH) [現場で使える！深層強化学習入門](https://www.amazon.co.jp/dp/4798159921/ref=cm_sw_em_r_mt_dp_BgnjFbHHVHHKD) ![](https://i.imgur.com/x5u30VL.jpg =230x) --- #### 深層強化学習の基礎を、ドキュメント＆コードベースで理解 [付属データ](https://drive.google.com/drive/folders/1SisAdD3LohF-ZcooyIc090HeYnWsNuwI?usp=sharing) * 深層強化学習の概念やアルゴリズムについての説明 * ネットワーク作成と学習に絞ってソースコードを掲載し動作を詳しく解説している * Pythonやnumpyなどの基本についても参照できる基本的な知識を確認するために参照 --- #### ③ Qiita記事 [DQNで自作迷路を解く](https://qiita.com/cvusk/items/e4f5862574c25649377a) [ソースコード](https://github.com/shibuiwilliam/maze_solver) ソースコードを読み、参考資料①での強化学習（Q学習）と、DQNとの違いについて理解 --- #### ④ 機械学習特論（2019）の資料 [参考資料](https://drive.google.com/drive/folders/14RaimTEGLBiyvoZkG0gklUOkOtz5JTas?usp=sharing) --- #### 迷路探索の課題を題材とした強化学習の理解 [ソースコード（GitHub）](https://github.com/YutaroOgawa/) ニューラルネットワークの作成 ``` python def build_model(self): model = Sequential() model.add(Dense(128, input_shape=(2,2), activation='tanh')) model.add(Flatten()) model.add(Dense(128, activation='tanh')) model.add(Dense(128, activation='tanh')) model.add(Dense(1, activation='linear')) model.compile(loss="mse", optimizer=RMSprop(lr=self.learning_rate)) return model ``` --- エージェントの状態、行動、報酬、次の状態、次の行動、次の報酬を記録 ``` python def remember_memory(self, state, action, reward, next_state, next_movables, done): self.memory.append((state, action, reward, next_state, next_movables, done)) ``` --- epsilonグリーディー法によって行動を選択 ``` python def choose_action(self, state, movables): if self.epsilon >= random.random(): return random.choice(movables) else: return self.choose_best_action(state, movables) ``` --- 最適行動価値を更新 ``` python def choose_best_action(self, state, movables): best_actions = [] max_act_value = -100 for a in movables: np_action = np.array([[state, a]]) act_value = self.model.predict(np_action) if act_value > max_act_value: best_actions = [a,] max_act_value = act_value elif act_value == max_act_value: best_actions.append(a) return random.choice(best_actions) ``` --- replay_experience ``` python def replay_experience(self, batch_size): batch_size = min(batch_size, len(self.memory)) minibatch = random.sample(self.memory, batch_size) X = [] Y = [] for i in range(batch_size): state, action, reward, next_state, next_movables, done = minibatch[i] input_action = [state, action] if done: target_f = reward else: next_rewards = [] for i in next_movables: np_next_s_a = np.array([[next_state, i]]) next_rewards.append(self.model.predict(np_next_s_a)) np_n_r_max = np.amax(np.array(next_rewards)) target_f = reward + self.gamma * np_n_r_max X.append(input_action) Y.append(target_f) np_X = np.array(X) np_Y = np.array([Y]).T self.model.fit(np_X, np_Y, epochs=1, verbose=0) # 現在の行動に対する期待値を近似 if self.epsilon > self.e_min: self.epsilon *= self.e_decay # epsilonが減衰 ``` --- ## 8/26 ### 強化学習 --- #### 状態遷移＆MDPおさらい * 機械学習特論のテキストを再確認 --- #### Q学習の目的 * アクション実行後の状態が事前にわからない * いつ、どれだけの報酬が得られるか予めわからない * Q学習により、状態遷移確率、報酬関数を学習 --- #### UberEatsでの状態遷移 ![](https://i.imgur.com/noIuSnw.png) --- #### 報酬（検討中） ![](https://i.imgur.com/ZDkqaKi.png =430x) --- #### 行動決定(検討中) * 内部状態と、環境から得た情報（制御パラメータ・変動パラメータ）により行動を決定 ![](https://i.imgur.com/y7VFde1.png =600x) --- #### 状態遷移を可視化 [UberEats-MDP.ipynb](https://colab.research.google.com/drive/1Ic0NIQAVlssl5OckwQqpQZ0lEnHAaS-A?usp=sharing)