# RL-Pytorch Mario ## 1. Setup Mario using Open AI Gym - **模組安裝** >* Mario game(use version==7.4.0) https://pypi.org/project/gym-super-mario-bros/ >* gym(use version==0.25.1) https://pypi.org/project/gym/ >* 虛擬搖桿 https://pypi.org/project/nes-py/ - **環境載入** ```python= # Game import gym_super_mario_bros as mario # Joypad wrapper from nes_py.wrappers import JoypadSpace # Simplified controls from gym_super_mario_bros.actions import SIMPLE_MOVEMENT ``` - **設定基礎環境** ```python= # 設定 Mario 版本 (v0 = 第一代預設版) env = mario.make("SuperMarioBros-v0") # 簡化 action env = JoypadSpace(env, SIMPLE_MOVEMENT) ``` - **創建 Game Loop** ```python= # Create s flag - restart done = True # The loop of the frame for step in range(100000): if done: # Start the game env.reset() # Take random actions state, reward, done, info = env.step(env.action_space.sample()) # Show the game on the screen env.render() env.close() ``` ## 2. Preprocess Environment - **遊戲擷取** ```python= # Frame_stack : 取得遊戲每一幀數所發生的事 # Gray_scale_observation : 將彩色的遊戲畫面壓縮灰白,以減少資源消耗 from gym.wrappers import FrameStack, GrayScaleObservation # Import Matplotlib to show the impact of frame stacking from matplotlib import pyplot as plt ``` - **灰度壓縮(灰度觀察)** >藉由將擷取到的彩色畫面轉為灰白畫面,以降低資源消耗 >![](https://i.imgur.com/WawoLgE.jpg =50%x)![](https://i.imgur.com/llMle1u.jpg =50%x) >將圖片從 深度3(RGB) 壓縮成 深度1 ==> 資源消耗降為1/3 - **環境初步處理** ```python= # 1. Create the base env env = mario.make("SuperMarioBros-v0") # 2. Simplify the controls env = JoypadSpace(env, SIMPLE_MOVEMENT) # 3. GrayScale env = GrayScaleObservation(env, keep_dim=True) # 4. Wrap inside the Dummy env env = DummyVecEnv([lambda : env]) ``` - **幀數處理** ```python= # 5. Stack the frames env = VecFrameStack(env, 4, channels_order="last") ``` >保留 4 幀的遊戲畫面(已灰度壓縮)便於之後學習 ![](https://i.imgur.com/PgHf56o.png =25%x)![](https://i.imgur.com/5qSm94u.png =25%x)![](https://i.imgur.com/PHSSWWU.png =25%x)![](https://i.imgur.com/evvuIeV.png =25%x) ## 3. Train the RL model - **學習模式** >**=> Reinforcement Learning** >>**Agent:** *Mario* >>**Environment:** *Mario world* >>**Action:** *right, left, A, B* >>**Reward:** *distance, coin, time...etc* >>![](https://i.imgur.com/KpeQAjq.png) - **建立保存系統** >為了避免每次模擬時,將上次的訓練結果遺失,須建立保存與復原系統 ```python= # set a call back system import os from stable_baselines3.common.callbacks import BaseCallback class TrainAndLoggingCallback(BaseCallback): def __init__(self, check_freq, save_path, verbose=1): super(TrainAndLoggingCallback, self).__init__(verbose) self.check_freq = check_freq self.save_path = save_path def _init_callback(self): if self.save_path is not None: os.makedirs(self.save_path, exist_ok=True) def _on_step(self): if self.n_calls % self.check_freq == 0: model_path = os.path.join(self.save_path, 'best_model_{}'.format(self.n_calls)) self.model.save(model_path) return True # setup model saving callback CHECKPOINT_DIR = "./train/" LOG_DIR = "./logs/" callback = TrainAndLoggingCallback(check_freq=100000, save_path=CHECKPOINT_DIR) ``` >*注 : 其中的 *check_freq* 是指當訓練進行多少步之後要進行一次保存* - **演算法** >**=> PPO algorithm** >>全名 Proximal Policy Optimization ( 近端策略優化 ) >>是由 Open AI 於近幾年所設計 >>廣泛用於增強式學習的演算法 >>其優勢在於可比對模擬前後的學習成果 >>將學習效率調整成不快也不慢的穩定演算法 >> >>**Open AI 官網文獻** >>>[https://openai.com/research/openai-baselines-ppo](https://) ```python= from stable _baselines3 import PPO ``` - **AI model set** ```python= model = PPO(policy="CnnPolicy", env=env, verbose=1, tensorboard_log=LOG_DIR, learning_rate=0.0001, n_steps=512) ``` >*注 : 其中的 CnnPolicy 是一種加速圖片處理的運算方式* - start learning ```python= model.learn(total_timesteps=1000000, callback=callback) ``` ## 4. Test it out - 模擬結果演示 ```python= # AI model play last_model = input("last model:") model = PPO.load("./train/best_model_"+last_model) state = env.reset() while True: action, _state = model.predict(state) state, reward, done, info = env.step(action) env.render() ``` > *實際模擬展示* ![](https://i.imgur.com/FiGuic7.gif) ## 5. 人性化界面 >為了在之後模擬時方便進行,不需一直去修改程式碼 >故添加了這些內容 > > **1. 學習/模擬程式碼分離** > > 雖然在執行模擬學習結果與程式學習的程式碼大體相同 > > 但在末端卻任有些許不同 > > 加上為了能邊學習邊觀察學習成果 > > 故將程式碼拆成兩分 > > > > *學習程式碼* > >![](https://i.imgur.com/4EHtLW7.png =75%x) > > *模擬程式碼* > >![](https://i.imgur.com/UCYzwYg.png =75%x) > > **2. 分支選取功能** > > 因在學習時,常常需要調整 model 中的參數以達到不同的學習成果 > > 故加入了分支功能 > > > > *分支選取* ```python= branch = input("choose branch:") CHECKPOINT_DIR = f"E:/Mario/{branch}/train/" LOG_DIR = f"E:/Mario/{branch}/logs/" ``` > > *分支參數設定* ```python= if branch == "branch1": lr = 0.00001 elif branch == "branch2": lr = 0.001 print(f"set the learning_rate = {lr}") ``` > **3. 模型選取功能** > > 為觀察 AI 在不同學習次數中所展現的結果,或為繼承上次的學習段落 > > 因此加入了能選取模型的功能 ```python= choose_model = input("choose model:") model = PPO.load(f"E:/Mario/{branch}/train/best_model_{choose_model}.zip") ``` ## 6.最終成果展示