# RL-Pytorch Mario
## 1. Setup Mario
using Open AI Gym
- **模組安裝**
>* Mario game(use version==7.4.0)
https://pypi.org/project/gym-super-mario-bros/
>* gym(use version==0.25.1)
https://pypi.org/project/gym/
>* 虛擬搖桿
https://pypi.org/project/nes-py/
- **環境載入**
```python=
# Game
import gym_super_mario_bros as mario
# Joypad wrapper
from nes_py.wrappers import JoypadSpace
# Simplified controls
from gym_super_mario_bros.actions import SIMPLE_MOVEMENT
```
- **設定基礎環境**
```python=
# 設定 Mario 版本 (v0 = 第一代預設版)
env = mario.make("SuperMarioBros-v0")
# 簡化 action
env = JoypadSpace(env, SIMPLE_MOVEMENT)
```
- **創建 Game Loop**
```python=
# Create s flag - restart
done = True
# The loop of the frame
for step in range(100000):
if done:
# Start the game
env.reset()
# Take random actions
state, reward, done, info = env.step(env.action_space.sample())
# Show the game on the screen
env.render()
env.close()
```
## 2. Preprocess Environment
- **遊戲擷取**
```python=
# Frame_stack : 取得遊戲每一幀數所發生的事
# Gray_scale_observation : 將彩色的遊戲畫面壓縮灰白,以減少資源消耗
from gym.wrappers import FrameStack, GrayScaleObservation
# Import Matplotlib to show the impact of frame stacking
from matplotlib import pyplot as plt
```
- **灰度壓縮(灰度觀察)**
>藉由將擷取到的彩色畫面轉為灰白畫面,以降低資源消耗
>
>將圖片從 深度3(RGB) 壓縮成 深度1 ==> 資源消耗降為1/3
- **環境初步處理**
```python=
# 1. Create the base env
env = mario.make("SuperMarioBros-v0")
# 2. Simplify the controls
env = JoypadSpace(env, SIMPLE_MOVEMENT)
# 3. GrayScale
env = GrayScaleObservation(env, keep_dim=True)
# 4. Wrap inside the Dummy env
env = DummyVecEnv([lambda : env])
```
- **幀數處理**
```python=
# 5. Stack the frames
env = VecFrameStack(env, 4, channels_order="last")
```
>保留 4 幀的遊戲畫面(已灰度壓縮)便於之後學習

## 3. Train the RL model
- **學習模式**
>**=> Reinforcement Learning**
>>**Agent:** *Mario*
>>**Environment:** *Mario world*
>>**Action:** *right, left, A, B*
>>**Reward:** *distance, coin, time...etc*
>>
- **建立保存系統**
>為了避免每次模擬時,將上次的訓練結果遺失,須建立保存與復原系統
```python=
# set a call back system
import os
from stable_baselines3.common.callbacks import BaseCallback
class TrainAndLoggingCallback(BaseCallback):
def __init__(self, check_freq, save_path, verbose=1):
super(TrainAndLoggingCallback, self).__init__(verbose)
self.check_freq = check_freq
self.save_path = save_path
def _init_callback(self):
if self.save_path is not None:
os.makedirs(self.save_path, exist_ok=True)
def _on_step(self):
if self.n_calls % self.check_freq == 0:
model_path = os.path.join(self.save_path, 'best_model_{}'.format(self.n_calls))
self.model.save(model_path)
return True
# setup model saving callback
CHECKPOINT_DIR = "./train/"
LOG_DIR = "./logs/"
callback = TrainAndLoggingCallback(check_freq=100000, save_path=CHECKPOINT_DIR)
```
>*注 : 其中的 *check_freq* 是指當訓練進行多少步之後要進行一次保存*
- **演算法**
>**=> PPO algorithm**
>>全名 Proximal Policy Optimization ( 近端策略優化 )
>>是由 Open AI 於近幾年所設計
>>廣泛用於增強式學習的演算法
>>其優勢在於可比對模擬前後的學習成果
>>將學習效率調整成不快也不慢的穩定演算法
>>
>>**Open AI 官網文獻**
>>>[https://openai.com/research/openai-baselines-ppo](https://)
```python=
from stable _baselines3 import PPO
```
- **AI model set**
```python=
model = PPO(policy="CnnPolicy", env=env, verbose=1,
tensorboard_log=LOG_DIR,
learning_rate=0.0001, n_steps=512)
```
>*注 : 其中的 CnnPolicy 是一種加速圖片處理的運算方式*
- start learning
```python=
model.learn(total_timesteps=1000000, callback=callback)
```
## 4. Test it out
- 模擬結果演示
```python=
# AI model play
last_model = input("last model:")
model = PPO.load("./train/best_model_"+last_model)
state = env.reset()
while True:
action, _state = model.predict(state)
state, reward, done, info = env.step(action)
env.render()
```
> *實際模擬展示*

## 5. 人性化界面
>為了在之後模擬時方便進行,不需一直去修改程式碼
>故添加了這些內容
>
> **1. 學習/模擬程式碼分離**
> > 雖然在執行模擬學習結果與程式學習的程式碼大體相同
> > 但在末端卻任有些許不同
> > 加上為了能邊學習邊觀察學習成果
> > 故將程式碼拆成兩分
> >
> > *學習程式碼*
> >
> > *模擬程式碼*
> >
>
> **2. 分支選取功能**
> > 因在學習時,常常需要調整 model 中的參數以達到不同的學習成果
> > 故加入了分支功能
> >
> > *分支選取*
```python=
branch = input("choose branch:")
CHECKPOINT_DIR = f"E:/Mario/{branch}/train/"
LOG_DIR = f"E:/Mario/{branch}/logs/"
```
> > *分支參數設定*
```python=
if branch == "branch1":
lr = 0.00001
elif branch == "branch2":
lr = 0.001
print(f"set the learning_rate = {lr}")
```
> **3. 模型選取功能**
> > 為觀察 AI 在不同學習次數中所展現的結果,或為繼承上次的學習段落
> > 因此加入了能選取模型的功能
```python=
choose_model = input("choose model:")
model = PPO.load(f"E:/Mario/{branch}/train/best_model_{choose_model}.zip")
```
## 6.最終成果展示