# Taxi-v2
[TOC]
 
 
## 訓練次數不同,Q-table 的差異
- 訓練約 2500 次後 [Q-table](https://hackmd.io/QOHobDNpSWOR264loGjv3g) 不會再改變
- 訓練次數低於 900 次所獲得的 Q-table ,在測試時可能會進入無止盡的循環。
 
 
訓練800次時,測試遇到的狀況 :
 

```python=
state = env.encode(4,0,1,2) # (taxi row,taxi column,passenger index, destination index)
print("State : ",state)
```
State : 406
```python=
q_table[406]
```
array([ -6.11187636, -6.15386317, -6.1232991 , -6.04170499,-14.82243346, -15.14841173])
- 0: south
- 1: north
- 2: east
- 3: west
- 4: pickup
- 5: dropoff
 
 
## 更改 Reward
 
原來的 Q-table_2500 (錯誤的接送行為的獎勵為 -10)
array([[ 0. , 0. , 0. , 0. , 0. , 0. ],
[ 1.62261467, 2.9140163 , 1.62261467, 2.9140163 , 4.348907 , -6.0859837 ],
[ 4.348907 , 5.94323 , 4.348907 , 5.94323 , 7.7147 ,-3.05677 ],
...,
[ 7.7147 , 9.683 , 7.71469995, 5.94322993, -1.28530002,-1.28530008],
[ 1.62261467, 2.9140163 , 1.62261467, 2.9140163 , -7.37738533, -7.37738533],
[14.3 , 11.87 , 14.3 , 17. , 5.3 ,5.3 ]])
 
 
將錯誤的接送行為的獎勵改成 -50
Q-table_2500 :
array([[ 0. , 0. , 0. , 0. ,0. , 0. ],
[ 1.62261467, 2.9140163 , 1.62261467, 2.9140163 ,4.348907 , -46.0859837 ],
[ 4.348907 , 5.94323 , 4.348907 , 5.94323 ,7.7147 , -43.05677 ],
...,
[ 7.7147 , 9.683 , 7.7147 , 5.94323 ,-41.2853 , -41.2853 ],
[ 1.62261467, 2.9140163 , 1.62261467, 2.9140163 ,-47.37738533, -47.37738533],
[ 14.3 , 11.87 , 14.3 , 17. , -34.7 , -34.7 ]])
 
 
將錯誤的接送行為的獎勵改成 -1 (與其他動作的獎勵相同)
array([[ 0. , 0. , 0. , 0. , 0. ,0. ],
[ 1.62261467, 2.9140163 , 1.62261467, 2.9140163 , 4.348907 ,2.9140163 ],
[ 4.348907 , 5.94323 , 4.348907 , 5.94323 , 7.7147 , 5.94323 ],
...,
[ 7.7147 , 9.683 , 7.7147 , 5.94323 , 7.71469999, 7.71469997],
[ 1.62261467, 2.9140163 , 1.62261467, 2.9140163 , 1.62261467, 1.62261467],
[14.3 , 11.87 , 14.3 , 17. , 14.3 ,14.3 ]])
 
 
將完成任務的獎勵改為 -20 時
儘管訓練100000次後 ( 每個episode 訓練的時間更長 ),每次測試都會進入無止盡的循環 。
 
 
## 更改 MAP
原來的 MAP

訓練2500次,測試100次 的結果 :
Results After 100 episodes
Average timestep per Episode :12.21
Average Penalties per Episode : 0.0
- Average timestep per Episode 基本上介於 12~13 之間
 
 
將MAP的牆壁移除

訓練2500次,測試100次 的結果 :
Results After 100 episodes
Average timestep per Episode :10.73
Average Penalties per Episode : 0.0
- Average timestep per Episode 基本上介於 10~11 之間
 
 
增加牆壁量v1

訓練2500次,測試100次 的結果 :
Results After 100 episodes
Average timestep per Episode :13.25
Average Penalties per Episode : 0.0
- Average timestep per Episode 基本上介於 13~14 之間
 
 
增加牆壁量 v2

訓練2500次,測試100次 的結果 :
Results After 100 episodes
Average timestep per Episode :18.49
Average Penalties per Episode : 0.0
- Average timestep per Episode 基本上介於 18~19 之間
- 測試時有可能會進入無止盡的循環