# Taxi-v2 [TOC]     ## 訓練次數不同,Q-table 的差異 - 訓練約 2500 次後 [Q-table](https://hackmd.io/QOHobDNpSWOR264loGjv3g) 不會再改變 - 訓練次數低於 900 次所獲得的 Q-table ,在測試時可能會進入無止盡的循環。     訓練800次時,測試遇到的狀況 :   ![](https://i.imgur.com/Wy46OXB.png) ```python= state = env.encode(4,0,1,2) # (taxi row,taxi column,passenger index, destination index) print("State : ",state) ``` State : 406 ```python= q_table[406] ``` array([ -6.11187636, -6.15386317, -6.1232991 , -6.04170499,-14.82243346, -15.14841173]) - 0: south - 1: north - 2: east - 3: west - 4: pickup - 5: dropoff     ## 更改 Reward   原來的 Q-table_2500 (錯誤的接送行為的獎勵為 -10) array([[ 0. , 0. , 0. , 0. , 0. , 0. ], [ 1.62261467, 2.9140163 , 1.62261467, 2.9140163 , 4.348907 , -6.0859837 ], [ 4.348907 , 5.94323 , 4.348907 , 5.94323 , 7.7147 ,-3.05677 ], ..., [ 7.7147 , 9.683 , 7.71469995, 5.94322993, -1.28530002,-1.28530008], [ 1.62261467, 2.9140163 , 1.62261467, 2.9140163 , -7.37738533, -7.37738533], [14.3 , 11.87 , 14.3 , 17. , 5.3 ,5.3 ]])     將錯誤的接送行為的獎勵改成 -50 Q-table_2500 : array([[ 0. , 0. , 0. , 0. ,0. , 0. ], [ 1.62261467, 2.9140163 , 1.62261467, 2.9140163 ,4.348907 , -46.0859837 ], [ 4.348907 , 5.94323 , 4.348907 , 5.94323 ,7.7147 , -43.05677 ], ..., [ 7.7147 , 9.683 , 7.7147 , 5.94323 ,-41.2853 , -41.2853 ], [ 1.62261467, 2.9140163 , 1.62261467, 2.9140163 ,-47.37738533, -47.37738533], [ 14.3 , 11.87 , 14.3 , 17. , -34.7 , -34.7 ]])     將錯誤的接送行為的獎勵改成 -1 (與其他動作的獎勵相同) array([[ 0. , 0. , 0. , 0. , 0. ,0. ], [ 1.62261467, 2.9140163 , 1.62261467, 2.9140163 , 4.348907 ,2.9140163 ], [ 4.348907 , 5.94323 , 4.348907 , 5.94323 , 7.7147 , 5.94323 ], ..., [ 7.7147 , 9.683 , 7.7147 , 5.94323 , 7.71469999, 7.71469997], [ 1.62261467, 2.9140163 , 1.62261467, 2.9140163 , 1.62261467, 1.62261467], [14.3 , 11.87 , 14.3 , 17. , 14.3 ,14.3 ]])     將完成任務的獎勵改為 -20 時 儘管訓練100000次後 ( 每個episode 訓練的時間更長 ),每次測試都會進入無止盡的循環 。     ## 更改 MAP 原來的 MAP ![](https://i.imgur.com/03dbrVK.png) 訓練2500次,測試100次 的結果 : Results After 100 episodes Average timestep per Episode :12.21 Average Penalties per Episode : 0.0 - Average timestep per Episode 基本上介於 12~13 之間     將MAP的牆壁移除 ![](https://i.imgur.com/sSYLA9o.png) 訓練2500次,測試100次 的結果 : Results After 100 episodes Average timestep per Episode :10.73 Average Penalties per Episode : 0.0 - Average timestep per Episode 基本上介於 10~11 之間     增加牆壁量v1 ![](https://i.imgur.com/c3q7HNv.png) 訓練2500次,測試100次 的結果 : Results After 100 episodes Average timestep per Episode :13.25 Average Penalties per Episode : 0.0 - Average timestep per Episode 基本上介於 13~14 之間     增加牆壁量 v2 ![](https://i.imgur.com/KOrA8qt.png) 訓練2500次,測試100次 的結果 : Results After 100 episodes Average timestep per Episode :18.49 Average Penalties per Episode : 0.0 - Average timestep per Episode 基本上介於 18~19 之間 - 測試時有可能會進入無止盡的循環