An introduction to Q-Learning: reinforcement learning (Part2)

# An introduction to Q-Learning: reinforcement learning (Part2) [toc] &emsp; &emsp; ## [Q-Learning to solve the Taxi-v2 Environment](https://github.com/ADLsourceCode/Reinforcement-Learning/blob/master/Q_learning/Implemetation.ipynb) [Taxi-v2](https://gym.openai.com/envs/Taxi-v2/) &emsp; &emsp; ![](https://i.imgur.com/AletIBT.png) &emsp; &emsp; 1. 任務是在正確的位置接乘客，並在正確的位置讓他下車。 2. 一次只能移動一格，有牆壁的地方不能行進。 3. 每次移動都會扣1分，不正確的"接送"行為會扣10分。 4. 任務成功將得到20分。 &emsp; &emsp; ### Implementing Q-learning in python &emsp; ```python= import gym import numpy as np import random from IPython.display import clear_output env = gym.make('Taxi-v2').env env.render() ``` &emsp; ![](https://i.imgur.com/L13dGOs.png) &emsp; ```python= env.reset() # reset environment to a new, random state env.render() # Number of possible actions print('Action Space {}'.format(env.action_space)) # Number of possible states print('State Space {}'.format(env.observation_space)) ``` &emsp; ![](https://i.imgur.com/D3daBc5.png) &emsp; &emsp; Action : south,north,east,west,pickup,dropoff State : 5 X 5 X 5 X 4 = 500 (車子的位置有5x5種可能，乘客的位置有5種可能，目的地有4種可能) &emsp; ```python= state = env.encode(4,1,2,0) # (taxi row,taxi column,passenger index, destination index) print("State : ",state) env.s = state env.render() ``` &emsp; ![](https://i.imgur.com/kUkTXDr.png) 藍色字母 ( Y ) 是乘客要上車的地點紫色字母 ( R ) 是乘客要下車的地點 &emsp; ```python= env.P[428] ``` &emsp; ![](https://i.imgur.com/oDMRGi8.png) action : (probability, nextstate, reward, done) - 0: south - 1: north - 2: east - 3: west - 4: pickup - 5: dropoff &emsp; #### Training the Agent &emsp; ```python= #intialize the Q-Table with Zeros q_table= np.zeros([env.observation_space.n,env.action_space.n]) q_table ``` &emsp; ![](https://i.imgur.com/rn8TZQl.png) &emsp; ```python= #Hyparameters gamma = 0.9 # discount rate alpha = 0.1 # learning rate epsilon = 0.1 # exploration rate max_epsilon=1.0 min_epsilon=0.01 decay_rate = 0.01 # exponential decay rate for exploration probability # For plotting metrics all_epochs = [] all_penalities=[] penalities=0 ``` ```python= for i in range(1,100001): state = env.reset() epochs, penalites,reward = 0,0,0 done = False while not done: if random.uniform(0,1)<epsilon: action = env.action_space.sample() # Explore action space else: action = np.argmax(q_table[state]) # Exploit learned values # Take the action and observe the outcome state and reward next_state,reward,done,info = env.step(action) next_max = np.max(q_table[next_state]) q_table[state,action] = q_table[state,action] + alpha * ( reward + gamma * next_max - q_table[state,action]) if reward == -10: penalities +=1 state = next_state epochs +=1 # Reduce epsilon (because we need less and less exploration) epsilon= min_epsilon +(max_epsilon-min_epsilon) * np.exp(-0.1*epsilon) if i%100==0: clear_output(wait=True) print('Episode: {}'.format(i)) print('Training Finished..') ``` Episode: 100000 Training Finished.. &emsp; &emsp; Now that the Q-table has been established over 100,000 episodes, let's see what the Q-values are at our illustration's state : &emsp; ```python= #Preview the Q-Table q_table[428] ``` &emsp; array([-0.58568212, 0.4603532 , -1.52711391, -0.58568212, -9.58568212,-9.58568212]) &emsp; ![](https://i.imgur.com/FKKxrQj.png) &emsp; &emsp; #### Evaluating the agent &emsp; ```python= penalites=0 total_epochs , total_penalities =0,0 episode = 100 for _ in range(episode): state = env.reset() epochs , penalties , reward = 0,0,0 done = False while not done: env.render() action = np.argmax(q_table[state]) state,reward,done,info = env.step(action) if reward == -10: penalites+=1 epochs+=1 total_epochs+=epochs total_penalities+=penalites print('Results After {} episodes'.format(episode)) print('Average timestep per Episode :{}'.format(total_epochs/episode)) print('Average Penalties per Episode : {}'.format(total_penalities/episode)) ``` &emsp; Results After 100 episodes Average timestep per Episode : 12.74 Average Penalties per Episode : 0.0 &emsp; &emsp; &emsp; 參考資料 https://github.com/ADLsourceCode/Reinforcement-Learning/blob/master/Q_learning/Implemetation.ipynb https://www.learndatasci.com/tutorials/reinforcement-q-learning-scratch-python-openai-gym/ https://saksham-jain.com/p6.html https://github.com/openai/gym/blob/master/gym/envs/toy_text/taxi.py https://www.freecodecamp.org/news/an-introduction-to-q-learning-reinforcement-learning-14ac0b4493cc/