# An introduction to Q-Learning: reinforcement learning (Part2)
[toc]
 
 
## [Q-Learning to solve the Taxi-v2 Environment](https://github.com/ADLsourceCode/Reinforcement-Learning/blob/master/Q_learning/Implemetation.ipynb)
[Taxi-v2](https://gym.openai.com/envs/Taxi-v2/)
 
 

 
 
1. 任務是在正確的位置接乘客,並在正確的位置讓他下車。
2. 一次只能移動一格,有牆壁的地方不能行進。
3. 每次移動都會扣1分,不正確的"接送"行為會扣10分。
4. 任務成功將得到20分。
 
 
### Implementing Q-learning in python
 
```python=
import gym
import numpy as np
import random
from IPython.display import clear_output
env = gym.make('Taxi-v2').env
env.render()
```
 

 
```python=
env.reset() # reset environment to a new, random state
env.render()
# Number of possible actions
print('Action Space {}'.format(env.action_space))
# Number of possible states
print('State Space {}'.format(env.observation_space))
```
 

 
 
Action : south,north,east,west,pickup,dropoff
State : 5 X 5 X 5 X 4 = 500 (車子的位置有5x5種可能,乘客的位置有5種可能,目的地有4種可能)
 
```python=
state = env.encode(4,1,2,0) # (taxi row,taxi column,passenger index, destination index)
print("State : ",state)
env.s = state
env.render()
```
 

藍色字母 ( Y ) 是乘客要上車的地點
紫色字母 ( R ) 是乘客要下車的地點
 
```python=
env.P[428]
```
 

action : (probability, nextstate, reward, done)
- 0: south
- 1: north
- 2: east
- 3: west
- 4: pickup
- 5: dropoff
 
#### Training the Agent
 
```python=
#intialize the Q-Table with Zeros
q_table= np.zeros([env.observation_space.n,env.action_space.n])
q_table
```
 

 
```python=
#Hyparameters
gamma = 0.9 # discount rate
alpha = 0.1 # learning rate
epsilon = 0.1 # exploration rate
max_epsilon=1.0
min_epsilon=0.01
decay_rate = 0.01 # exponential decay rate for exploration probability
# For plotting metrics
all_epochs = []
all_penalities=[]
penalities=0
```
```python=
for i in range(1,100001):
state = env.reset()
epochs, penalites,reward = 0,0,0
done = False
while not done:
if random.uniform(0,1)<epsilon:
action = env.action_space.sample() # Explore action space
else:
action = np.argmax(q_table[state]) # Exploit learned values
# Take the action and observe the outcome state and reward
next_state,reward,done,info = env.step(action)
next_max = np.max(q_table[next_state])
q_table[state,action] = q_table[state,action] + alpha * ( reward + gamma * next_max - q_table[state,action])
if reward == -10:
penalities +=1
state = next_state
epochs +=1
# Reduce epsilon (because we need less and less exploration)
epsilon= min_epsilon +(max_epsilon-min_epsilon) * np.exp(-0.1*epsilon)
if i%100==0:
clear_output(wait=True)
print('Episode: {}'.format(i))
print('Training Finished..')
```
Episode: 100000
Training Finished..
 
 
Now that the Q-table has been established over 100,000 episodes, let's see what the Q-values are at our illustration's state :
 
```python=
#Preview the Q-Table
q_table[428]
```
 
array([-0.58568212, 0.4603532 , -1.52711391, -0.58568212, -9.58568212,-9.58568212])
 

 
 
#### Evaluating the agent
 
```python=
penalites=0
total_epochs , total_penalities =0,0
episode = 100
for _ in range(episode):
state = env.reset()
epochs , penalties , reward = 0,0,0
done = False
while not done:
env.render()
action = np.argmax(q_table[state])
state,reward,done,info = env.step(action)
if reward == -10:
penalites+=1
epochs+=1
total_epochs+=epochs
total_penalities+=penalites
print('Results After {} episodes'.format(episode))
print('Average timestep per Episode :{}'.format(total_epochs/episode))
print('Average Penalties per Episode : {}'.format(total_penalities/episode))
```
 
Results After 100 episodes
Average timestep per Episode : 12.74
Average Penalties per Episode : 0.0
 
 
 
參考資料
https://github.com/ADLsourceCode/Reinforcement-Learning/blob/master/Q_learning/Implemetation.ipynb
https://www.learndatasci.com/tutorials/reinforcement-q-learning-scratch-python-openai-gym/
https://saksham-jain.com/p6.html
https://github.com/openai/gym/blob/master/gym/envs/toy_text/taxi.py
https://www.freecodecamp.org/news/an-introduction-to-q-learning-reinforcement-learning-14ac0b4493cc/