Reforcement Learning
===
https://colab.research.google.com/drive/1GkOWHv4kDySuxWQcEHsEUkk_lsd6IWM5?usp=sharing
Type:
---
- **Model-free** or **Model-based**
- **Value-Based** or **Policy-Based**(Probability)
- **Monte-Carlo update** or **Temporal-Difference update**
- **On-Policy** or **Off-Policy**
> **Deep-Q-Learning**:
> - Model-free
> - Value-Based
> - Off-Policy
Trained by:
---
- Model-free:
- Value-Based:
- Q learning
- Sarsa
- Deep-Q Network
- Policy-Based:
- Policy Gradients
- Model-based:
- Virtual Environment:
- Model based RL
- Monte-Carlo update:
- Monte-carlo learning
- policy gradients ( basic )
- Temporal-Difference update:
- Qlearning
- Sarsa
- policy gradients ( advanced )
- On-Policy:
- Sarsa lambda
- Off-Policy:
- Q learning
- Deep-Q-Network
**Algorithms**
---
References:
- https://mofanpy.com/tutorials/machine-learning/reinforcement-learning/intro-q-learning/
- Lecture of KT
[**Q Learning**](/5umvBvHiSqqwcFS3Ts9G_w)(HackMD Link)
[**Sarsa**(&Sarsa-Lambda)](/KEzTGrblRcC2htol3yB_jw)(HackMD Link)
[**Deep-Q Learning**](/oPdQtEeSQYutWiNQxnVaCg)(HackMD Link)
[Policy Gradients](/3GLu_EcQTZKYTdfPyYmfhA)
[Actor Critic](/TOqnE2I6TGGX8IpAi_esYQ)
[**DDPG**](/RaZ_DwMSR2GTl-O86U3gHQ) ( Deep Deterministic Policy Gradient )
A3C ( Asynchronous Advantage Actor-Critic )
DPPO ( Distributed Proximal Policy Optimization )
Cheat Sheet:
---
- Q-Learning
- initial:
- actions list
- learning_rate
- reward_decay rate
- epsilon_greedy rate
- q_table
- Sarsa
> **Q-Learning** vs **Sarsa**:
> 
Reddit
---
If the environment is expensive to sample from, use **DDPG** or **SAC**, since they're more sample efficient. If it's cheap to sample from, using **PPO** or a **REINFORCE-based** algorithm, since they're straightforward to implement, robust to hyperparameters, and easy to get working. You'll spend less wall-clock time training a PPO-like algorithm in a cheap environment.
If you need to decide between DDPG and SAC, choose **TD3**. The performance of SAC and DDPG is nearly identical when you compare on the basis of whether or not a twin delayed update is used. SAC can be troublesome to get working, and the temperature parameter controls the stochasticity of your final policy -- effectively, it means your reward scheme can give you a policy that is too random to be useful, and picking a temperature parameter isn't necessarily straightforward. TD3 is almost the same as SAC, but noise injection is often easier to visualize and tune than setting the right temperature parameter.