Reforcement Learning

Reforcement Learning === https://colab.research.google.com/drive/1GkOWHv4kDySuxWQcEHsEUkk_lsd6IWM5?usp=sharing Type: --- - **Model-free** or **Model-based** - **Value-Based** or **Policy-Based**(Probability) - **Monte-Carlo update** or **Temporal-Difference update** - **On-Policy** or **Off-Policy** > **Deep-Q-Learning**: > - Model-free > - Value-Based > - Off-Policy Trained by: --- - Model-free: - Value-Based: - Q learning - Sarsa - Deep-Q Network - Policy-Based: - Policy Gradients - Model-based: - Virtual Environment: - Model based RL - Monte-Carlo update: - Monte-carlo learning - policy gradients ( basic ) - Temporal-Difference update: - Qlearning - Sarsa - policy gradients ( advanced ) - On-Policy: - Sarsa lambda - Off-Policy: - Q learning - Deep-Q-Network **Algorithms** --- References: - https://mofanpy.com/tutorials/machine-learning/reinforcement-learning/intro-q-learning/ - Lecture of KT [**Q Learning**](/5umvBvHiSqqwcFS3Ts9G_w)(HackMD Link) [**Sarsa**(&Sarsa-Lambda)](/KEzTGrblRcC2htol3yB_jw)(HackMD Link) [**Deep-Q Learning**](/oPdQtEeSQYutWiNQxnVaCg)(HackMD Link) [Policy Gradients](/3GLu_EcQTZKYTdfPyYmfhA) [Actor Critic](/TOqnE2I6TGGX8IpAi_esYQ) [**DDPG**](/RaZ_DwMSR2GTl-O86U3gHQ) ( Deep Deterministic Policy Gradient ) A3C ( Asynchronous Advantage Actor-Critic ) DPPO ( Distributed Proximal Policy Optimization ) Cheat Sheet: --- - Q-Learning - initial: - actions list - learning_rate - reward_decay rate - epsilon_greedy rate - q_table - Sarsa > **Q-Learning** vs **Sarsa**: > ![](https://i.imgur.com/TGjsZP9.png) Reddit --- If the environment is expensive to sample from, use **DDPG** or **SAC**, since they're more sample efficient. If it's cheap to sample from, using **PPO** or a **REINFORCE-based** algorithm, since they're straightforward to implement, robust to hyperparameters, and easy to get working. You'll spend less wall-clock time training a PPO-like algorithm in a cheap environment. If you need to decide between DDPG and SAC, choose **TD3**. The performance of SAC and DDPG is nearly identical when you compare on the basis of whether or not a twin delayed update is used. SAC can be troublesome to get working, and the temperature parameter controls the stochasticity of your final policy -- effectively, it means your reward scheme can give you a policy that is too random to be useful, and picking a temperature parameter isn't necessarily straightforward. TD3 is almost the same as SAC, but noise injection is often easier to visualize and tune than setting the right temperature parameter.