owned this note
owned this note
Published
Linked with GitHub
# LunarLander V2 🌑
> Can we create and train a model to be successful at **LunarLander V2**?
## Background Research 📚
### References
### A brief look at the history 📜
Reinforcement Learning (RL) is a type of machine learning that focuses on training agents to make decisions in an environment by maximizing a reward signal. The roots of RL actually stems all the way back to the 1930s and 40s, where Skinner presented his experimental research on the behaviour of animals. He described the concept of "operant conditioning" which involved manipulating the consequences of an animal's behaviour, in order to change the likelihood that the behaviour would occur in the future. (Skinner, 1991)

<em>Skinner in his lab, 1974 (Image Credits: https://braintour.harvard.edu/)</em>
The next big step was Markov Decision Processes (MDP), which was first introduced by Richard Bellman in the 1950s. (Bellman, 1957). The key idea behind MDPs is to have **an agent** (the learner and decision maker) play in **an environment** (the space in which the agent is playing within). These two entities would interact continually with the environment providing some sort of reinforcement and the agent then performing actions and learning from the feedback from the environment. More details are discussed under the `RL algorithms 🦄` section below.
From this, the field grew larger, with **one of the key early algorithms** in RL being **Q-Learning**, first proposed by Christopher Watkins in 1989. **Q-Learning** is a model-free algorithm that estimates the optimal action-value function using experience. Essentially, there exists a lookup table to determine what is the next possible move. For instance, imagine if one is in a maze and the goal is to get out of it. A pre-computed lookup table would contain the information of the action to take given the position one is at.
<img width=300 src="https://i.imgur.com/ULAtuJM.png" />
*Maze (Image Credit: Mid Journey)*
### The core of reinforcement learning 🦄
> Any goal can be formalized as the outcome of maximizing a cumulative reward - Hado van Hasselt, DeepMind.com
Here we talk about the different types of RL algorithms, such as Q-learning, policy-based, and actor-critic methods, and the advantages and disadvantages of each.
Each of the algorithms revolve around an agent that plays in an environment. There are a few different types of components that an agent can contain. These are:
- Agent state
- Policy
- Value function estimate (optional)
- Model (optional)

<em>Transitions of components over time. State refers to 'agent state'. (Image Credits: deepmind.com)</em>
<u>Agent state</u>. As we see in the figure above, the agent state is essentially what gets carried over towards to time step $t + 1$ if we are currently at time $t$. An analogy could be that of a basket of knowledge. One example could be that perhaps we may want to store some sort of information such as the past few observations into this agent state. Using the agent state as a sort of knowledge base, the agent's policy will then decide what sort of action to perform. By convention, the state of an agent is denoted as $S_t$ where $t$ is the time step.
Here are a few possible examples that the agent state can take on:
$$S_t = O_t \\ S_t = H_t \\ S_t = u(S_{t - 1}, A_{t - 1}, R_t, O_t) $$
(Sutton and Barto, 2018)
The first equation means that solely the current observation $O_t$ at time $t$, is used as the agent state. However, it is very likely that this is insufficient information as the observations that are received only give a partial truth. This will then very likely result in a **non-Markovian agent state**, where an agent will **find themselves in seemingly identical** situations, when in truth are in a different position. An alternative definition to what a Markovian state is, is that
> A decision process is Markov if $$p(r,s \mid S_t, A_t) = p(r, s | H_t, A_t) $$
> (DeepMind, 2021)
Essentially, this means that the probability of a reward $r$ and the corresponding agent state $s$ does not change if more information, specifically <i>history $H_t$ </i> is added. In essence, if **adding anymore information on top of $S_t$ results in no improvement**.
The second represents that the agent state **captures the entire history** where $H_t$ represents historical sequence of observations, actions and rewards from $t = 0... t$. Specifically, $$H_t = O_0, A_0, R_1, O_1, ..., O_{t-1}, A_{t-1}, R_{t}, O_{t} $$
where $R_t$ represents the reward attained at time step $t$. However, the problem with this sort of capturing is that although it is ideal to have as much as possible inside the agent state, this is clearly **too much** to store inside.
Lastly, the third equation is a more generalized approach, where $u$ acts as a sort of **compression function** that takes in the current states
In practical cases, it is **not needed to enforce a perfect Markovian agent state**, as in real world situations this would be **too difficult and complex**. Rather, it is much easier to achieve some level that's good enough, which is how it's done more commonly.
<u>Policy</u>. A policy is the mapping from states to actions, which ultimately **defines how the agent acts**. There are two types of policies, **deterministic** and **stochastic**. Deterministic is similar to supervised learning. Suppose our agent is in some environment, to which we say that the policy is deterministic if the action $A$ that the agent will take is defined as
$$A = \pi_{\text{det}}(S)$$ where $\pi$ denotes the policy function and $S$ is the current agent state. On the contrary, a **stochastic process** is defined as
$$\pi_{\text{sto}}(A \mid S) = p(A \mid S) $$ which tells us that **given the state**, there is a probability that describes how likely $A$ is to be chosen as an action. As a more general case, the **deterministic process** essentially performs an $argmax$ function across the probability distribution of $p$ for all possible actions:
$$\pi_{\text{det}} = \text{argmax}_A\ \pi_{sto}(A \mid S)$$
<u>Value function estimate</u>. First, to understand the concept and reasoning behind value function, we first need to go over the topic of **return**. **Return** is defined as the **cumulative reward** or **the sum of the reward into the future**. By convention, return is assigned the notation $G_t$ where $$G_t = R_{t + 1} + R_{t + 2} + R_{t + 3} +...$$
Note that we are looking **forward into time**, as the past reward is not our concern due to the fact that the agent is not able to change the past.
The truthful value function is infact just the expected return itself. However, as expected in an unfamiliar environment, we do not have access to this true value function. Thus, the solution is that we can often use estimators to give us a gauge on our current state. Why might we do this?
For example, in a longwided game such as chess, it is only at the end of the game that the agent receives some form of feedback. Thus, it is actually quite useful for the agent to have some sort of ability to gauge whether the action they will perform will lead to them to a more ideal state. This is where value function estimates come in, to provide an educated estimate of what the expected return will be at the end of the game.
However, it is not so useful to practically define the return $G_t$ like the above equation. This is because it currently means that future rewards of $R_{l} \text{ for } l > t + 1$ are **given the same weight** as the immediate reward $l = t + 1$. Although one may think that it is important to consider delayed rewards, which is true, there may be scenarios where immediate rewards are important as well. To enable this form of adjustment, the return is actually defined as
$$G_t = \sum_{k = t + 1}^{\inf} \gamma^{k - t - 1} R_k \tag{Sutton, 2018), (1}$$
$$G_t = \sum_{k = t + 1}^{T} \gamma^{k - t - 1} R_k \tag{2}$$
The above equations can be expanded to $G_t = R_{t + 1} + \gamma R_{t + 2} + \gamma^2 R_{t + 3} + ... + \gamma^{\inf}R_{k}$. Now, this allows us to adjust how much importance we may want to place on to future rewards. For instance, if we choose $\gamma = 0$, the terms of $\geq t + 2$ are immediately cancelled off, and thus **immediate reward would be prioritized**. On the other hand, we can adjust $\gamma = 1$, which put an equal weight on all future rewards. Equation $(1)$ refers to an infinite process, whereby there is **no termination sequence** and thus the number of future possible rewards is infinite. Although theoretically possible, **a more practical use case would be to have** an episodic case of the return function, which would be equation $(2)$.
### Function Approximators 🎯
Why use function approximators?
#### Neural Networks
#### Decision Trees
### Key Challenges in RL ⚔️
#### Exploration-Exploitation
When an agent is initialized and put into a new environment, the optimal actions it should take are essentially random, in that the agent does not possess any knowledge of what to do, or what the task to tackle even is. Only when it interacts with environment, gain knowledge from data and learn the optimal actions, does it improve. However, "reliance on data" could possibly lead to two different scenarios. (Wang, Zariphopoulou and Zhou, 2019)
1. 🧠 **Exploitation**: The agent learns that a certain action returns some reward. Because the goal is to maximize the total reward, the agent then continuous to maximize the reward by repeatedly exploiting this specific knowledge or performing this move. As one can imagine, if the agent has not ultimately visited a large enough action space, this knowledge may lead to a suboptimal policy (Wiering, 1999).
2. 🗺️ **Exploration**: Take actions that currently don't possess the maximum expected reward, to learn more about the environment and realize better options for the future. However, the agent focusing solely on learning new knowledge, will lead to a potential waste of resource, time and opportunities.

*An illustration of explore vs exploit (Image Credits: https://ai-ml-analytics.com/)*
Therefore the agent must learn to **balance the trade-off** between the exploring and exploiting, to learn the actions that will ultimately lead to the maximum optimal policy.
What are some approaches to tackle this issue? The simplest way is to randomly choose; every move there is a 50% chance to explore, and the other 50% to exploit. One may then realize that infact, a much smarter move would be have some sort of parameter **epsilon** $\epsilon$, that controls the **probability to exploit**, with the probability to explore being 1 - $\epsilon$. By doing this, $\epsilon$ can **now be tuned** to maximize the policy, which empirically is much. (Bather, 1990)
#### Delayed Reward
Usually, unlike in Supervised Learning, agents do not get immediate feedback on a per action basis. Rather, the reward system is attributed towards a sequence of actions. This means that agents must be considerate of the possibility that taking greedy approaches (essentially trying to retrieve immediate rewards) may result in less future reward.
### Uses of RL 🧰
So how does one **actually use RL**? It can be used to optimize decision making in systems where the decision maker does not have complete information about the system or the consequences of its actions. Additionally, it may be used to control systems that are difficult to model completely under mathematical equations, such as robots that must operate in uncertain environments.
#### Applications
RL can be used in control problems such as:
- Robotics
- Games
- Autonomous systems
In robotics, reinforcement learning algorithms can be used to train robots to perform tasks in real-world environments, such as grasping objects or walking.
In gaming, reinforcement learning algorithms have been used to develop AI agents that can play games at a superhuman level, such as chess or Go.

<em>AlphaGo vs Korean Grandmaster Lee Se-dol (Image Credits: cnet.com)</em>
For instance, AlphaGo, a reinforcement agent developed by Google’s DeepMind was able to <strong>defeat Korean Grandmaster Lee-Sedol</strong> at the game of Go, which he plays professionally.
It's also used in various ways by Boston Dynamics in **the control and training of their robots**. Boston Dynamics develops robots that are designed to operate in challenging environments and perform a wide range of tasks, from walking and jumping to carrying and manipulating objects. (Raibert and Tello, 1986)

<em>Boston Dynamics robot (Image Credits: bostondynamics.com)</em>
For example, Boston Dynamics has used reinforcement learning to train its robots to balance and walk on rough terrain, such as rocks or uneven surfaces. The robots **receive rewards for maintaining balance and penalties for falling over**, allowing them to learn to walk more stably and efficiently over time.
RL has proven to be a powerful tool for Boston Dynamics in their development of advanced robots, allowing them to perform complex and dynamic tasks in real-world environments with greater stability and robustness. (Pineda-Villavicencio, Ugon and Yost, 2018)
### Recent Developments 🔬
One of the recent and interesting developments in RL is Real-Time Reinforcement Learning.
These are just a few examples of the potential applications of real-time RL. As the field of RL continues to evolve, new applications are likely to emerge, making real-time RL a promising area of research and development (Ramstedt and Pal, 2019).
### LunarLander V2 🌑
Lunar Lander v2 is a reinforcement learning environment developed by OpenAI. It is based on the classic Atari game Lunar Lander, in which the player must control a lunar lander and land it safely on a landing pad while avoiding obstacles and managing fuel consumption. The goal of the game is to land the lander as safely and efficiently as possible.