Deep Reinforcement Learning and applications to Robotics

# Deep Reinforcement Learning and applications to Robotics ## Abstract Deep Reinforcement Learning, a combination of Deep and Reinforcement Learning, is currently the most active topic of Machine Learning, applied to robotics. Being a field with fast improvements and constant advances, it is consistently getting better when it comes to creating autonomous systems with a higher level of understanding of the environment, capable of learning from scratch to perform human tasks. Deep Reinforcement Learning has recently emerged as a fundamental step in advancing towards this goal, allowing to scale Reinforcement Learning to problems that were impossible to be solved by computing systems in the past. This learning method has proved to be an essential boost in robotics, as such, became one important research subject in this area. For that reason, in this document, we provide a brief overview of the topic and some of the interesting advances involved. We start by introducing basic concepts of the methods involved, namely Deep and Reinforcement Learning, for a better understanding. Then, we present interesting research developed by OpenAI, where they developed a new strategy that seems very promising for the improvement of the learning process. Finally, a hot trend concerning future directions of the subject is introduced, which is, Inverse Reinforcement Learning. ## Introduction It is well known one of the greatest ambitions of Artificial Intelligence is to create a fully autonomous agent, able to start with no knowledge and learn by himself to perform complex tasks by interacting with the environment that surrounds him. By complex tasks, we refer to tasks that previously were only achievable by humans. The nature of learning of a human being consists of learning through continuous interaction with the environment. Starting with no knowledge, the human acts on the environment through trial-and-error, evaluating the environment's responses to its actions and building its knowledge from there, in order to make the best decisions in the future. This learning through interaction and trial-and-error was the main foundation of Reinforcement Learning, a mathematical framework to simulate the human way of "thinking" and one of the first steps towards experienced-driven autonomous learning. Despite its early successes, Reinforcement Learning was limited by complexity issues. As such, it lacked scalability and could only learn in low-dimensional environments. However, most of the human tasks are done in environments influenced by a lot of variables, making it impossible for these algorithms to learn them. So, how could Reinforcement Learning be applied to complex problems? The answer is Deep Reinforcement Learning. By integrating the deep neural network's ability to find compact low-dimensional outputs from high-dimensional data, Deep Reinforcement Learning allowed Reinforcement Learning to scale to much more complex environments. From its first successes in developing algorithms capable of learning to play Atari games directly from image pixels to the creation of AlphaGo, a system capable of defeating a human world champion of Go, Deep Reinforcement Learning revolutionized Machine Learning, showing that its agents could be trained on raw, high-dimensional observations. Its application had a strong impact on a wide range of areas, especially to robotics, allowing robots to learn directly from raw visual inputs they capture through its sensors. ## Brief overview of Deep Reinforcement Learning Deep Reinforcement Learning is a field of Machine Learning and currently the most trending one. It is since it is bringing a human way of thinking and acting into computed systems, allowing the resolution of a big set of complex, decision-making problems that were impossible to solve before. [1] The way of operating is very similar to a human that borns into the world without any knowledge of it. And so, it has to interact through trial and error, until it learns how to act properly on each circumstance. In this context, acting properly means behaving like humans, or even better. However, it is not possible to explain Deep Reinforcement Learning without individually going into its more fundamental parts: Reinforcement Learning and Deep Learning. ### Reinforcement Learning Reinforcement Learning, similarly to Supervised and Unsupervised Learning, is a branch of Machine Learning and it is currently the most trending, out of the three. It was the first approach to bring a **human way of thinking and acting to systems**, allowing its agents to be smart enough to **learn from previous experiences** in search for the optimal way to solve a specific problem. It can be described as a Markov Decision Process, containing: * S: being the state a set of variables that represents the environment, S defines all sets of values those variables can take, also known as **state space**. This includes special states, such as start and terminal states. * A: a set of **actions** the agent can perform on the environment. At each state, the environment makes available a set of actions and the agent chooses one among these to apply on the environment. * T: a **transition function**, that defines the probability of the environment to transitions to a state S(t+1) after the agent performs an action a(t) in the state S(t) at the instant t. * R: a **reward function** which quantifies how well did the agent go by performing the action a(t) in the state S(t) at instant t and reaching state S(t+1). * γ: a **discounting factor** controlling the importance of rewards through time. Lower values highlight immediate rewards, while higher values highlight future rewards. In Reinforcement Learning, an autonomous agent is placed in an environment without any previous model of its states transition dynamics. The only thing the agent can observe is the state of the environment. This state is a representation of the environment that comprises all necessary information for the agent to analyze the consequences of its actions on it. At each time step t, the agent gets a state **S(t)** from its environment and interacts with it taking an action **a(t)**, which can be random in an early stage. As a response to that action, the environment transitions to a new state **S(t+1)** and returns a reward **r(t+1)** associated to that transition. The reward of each transition depends on the tasks to be performed by the agent and allows the agent to learn about the consequences of its actions in the environment. [2] Through trial-and-error, the agent experiences different actions in several states of the environment. Each of these interactions allows the agent to collect information about states transitions and respective rewards that can be used to update its knowledge. This way, the agent aims to gradually build an **optimal policy**, a function that returns, for each state, the best action to take, being the best action the one that maximizes the **expected return** (accumulation of rewards and respective discounts). ![](https://i.imgur.com/woShH5C.png) One of the main challenges in Reinforcement Learning consists in making an efficient management of the trade-off between **exploration and exploitation**. In order to accumulate the greatest rewards, the agent tends to choose actions it tried in the past and already knows that returned the higher rewards. However, always choosing these actions greedily can hide other alternatives that, although not immediate, would lead to greater rewards in the future. As such, the agent must know when to ignore the best actions already known, in order to explore and discover new alternatives that can become more advantageous. ### Deep Learning Deep Learning is also a subset of Machine Learning, orthogonal to the three branches mentioned before. That means the concepts of Supervised, Unsupervised and Reinforcement Learning can be applied to Deep Learning algorithms. Its implementation is inspired by the structure of the human brain and, therefore, it is composed of multiple neurons that process information. The Artificial Neural Network, as it is called, divides itself into three processing layers, the input, inner and output layers, and each one of them aggregates a variable number of neurons that depend on the situation being analysed. Besides, there is always one input and output layer, but the quantity of the inner ones is variable and must be adjusted to get the best decision boundary. As an example, if one would want to process an image and further classify it, the input layer would need a number of neurons equal to the number of pixels (variables that are being fed to the neural network) and the output layer would match the number of outputs associated with each input. As for the inner layers, they are more challenging to define and require a pre-analysis of the data. In simple terms, the information that is fed to the input layer is directed to the following layer through connecting channels that link a pair of neurons. Each channel has an associated weight that is multiplied to the value being transported. Then, the inputs on the subsequent neuron are added, as well as a bias value associated with the neuron. Finally, an activation function determines whether or not the computed result is passed to the following layer, similarly, until the output layer is reached. The training of the Artificial Neural Network consists of continuously adjusting the weights and the biases with the help of a training data set and backpropagation of data. ### Deep Reinforcement Learning Neural networks of Deep Learning are characterized by a strong inductive nature, being able to receive high-dimensional states as input and compress to low-dimension outputs, with the help of powerful function approximation properties. Therefore, integrating neural networks as components of Reinforcement Learning algorithms allowed the later to scale into a set of more complex problems, with far more raw, high-dimensional visual inputs. The states of the environment are encoded by a vector which is passed as input into the neural network. Then the neural network will try to predict which action should be played, by returning as outputs a Q-value for each of the possible actions. Eventually, the best action to play is chosen by either taking the one that has the highest Q-value or by overlaying a Softmax function. The main advantage is that one can model far more complex environments. As a result, the subject of intelligent and autonomous robots benefit from the development of such an algorithm, since they are then able to be read their surroundings and correctly act based on a well-trained evaluation. Consequently, a vast range of applications appeared, for example: - Autonomous driving, with trajectory optimization, path planning, and scenario-based learning policies - Industry automation, with robots performing tasks, sometimes dangerous for humans, in a more efficient way - Robot manipulation, providing the ability to handle objects, sometimes not even trained in simulation. There is a project developed by OpenAI, an Artificial Intelligence research and deployment company with lots of fundings from other big companies and positive impacts in the industry, that we found very interesting. They introduced a method that can further help the transfer of a model fully trained in a simulation environment into the real world. #### [OpenAI - Solving Rubik’s Cube with a Robot Hand](https://openai.com/blog/solving-rubiks-cube/) The majority of the breakthrough accomplishments of Deep Reinforcement Learning (e.g., AlphaGo/Zero, ATARI DQNs) before 2019 have been produced in domains with fully observable state spaces, limited action spaces and moderate credit assignment time-scales. Partial observability, vast action spaces and long time-scales remain elusive. However, 2019 confirmed that there's a large room for improvement in combining function approximation with reward-based target optimization. Robotic hand manipulation challenge highlights only a subset of exciting new domains which modern DRL is capable of tackling. The topic was chosen based on scientific contribution and not only relying on massive scaling of already existing algorithms. It's well known that Deep Learning is designed to solve problems which require the extraction and manipulation of high-level features. Dexterity is a skill so natural to humans, yet it still poses a major challenge for current computer systems. However, OpenAI’s dexterity efforts in Automatic Domain Randomization seek to bridge this gap. The developed solution aimed then to solve a Rubik's cube with just one robotic hand. However, the main difficulty did not rely on the solving itself but on the implementation of a system capable of responding properly when confronted by unpredicted environments. On the following video, it is possible to see the results of the implementation, where perturbations are introduced in order to test how robust the system is. Some perturbations include tied fingers, visual occlusion and the approach of other objects. [video here](https://www.youtube.com/watch?v=QyJGXc9WeNo&ab_channel=OpenAI) One key challenge for training Deep Reinforcement Learning agents on robotic tasks is to transfer what was learned in simulation to the actual physical robot. In fact, simulators only capture a subset of the mechanisms in the real world, therefore, simulating friction with a high level of accuracy demands computation time. Surely, time is costly and could otherwise be productively spent generating more transitions within the agent's environment. Automatic Domain Randomization has been proposed to obtain a robust policy. Instead of having the agent training on a single environment with a single set of environment hyperparameters, the agent is trained on a batch of different environments each with different configurations in order to maximize the learning progress. In addition to this, it's also possible to automatically vary the environment configurations based on the learning progress of the agent at each timestep. During training, the entropy of the environment is increased each time the agent is close to a defined threshold. That way, the learning curve never converges to a certain value, making the task harder and forcing the neural network to generalize to more randomized events. Automatic Domain Randomization together with PPO-LSTM-GAE-based policy gives rise to a form of meta-learning which has not yet been reached its full capabilities. The algorithm did not "entirely" learn end-to-end what the right sequence of moves is to solve a cube and then do the dexterous manipulation required. In other words, it failed to make in-hand manipulation with reward sparsity. On the bright side, it managed to learn a fairly short sequence of symbolic transformations. Indeed, Woj Zaremba mentioned [‘Learning Transferable Skills’ workshop at NeurIPS 2019] that it took them one day to "solve the cube" with Deep Reinforcement Learning and it is possible to do the whole process fully end-to-end. [3] ### Important research trend Although Deep Reinforcement Learning already had remarkable achievements with important impacts for the development of robotics, it is still a very active area when it comes to research, since there are a lot of improvement expectations regarding its applications. One of the current influential researches involves Imitiation Learning and Inverse Reinforcement Learning. Given a sequence of "optimal" actions from expert demonstrations, it is possible to learn from them. The process is called behavioral cloning and already had success (ALVINN, an autonomous car). However, behavioral cloning cannot adapt to new situations, where small variations can accumulate and lead to scenarios where the policy is unable to recover. A more generalizable solution is proposed by Inverse Reinforcement Learning. Its goal is to estimate an unknown reward function based on the observation of demonstrations, and then fine tune it with the help of Reinforcement Learning algorithms. [2] #### Interesting implications - **Demonstration substitutes manual specification of reward** Pre-specification of the reward function is a concern for the applicability of Deep Reinforcement Learning, restricting it to cases where the reward function is easily specified or simulated. The introduction of Inverse Reinforcement Learning offers a broader approach, reducing the need of manually design task specifications. [4] - **Transferability of the reward function** The learned reward function provides a useful basis when the agent specifications differ midly and is inherently more transferable than the observed agent's policy. It is more likely that significant state changes will turn the learned policy useless, while the learned reward function simply needs to be extended. [4] As such, Inverse Reinforcement Learning is rapidly expanding to a set of applications, such as, learning from experts to create an agent with the experts' preferences and learning from another agent to predict is behaviour. [4] ## Conclusions Today's machines are capable of teaching themselves based upon the results of their actions. This process may be achieved through the use of reinforcement learning. Indeed, its success in various fields is unquestionable. Anyway, piecing it together with state of the art deep neural networks isn't always a straightforward task. Yet, on the upside, it results in the cutting-edge deep reinforcement learning. This new approach can elevate RL's already huge potential for new unforeseen heights. Admittedly, DRL can be used to solve end-to-end deep learning real-world tasks. Some of those include hot topics such as autonomous driving, industrial automation and robot manipulation. This last one has been and continues to be comprehensively researched, as seen, in the OpenAI's paper "Solving Rubik's Cube with a Robot Hand". Nonetheless, and despite all the achievements already attained, innovation in this field is, undoubtedly, far from over. In truth, Inverse Reinforcement Learning (IRL) is a current relevant research trend with numerous plausible applications such as agents relations and knowledge sharing with its counterparts. To sum up, this technology is advancing with leaps and bounds and settled to evolve into accomplishing great things soon. ## Bibliography - [1] Jordi TORRES.AI, A gentle introduction to Deep Reinforcement Learning, May 2020. Accessed on: Oct. 19, 2020. [Online]. Available: https://towardsdatascience.com/drl-01-a-gentle-introduction-to-deep-reinforcement-learning-405b79866bf4 - [2] K. Arulkumaran, M. P. Deisenroth, M. Brundage, and A. A. Bharath,“Deep reinforcement learning: A brief survey,”IEEE Signal ProcessingMagazine, vol. 34, no. 6, p. 26–38, Nov 2017. [Online]. Available:http://dx.doi.org/10.1109/MSP.2017.2743240 - [3] OpenAI, I. Akkaya, M. Andrychowicz, M. Chociej, M. Litwin, B. McGrew,A. Petron, A. Paino, M. Plappert, G. Powell, R. Ribas, J. Schneider,N. Tezak, J. Tworek, P. Welinder, L. Weng, Q. Yuan, W. Zaremba, andL. Zhang, “Solving rubik’s cube with a robot hand,” 2019 - [4] S. Arora and P. Doshi, “A survey of inverse reinforcement learning: Chal-lenges, methods and progress,” 2019 - [5] OpenAI, Solving Rubik’s Cube with a Robot Hand, Oct. 15, 2019. Accessed on: Oct. 19, 2020. [Online]. Available: https://openai.com/blog/solving-rubiks-cube/ - [6] Y. Li, “Deep reinforcement learning,” 2018