Reinforcement Learning Survey

# Reinforcement Learning Survey ###### tags: honours, reinforcement learning ## reinforcement learning based recommender systems (RLRS) - DRL: deep reinforcement with recommender systems - DL: deep learning - RL: Reinforcement learning - RS: recommender systems Traditionally treated as classification problem, but now it's treated as a [Markov decision process](https://www.notion.so/Markow-decision-process-8f0e74e08f1d4991bfac1837ad84ec38). - Main idea: help users find their interests, based on data we know about them There are problems with deep learning(DL) and RS as well - RL: semi-supervised machine learning filed, the agent learns from the environment ## Section 2 Real life case: we have to make a decision, but we don't have *a priori* information. Thus, we have to ask for recommendation from others, who have experience about the subject. - Content Based Filtering: recommend items similar to the user profile - what to do when user is new - having a diverse range of recommendations ⇒ hybrid methods were developed. - these are traditional RS, unable to handle today's needs # Reinforcement Learning and Deep Reinforcement Learning ## Reinforcement Learning RL: agents learn what to do in order to maximize a numerical reward. - closep loop - the learner does not have a tutor to teach it what to do, but figures it out based on trial an derror - actions influence short *and* long term result. ### Definitions/Dictionary Agent: the learner or decision maker Environment: everything outside the agent State: the information the agent sees about the environment at time *t* Action: based on the state, the learner does something On taking this action, it receives a numerical reward, and transitions into a new state Markov decision process is formulated as a tuple: $(S,A,R,P,γ)$: - S: set of all possible states - A is the set of available actions - R: reward function - P: transition probability - $\gamma$: discount factor Goal: find a policy $\pi(a\vert s)$ that takes action a in state s in order to maximize the expected, discounted cumulative reward. $\pi$ is the policy, which is the probability of taking action a in the state s. - two kinds of algorithms: off-policy vs on-policy - off-policy: improve the used policy - on-policy: improve a policy different from the one used to generate the data Reward signal: select actions, environment provides a numerical signal reward to inform the agent Value function: tells us what is good in the long run, not just what is good now (like the reward signal) Model: an inference about the behaviour of the environment in different states. Slate: tuple ## Tabular vs approximate Two different methods for solving RL problems. ### Tabular value functions represented as tables includes: - dynamic programming, DP - assumes perfect model of environment, use a value function to search for good policies - policy iteration, value iteration - Monte Carlo, MC - no complete knowledge of the environment, and no assumptions - Need a sample - MCTS: Monte Carlo Tree Search - temporal difference, TD - Combination of DP and MC - no environment model - bootstrapping - Q-learning, SARSA ### Approximate Find a good approximate solution, with limited computational resources. - Generalize from seen states to unseen states - Function approximation - Many techniques could be used, like neural nets - REINFORCE, actor-critic are important methods ## Deep learning - based on neural nets - good performance - used as function approximator in RL - Deep Q-network, approximate method for Q-learning # Reinforcement learning for Recommendation User interaction with RS is sequential, not only a prediction problem, but a sequential decision problem → recommendation problem could be modelled as an MDP and solved by RL methods RS algorithm tries to recommend the best items to the user, and tries to maximize their satisfaction → RS algorithm is the RL agent, everything outside ( like users of the system, items) Can't apply traditional tabular RL algorithms to today's RS's with huge action and state spaces. → Instead, DRL Sliding window: instead of keeping track of all songs a person listened to, keep track of the last *m* songs. Gridworld game using biclustering: biclusters are formed from user-item matrix, then every cluster is mapped to one of the states in the gridworld. Any state in the gridworld can be a starting state, which is the most similar to the user. # 3: Algorithms Two groups: RL and DRL ## SARSA - RL, TD (bootstrap), tabular Has two units, global (tracking most popular items, for example) and local(tracking the user). Combine the two to get the next recommended item. Problem: scalabillity, keeping track of all users ## DP - tabular Model parameters of MDP recommender unkown, deploying it is costly, *they* suggest a predictive model, which provides initial parameteres for the MDP. This model is called a Markov chain, in which they model the state and transition function based on the observations in the dataset.To tackle the dimensionality problem, last k items is used to encode state information. ## MC - tabular Each song modelled as a vector of song descriptors: spectral fingerprint, rhytmic characteristics, loudness, their change over time. Reward: listener's preference over individual songs, and song transition pattern → listener parameters Two components: Learning listener parameters, planning a sequence of songs. - learning: divided into two parts: init and learning on the fly - init: the user is asked about the parameters - on the fly: requesting feedback while playing songs ## Fitted Q Methods # DRL-based methods ## Q-learning: DQN DQN for slate recommendation * Learn the value of full slates using a sequential greedy method. * Sequential presentation property: recommended items are presented to the user one by one. 1. It uses experience replay, first proposed in and a method that keeps agents’ experiences over various time steps in a replay memory and uses them to update weights in the training phase. 2. In order to reduce the complexity in updating weights, current updated weights are kept fixed and fed into a second (duplicate) network whose outputs are used as the Q-learning targets. 3. To limit the scale of error derivatives, the reward function is clipped such that it is 1 for positive rewards, -1 for negative rewards, and leaving 0 rewards unchanged. All these modifications turned out to improve the stability of DQN. In contrast with action-value methods, like DQN, policy gradient methods learn a parameterized policy without the need to a value function. 1) Policy approximate methods can ap- proach determinism, 2) Policy approximation could be easier than value function approximation, and 3) Policy approximation methods can find stochastic optimal policies, while value based methods can not. ## Discussion and open research directions * Most popular RL algorithm for RLRS: Q-learning * Main reason:simplicity * Formulating the RS problem as an MDP * Different sources of information are used to represenet the environment states. (about users, items, usuer-system interaction, and context) * Almost all algorithms recommend a single item in every episode. * Offline approach for evaluation, using public datasets or simulation ### Open research questions * First, RL algorithms have been basically developed to select one action from different actions around. However, in the RS field, it is quite common to recommend a list of items. This is usually called slate, top-K, or list-wise recommendation. * The problem of recommending a list of items should be studied more in the future as the RL agent faces with a huge combinatorial action space. * Although it makes sense to not reinvent the wheel, sometimes thinking out of the box could make a substantial improvement in the field. $\rightarrow$ Said in connection wiith RLRSs using available DRL architectures, that were developed and tested for other domains. These are typically designed based on physics or Gaussian processes, not based on the dynamic nature of the user. * Perhaps wisely combining these models (DL models developed for RSs) with traditional RL algorithms could outperform existing DRL models. * Valuable research direction for the future to possibly find a relationship between the RL algorithm and the RS application. * Explainable recommendation is the ability of an RS to not only provide a recommendation, but also to address why a specific recommendation has been made. * An essential need to have a powerful simulator, preferably a strong unified evaluation framework, for RLRSs, and the RS field in general.