RL - HackMD

# RL - Why NAS - NAS is a technique for finding the best possible architechture for neural nets. It takes off the human bias and finds the best archi suited for out needs. - What NAS - Search Space - Types of layer we can experiment with, number of layers, how will they we connected. Often biased as we are the ones to inclue these choices - Search Algorithm - samples a population of architechture candidates - Then recieves the child model preformance as a metric : accuracy in our case - Our approach - We have limited the size of our CNN due to commputational resources - We use RNN as a controller and train it as a RL task by REINFORCE - Use the controller to generate an encoded sequence that replicates a valid child - This RNN predicsts with some restriction - Other options for controller : - Convert the encoded sequence into an actual CNN model. - Train said CNN model and make a note of its validation accuracy - Utilize this validation accuracy and the encoded model architecture to train the controller itself - Use REINFORCE - Reward function is not differntiable - Controller predictions are taken as actions - Controller current state is the state - Accuracy is the reward - Use Monte-Carlo methods to estimate the expections - - Repeat ## How do we train There are various ways to move towards better results. Like random samppling. We use reinforcement learning to as a policy optimization method In our example - State : State of the NASCell Controller - Action : Next element in the sequence as predicted by the said controller - Policy : Probability distribution that determines the action. given the state ## Objective Function - In other words, the objective is to learn a policy that maximizes the cumulative future reward to be received starting from any given time t until the terminal time T. REINFORCE is a Monte-Carlo variant of policy gradients (Monte-Carlo: taking random samples). The agent collects a trajectory τ of one episode using its current policy, and uses it to update the policy parameter. Since one full trajectory must be completed to construct a sample space, REINFORCE is updated in an off-policy way. ![](https://i.imgur.com/vcSCulq.png) ![](https://i.imgur.com/XcZ81mG.png) ![](https://i.imgur.com/GcQAIcE.png) - Another successful effort called [MetaQNN](https://arxiv.org/abs/1611.02167)[7](https://theaisummer.com/neural-architecture-search/#fn-7) uses Q Learning with an [e-greedy exploration](https://paperswithcode.com/method/epsilon-greedy-exploration) mechanism and [experience replay](https://paperswithcode.com/method/experience-replay) - Backprop ![[Pasted image 20221130122108.png]] - RL Basics - State : State is the information used to determine what happens next. State is a function of History - Enviournment State : whatever data the environment uses to pick the next observation/reward - Agent State : whatever information the agent uses to pick the next action - This is the information used by RL algorithms - History : History is a sequence of Observation, Actions, Reward - Markov state : - Once the state is known, the history may be thrown away i.e. The state is a sufficient statistic of the future - P[St+1 | St ] = P[St+1 | S1, ..., St ] - Action : Representation of what our RL agent makes out of the state and rewards - Reward : The feedback that the envoiournment or the model gives to the RL agent - Value : - ![[Pasted image 20221130145749.png]] - Policy : - ![[Pasted image 20221130145723.png]] - Model : - ![[Pasted image 20221130145806.png]] https://www.davidsilver.uk/wp-content/uploads/2020/03/intro_RL.pdf - Q-value : is the quality value of a action, state. Which is the expected reward we will get if we take this action at this state ![](https://i.imgur.com/tzRfaU8.png)