# RL
- Why NAS
- NAS is a technique for finding the best possible architechture for neural nets. It takes off the human bias and finds the best archi suited for out needs.
- What NAS
- Search Space
- Types of layer we can experiment with, number of layers, how will they we connected. Often biased as we are the ones to inclue these choices
- Search Algorithm
- samples a population of architechture candidates
- Then recieves the child model preformance as a metric : accuracy in our case
- Our approach
- We have limited the size of our CNN due to commputational resources
- We use RNN as a controller and train it as a RL task by REINFORCE
- Use the controller to generate an encoded sequence that replicates a valid child
- This RNN predicsts with some restriction
- Other options for controller :
- Convert the encoded sequence into an actual CNN model.
- Train said CNN model and make a note of its validation accuracy
- Utilize this validation accuracy and the encoded model architecture to train the controller itself
- Use REINFORCE
- Reward function is not differntiable
- Controller predictions are taken as actions
- Controller current state is the state
- Accuracy is the reward
- Use Monte-Carlo methods to estimate the expections
-
- Repeat
## How do we train
There are various ways to move towards better results. Like random samppling.
We use reinforcement learning to as a policy optimization method
In our example
- State : State of the NASCell Controller
- Action : Next element in the sequence as predicted by the said controller
- Policy : Probability distribution that determines the action. given the state
## Objective Function
- In other words, the objective is to learn a policy that maximizes the cumulative future reward to be received starting from any given time t until the terminal time T.
REINFORCE is a Monte-Carlo variant of policy gradients (Monte-Carlo: taking random samples). The agent collects a trajectory τ of one episode using its current policy, and uses it to update the policy parameter. Since one full trajectory must be completed to construct a sample space, REINFORCE is updated in an off-policy way.



- Another successful effort called [MetaQNN](https://arxiv.org/abs/1611.02167)[7](https://theaisummer.com/neural-architecture-search/#fn-7) uses Q Learning with an [e-greedy exploration](https://paperswithcode.com/method/epsilon-greedy-exploration) mechanism and [experience replay](https://paperswithcode.com/method/experience-replay)
- Backprop
![[Pasted image 20221130122108.png]]
- RL Basics
- State : State is the information used to determine what happens next. State is a function of History
- Enviournment State : whatever data the environment uses to pick the next observation/reward
- Agent State : whatever information the agent uses to pick the next action
- This is the information used by RL algorithms
- History : History is a sequence of Observation, Actions, Reward
- Markov state :
- Once the state is known, the history may be thrown away i.e. The state is a sufficient statistic of the future
- P[St+1 | St ] = P[St+1 | S1, ..., St ]
- Action : Representation of what our RL agent makes out of the state and rewards
- Reward : The feedback that the envoiournment or the model gives to the RL agent
- Value :
- ![[Pasted image 20221130145749.png]]
- Policy :
- ![[Pasted image 20221130145723.png]]
- Model :
- ![[Pasted image 20221130145806.png]]
https://www.davidsilver.uk/wp-content/uploads/2020/03/intro_RL.pdf
- Q-value : is the quality value of a action, state. Which is the expected reward we will get if we take this action at this state
