# Conversational Planner - Litterature Review
### A Data-Driven Approach for Learning to Control Computers
- key takeways
- World of bits
- basically OpenAI's idea to have a virtual agent control a computer to complete language-specified goals
- in this work, they make the environement more general, collect good demonstrations, and combining RL and behavioural cloning solves it
- GAIA
- a takeway for GAIA is that we might need to find demonstrations that we can label, i.e. assign them goal descriptor in natural language
- methods
- MiniWob++
- in this paper, agents operate in a more general env in which it can only use the mouse and keyboard based actions
- agent has access to a task input fieled (as well as a task descriptor) and it can copy-paste the fields to fill up DOM elements
- realtime env
- agent architecture
- 
- a = ResNet(visual input)
- b = langauge transformer(text input)
- c = extra embeddings
- d = multimodal_transformer(a, b, c)
- action = LSTM(d)
- action type
- cursor coord
- keyboard-key index
- task-field index
- for copy-pasting into the DOM
- human data collection
- 2.4M trajectories
- training
- co-training of RL and BC
- Results
- 
- Multi-task more sample (and compute) efficient
- 
- the data is super important
- NOTE: we could still make progress on the sample efficiency front
- 
- the DOM obs and DOM actions are (still) quite important
- this could be a limitation in more general cases where DOMs are not accessible
- the agent compared to humans struggles in task where the realtime aspect is an issue
- maybe modulo this, the benchmark is solved (assuming large pretraining corpus)
- Discussion
- in MiniWob++, the difficulties of dealing with human intent is put aside
### INTROSPECTIVE ACTION ADVISING FOR INTERPRETABLE TRANSFER LEARNING (under review)
- key takeways:
- action advising: a teacher trained in a source task actively guides a student's exploration in a target task
- NOTE: this might be an interesting framework to cast our project in
- doesn't assume access to the policy weights, similarly to us
- at a high level, the teacher recommendes actions (advices) to a student; those advices are transferable if teacher and students value function are close in a particular state
- problem: difficult to determine whether advice is not transferable due to a mismatch in state-values or due to the student still being undertrained
- solution: introspection, i.e. teacher directly estimate the state-value function in the target task, w/ the assumption that refining the teacher's existing estimate of the state-values in the source task wlll lead to quicker convergence than estimating it from scratch
- 
- 4) Introspective Action Advising (IAA)
- a policy is transferable in a given state if 
- we assume a temporal cut-off $\tau$ after which we no longer assume the teacher's actions will lead to better rewards than the student
- problem: difficult to determine whether advice is not transferable due to a mismatch in state-values or due to the student still being undertrained
- solution: introspection, i.e. teacher directly estimate the state-value function in the target task, w/ the assumption that refining the teacher's existing estimate of the state-values in the source task wlll lead to quicker convergence than estimating it from scratch
- i.e. 
- technicalities:
- burn in $\gamma$ period for finetuning the teacher
- off-policy correction when training the student with data is collected w/ the teacher's policy (and vice-versa when finetuning the teacher)
- ultimately, 3 new hparams are introduced: $\epsilon, \tau, \gamma$
### Yoshua Bengio: large language models, higher cognition, causality, working memory, responsible AI
- key takeways:
- LLMs are increadibly not sample efficient
- one hypothesis to improve them is by decoupling the (currently implicit) world model from the model performing inference on it
- accordingly, he believes in the model-based promise to increase sample efficiency
- NOTE: there seems to already be some overlap with us here, as we also want to learn an explicit world model
- probably we should look into his paper [Inductive Biases for Deep Learning of Higher-Level Cognition](https://arxiv.org/abs/2011.15091)
- might need GflowNets to wow him
- he believes in the model-based promise to increase sample efficiency
- why do we have such a constrained working memory?
- e.g. animals have a bigger one.
- Hypothesis: pushes us to build abstract and compact models of the world!
- an MBRL motivating example:
- learning how to drive on the left.
- ofc the MBRL approach is favored over the model free.
- you want to update your world model and then learn a new policy within it, i.e. not by actually interacting with the env
- conciousness is a **hard** attention mechanism
- might be a good inductive bias to bake into our algos
- actually, a follow up of his Attention paper showed that soft and hard attention performed comparably
- to his surprise, as soft attention mechanism backpropagate much better
- G-Flow Nets:
- think of it as a learned MCMC for sampling parameters
- potential tool to decouple the world model from the inference
- tool to learn posteriors over data structures like graphs
- so potentially useful to learn a posterior over the causal structure of the env (causal world world)
- your model of the world should be bayesian: you should always entertain multiple theories about it
- planning at an abstract level (HRL)
- Bengio wishes he had the solution to this!
- high risk, high reward!
- abstract plans can be translated into sentences
- NOTE: project alignment
- BONUS:
- Bengio as 2 productive states:
- walks
- right when he wakes up, 30min in bed, eyes closed, waiting for solutions to pop up.
### Describe, Explain, Plan and Select: Interactive Planning with Large Language Models Enables Open-World
- key takeways:
- adresses the problem of long-term planning and planning efficiency
- interactive planning via an LLM
- high-level idea:
- an Planer (LLM) makes a high-level plan, that is, a partially-ordered set of subgoals;
- a Selector chooses the next subgoal;
- a low-level controller attempts the next subgoal;
- whenever the low-level controller fails, the Descriptor takes in the state and communicates it to the Explainer which communicates it to the Planner (LLM) in language for the plan to be revised.
- algorithm =
- Descriptor: when low-level controller fails at the plan,
- input = feedback generated by the agent during the excution of the task
- output $d_t$ = description if the current state in NL
- Explainer: LLM
- input: $d_t$ as well as the previous plan $p_{t-1}$
- output: explains why the plan failed $p_t$
- Planner: LLM
- input: $p_t$
- output: revised plan $P_t$
- Selector:
- input: plan $P_t$
- output: next goal to attempt $g_t$
- Low-level controller
- $\pi(a_t| s_{t-1}, g_t)$
- trained w/ behaviour cloning
- 
- 
- 
- Selector (is your world model):
- trained on offline logs to predict the next states
- you can use it to predict the time of completion of goals (proximity)
- then, it can choose what goals to attempt next
- NOTE: might be related to the SNOW use case of training a world model on some (dirty) logs
- results:
- previous work < DEP < DEPS
- the number of allowed plan iteration matters, esp for longer tasks
- DEPS is able to solve the hardest task (MineDiamond) in a **zero-shot** fashion, i.e. it was never trained to do this specific task
### Sparks of Artificial General Intelligence: Early experiments with GPT-4 (unfinished)
* key takeways:
* (non-multimodal early version of) GPT-4 is really powerful:
* really good at coding in particular
* e.g. really solid at Tikz, can code a unicorn:
- 
- still struggling in tasks requiring some sort of planning, especially "discontinuous" task, as opposed to "incremental" tasks
- or similarly, good at tasks with local constraints, but poor with under global constraints
- probably due to its ability to only think linearly due to its autoregressive nature/next-word prediction paradigm
- which manifest as the model’s lack of planning, working memory, ability to backtrack, and reasoning abilities.
- to me this example requires some sort of planning:
- 
- you have to plan **ahead**
- unless ofc this example was in the training data
- GPT4 can solve more complex task if its allowed to reiterate/revise its answer
- potentially, working memory was the bottle next and each iteration increases it.
- GPT4 can reason about high-level maths
- GPT4 can use tools like calculators, search engines, and subroutines
- important sections for us:
- 3) coding
- 4) maths
- 5) interacting with the world
- 5.1) Tool Use
- 5.2) Embodied interactions
- 8) Limitations of autoregressive architecture
- 3) Coding
- able to develop a 3D game:
- 
- it would be hard to successed at this task w/o planning ahead?
- Question: incremental task, or discontinuous task?
- quite fascinating that the plan emerges only from the transformer's internal representations
- GPT4 is able to reverse engineer Assembly code
- I think this this example highlights that when GPT4 is operating in a sandbox/simulator, i.e. sampling is cheap, then maybe not much planning is required: brute force will do.
- GPT4 can reason about code execution
- i.e. its can come up with a model of the codebase from in-context learning
- 4) Mathematical abilities
- 4.3) Math Modeling in various domains
- ability to come up w/ complex models:
- 
- some hint that LLMs could guide you strategy to learn a world model
- ability to decompose a problem
- 
- 4.4) higher level maths:
- can solve queries that require discontinuous jumps
- 
- 5) interacting with the world
- 5.1) Tool use
- GPT-4 can use tools:
- 
- potentially, you could prompt it with your skills/subroutines, and it could come up with complex plans to solve complex queries
- i.e. use it for high-level abstract planning
- 5.2) embodied interactions
- can build maps on-the-fly:
- 
- seems to be able to help another agent w/ causal discovery (here we are trying to find the cause of the leak):
- 
* 8) Limitations of autoregressive architecture highlighted by GPT-4
* successful example which requires planning ahead
* 
* NOTE: the only working memory the model can use for planning is in the internal representations...
* i.e. no scratchpad, no inner dialogue
* GPT-4 struggles to lay out a plan on its own, but can execute plans
* 
* GPT-4 still bad at simple arithmetic tasks
* probably because it doesn't have access to a scratchpad/inner dialogue
* NOTE: do we care if GPT is able to use tools like calculators?
* 8.2 lack of planning in text generation:
* really good with local constraints
* 
* not so good at global constraints
* 
* incremental vs discontinuous tasks:
* 
* 
* analogy w/ fast and slow thinking (Kahneman)