Conversational Planner - Litterature Review

# Conversational Planner - Litterature Review ### A Data-Driven Approach for Learning to Control Computers - key takeways - World of bits - basically OpenAI's idea to have a virtual agent control a computer to complete language-specified goals - in this work, they make the environement more general, collect good demonstrations, and combining RL and behavioural cloning solves it - GAIA - a takeway for GAIA is that we might need to find demonstrations that we can label, i.e. assign them goal descriptor in natural language - methods - MiniWob++ - in this paper, agents operate in a more general env in which it can only use the mouse and keyboard based actions - agent has access to a task input fieled (as well as a task descriptor) and it can copy-paste the fields to fill up DOM elements - realtime env - agent architecture - ![](https://i.imgur.com/uh9bAmf.png) - a = ResNet(visual input) - b = langauge transformer(text input) - c = extra embeddings - d = multimodal_transformer(a, b, c) - action = LSTM(d) - action type - cursor coord - keyboard-key index - task-field index - for copy-pasting into the DOM - human data collection - 2.4M trajectories - training - co-training of RL and BC - Results - ![](https://i.imgur.com/AimRHwY.png) - Multi-task more sample (and compute) efficient - ![](https://i.imgur.com/GJ8LTPT.png) - the data is super important - NOTE: we could still make progress on the sample efficiency front - ![](https://i.imgur.com/6GG8r6q.png) - the DOM obs and DOM actions are (still) quite important - this could be a limitation in more general cases where DOMs are not accessible - the agent compared to humans struggles in task where the realtime aspect is an issue - maybe modulo this, the benchmark is solved (assuming large pretraining corpus) - Discussion - in MiniWob++, the difficulties of dealing with human intent is put aside ### INTROSPECTIVE ACTION ADVISING FOR INTERPRETABLE TRANSFER LEARNING (under review) - key takeways: - action advising: a teacher trained in a source task actively guides a student's exploration in a target task - NOTE: this might be an interesting framework to cast our project in - doesn't assume access to the policy weights, similarly to us - at a high level, the teacher recommendes actions (advices) to a student; those advices are transferable if teacher and students value function are close in a particular state - problem: difficult to determine whether advice is not transferable due to a mismatch in state-values or due to the student still being undertrained - solution: introspection, i.e. teacher directly estimate the state-value function in the target task, w/ the assumption that refining the teacher's existing estimate of the state-values in the source task wlll lead to quicker convergence than estimating it from scratch - ![](https://i.imgur.com/EJCiTcf.png) - 4) Introspective Action Advising (IAA) - a policy is transferable in a given state if ![](https://i.imgur.com/FwdnMEn.png) - we assume a temporal cut-off $\tau$ after which we no longer assume the teacher's actions will lead to better rewards than the student - problem: difficult to determine whether advice is not transferable due to a mismatch in state-values or due to the student still being undertrained - solution: introspection, i.e. teacher directly estimate the state-value function in the target task, w/ the assumption that refining the teacher's existing estimate of the state-values in the source task wlll lead to quicker convergence than estimating it from scratch - i.e. ![](https://i.imgur.com/WXtYImV.png) - technicalities: - burn in $\gamma$ period for finetuning the teacher - off-policy correction when training the student with data is collected w/ the teacher's policy (and vice-versa when finetuning the teacher) - ultimately, 3 new hparams are introduced: $\epsilon, \tau, \gamma$ ### Yoshua Bengio: large language models, higher cognition, causality, working memory, responsible AI - key takeways: - LLMs are increadibly not sample efficient - one hypothesis to improve them is by decoupling the (currently implicit) world model from the model performing inference on it - accordingly, he believes in the model-based promise to increase sample efficiency - NOTE: there seems to already be some overlap with us here, as we also want to learn an explicit world model - probably we should look into his paper [Inductive Biases for Deep Learning of Higher-Level Cognition](https://arxiv.org/abs/2011.15091) - might need GflowNets to wow him - he believes in the model-based promise to increase sample efficiency - why do we have such a constrained working memory? - e.g. animals have a bigger one. - Hypothesis: pushes us to build abstract and compact models of the world! - an MBRL motivating example: - learning how to drive on the left. - ofc the MBRL approach is favored over the model free. - you want to update your world model and then learn a new policy within it, i.e. not by actually interacting with the env - conciousness is a **hard** attention mechanism - might be a good inductive bias to bake into our algos - actually, a follow up of his Attention paper showed that soft and hard attention performed comparably - to his surprise, as soft attention mechanism backpropagate much better - G-Flow Nets: - think of it as a learned MCMC for sampling parameters - potential tool to decouple the world model from the inference - tool to learn posteriors over data structures like graphs - so potentially useful to learn a posterior over the causal structure of the env (causal world world) - your model of the world should be bayesian: you should always entertain multiple theories about it - planning at an abstract level (HRL) - Bengio wishes he had the solution to this! - high risk, high reward! - abstract plans can be translated into sentences - NOTE: project alignment - BONUS: - Bengio as 2 productive states: - walks - right when he wakes up, 30min in bed, eyes closed, waiting for solutions to pop up. ### Describe, Explain, Plan and Select: Interactive Planning with Large Language Models Enables Open-World - key takeways: - adresses the problem of long-term planning and planning efficiency - interactive planning via an LLM - high-level idea: - an Planer (LLM) makes a high-level plan, that is, a partially-ordered set of subgoals; - a Selector chooses the next subgoal; - a low-level controller attempts the next subgoal; - whenever the low-level controller fails, the Descriptor takes in the state and communicates it to the Explainer which communicates it to the Planner (LLM) in language for the plan to be revised. - algorithm = - Descriptor: when low-level controller fails at the plan, - input = feedback generated by the agent during the excution of the task - output $d_t$ = description if the current state in NL - Explainer: LLM - input: $d_t$ as well as the previous plan $p_{t-1}$ - output: explains why the plan failed $p_t$ - Planner: LLM - input: $p_t$ - output: revised plan $P_t$ - Selector: - input: plan $P_t$ - output: next goal to attempt $g_t$ - Low-level controller - $\pi(a_t| s_{t-1}, g_t)$ - trained w/ behaviour cloning - ![](https://i.imgur.com/pa25hW4.png) - ![](https://i.imgur.com/gPCKaU2.png) - ![](https://i.imgur.com/zvUAtie.png) - Selector (is your world model): - trained on offline logs to predict the next states - you can use it to predict the time of completion of goals (proximity) - then, it can choose what goals to attempt next - NOTE: might be related to the SNOW use case of training a world model on some (dirty) logs - results: - previous work < DEP < DEPS - the number of allowed plan iteration matters, esp for longer tasks - DEPS is able to solve the hardest task (MineDiamond) in a **zero-shot** fashion, i.e. it was never trained to do this specific task ### Sparks of Artificial General Intelligence: Early experiments with GPT-4 (unfinished) * key takeways: * (non-multimodal early version of) GPT-4 is really powerful: * really good at coding in particular * e.g. really solid at Tikz, can code a unicorn: - ![](https://i.imgur.com/9j1ubu4.png) - still struggling in tasks requiring some sort of planning, especially "discontinuous" task, as opposed to "incremental" tasks - or similarly, good at tasks with local constraints, but poor with under global constraints - probably due to its ability to only think linearly due to its autoregressive nature/next-word prediction paradigm - which manifest as the model’s lack of planning, working memory, ability to backtrack, and reasoning abilities. - to me this example requires some sort of planning: - ![](https://i.imgur.com/eIWOmeA.png) - you have to plan **ahead** - unless ofc this example was in the training data - GPT4 can solve more complex task if its allowed to reiterate/revise its answer - potentially, working memory was the bottle next and each iteration increases it. - GPT4 can reason about high-level maths - GPT4 can use tools like calculators, search engines, and subroutines - important sections for us: - 3) coding - 4) maths - 5) interacting with the world - 5.1) Tool Use - 5.2) Embodied interactions - 8) Limitations of autoregressive architecture - 3) Coding - able to develop a 3D game: - ![](https://i.imgur.com/beUrmGG.png) - it would be hard to successed at this task w/o planning ahead? - Question: incremental task, or discontinuous task? - quite fascinating that the plan emerges only from the transformer's internal representations - GPT4 is able to reverse engineer Assembly code - I think this this example highlights that when GPT4 is operating in a sandbox/simulator, i.e. sampling is cheap, then maybe not much planning is required: brute force will do. - GPT4 can reason about code execution - i.e. its can come up with a model of the codebase from in-context learning - 4) Mathematical abilities - 4.3) Math Modeling in various domains - ability to come up w/ complex models: - ![](https://i.imgur.com/SY4bXFG.png) - some hint that LLMs could guide you strategy to learn a world model - ability to decompose a problem - ![](https://i.imgur.com/QWIMf0D.png) - 4.4) higher level maths: - can solve queries that require discontinuous jumps - ![](https://i.imgur.com/oHDBaei.png) - 5) interacting with the world - 5.1) Tool use - GPT-4 can use tools: - ![](https://i.imgur.com/SNFvW3O.png) - potentially, you could prompt it with your skills/subroutines, and it could come up with complex plans to solve complex queries - i.e. use it for high-level abstract planning - 5.2) embodied interactions - can build maps on-the-fly: - ![](https://i.imgur.com/CJKTi2O.png) - seems to be able to help another agent w/ causal discovery (here we are trying to find the cause of the leak): - ![](https://i.imgur.com/a2E3aWr.png) * 8) Limitations of autoregressive architecture highlighted by GPT-4 * successful example which requires planning ahead * ![](https://i.imgur.com/2Dksy8D.png) * NOTE: the only working memory the model can use for planning is in the internal representations... * i.e. no scratchpad, no inner dialogue * GPT-4 struggles to lay out a plan on its own, but can execute plans * ![](https://i.imgur.com/L2eo9hn.png) * GPT-4 still bad at simple arithmetic tasks * probably because it doesn't have access to a scratchpad/inner dialogue * NOTE: do we care if GPT is able to use tools like calculators? * 8.2 lack of planning in text generation: * really good with local constraints * ![](https://i.imgur.com/nrCeUm9.png) * not so good at global constraints * ![](https://i.imgur.com/303QiKy.png) * incremental vs discontinuous tasks: * ![](https://i.imgur.com/OG5l6mR.png) * ![](https://i.imgur.com/kE4edId.png) * analogy w/ fast and slow thinking (Kahneman)