tristan karch
    • Create new note
    • Create a note from template
      • Sharing URL Link copied
      • /edit
      • View mode
        • Edit mode
        • View mode
        • Book mode
        • Slide mode
        Edit mode View mode Book mode Slide mode
      • Customize slides
      • Note Permission
      • Read
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Write
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Engagement control Commenting, Suggest edit, Emoji Reply
    • Invite by email
      Invitee

      This note has no invitees

    • Publish Note

      Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

      Your note will be visible on your profile and discoverable by anyone.
      Your note is now live.
      This note is visible on your profile and discoverable online.
      Everyone on the web can find and read all notes of this public team.
      See published notes
      Unpublish note
      Please check the box to agree to the Community Guidelines.
      View profile
    • Commenting
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
      • Everyone
    • Suggest edit
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
    • Emoji Reply
    • Enable
    • Versions and GitHub Sync
    • Note settings
    • Note Insights New
    • Engagement control
    • Make a copy
    • Transfer ownership
    • Delete this note
    • Save as template
    • Insert from template
    • Import from
      • Dropbox
      • Google Drive
      • Gist
      • Clipboard
    • Export to
      • Dropbox
      • Google Drive
      • Gist
    • Download
      • Markdown
      • HTML
      • Raw HTML
Menu Note settings Note Insights Versions and GitHub Sync Sharing URL Create Help
Create Create new note Create a note from template
Menu
Options
Engagement control Make a copy Transfer ownership Delete this note
Import from
Dropbox Google Drive Gist Clipboard
Export to
Dropbox Google Drive Gist
Download
Markdown HTML Raw HTML
Back
Sharing URL Link copied
/edit
View mode
  • Edit mode
  • View mode
  • Book mode
  • Slide mode
Edit mode View mode Book mode Slide mode
Customize slides
Note Permission
Read
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Write
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Engagement control Commenting, Suggest edit, Emoji Reply
  • Invite by email
    Invitee

    This note has no invitees

  • Publish Note

    Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

    Your note will be visible on your profile and discoverable by anyone.
    Your note is now live.
    This note is visible on your profile and discoverable online.
    Everyone on the web can find and read all notes of this public team.
    See published notes
    Unpublish note
    Please check the box to agree to the Community Guidelines.
    View profile
    Engagement control
    Commenting
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    • Everyone
    Suggest edit
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    Emoji Reply
    Enable
    Import from Dropbox Google Drive Gist Clipboard
       Owned this note    Owned this note      
    Published Linked with GitHub
    • Any changes
      Be notified of any changes
    • Mention me
      Be notified of mention me
    • Unsubscribe
    # Language as a Cognitive Tool: Dall-E, Humans and Vygotskian RL Agents Have you heard about DALL-E? Beyond the funny mashup between Pixar’s robot and the surrealist painter, Dall-E is OpenAI’s new transformer trained to compose images from text descriptions [[DALL-E](https://openai.com/blog/dall-e/)]. Most people agree, DALL-E’s ability to blend concepts into natural images is remarkable; it often composes just like you would. Look at these avocado-chairs: <p><center> <img src="https://i.imgur.com/A76x4tV.png" alt style="margin-top:25px"> <em>Output from OpenAI’s Dall-E model when prompted by: “an armchair in the shape of an avocado” <a href="https://openai.com/blog/dall-e/"> [Dall-E]</a>.</em></center> <p> </p> </p> If it’s so impressive, it’s because this type of composition feels like a very special human ability. How would it know? The ability to compose ideas is indeed at the source of a whole bunch of uniquely-human capacities: it is key for abstract planning, efficient generalization, few-shot learning, imagination and creativity. If we ever were to design truly autonomous learning agents, they would certainly need to demonstrate such compositional imagination skills. In this blog post, we argue that aligning language and the physical world helps transfer the structure of our compositional language to the messy, continuous physical world. As we will see, language is much more than a communication tool. Humans extensively use it as a cognitive tool, a cornerstone of their development. We’ll go over developmental psychology studies supporting this idea and will draw parallels with recent advances in machine learning. This will lead us to introduce _language-augmented learning agents_: a family of embodied learning agents that leverage language as a cognitive tool. ## Language as a Cognitive Tool in Humans Ask anyone what language is about. Chances are, they will answer something along these lines: “language is for people to communicate their thoughts to each other”. They’re right of course, but research in developmental psychology shows that language is a lot more. Vygotsky, a Soviet psychologist of the 1930s, pioneered the idea of the importance of language in the cognitive development of humans [[Vygotsky, 1934](https://mitpress.mit.edu/books/thought-and-language)]. This idea was then developed by a body of research ranging from developmental psychology, through linguistic to philosophy [[Whorf, 1956](https://mitpress.mit.edu/books/language-thought-and-reality); [Rumelhart et al., 1986](https://www.cs.toronto.edu/~fritz/absps/pdp14.pdf); [Dennett, 1991](https://en.wikipedia.org/wiki/Consciousness_Explained); [Berk, 1994](https://www.scientificamerican.com/article/why-children-talk-to-themselves/) and [Clark, 1998](https://era.ed.ac.uk/bitstream/handle/1842/1311/magic.pdf?sequence=1&isAllowed=y), [Carruthers, 2002](https://drum.lib.umd.edu/bitstream/handle/1903/4339/Cognitive.Functions.of.Language.pdf?sequence=3)]. <p> <center> <img src="https://i.imgur.com/MlOll5z.jpg" alt width=250 style="margin-top:25px"> <em> </em> </center> <em> <b> Lev Vygostky and his socio-cultural theory of cognitive development. </b> Vygotsky is mainly known for two ideas. First, although language is initially learned through and for communication with others, it is later internalized and used as a cognitive tool to organize thoughts. As a consequence, important aspects of high-level cognitive functions have social origins. Second, caretakers often setup a _Zone of Proximal Development_ to facilitate children learning. This can take the form of a scaffolding of the environment, demonstrations or linguistic aids adapted to the current level of the child---as the child progresses, caretakers propose new challenges just beyond their current capabilities. </em> <p> </p> </p> Let us start with words. Words are invitations to form categories [[Waxman and Markow, 1995](https://www.sciencedirect.com/science/article/abs/pii/S001002858571016X)]. Hearing the same word in a variety of contexts invites humans to compare situations, find similarities, differences and build symbolic representations of objects (dogs, cats, bottles) and their attributes (colors, shapes, materials). With words, the continuous world can be structured into mental entities, symbols which, when composed and manipulated enable reasoning and give rise to the incredible expressiveness and flexibility of human thoughts [[Whitehead, 1927](https://archive.org/details/in.ernet.dli.2015.166010)]. In the same way, relational language seems to offer _invitations to compare_ and is thought to be the motor of analogical and abstract reasoning [[Gentner and Hoyos, 2017](https://onlinelibrary.wiley.com/doi/10.1111/tops.12278)]. <p> <center> <img src="https://i.imgur.com/jJ3jlUN.png" alt width=600 style="margin-top:25px"> <em> </em> <em> The Language Swiss Army Knife -- key cognitive functions enhanced by language. </em></center> <p> </p> </p> More generally, the language we speak seems to strongly shape the way we think. [Lera Boroditsky’s Ted Talk](https://www.youtube.com/watch?v=RKK7wGAYP6k) presents some examples of these effects. The perception of colors, for instance, is directly affected by the color-related words our language contains [[Winawer et al., 2007](https://www.pnas.org/content/pnas/104/19/7780.full.pdf)]. Whether your language uses one word for each number or simply categorizes one, two and many will impact your ability to reason with numbers abstractly and thus to develop abilities for math and science [[Frank et al., 2008](https://www.sciencedirect.com/science/article/pii/S0010027708001042?casa_token=SCkMeHibf_YAAAAA:BonfbLRNN9-eymEOohJK_ijP6MbOWawddS2uxsSDDsjXMidUN1OyfNuDAJIo1-qbTfj0pTEGVQ)]. Whether you use a geo-centric or ego-centric location system (left/right vs west/east) affects how you represent time abstractly [[Boroditsky and Gaby, 2010](https://www.researchgate.net/profile/Alice-Gaby-2/publication/47500080_Remembrances_of_Times_East/links/02e7e53c5dc63e2b80000000/Remembrances-of-Times-East.pdf)]. >*Vygotsky, Berk and others showed that private speech was instrumental to the ability of children to reason and solve tasks* Language can also be used as a tool to solve problems. Piaget first described that two-to-seven-year old children often use private speech or self-talk to describe their on-going activities and organize themselves, but thought this was a sign of cognitive immaturity [[Piaget, 1923](https://newlearningonline.com/literacies/chapter-14/piaget-on-the-language-and-thought-of-the-child)]. Vygotsky, Berk and others showed that private speech was instrumental to the ability of children to reason and solve tasks: the harder the task, the more intensively children used it for planning [[Vygotsky, 1934](https://mitpress.mit.edu/books/thought-and-language), [Berk, 1994](https://www.scientificamerican.com/article/why-children-talk-to-themselves/)]. Far from being left behind as children grow up, Vygotsky showed that private speech is internalized to become inner speech, the little voice in your head [[Vygotsky, 1934](https://mitpress.mit.edu/books/thought-and-language) and [Kohlberg, 1968](https://www.jstor.org/stable/1126979?origin=crossref)]. Children that cannot formulate sentences like “at the left of the blue wall” show decreased spatial orientation capacities in such contexts compared to children who can. Interfering with adults' inner speech by asking them to repeat sentences also hinders their ability to orient spatially [[Hermer-Vazquez et al. 2001](https://www.researchgate.net/profile/Linda-Hermer-2/publication/12165254_Language_space_and_the_development_of_cognitive_flexibility_in_humans_The_case_of_two_spatial_memory_tasks/links/5c2e1ecba6fdccd6b58f7d23/Language-space-and-the-development-of-cognitive-flexibility-in-humans-The-case-of-two-spatial-memory-tasks.pdf)]. Because language is---at least partially---compositional, we can immediately generalize and understand sentences that we never heard before. This is called *systematic generalization*: the ability to automatically transfer the meaning of a few thoughts into a myriad of other thoughts [[Fodor and Pylyshyn, 1988](https://uh.edu/~garson/F&P1.PDF)]. Compositionality also underlies the reverse process: *language productivity* [[Chomsky, 1957](http://217.64.17.124:8080/xmlui/bitstream/handle/123456789/557/syntactic_structures%20(1).pdf?sequence=1)]. If words and ideas are like lego blocks, we can combine them recursively in infinite ways to form an infinite space of sentences and thoughts. This mechanism powers the imagination of new ideas and concepts that underlies many of the high-level cognitive abilities of humans. While our language productivity helps us generate new concepts like *"an elephant skiing on a lava flow"*, systematic generalization lets us understand them. Once we have composed a new idea via language, it seems we can effortlessly picture what it would look like. Not convinced? Try to imagine what a cat-bus looks like and check [here](https://www.google.com/search?q=miyazaki+catbus&tbm=isch&) whether what you imagine matches Miyazaki’s creature. If most humans take this Dall-E-like ability for granted, studies showed that early contact with a compositional, recursive language is completely necessary. In neurobiology, this conscious, purposeful process of synthesizing novel mental images from two or more objects stored in memory is called Prefrontal synthesis (PFS) [[Vyshedskiy, 2019](https://riojournal.com/article/38546/)]. Children born deaf with no access to recursive sign language and Romanian children left on their own in Ceausescu’s orphanages---among others---were shown to lack PFS abilities and failed to acquire abstract compositional thinking even after intensive language therapy [[Vyshedskiy, 2019](https://riojournal.com/article/38546/)]. When PFS works, it seems to be easily triggered by language. The embodied simulation hypothesis indeed argues that humans have rich, multi-sensory representations prompted by language. If you read the sentence “he saw a pink elephant in the garden”, chances are you’re visualizing a pink elephant. More generally, understanding language seems to involve parts of the brain that would be active if you were in the situation described by the sentence. Reading about pink elephants? Your visual cortex lights up. Reading about someone cutting a tree? Your motor cortex lights up. This was even shown to work for metaphorical use of words [[Bergen, 2012](https://www.basicbooks.com/titles/benjamin-k-bergen/louder-than-words/9780465033331/)]. In a nutshell, humans use language as a tool for many of their high-level cognitive skills including abstraction, analogies, planning, expressiveness, imagination, creativity and mental simulation. ## Language-World Alignment Language can only help if it is grounded in the physical world. Before drawing an avocado-chair, you need to know what avocados and chairs are; you need to know what drawing means. Our source of information comes from __aligned data__: as children, we observe and experience the physical world and hear corresponding linguistic descriptions. The mother could say: “This is a chair” when the infant is looking at it. The father could say: “Let’s put you on the chair” while the infant experiences being transported and sat on a chair. As infants hear about chairs while experiencing chairs (seeing them, sitting on them, bumping their feet against it), they strengthen the association and build their knowledge of chairs. > Aligning language and physical data might just be about transferring the discrete structure of language onto the continuous, messy real world. It is now time to turn to artificial learning systems: AI systems also use aligned data! A basic example is the datasets of image and label pairs used by image classification algorithms. In language-conditioned reinforcement learning (LC-RL), engineers train learning agents to perform behaviors satisfying linguistic instructions: they reward agents when their state matches---is aligned---with the instruction. In real life however, children are not often provided with object-label pairs or explicit instructions. Caretakers mostly provide descriptive feedback, they describe events that are deemed novel or relevant for the child [[Tomasello, 2005](https://www.hup.harvard.edu/catalog.php?isbn=9780674017641) and [Bornstein et. al., 1992](https://www.jstor.org/stable/1131235?seq=1)]. The IMAGINE agent receives such descriptive feedback [[Colas et. al., 2020](https://arxiv.org/pdf/2002.09253.pdf)]. In Imagine, we look at how language can be used as a tool to imagine creative goals as a way to power an autonomous exploration of the environment---this will be discussed below. In a controlled [Playground](https://github.com/flowersteam/playground_env) environment, the Imagine agent freely explores the environment and receives simple linguistic descriptions of interesting behaviors from a simulated caretaker. If the agent hears “you grasped a red rose”, it turns this description into a potential goal and will try to grasp red roses again. To do so, it needs to 1) understand what that means; 2) learn to replicate the interaction. The description it just received is an example of aligned data: a trajectory and a corresponding linguistic description. This data can be used to learn a reward function (1): a function that helps the agent recognize when the current state matches---is aligned---with the linguistic goal description. When the two match, the reward function generates a positive reward: the goal is reached. Given a few examples, the agent correctly recognizes when goals are achieved and can learn a policy to perform the required interaction via standard reinforcement learning using self-generated goals and rewards (2). <p> <center> <img src="https://i.imgur.com/6aMRo6P.jpg" alt width=600 style="margin-top:25px" > <em> </em> <em> Example of a descriptive feedback used to ground language in an agent's behavior. The robot is initially targeting to grasp the yellow tennis ball but as it reaches another goal the social partner provides a description of it (image from [this post](https://www.aber.ac.uk/en/news/archive/2019/06/title-224324-en.html)). </em></center> <p> </p> </p> Descriptive feedback is interesting because it facilitates hindsight learning. While instruction-based feedback is limited to tell the agent whether its original goal is achieved, descriptions can provide feedback on any interaction, including ones the agent did not know existed. As a result, information collected while aiming at a particular goal can be reused to learn about others, a phenomenon known as hindsight learning [[HER](https://arxiv.org/pdf/1707.01495.pdf?source=post_page---------------------------)]. In existing implementations of LC-RL agents, state descriptions can be generated by scanning possible descriptions with a learned reward function [[IMAGINE](https://arxiv.org/pdf/2002.09253.pdf)], by a hard-coded module [[ACTRCE](https://arxiv.org/abs/1902.04546) and [Jiang et. al., 2019](https://arxiv.org/abs/1906.07343)] or by a learned captioning module trained on instruction-based feedback [[HIGhER](https://arxiv.org/pdf/1910.09451.pdf); [Zhou et al., 2020](https://arxiv.org/pdf/2008.06924.pdf) and [ILIAD](https://arxiv.org/pdf/2102.07024.pdf)]. Given the same aligned data, one can indeed learn a reverse mapping, from the physical world to the language space. This captioning system can be used as private or inner speech to label new chunks of the physical world and generate more aligned data autonomously. ## From Language Structure to World Structure In the physical world, everything is continuous. Perceptual inputs are just a flow of images, sounds, odors; behavior is a flow of motor commands and proprioceptive feedback. It is very hard to segment this flow into meaningful bits for abstractions, generalization, composition and creation. Language on the other hand is discrete. Words are bits that can easily be swapped and composed into novel constructions. Aligning language and physical data might just be about transferring the discrete structure of language onto the continuous, messy real world. Dall-E seems to be a very good example of this. Everything starts with aligned data---pairs of images and compositional descriptions---and that’s all there is. Dall-E is not creative per se, but descriptions and images---both constructed by humans---are indeed. Dall-E simply---but impressively---learns to project the structure of language onto the image space, so that, when a human inputs a new creative description, Dall-E can project it onto images. As we argued before, language facilitates some forms of creativity via simple mechanisms to compose concepts, form analogies and abstractions. Swapping words, composing new sentences, generating metaphors are all easier in language space than in image space directly. ## Language-Augmented Reinforcement Learning Agents Readers should now be convinced that autonomous embodied learning agents should be language-augmented; they should leverage language as a cognitive tool to structure their sensorimotor experience, to compose, generalize, plan, etc. Let us go over some first steps in that direction. LC-RL methods often investigate the ability of their agent in terms of systematic generalization: the ability to interpret never-seen-before sentences by leveraging knowledge about their constituents. These sentences are known constructions with new associations. Let's consider an agent that learned to grasp plants and to feed animals by bringing them food. As it learned these skills, the agent acquired representations of the concepts of animals, plants, food and grasping. Can these concepts be leveraged to understand a new combination such as “feed the plant”? Here again, agents use aligned data (e.g. instruction-state or description-state) and are tested on their ability to project the compositional aspect of language (“feed the animal” + “grasp the plant” → “feed the plant”) to the behavioral space. Although it is not perfectly _systematic_, this type of generalization often works quite well, especially in settings where agents are exposed to a wide distribution of objects/words and perceive the world through an egocentric point of view that helps them isolate individual objects [[Hill et. al., 2019](https://deepmind.com/research/publications/Emergent-Systematic-Generalization-in-a-Situated-Agent)]. <p> <center> <img src="https://i.imgur.com/0honuuW.png" alt width=500 style="margin-top:25px"> <em> The Policy Sketches approach: long-horizon behaviors are aligned with policy sketches, sequences of symbolic tokens segmenting the long-horizon task into shorter steps [[Policy Sketches](https://arxiv.org/pdf/1611.01796.pdf)]. </em> <p style="margin-top:10px"> </p> </center> </p> > If we want to achieve _creative exploration_, we need agents to generate _out-of-distribution_ goals, to imagine new possible interactions with their world. Building on these generalization properties, language was also found to be a good way to represent abstractions in hierarchical reinforcement learning setups [[Jiang et. al., 2019](https://arxiv.org/pdf/1906.07343.pdf)]. While the high-level policy acts in an abstract and human-interpretable representation space towards the resolution of long-horizon control problems, the low-level policy benefits from the generalization induced by language-behavior alignment. In Modular Multi-task RL with Policy Sketches, long-horizon behaviors are aligned with policy sketches, sequences of symbolic tokens segmenting the long-horizon task into shorter steps [[Andreas et. al., 2017](https://arxiv.org/pdf/1611.01796.pdf)]. This simple alignment, along with a bias to encode each step-specific behavior with a different policy, is enough to significantly speed up learning without any explicit pre-training of the sub-policies. The sequential structure of the task into separate steps can be projected onto the behavioral space---i.e. sub-policies. Planning and reasoning in language space is much easier for humans and allows us to handle long-horizon tasks. Similarly, the Text-World approach defines artificial learning environments where agents observe and interact with text only [[TextWorld](https://arxiv.org/pdf/1806.11532.pdf)]. This can be seen as a high-level world model in language space. Agents can efficiently explore and plan in the language space, then transfer this knowledge to an aligned sensorimotor space [[AlfWorld](https://arxiv.org/pdf/2010.03768.pdf)]. <p> <center> <img src="https://i.imgur.com/ZcKJiGd.jpg" alt width=700 style="margin-top:25px"> <em> Out-of-distribution goal generation powers creative exploration. </em> </center> <p style="margin-top:10px"> </p> </p> Creative exploration is about finding new ways to interact with the environment. Agents can efficiently organize their exploration by generating and pursuing their own goals [[Forestier et al., 2017](https://arxiv.org/pdf/1708.02190.pdf)]. Goal generation, however, is often limited to the generation of target representations that are _within_ the distribution of previously-experienced representations. If we want to achieve _creative exploration_, we need agents to generate _out-of-distribution_ goals, to imagine new possible interactions with their world. This is reminiscent of the way children generate their own creative goals during pretend play [[Vygotsky, 1930](https://files.eric.ed.gov/fulltext/EJ1138861.pdf)], a type of behavior that is argued to benefit children's ability to solve problems later on [[Chu and Schulz, 2020](https://www.annualreviews.org/doi/pdf/10.1146/annurev-devpsych-070120-014806)]. <p> <center> <img src="https://drive.google.com/uc?id=1Xre-pH-r5lXaT9kv33zTIOLrnpUelgNy" alt style="margin-top:25px"> </center> <em> <b> The Imagine architecture as a Vygostkian Deep RL system. </b> Social interactions with a social peer enables the agent to learn to represent and understand language as a pre-existent social structure (left). Then, language is internalized and used as a cognitive tool to imagine novel goals by re-combining known sentences: the agent aims to achieve these goals autonomously, enabling creative free exploration (right). Another Vygostkian dimension of Imagine is that the social peer scaffolds the environment: when the agent imagines and formulates a goal through private speech, the social peer sets up the environment so that it's neither too hard (the right objects are present), nor too easy (procedurally-generated objects and additional distracting objects) <a href="https://arxiv.org/pdf/2002.09253.pdf"> [IMAGINE]</a>. </em> <p style="margin-top:25px"> </p> </p> The IMAGINE agent we discussed above leverages the productivity of language to generate such creative, out-of-distribution goals---it composes known linguistic goals to form new ones, to imagine new possible interactions [[IMAGINE](https://arxiv.org/pdf/2002.09253.pdf)]. The mechanism is crudely inspired by usage-based linguistic theories. It detects recurring linguistic patterns, labels words used in similar patterns _equivalent_ and uses language productively by switching equivalent words in the discovered templates. This simple mechanism generates truly creative goals that are both novel and appropriate---see discussion in [Runco and Jaeger, 2012](http://emotrab.ufba.br/wp-content/uploads/2019/06/RUNCO-Mark-The-Standard-Definition-of-Creativity.pdf). The novel and appropriate definition of creativity shares similarities with the _intermediate novelty_ principle. Not novel enough is boring (known sentences), but too novel is overwhelming (sentences with random words). The sweet spot is in the middle---using novel instances of known constructions. We show that this simple mechanism boosts the agent’s exploration, as it interacts with more objects, driven by its imagined goals. Sometimes they make sense---the agent knows it can grasp plants and animals, and feed animals, so it will imagine it can feed plants as well. Sometimes they don’t and the agent might try to feed the lamp just like a child might feed his doll in _pretend play_ [[Vygotsky, 1933](http://yuoiea.com/uoiea/assets/files/pdfs/vygotsky-play.pdf)]. In any case, the agent is driven to interact with its world, with the objects around in a directed and committed way. Goal imagination also boosts generalization for two reasons. First, imaginative agents train on a larger set of goals, thus extend the support of their training distribution and generalize better. Indeed, goal imagination enables agents to extend the set of known goals to a larger one made of all combinations of goals from the original set. Second, goal imagination can help correct for over-generalizations. In Playground, animals can be fed with food or water, but plants only receive water. When asked to feed plants for the first time (for evaluation purpose, i.e. zero-shot generalization), the agent over-generalizes and brings them water or food with equal probability---how could it know? When allowed to imagine goals, agents imagine that plants could be fed just like animals and try to do so---with water or food. Although the policy over-generalizes, the reward function can still identify whether plants have grown or haven't. Agents detect that plants only grow when provided with water, never with food. As a result, the policy can be updated based on internally-generated rewards to correct for the prior over-generalization. Language does not need to correspond to a perfectly compositional world, the agent can correct for inconsistencies. In another contribution, we introduce a language-conditioned goal generator to execute mental simulations of possible sensorimotor goals matching linguistic descriptions [[LGB](https://arxiv.org/pdf/2006.07185.pdf)]. This Language-Goal-Behavior approach (LGB) decouples skill learning from language grounding. In the _skill learning phase_, LGB relies on an innate semantic representation that characterizes spatial relations between objects in the scene using predicates known to be used by pre-verbal infants [[Mandler, 2012](https://cogsci.ucsd.edu/~jean/abstract/SpatialEnrichment.pdf)]. LGB explores its semantic space, discovers new configurations and learns to achieve them reliably. In the _language grounding phase_, LGB interacts with a descriptive caretaker that provides aligned data: linguistic descriptions describing LGB’s trajectories. This data is used to train the language-conditioned goal generator. When instructed with language, LGB can simulate/imagine several semantic configurations that could result from executing the language instruction. Instead of following the instruction directly as a standard LC-RL agent does, LGB samples a possible matching configuration and targets it directly. This type of mental visualization of the instructed goal is known to be performed by humans [[Wood et. al., 1976](https://acamh.onlinelibrary.wiley.com/doi/pdfdirect/10.1111/j.1469-7610.1976.tb00381.x)] and allows agents to demonstrate a diversity of behaviors for a single instruction. This approach is reminiscent of the Dall-E algorithm and could integrate it to generate visual targets from linguistic instructions. <p> <center> <img src="https://i.imgur.com/vCVs36T.jpg" alt style="margin-top:25px"> <em> </em> <em> The Language-Goal-Behavior architecture [[LGB](https://arxiv.org/pdf/2006.07185.pdf)]. The agent first generates a set of possible future configurations matching a linguistic description, then samples one of them as a concrete target. </em> </center> </p> ## The Big Picture It is now time to take a step back. In this post, we’ve seen that language, more than a communication tool, is also a cognitive tool. Humans use it all the time, to represent abstract knowledge, plan, invent new ideas, etc. If it’s so helpful, it’s because we align it to the real world and, by doing so, project its structure and compositional properties to the continuous, messy physical world. > These approaches are only the first steps towards a more ambitious goal—artificial agents that demonstrate a rich linguistic mental life. > The embodied simulation hypothesis supports this idea. Humans seem to use language, to generate structured representations and simulations of what it refers to. This view is compatible with theories viewing humans as maintaining collections of world models [e.g. [Forrester, 1971](http://www.virtualadjacency.com/wp-content/uploads/2008/01/42c-MIT-Prof-Forrester-Counterintuitive-Behavior-of-Social-Systems-TechRvw-Jan-1971.pdf); [Schmidhuber, 2010](https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.467.5494&rep=rep1&type=pdf); [Nortmann et al., 2013](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4428292/pdf/bht318.pdf); [Mous, 2013](https://reader.elsevier.com/reader/sd/pii/S0896627313002572?token=4396251A8D1E11B05D6464A79137FF32FB5A5BE41D1860EA40FCE1F870AB1931059F900D2AEF309E276B5049841CD186) and [Clark, 2015](https://books.google.fr/books?hl=fr&lr=&id=TnqECgAAQBAJ&oi=fnd&pg=PP1&dq=clark+surfing+uncertainty&ots=ausj5hB5PR&sig=p8bKMxE4Mgi7QvkI8sd_r0BHx4A#v=onepage&q=clark%20surfing%20uncertainty&f=false)]. Now if we look at the algorithms discussed above under that lens, we’ll find that many of them are conducting mental simulations triggered by language. Dall-E certainly does; it generates visual simulation (i.e. visualizes) language inputs. That’s also what LGB does, it visualizes specific semantic representation that might result from executing an instruction. In a weaker sense, all LC-RL algorithms also do that: given a language input, they generate what’s the next action to take to execute the instruction. Model-based versions of these algorithms would do so in a stronger sense---picturing whole trajectories matching language descriptions. Finally, learned reward functions offer verification systems for mental simulation: checking whether descriptions and state---imagined or real---match. These approaches are only the first steps towards a more ambitious goal---*artificial agents that demonstrate a rich linguistic mental life*. Just like humans, autonomous agents should be able to describe what’s going on in the world with a form of inner speech. The other way around, these agents should be able to leverage the productivity of language, generate new sentences and ideas from known ones and project these linguistic representations into visual, auditive and behavioral simulations. Linguistic productivity can also drive pretend play, the imagination of creative made-up goals for the agent to practice its problem resolution skills. Only agents that conduct such an intensive alignment between language and the physical world can project linguistic structures onto their sensorimotor experience and learn to recognize the building blocks that will help them plan, compose, generalize and create. This blog post covered works from developmental psychology and showed the importance of aligning language and physical experience in humans. Inspired by these studies, we argued for the importance of augmenting learning agents with language-based cognitive tools and reviewed first steps in that direction. Whereas standard language-conditioned RL approaches only use language to communicate instructions or state representations, language-augmented RL agents align language and sensorimotor interactions to build structured world models. Language-Augmented Reinforcement Learning (LARL) builds on the history of research in developmental psychology pioneered by the Russian school [[Vygotsky, 1934](https://mitpress.mit.edu/books/thought-and-language)] and the recent movement to transpose these ideas to cognitive robotics [[Mirolli et al., 2011](https://core.ac.uk/download/pdf/37835593.pdf)]. In LARL, language is used as the main cognitive tool to guide agents' development. Artificial agents, just like humans, build language-structured world models that underlie high-level cognitive abilities such as planning, representation abstractions, analogies and creativity.

    Import from clipboard

    Paste your markdown or webpage here...

    Advanced permission required

    Your current role can only read. Ask the system administrator to acquire write and comment permission.

    This team is disabled

    Sorry, this team is disabled. You can't edit this note.

    This note is locked

    Sorry, only owner can edit this note.

    Reach the limit

    Sorry, you've reached the max length this note can be.
    Please reduce the content or divide it to more notes, thank you!

    Import from Gist

    Import from Snippet

    or

    Export to Snippet

    Are you sure?

    Do you really want to delete this note?
    All users will lose their connection.

    Create a note from template

    Create a note from template

    Oops...
    This template has been removed or transferred.
    Upgrade
    All
    • All
    • Team
    No template.

    Create a template

    Upgrade

    Delete template

    Do you really want to delete this template?
    Turn this template into a regular note and keep its content, versions, and comments.

    This page need refresh

    You have an incompatible client version.
    Refresh to update.
    New version available!
    See releases notes here
    Refresh to enjoy new features.
    Your user state has changed.
    Refresh to load new user state.

    Sign in

    Forgot password

    or

    By clicking below, you agree to our terms of service.

    Sign in via Facebook Sign in via Twitter Sign in via GitHub Sign in via Dropbox Sign in with Wallet
    Wallet ( )
    Connect another wallet

    New to HackMD? Sign up

    Help

    • English
    • 中文
    • Français
    • Deutsch
    • 日本語
    • Español
    • Català
    • Ελληνικά
    • Português
    • italiano
    • Türkçe
    • Русский
    • Nederlands
    • hrvatski jezik
    • język polski
    • Українська
    • हिन्दी
    • svenska
    • Esperanto
    • dansk

    Documents

    Help & Tutorial

    How to use Book mode

    Slide Example

    API Docs

    Edit in VSCode

    Install browser extension

    Contacts

    Feedback

    Discord

    Send us email

    Resources

    Releases

    Pricing

    Blog

    Policy

    Terms

    Privacy

    Cheatsheet

    Syntax Example Reference
    # Header Header 基本排版
    - Unordered List
    • Unordered List
    1. Ordered List
    1. Ordered List
    - [ ] Todo List
    • Todo List
    > Blockquote
    Blockquote
    **Bold font** Bold font
    *Italics font* Italics font
    ~~Strikethrough~~ Strikethrough
    19^th^ 19th
    H~2~O H2O
    ++Inserted text++ Inserted text
    ==Marked text== Marked text
    [link text](https:// "title") Link
    ![image alt](https:// "title") Image
    `Code` Code 在筆記中貼入程式碼
    ```javascript
    var i = 0;
    ```
    var i = 0;
    :smile: :smile: Emoji list
    {%youtube youtube_id %} Externals
    $L^aT_eX$ LaTeX
    :::info
    This is a alert area.
    :::

    This is a alert area.

    Versions and GitHub Sync
    Get Full History Access

    • Edit version name
    • Delete

    revision author avatar     named on  

    More Less

    Note content is identical to the latest version.
    Compare
      Choose a version
      No search result
      Version not found
    Sign in to link this note to GitHub
    Learn more
    This note is not linked with GitHub
     

    Feedback

    Submission failed, please try again

    Thanks for your support.

    On a scale of 0-10, how likely is it that you would recommend HackMD to your friends, family or business associates?

    Please give us some advice and help us improve HackMD.

     

    Thanks for your feedback

    Remove version name

    Do you want to remove this version name and description?

    Transfer ownership

    Transfer to
      Warning: is a public team. If you transfer note to this team, everyone on the web can find and read this note.

        Link with GitHub

        Please authorize HackMD on GitHub
        • Please sign in to GitHub and install the HackMD app on your GitHub repo.
        • HackMD links with GitHub through a GitHub App. You can choose which repo to install our App.
        Learn more  Sign in to GitHub

        Push the note to GitHub Push to GitHub Pull a file from GitHub

          Authorize again
         

        Choose which file to push to

        Select repo
        Refresh Authorize more repos
        Select branch
        Select file
        Select branch
        Choose version(s) to push
        • Save a new version and push
        • Choose from existing versions
        Include title and tags
        Available push count

        Pull from GitHub

         
        File from GitHub
        File from HackMD

        GitHub Link Settings

        File linked

        Linked by
        File path
        Last synced branch
        Available push count

        Danger Zone

        Unlink
        You will no longer receive notification when GitHub file changes after unlink.

        Syncing

        Push failed

        Push successfully