The Rise and Potential of Large Language Model Based Agents: A Survey

# The Rise and Potential of Large Language Model Based Agents: A Survey ## 1 Introduction - 18th century, philosopher Denis Diderot introduced the idea that if a parrot could respond to every question, it could be considered intelligent - In the 1950s, Alan Turing expanded this notion to artificial entities and proposed the renowned Turing Test > 💡 an agent refers to an artificial entity capable of perceiving its surroundings using sensors, making decisions, and then taking actions in response using actuators > 💡 agent definition began with philosophy: entities possessing desires, beliefs, intentions, and the ability to take actions > 💡 As AI advanced, the term "agent" found its place in AI research to depict entities showcasing intelligent behavior and possessing qualities like autonomy, reactivity, pro-activeness, and social ability -> a pivotal stride towards AGI - most superhuman/human AI systems have been narrow so far (Go or Chess) - World Scope (WS) 5 stages of NLP to general AI: Corpus, Internet, Perception, Embodiment, Social > 💡 General framework: brain, perception, action ## 2 Background ### 2.1 Origin of AI Agent - an “agent” is an entity with the capacity to act, and the term “agency” denotes the exercise or manifestation of this capacity - the concept of an agent involves individual autonomy, granting them the ability to exercise volition, make choices, and take actions, rather than passively reacting to external stimuli - due to metaphysical nature of consciousness and desires -> don't think about agent as having autonomy (philosophy) but observe an entity to see if it replicates behavior of a known agent like autonomy, reactivity, and social ability ### 2.2 Technological Trends in Agent Research - 3 different types of agents in early AI research - symbolic agent: symbolic logic and symbolic representations; limitations in large-scale world problems - reactive agent: sense-react to environment; lacked higher-level decision making - RL-based agent: RL agents; limitation long training times, sample efficiency, stability - transfer-learning/meta-learning agent: meta-learning is learning how to learn; limitations in large sample sizes, pretraining, hard to establish universal learning policy for meta-learning - LLM-based agents: use LLMs ### 2.3 Why is LLM suitable as the primary component of an Agent’s brain? - Autonomy: able to act to a certain degree without direct intervention from humans; adaptive autonomy (to input); creative - Reactivity: perceive alterations in environment and take action; embodiment; multimodal - Pro-activeness: display goal-oriented actions (not just reactionary to environment); planning; CoT - Social ability: ability to communicate with other agents ## 3 The Birth of An Agent: Construction of LLM-based Agents ![image](https://hackmd.io/_uploads/HkadFqcwa.png) - brain for memory and decision making - perception for extending input modality - action for taking actions, embodied, tools, etc ### 3.1 Brain ![image](https://hackmd.io/_uploads/SkoFc5cDp.png) #### 3.1.1 Natural Language Interaction - natural language generation is paramount; multi-turn/high-quality #### 3.1.2 Knowledge - linguistic knowledge (grammar, punctuation) - commonsense knowledge - professional domain specific knowledge - hallucinations #### 3.1.3 Memory - compress memory - extend limit of transformers - summarize memory #### 3.1.4 Reasoning and Planning - reasoning (deductive, inductive, abductive) - some argue reasoning is achieved in pretraining or at a certain scale - prompt engineering - planning composed of 2 steps: formulation and reflection - some plan formulation are sequential/hierarchical/causal #### 3.1.5 Transferability and Generalization - unseen task generalization: LLMs have remarkable zero-shot ability - in-context learning: learning by analogy in the prompt - continual learning: ci/cd for the model; subject to catastrophic forgetting ### 3.2 Perception ![image](https://hackmd.io/_uploads/B1Rd2a5w6.png) #### 3.2.1 Textual Input #### 3.2.2 Visual Input - adapt a visual encoder to an LLM #### 3.2.3 Auditory Input #### 3.2.4 Other Input - touch, smell, temperature/humidity/brightness/LIDAR/GPS ### 3.3 Action ![image](https://hackmd.io/_uploads/SkrzZA5Pp.png) #### 3.3.1 Textual Output #### 3.3.2 Tool Using - tool-use can have side effects like hallucinations - can acquire tools via zero/few-shot prompting or learning from feedback - curriculum learning -> start learning simple tools then move onto harder ones - agents can create their own tools - tools expand action space but output space is still text #### 3.3.3 Embodied Action > 💡 Embodiment hypothesis draws inspiration from the human intelligence development process, posing that an agent’s intelligence arises from continuous interaction and feedback with the environment rather than relying solely on well-curated textbooks. - RL traditionally used for embodied actions, but they have issues with data efficiency, generalization and complex reasoning - LLMs seem to alleviate some of these problems - cost efficiency: LLMs jointly trained on embodied data can transfer learn - LLMs exhibit great cross-task generalization - LLMs have ability to plan - embodied actions for LLM-based agents - fundamental LLM-based agent embodied actions: - observation: primary way agent acquires environmental information; can be through vision/text/audio/etc - manipulation: object rearrangement, mobile manipulation, tabletop manipulation - navigation: dynamically alter their position in the environment - Navigation is usually a long-horizon task, where the upcoming states of the agent are influenced by its past actions. A memory buffer and summary mechanism are needed to serve as a reference for historical information - still need evaluation criteria and embodied task paradigms to effectively deploy embodied agents ## 4 Agents in Practice: Harnessing AI for Good ![image](https://hackmd.io/_uploads/B13s9ujwa.png) - harness AI to assist users in repetitive work; daily tasks; task-solving efficiency - no need for explicit low-level instructions -> agent can analyze and plan - lets user explore innovative work instead of monotonous work ![image](https://hackmd.io/_uploads/rkiNi_jP6.png) ### 4.1 General Ability of Single Agent ![image](https://hackmd.io/_uploads/rkFhidiwT.png) #### 4.1.1 Task-oriented Deployment - **task oriented deployment**: agent follows high-level instructions from users -> goal decomp, sequence plan of subgoals, interactive exploration of env - simulated test environments to test their deployment - web scenarios: manipulating web - life scenarios: real-life scenarios #### 4.1.2 Innovation-oriented Deployment - inherent complexity of science difficult to represent in text + lack of sufficient training data #### 4.1.3 Lifecycle-oriented Deployment - agent has a long-term lifecycle -> pivotal milestone in AGI - use minecraft as a proxy -> low level control and high level planning -> RL imitation learning (before) ### 4.2 Coordinating Potential of Multiple Agents ![image](https://hackmd.io/_uploads/B1p67Ysvp.png) #### 4.2.1 Cooperative Interaction for Complementarity - research in multi-agents can be categorized into the following: - cooperative interaction for complementarity - disordered cooperation (uncontrolled discussion) -> majority voting - ordered cooperation agents in system adhere to specific rules (CAMEL, MetaGPT) #### 4.2.2 Adversarial Interaction for Advancement - forcing debate, argumentation, reflection like AlphaGo Zero - limitations: limited context can't process entire debate, computational complexity explodes, converge to incorrect consensus ### 4.3 Interactive Engagement between Human and Agent ![image](https://hackmd.io/_uploads/H1PeUKsv6.png) - 2 paradigms in human-agent interactions: - instructor-executor - equal partnership #### 4.3.1 Instructor-Executor Paradigm - quantitative (numerical) and qualitative (text) feedback #### 4.3.2 Equal Partnership Paradigm - empathetic communicator - human-level participation ## 5 Agent Society: From Individuality to Sociality ![image](https://hackmd.io/_uploads/Hke7Otswa.png) ### 5.1 Behavior and Personality of LLM-based Agents ![image](https://hackmd.io/_uploads/BJqP85sPT.png) #### 5.1.1 Social Behavior - input behavior (absorption of surrounding info) - internalizing (inward cognitive processing) - output behavior (outward action) - dynamic group behavior - positive - neutral - negative #### 5.1.2 Personality - cognitive ability - emotional intelligence - character portrayal ### 5.2 Environment for Agent Society #### 5.2.1 Text-based environment #### 5.2.2 Virtual Sandbox environment - extensibility and visualization #### 5.2.3 Physical Environment - sensor perception and processing - motion control constraints ### 5.3 Society Simulation with LLM-based Agents #### 5.3.1 Key Properties and Mechanisms of Agent Society - open: environment open to agents to explore - persistent: environment persists, agents affect it - situated: the environment is within a certain context - organized: meticulously organized framework mirroring the real world #### 5.3.2 Insights from Agent Society - organized productive cooperation - propagation in social networks - understand social dissemination phenomena - ethical decision-making and game theory - through the modeling of diverse scenarios, researchers acquire valuable insights into how agents prioritize values like honesty, cooperation, and fairness in their actions. In addition, agent simulations not only provide an understanding of existing moral values but also contribute to the development of philosophy by serving as a basis for understanding how these values evolve and develop over time. - policy formulation and improvement #### 5.3.3 Ethical and Social Risks in Agent Society - unexpected social harm - stereotypes and prejudice - privacy and security: - over-reliance and addictiveness: being fond of agents ## 6 Discussion ### 6.1 Mutual Benefits between LLM Research and Agent Research - LLMs excel in comprehension, planning, memory, reasoning -> helps with agent construction/research ### 6.2 Evaluation for LLM-based Agents - utility - success rate of task completion - foundational capabilities (out of the box) - efficiency - sociability - language communication proficiency - cooperation/negotiation - role-playing - values - honesty - harmlessness - adapting to specific demographics/cultures - ability to evolve continually - continual learning - autotelic learning (create goals) - adaptability and generalization ### 6.3 Security, Trustworthiness and Other Potential Risks of LLM-based Agents #### 6.3.1 Adversarial Robustness - backdoor/prompt-specific/data poisoning - adversarial training, augmentation, sample detection #### 6.3.2 Trustworthiness #### 6.3.3 Other Risks - Misuse - unemployment - threat to human-race ### 6.4 Scaling up the Number of Agents - pre-determined scaling: user determines num of agents - dynamic scaling: dynamically scale/autonomously - limitations; communication + message propagation; computational burden ### 6.5 Open Problems - whether or not LLM-based agents is a potential path to AGI - some argue LLMs exhibit early signs of AGI others say the agents are just reactionary and are just next-token predictors - from virtual/simulated to real-world - adaptability of hardware - enhanced environmental generalization capabilities - tolerance level of errors in virtual env is high but not so much in real world - collective intelligence in AI agents is not guaranteed by multi-agents - agents as a service (LLMAaaS)