## Breaking Down Devin AI LLM Orchestration: High Level This is a high level summary of how I think Devin AI actuall works under the hood. Given that the footage of the product is released less than 48 hours ago, this could very likely be wrong. This is merely a speculation until we see something official. ## What makes Devin Special Let's start here. I think most developers are over-indexed with Devin's ability to write code, when we have already seen coding ability in Cursor, and even longer since GitHub Copilot I think what lets Devin captures the imagination of average developer is 2 factors 1) the UX for keeping human in the loop 2) live orchestration of web browser, code editor and command shell manipulation 3) aggregate outcome are considered semi-reliable -- if you have been in LLM engineering for a while, this is a problem of how to do **LLM Orchestration** reliably, which is a hard problem. ## From 10K Feet ![High level diagram](https://hackmd.io/_uploads/B16RqyGC6.png) Devin is comprised of two large systems: 1) User Interface Agent, given that it does more than chatting. Lets call this *foreground agent* 2) Action Agent, this performs actual action on tools like editor, commandline and browser. Given that the scope is likely larger than this, lets call this *background agent* How do we know this for sure? Well we at least know that there is some notion of concurrency. The UI input box explicitly says that messages you sent will `not interrupt Devin` > TLDR; I believe LLM Orchestration on Devin is done via 2 Finite State Machine LLM Program communicating over a channel with message queues. ### Whats going on in the foreground **Planning**: I observe planning totally changes, not only just from a task done receiving :heavy_check_mark:, but how tasks are worded totally changed, this leads to a speculation that planning state is in fact very simple- The state is just a txt output. No fancy data structure needed. LLM likely takes an input that is old plan, a combination of task accomplished/stuck, and user input, and output the most current rendition of a plan. more observations **Buffered Messages and Reply Logic** Devin Message UI does not answer user's messages immediately: incoming messages are buffered in a user facing message queue. We can deduce that there might be some conditions to be met before outbound messages are sent to user. 1) ongoing task is empty and user submits a message 2) task is done and user submits a message 3) background agent is stuck and requires a user input. **Message types:** Devin UI also support multiple message types: status of most current work, task accomplished ✅, task stuck and awaiting user's response ⭕️ **Replaying old state**: you see time series messages and screenshots + which tool is being worked on. As you move the slider, you can see both the focus shift of Devin's workspace and status messages as well `devin is doing ...`. From this we can deduce that UI can load the entire state of all tasks in a message log with timestamp- both for Chat UI, and Web browser/Code Editor and Shell, assuming these are simply just images We can say that the foreground agent is a higher level cognitive function, responsible for translating and taking in user input, planning and plan modification, playing and replaying past events What does it communicate to background agent: I believe this is current task, objective, the plan and user's message inputs Now, given that Foreground has to communicate these requirements to Background Agent, we need to figure out transport ### What does the transport layer look like? very possible that there is a global state somewhere. However given that we know 2 systems are concurrent, Its reasonable to assume that the two systems may use a shared global state (which could get tricky as there can be deadlock or race condition), but more likely some kind of message passing to communicate the most recent task.. with a way to accept new inputs when the task is done. ie. Foreground and Background Agent is each simply a single worker consuming from message queues ### Whats going on in the background The background is given higher level instruction from foreground agent, it should keep the context of task done, current task status, action logs, possibly also high level reflections train of thoughts of all the tools and output available. Whats important at this level that distinguish Devin from AutoGPT type of agents is **Termination Signal**- where Agents typically fall at: terminating current process with some kind of 'success' or 'stuck' status, ensuring no infinite loop, this can be implemented as some kind of upperbound retry mechanism or even with a classifier. Notably, Devin is shown with behaviour to be able to ask user for intervention. To simplify we can say that the background agent is either in a state of **done**, **doing**, or **stuck**. Each requires Foreground agent to react differently based on each state. And readying or accepting Foreground agent messages thereafter. Each Agent with tool likely contains a long running logs and reflection on its current state. This minimize context clutter at the high level and achieve higerh degree of context isolation, only sending what the foreground needs to know about current task. There are some occasions that show *foreground have access to memory* (what is shown to the UI). Either foreground has access from the message being passed and answering immediately, or it can query background agent at will to answer this question (yet another task). *does background has a high level plan like foreground- but one that user can't see*? ie. lower level hierarchy of tasks. Possibly, its likely Agent at this level doesnt have a notion of sub task though, but simply a stateful ReAct loop with multiple tools available. Reason is there is a lot of unknowns and reflecting on current progress matters more. That said, how it juggles 3 different tools and context, while knowing how to respond to Foreground Agent is an interesting question. there is likely to be a router or controller some kind that determines the next tool used, this can be a classifier, but also possibly an explicit input choice from the foreground. it could also be a ReAct-esque reflection ie. **observation:** I see a custom error code from tool x, **reflection:** I dont know what this is, I should search online for tool x error code. There is likely one agent per tool, and another agent to coordinate and route which tool/task should be used at one time, ie. *a fixing compiler error task* would need to **read documentation**, **shell output**, and then go **work with the editor**. after the task is accomplished, the output, logs are delivered as the outcome to high level foreground. Does the back There is no concurrency or parallelism here, Devin either code, or use shell, or browse, but not more than one at once. ### What we dont know - this current high level diagram does not cover the implementations of LLM Agent that manipulates commandline, code editor, or web browser and how it accomplish this level of orchestration. Although we simply cant handwaive these Tool-specific agents, one must admit enough research has already been done on this. for architecture, what matters is guard rails and message passing from higher lvl controller to distribute tasks. Its possible this can be implemented in isolation as long as interface contract between the agents are known. - how it wins coding benchmark: as much as beating is impressive to researchers, I dont think its what makes the product stand out. I think delivering end to end development workflow like a human is what matters. - what is devin maximum context length of the work before it starts to forget? ie. memory management strategy for a real long running session. We know session can go on to about an hour with a fine tunning job video https://www.youtube.com/watch?v=V_J-xOeCklQ. so code session can run for a while, but Devin needs to explicitly check for the status, it doesnt have it in context when asks - systems operations and security issues: there is a huge array of security vulnerabilities in keeping this system secure. Requiring a sandboxed Ubuntu server running an IDE, browser and shell command with root access.. How this environment get provisioned, tear down, and also resumed by user. ## Implications: 1. It should be possible to implement foreground/ background agent and transport layer between asynchronously, as long as scope and contract between each are clear 2. It would be helpful to see how an example callstack between the two concurrent system looks like on a mocked out task 3. UI and UX represents a huge issue still. Devin UI is very dynamic and have a heavy use of streaming output ie. via websocket 4. lastly, there is a whole host of challenges in implementing background agent that can code well alone, let alone manipulating browser and shell. This is a big area of research