context-glossary-20251112

# context-glossary-20251112 ### Context Window (final, context-engineered version) The **context window** is the total number of tokens a model can consider and generate within a single inference — **including both input and output**. It defines the model’s *active field of attention* at any given moment. Everything inside that window—system instructions, tool metadata, retrieved documents, user prompts, and the model’s own generated text—constitutes what the model “knows” right now. Anything beyond that limit is invisible unless deliberately re-introduced. --- #### How it works During a single generation, the model must fit *all* input **and** output within this window. If the window is 200 K tokens and your input already occupies 60 K, then the model has ≈140 K tokens left for its reply before reaching the hard limit. There’s no overflow; the transformer cannot generate beyond its capacity. Between turns, however, most chat interfaces and AI tools **reconstruct** the context rather than persisting the model’s internal state. They resend selected portions of the conversation history—sometimes summarized, filtered, or rewritten—along with new user input. This is where **context engineering** comes in: the designers of AI interfaces, and advanced users, decide *what subset of history or external material* should be visible to the model for the next reasoning step. --- #### Context-engineering implications * **Selective recall:** Interfaces act as curators, not stenographers. They may keep only the last few exchanges, summaries, or key facts instead of the entire transcript. * **Dynamic summarization:** Systems can compress long dialogues into shorter synopses or structured memories to preserve continuity within token limits. * **Retrieval augmentation:** External stores (notes, databases, embeddings) allow the interface to re-inject relevant information on demand instead of retaining every token. * **Design choice, not architecture:** The model itself doesn’t remember between calls; *the application layer engineers the illusion of memory* by managing what gets re-sent each turn. --- #### Key takeaways * **Context window = total capacity per call (input + output).** * **No persistence by default:** Each turn is a fresh call; history is re-included by design, not by model memory. * **Context engineering** is the art of deciding what to resend, summarize, or retrieve so the model always has *the right 200 K tokens* in view—no more, no less. ### Prompt Engineering **Prompt engineering** is the craft of designing and structuring inputs to a model to elicit desired behaviors or outputs. It involves choosing phrasing, examples, and constraints that align the model’s generative process with human intent. While it often focuses on *wording and syntax*, prompt engineering operates within a larger **context engineering** framework that also manages what information surrounds the prompt. --- ### Context Engineering **Context engineering** extends prompt engineering beyond wording — it involves curating, structuring, and dynamically managing *everything* that enters the model’s context window. It treats the prompt as one layer in a broader system of information flows: persistent memory, embeddings, retrieval mechanisms, and context hierarchies. In practice, context engineering governs *what the model knows at runtime*, ensuring the right data is visible at the right time for coherent reasoning. --- ### Retrieval-Augmented Generation (RAG) **Retrieval-Augmented Generation (RAG)** is a technique that supplements the model’s internal knowledge with relevant external information at query time. RAG systems retrieve passages, notes, or embeddings from a database based on semantic similarity, then feed them into the model’s context window as supporting material for generation. It exemplifies context engineering at scale: structuring retrieval so that models reason over curated, just-in-time knowledge. --- ### Context Pruning **Context pruning** is the deliberate removal of redundant, irrelevant, or outdated material from the context window to preserve token space and maintain coherence. Since large language models process a finite number of tokens, pruning improves efficiency and reasoning quality. Effective pruning strategies may involve recency weighting, relevance scoring, or semantic deduplication — keeping only what matters most for the current task. --- ### Context Persistence **Context persistence** refers to maintaining or reconstructing relevant context across multiple interactions with a model. Because most models have no long-term memory, persistence is often achieved through external systems — e.g., databases, vector stores, or custom memory managers — that store and reinject context into new sessions. It allows an AI system to behave as though it “remembers” prior interactions, key facts, or evolving projects. --- ### Context Rot **Context rot** occurs when stored or reintroduced contextual material becomes stale, inaccurate, or misaligned with the current state of a project or conversation. Over time, accumulated memory (e.g., cached summaries or embeddings) may mislead the model, degrading output quality. Preventing context rot is a major maintenance task in long-lived AI systems, akin to refactoring or data hygiene in software engineering. --- ### Needle in a Haystack The **needle-in-a-haystack** problem describes the challenge of locating a small but crucial piece of information within a very large context. As context windows grow (e.g., 200K+ tokens), models can technically “see” more text — but attention mechanisms may still struggle to focus on the right part. This concept underscores the need for **context structuring and retrieval strategies** that surface relevant signals amid noise. --- ### Chain-of-Thought (with reference to Context) **Chain-of-thought (CoT)** refers to the model’s ability to generate intermediate reasoning steps between a question and an answer. When prompted explicitly (“Let’s reason step-by-step”) or implicitly via examples, the model lays out its internal logic in natural language. In context engineering, CoT depends on well-managed context — the reasoning chain only works if prior steps remain visible and coherent within the window. Extending CoT over multiple turns often requires **persistent or reconstructed context**, bridging the gap between in-window reasoning and long-term dialogue continuity. Excellent — this will extend your glossary into the **systems-integration** layer of context engineering: how models connect to external tools, data, and workflows. Below are expanded 1–2 paragraph entries for **Retrieval-Augmented Generation (RAG)** and related **Model Context Protocol (MCP)** concepts. They’re written in the same explanatory-pedagogical style as your existing entries and ready to drop into your Markdown glossary. --- ### Retrieval-Augmented Generation (RAG) — *extended definition* **Retrieval-Augmented Generation (RAG)** integrates external information retrieval with a model’s generative process. Instead of relying solely on the model’s internal parameters (its “frozen” pretraining knowledge), a RAG system **searches a knowledge base**—vector store, document index, or database—at query time to fetch semantically relevant passages or notes. These retrieved chunks are then inserted into the model’s **context window** before generation, grounding its output in up-to-date or domain-specific evidence. At its best, RAG exemplifies **context engineering at scale**: it dynamically curates *just-in-time knowledge* so that the model reasons within the right slice of information. Effective RAG pipelines manage retrieval quality (similarity metrics, re-ranking), context formatting (chunk size, citation scaffolding), and token budgeting (how much retrieved text to include). Current frontiers include **multi-hop retrieval**, **graph-based context expansion**, and **hybrid systems** that blend symbolic search with embeddings to reduce hallucination and improve interpretability. --- ### Vector Embeddings and Vector Stores A **vector embedding** converts text (or images, code, audio) into a high-dimensional numeric representation capturing semantic similarity: nearby vectors correspond to conceptually related content. **Vector stores** (e.g., Pinecone, Weaviate, Chroma, FAISS) index these embeddings for rapid nearest-neighbor search. In a RAG pipeline, embeddings allow context retrieval based on *meaning* rather than exact keywords. However, they also introduce subtle biases—similarity in embedding space does not always equal conceptual relevance—making **embedding strategy** and **re-ranking logic** key parts of context engineering. --- ### Retrieval Chain The **retrieval chain** is the orchestrated sequence of steps that transform a user query into a context-aware prompt: `query → retrieval → filtering → ranking → formatting → generation.` Each stage encodes design choices: Which embeddings? How many results? How are they chunked, cited, or summarized? A well-engineered retrieval chain balances precision (surface only what’s relevant) and coverage (don’t omit needed context). Emerging frameworks like **LangChain**, **LlamaIndex**, and **Semantic Kernel** provide composable abstractions for these pipelines, but performance still depends on *context orchestration*—how results are arranged, delimited, and framed before entering the model window. --- ### Context-Aware Tool Use Modern models increasingly integrate with **external tools**—search engines, calculators, APIs, or custom functions—called during inference to extend the model’s abilities. Each tool call injects *live* context (fresh data, computed results) into the reasoning process. This transforms prompting into **context choreography**: deciding when to rely on the model’s internal reasoning versus when to delegate to a tool. From a context-engineering perspective, tool use expands the model’s “window” beyond text: every function call or retrieval event becomes a contextual act that reshapes what the model *knows* mid-conversation. --- ### Model Context Protocol (MCP) The **Model Context Protocol (MCP)** is an emerging open standard for connecting AI models to external data sources, tools, and applications in a secure, interoperable way. Developed initially by Anthropic and now adopted across ecosystems (Claude Desktop, Cursor, VS Code, Obsidian, ChatGPT’s tool system, and others), MCP defines a consistent message-passing layer through which models can **list resources**, **invoke tools**, and **exchange context** with local or remote systems. MCP’s significance for context engineering is profound: it formalizes how an LLM **extends its context beyond text**, treating APIs, databases, or files as live contextual surfaces. With MCP, context engineering becomes distributed—an orchestration problem spanning many sources of truth. This enables *agentic workflows* where models browse files, run code, or query documents on demand, while keeping clear boundaries for safety and interpretability. In short, MCP is the plumbing that turns context engineering into a standardized **runtime ecology** for intelligent systems. --- ### MCP Client / Server Architecture An MCP setup usually involves a **client** (the model runtime or chat interface) and one or more **servers** that expose resources or tools. Servers describe available endpoints (“resources”) and the schemas for interacting with them; clients can request lists of tools, fetch resource contents, or call a function directly. Because the protocol is language-agnostic and transport-agnostic (often JSON RPC over WebSocket), developers can integrate heterogeneous systems—local databases, cloud APIs, command-line tools—under a unified context layer. For users, this means that apps like Claude Desktop or Cursor can “see” your local files, notes, or project directories as contextual extensions. For educators and researchers, it opens new avenues for **controlled transparency**: students or faculty can inspect exactly *what context* an AI accessed when forming an answer. --- ### MCP vs. Traditional Plugins Where older plugin systems (e.g., ChatGPT Plugins, VS Code Extensions) bound tools to specific hosts or UIs, **MCP** provides a *protocol-level abstraction*. Any MCP-compliant model or app can interoperate with any MCP-compliant tool without bespoke integration. This moves context engineering from ad-hoc glue code toward **modular, inspectable architectures**—a foundational shift from “prompt engineering” to **context-as-infrastructure**. --- ### RAG 2.0 (Retrieval-Augmented Generation 2.0) **RAG 2.0** marks the evolution from simple “retrieve-then-generate” systems toward **integrated retrieval-reasoning loops**. Where classical RAG appended top-K search snippets to a prompt, RAG 2.0 architectures **interleave retrieval and generation dynamically**, letting the model *decide when, what, and how* to retrieve during reasoning. In practice this often means a **controller model** that issues structured retrieval requests mid-generation (via tool calls or MCP functions), incorporates results into its chain-of-thought, and continues reasoning with updated evidence. RAG 2.0 thus transforms static context injection into **active context negotiation**—a fluid, multi-turn process that fuses semantic search, memory management, and epistemic awareness. Key research threads include *self-reflective retrieval*, *contextual caching*, and *multi-agent collaboration* for evidence gathering. --- ### Agentic RAG **Agentic RAG** combines retrieval with **autonomous planning and tool use**. Instead of one fixed retrieval step, an agentic system decomposes a user query into subtasks—searching, filtering, verifying, synthesizing—and executes them iteratively. Each sub-agent or skill contributes specialized context: one retrieves documents, another summarizes, a third verifies citations. Through orchestrated reasoning (often using frameworks like LangGraph, Semantic Kernel, or MCP skills), the model builds a **context graph** rather than a flat text dump. Agentic RAG represents the frontier of context engineering: retrieval becomes an *interactive dialogue* between reasoning processes. --- ### Structured Retrieval **Structured retrieval** moves beyond raw text chunks to **schema-aware data access**—pulling from tables, JSON APIs, knowledge graphs, or relational databases. Instead of feeding verbatim paragraphs into the model, structured retrieval injects *symbolic abstractions*: key-value pairs, ontologies, or code-generated summaries. This yields higher precision and interpretability, reducing noise in the context window. In RAG 2.0 pipelines, structured retrieval acts as the “semantic spine,” aligning unstructured text with formal data representations. --- ### Retrieval Memory Graphs A **retrieval memory graph** links retrieved items not just by similarity but by **semantic relationships and temporal dependencies**. Nodes represent documents or entities; edges capture citations, chronology, or co-occurrence. When a model queries the graph, it can traverse these relationships—pulling clusters of context that mirror human associative reasoning. Such graphs underpin **context persistence** across sessions, enabling long-term RAG systems that evolve as knowledge grows. They also allow for *context pruning by topology*: removing stale or weakly connected nodes to fight **context rot**. --- ### Multi-Hop Retrieval **Multi-hop retrieval** is the ability to chain retrieval steps—using the results of one query to form the next—to answer complex or compositional questions. Instead of one vector search, the model performs *progressive exploration* (“find papers on X; from those, extract mentions of Y; retrieve Y’s datasets”). In RAG 2.0, this process can be mediated by an **agentic planner** or **reasoning graph**, often with explicit citation tracking to preserve transparency. --- ### Hybrid Retrieval (Symbolic + Vector) **Hybrid retrieval** combines **symbolic search** (e.g., keyword, metadata, SQL filters) with **vector retrieval** (semantic similarity). This dual approach balances precision and recall—symbolic filters narrow scope, while vectors surface conceptual relevance. Hybrid retrieval exemplifies *context budgeting*: each retrieval modality contributes complementary evidence to the context window. Modern systems (OpenSearch, Vespa, Weaviate) increasingly integrate hybrid search natively, blurring the line between database query and context construction. --- ### Context Cache / Adaptive Caching In RAG 2.0 systems, a **context cache** stores previously retrieved and validated information so that similar future queries can reuse results without repeating full retrieval cycles. These caches can be semantic (based on embeddings) or procedural (based on exact query signatures). When combined with **context-quality scoring** or **recency weighting**, adaptive caches function as *short-term memory layers*—a critical ingredient in scalable, low-latency RAG systems. --- ### Evaluation of RAG 2.0 Systems Evaluating RAG 2.0 requires metrics beyond text similarity or factual accuracy. Researchers now measure: * **Retrieval precision & coverage** (did the system find what it needed?) * **Context utilization** (did the model actually use retrieved info?) * **Faithfulness** (is reasoning grounded in sources?) * **Latency & cost trade-offs** (is retrieval efficient?) Context engineering thus extends into **instrumentation**: logging which retrieved passages influenced which parts of output, creating transparent, auditable reasoning chains. --- ### MCP-Enabled RAG Within the **Model Context Protocol**, RAG 2.0 systems gain a standardized way to **query and integrate external knowledge**. Each MCP “server” (e.g., a Notion workspace, a local folder, a research database) exposes resources that act as *live retrieval endpoints*. A model connected via MCP can list these sources, call them as tools, and dynamically weave their outputs into its context window. This creates **distributed retrieval ecosystems** where context lives across local and cloud environments. MCP effectively turns RAG pipelines into *federated context networks*—portable, inspectable, and user-controllable. --- ### Toward Context Operating Systems The convergence of **RAG 2.0**, **agentic workflows**, and **MCP interconnects** points toward a new abstraction: the **Context Operating System**. In such systems, context is not an incidental by-product of prompting but a **first-class computational resource**—queried, versioned, synchronized, and visualized like code. This vision reframes AI interaction as *context management*: deciding what should be in focus, what can fade, and how human oversight governs that flow. Context engineering, in this view, becomes the architecture of thought for machine-assisted reasoning. Excellent — these seven terms together define the **middle stratum** of context engineering: the operational and epistemic techniques that govern how meaning, information, and computation are distributed inside and around the model. Below are extended 1–2 paragraph definitions in the same clean Markdown style as your existing glossary — ready for direct inclusion in your context-engineering document. --- ### Context Budgeting **Context budgeting** is the deliberate allocation of a model’s finite token space across different informational layers — system prompts, conversation history, retrieved material, user input, and expected output. Because every transformer has a hard limit (its **context window**), effective context engineering requires treating that window like a budget: you must decide which portions of context deserve more “attention real estate” and which can be summarized, pruned, or omitted. A well-balanced context budget reflects design priorities. For instance, in a 200K-token window, a system might reserve 10K for system and role instructions, 20K for persistent memory, 100K for retrieved documents, 50K for user input, and 20K for output generation. Overloading any one layer risks incoherence or hallucination. The key insight is that *context is a scarce cognitive resource*—and budgeting is the art of optimizing it for the model’s interpretive performance. --- ### Retrieval Chain The **retrieval chain** is the full pipeline that transforms a user query into an informed, context-rich model prompt. It typically follows a multi-step process: `user query → preprocessing → retrieval → filtering → ranking → formatting → prompt construction → generation.` Each stage is a potential site of context engineering. Preprocessing defines how the query is represented (keywords, embeddings, structured attributes). Retrieval and filtering determine the raw candidate pool. Ranking optimizes for relevance or diversity. Formatting decides how results are presented (summaries, citations, metadata). At the final stage, all curated material is assembled into the prompt that enters the model’s context window. In advanced RAG 2.0 systems, this chain is no longer linear but **iterative and reflexive**: the model may call back into retrieval mid-generation, re-ranking or expanding context dynamically as its reasoning evolves. --- ### Context Drift **Context drift** describes the gradual semantic shift that occurs when a model’s apparent understanding of a topic, goal, or identity diverges from the user’s intent or the external truth conditions of a project. Drift can arise from accumulated summarization errors, evolving instructions, or subtle changes in wording that compound across multiple turns or sessions. Over time, these shifts can lead to factual inconsistency, goal misalignment, or loss of coherence. From a systems-design perspective, context drift mirrors the problem of **concept drift** in machine learning—data distributions change, and the model’s internal representation no longer matches reality. Mitigation strategies include regular **context grounding** (refreshing with verified data), **context pruning** (removing outdated memory), and **semantic anchoring** (reasserting key facts or constraints periodically). In humanistic terms, context drift is a hermeneutic danger: a slow re-interpretation of meaning through iteration. --- ### Context Grounding **Context grounding** ensures that the information a model uses and reuses reflects the *actual external state* of the world, not just prior textual echoes. In practice, grounding is achieved by linking the model’s reasoning to verifiable sources—retrieved documents, structured data, or tool outputs—so that each statement can be traced back to evidence. Grounding guards against **hallucinated continuity**, the illusion that the model “remembers” correctly when its memory is only reconstructed context. At the architectural level, grounding connects symbolic reasoning (the model’s text-based logic) with empirical data streams (APIs, sensors, updated databases). For scholars and educators, it also has epistemic significance: grounded context allows machine reasoning to remain accountable, auditable, and situated within a shared reality. Without grounding, persistence devolves into fiction; with it, context becomes a living epistemology. --- ### Context Framing **Context framing** refers to the rhetorical and structural design choices that shape *how* a model interprets the information it receives. Two systems might process the same text but yield very different outputs depending on the framing — the role prompt, tone, instructions, or even punctuation. Framing determines the lens through which the model perceives context: is it reading as a critic, summarizer, teacher, or analyst? From a design standpoint, framing operates at the boundary between **prompt engineering** and **context engineering**. It doesn’t change the content of the context but its *semantic posture* — like mise-en-scène in film or voice in writing. Good framing stabilizes meaning; poor framing amplifies ambiguity or drift. Advanced interfaces use **dynamic framing**—modifying tone or perspective based on detected user intent or prior turns—to maintain coherence and persona consistency. --- ### Context Orchestration **Context orchestration** is the active coordination of multiple context sources—conversation history, user profiles, retrieved documents, memory stores, and tool outputs—into a coherent whole for each model invocation. It’s the process of *composing a symphony of contexts* where timing, order, and emphasis matter as much as content. An orchestrator (human or software) decides which sources to invoke, in what sequence, and how to merge their outputs: Should recent messages outweigh long-term memory? Should a retrieval step precede summarization or follow it? In complex agentic systems, orchestration often involves **pipelines of models** or **context layers** communicating asynchronously. The success of orchestration determines whether the model acts like a scatterbrained assistant or a disciplined researcher. In this sense, context orchestration is both a technical and narrative art—aligning the rhythm of reasoning across components. --- ### Context Compression **Context compression** is any method for shrinking large or redundant context into a more compact, semantically faithful representation that fits within a limited window. Compression can be achieved through summarization, embedding-based abstraction, clustering, or structured encoding (e.g., replacing raw text with JSON metadata). The goal is to **preserve meaning density while reducing token volume**—allowing models to sustain long-term coherence without exceeding their context budgets. Compression is central to persistent systems: as memory grows, raw accumulation leads to noise and drift. Strategic compression transforms long histories into distilled knowledge—turning dialogue logs into summaries, documents into embeddings, or sessions into “episodic memories.” The trade-off is interpretive: compression gains efficiency at the cost of nuance. Thus, context engineers must decide when fidelity matters more than continuity, and when to let the model remember *in outline* rather than in full detail.