What is Context Engineering
Large language models are text completion engines at their core. They continue from whatever context you give them, predicting the most likely next tokens. Context is everything the model sees before it answers: system prompts, chat history, tool results, documents, UI metadata.
Context engineering is designing everything the model sees before it generates a response. In other words, effective AI implementations focus on building high-quality context that leads to high-quality outputs.
Text in, text out
The first GPT models were simple text-in text-out systems. You gave them a text and the model generated a continuation of that text.
In the early days, people built clever prompt templates to guide the model's output. But fundamentally, you tried to build a chain of text that led the model to produce the desired result. This principle still holds true with modern LLMs.
LLMs are probabilistic text generators at their core that predict the next token based on the preceding context. It means each token or word you give as input or in system prompt can affect what comes next.
Note: below is a simplified demo of how token prediction works. In real life tokens are pieces of words, punctuation, or spaces, but here we use whole words for clarity. Watch this for a more in-depth dive in transformers.
How chat models work
Most LLMs are trained to understand chat templates that use system tokens like <|system|>, <|user|>, <|assistant|>, <|tool|>, </s>, which create better structure for multi-turn chats, tool use, reasoning etc.
The key point: under the hood it's still text-in text-out. We still want to build a chain of context that leads the model to produce the desired output.
Tools, MCP, and the application layer
Tool use (function calling) is the big shift that made LLMs far more useful beyond raw text generation. When an LLM uses tools like web search, code execution, APIs, it is not directly browsing the web or calling APIs.
Instead, the model generates text commands that the application layer interprets, executes and feeds back into the model as more context.
1
The model generates a text command that matches a schema you define.
2
Your application parses that command, runs the real tool / API request.
3
You feed the result back into the model as more context.
4
The model continues the completion with this new information.
Tools: These are structured definitions of what external capabilities the model can invoke. Each tool has a name, description, input schema, and output schema. An LLM might have multiple tools available, like "web_search", "get_current_temperature", or "execute_code" and choose which one to call based on the user's request.
Application Layer: This is the app where you call the LLM API, manage context, parse model outputs, execute tool calls, and handle the overall workflow. The application layer interprets the model's text commands, runs the actual tools or APIs, and feeds results back into the model. This is the part you are most likely building.
Model Context Protocol (MCP) turns application APIs into tool definitions. You can think of MCP as the UI of apps for LLMs: instructions that let models reliably talk to external systems. For example a project management app might expose tools like "create_task", "list_tasks", and "update_task_status" with clear schemas the LLM can use. This is also how you build the upcoming ChatGPT Apps that let's ChatGPT use third party apps directly from the chat interface.
How it works:
- The system message provides context and instructions to the model
- The user asks a natural language question
- The assistant generates a <tool_call> with structured JSON parameters
- Your application executes the tool and returns the result as a tool message
- The assistant reads the tool result and formats it into a natural response for the user
Note: The model doesn't execute tools itself. It requests tool calls, and your application handles the execution and returns results.
So again, what is Context Engineering?
It's about assembling the right information, tools, and constraints at the right moment. Every time the model runs, you're building a temporary workspace that includes:
- System prompt: role, rules, examples
- User message: the current request
- Chat history: recent conversation
- Retrieved information: relevant documents or data
- Available tools: what actions the model can take
- Output format: schemas guiding the response structure
Building powerful and reliable AI Agents is becoming less about finding a magic prompt or model updates. It is about the engineering of context and providing the right information and tools, in the right format, at the right time. It's a cross-functional challenge that involves understanding your business use case, defining your outputs, and structuring all the necessary information so that an LLM can "accomplish the task."
Most AI failures aren't model failures. They're context failures. The model didn't have the right information, constraints, or tools available when it needed them.
Context windows are limited. It means the maximum number of tokens a model can process at once. Each time you send a message to a chat model, the entire context (system prompt, chat history, tool results, documents) is included and counts against that limit.
Even large context windows suffer from "attention dilution", meaning that if you pack in too much information, the model loses focus. Good context engineering means including only what's needed, using compressed summaries over raw data, and retrieving information just-in-time rather than loading everything upfront.
High-quality context leads into high-quality output
LLMs don't remember, don't learn on the fly, and don't run code by themselves. Every time you send data to an LLM, it's a fresh temporary workspace that requires all necessary information to be included in the context. Effective context engineering involves:
- Selecting what goes into context: Prompts, history, documents, tool results—everything the model needs to generate accurate responses.
- Structuring it clearly: Organize context so the model can follow the logic and produce coherent outputs.
- Constraining outputs: Use schemas and tools to guide the model toward producing reliable, structured results.
On Retrieval Augmented Generation (RAG)
When you're working with large information sources like a knowledge base or documentation set, you can't include all the information in each request (because of the context window and well, costs). RAG initially became relevant to augment the model's original training data and to ground it's answers in this retrieved data.
Old RAG implementations retrieved context for every user query but these days most implementations are agentic: the model uses search as a tool and decides when it needs more information and what to search for.
TypeScript offers several key benefits:
- Static Type Checking: Catches errors at compile time instead of runtime, making your code more reliable [1]
- Better Developer Experience: Enhanced IDE support with autocomplete and refactoring tools [2]
- Improved Maintainability: Easier to understand and modify code, especially in large codebases [2]
- Team Collaboration: Type definitions serve as documentation, making it easier for teams to work together [2][3]
How RAG works (simplified):
- The user asks a question
- The assistant decides to search the knowledge base and generates a <tool_call> with a search query
- Your application searches the vector database/knowledge base and returns relevant documents with relevance scores
- The retrieved documents are added to the context as a tool message
- The assistant reads the retrieved information and "decides" if it needs to search again or not.
- The assistant synthesizes an answer based on the actual documents.
The challenge is building an effective search system that finds relevant information across different query types. If your project relies heavily on knowledge retrieval, start by building a solid search infrastructure with AI agents as a primary consumer. AI agents can query data in multiple ways: keyword search, semantic search, filtered queries, or even executing code to compute results.
AI implementations needs good user experience (UX)
Consider a scenario: you are using an AI assistant to get medical advice. The model provides an answer that seems plausible, but you have no idea where that information came from. Is it based on current medical guidelines, outdated information, or fabricated text?
Good AI UX builds trust through transparency and verifiability. LLMs can produce incorrect or misleading information even when functioning as intended, so design interfaces that help users verify outputs and recover from errors.
- Be transparent: show the process/context that lead to an answer.
- Make it easy to verify citations: ideally in-line, not just tiny footnote links that require reading a 20 page document.
- Use friction wisely: for example on critical actions, require user confirmation before execution.
- Use streaming outputs to let users see what is happening in real-time and being able to interrupt if something seems off.
Not everything needs to be a chat interface either. There are much to be explored in terms of intentfaces and different patterns of interaction that suit different use cases better than chat.
How to build intuition about LLMs and context engineering
Getting hands-on experience is the best way to understand how LLMs work and how to design effective context. Here are some steps to get started:
- Learn the basics: Understand what structured outputs and tool calling are. They're the backbone of robust AI products.
- If you have any development skills, you can try our basic tool calling template to build a chatbot using Vercel AI SDK. It's very bare bones but shows the core principles. We use it as an assignment in our recruitment process.
- Learn context engineering: Learn how to shape prompts, chain tools, and design flows that produce stable results. Start from guides like Anthropic's context engineering article or OpenAI's prompt engineering guide. There's also Google's prompt engineering guide.
- Nothing beats direct experience, so apply prompt engineering techniques in real projects.
- You can also use LLMs to help you learn prompt engineering. For example, ask an LLM to critique and improve your prompts.
- Try as many different GenAI products as you can: Spot patterns in how they structure context and workflows. Direct experience builds intuition.
- Try different chat apps like ChatGPT, Claude and Gemini. Try to solve something for yourself and see how they manage web search, memory, deep research, and tool use in general.
- AI app builders like Lovable, V0, Figma Make.
- If you are code-curious or already have development chops: look at Github Copilot, Claude Code and ChatGPT Codex.
For more in-depth LLM insights see Andrej Karpathy's lectures or watch 3Blue1Brown's Transformers, the tech behind LLMs. You don't need to understand all the math behind LLMs, but having a mental model of how they work under the hood helps you design better context and prompts.
Over time you'll see that everything is context building: your prompt, the app's hidden prompts, the tools it calls, and the data you feed it. Use that understanding to form your own opinion, rather than echoing hype or skepticism. For more practical insights on building AI products, Simon Willison's writing offers excellent real-world perspectives.
To sum it up
Context makes or breaks the AI implementations, and you only really understand how context works (prompts, structures, tools) by actually using and building with them.
The gap between reading about AI and working with it directly is enormous. Theory gives you vocabulary, but practice gives you intuition about what works, what fails, and why. You also get a sense of different models' strengths and weaknesses, and how to design around those. So start experimenting today. Build small things. Break them. Fix them. Build for yourself. Solve a problem you have.
Build your own context.