Context Engineering: What the Term Actually Means and What It Doesn't

Sometime in early 2025, “prompt engineering” stopped being the term people used and “context engineering” took its place. Like most renamings in this field, it was half marketing and half a genuine shift in what the work actually is. The marketing half is noise. The genuine half is worth understanding, because it names a real engineering problem that I spend a meaningful fraction of my time on.

This post is about the real part: what context engineering is when you treat it as engineering rather than as a LinkedIn phrase.

The Renaming Is Pointing at Something Real

The reason “prompt engineering” stopped being adequate is that single-shot prompting stopped being the dominant pattern. The interesting systems in 2024–2025 are not “write a clever prompt and get an answer.” They are loops: an agent that calls tools, reads results, calls more tools; a RAG pipeline that retrieves, reranks, and assembles; a long-running session that accumulates state. In all of these, the prompt is not a thing you write — it is a thing your code assembles, dynamically, on every single model call.

Once the prompt is assembled by code from many sources, the engineering question is no longer “what words do I use.” It is: given a fixed token budget, what goes into the context window on this call, in what order, and what gets dropped?

That is the actual discipline. Everything else labelled “context engineering” is either a subset of this or a rebrand of normal prompting.

The Context Window Is a Resource Budget

The mental model that makes this tractable: treat the context window exactly like you’d treat a fixed memory allocation or a cache of fixed size.

It has a hard capacity. Every token you spend on one thing is a token you can’t spend on another. Spending tokens has both a latency cost and a money cost that scale roughly linearly. And — this is the part people miss — more context is not monotonically better. Past a certain fill level, model performance on the actual task degrades even though you’re technically “giving it more information.”

That last point is the whole game. If context were free and quality rose monotonically with it, there would be no engineering problem — you’d stuff everything in. The reason context engineering is a discipline is that the relationship between “amount of context” and “quality of output” is not monotonic.

Why More Context Hurts: The Failure Modes

I categorise the ways a poorly managed context window degrades output. These map to failure modes that show up in production, not just in benchmarks.

Lost in the middle. Models attend most reliably to the beginning and end of their context. Information buried in the middle of a long context is recalled less reliably. This is well documented and it’s stable across model generations even as windows get larger. The practical consequence: position matters. The most important instructions and the most relevant retrieved facts should not be in the middle of a 100K-token blob.

Distraction and dilution. Irrelevant-but-plausible content in the context pulls the model’s output toward it. If you retrieve ten documents and only two are relevant, the eight irrelevant ones are not neutral — they actively degrade the answer by giving the model material to anchor on. A smaller, cleaner context routinely beats a larger, noisier one.

Contradiction. When the context contains conflicting information — an outdated tool result and a fresh one, two retrieved documents that disagree — the model has no reliable way to adjudicate. It will often pick the wrong one, or average them into something false.

Budget exhaustion in loops. In an agent loop, every tool call appends its result to the context. Left unmanaged, a ten-step agent run accumulates ten tool results, and by step eight the window is mostly full of stale step-one and step-two output that no longer matters. The agent’s effective working memory for the current decision shrinks as the run goes on.

Each of these is an engineering problem with engineering solutions. None of them is solved by “writing a better prompt.”

The Four Levers

Concretely, context engineering is the management of four things. I think of them as the four levers you have on every model call.

1. Selection — what goes in

This is retrieval, broadly. In a RAG system it’s the retrieve-and-rerank pipeline. In an agent it’s deciding which prior tool results, which parts of the conversation history, and which system instructions are relevant to this call.

The key insight: selection is a precision/recall tradeoff, and for context windows precision usually matters more than recall. Retrieving the top-3 genuinely relevant chunks beats retrieving the top-20 where 17 are noise — because of the dilution failure mode above. The instinct from search (“cast a wide net, let the model sort it out”) is wrong here. The model does not sort it out; it gets distracted.

1
2
3
Bad:  retrieve top-20 by vector similarity → dump all 20 into context
Good: retrieve top-20 → rerank with a cross-encoder → keep top-3–5
      → optionally dedupe near-identical chunks → assemble

The rerank step is the single highest-leverage addition most RAG systems are missing. Bi-encoder vector similarity is cheap and gets you candidates; a cross-encoder reranker is more expensive but far more accurate at judging actual relevance, and you only run it on the candidate set.

2. Ordering — where it goes

Given the lost-in-the-middle effect, ordering is a free lever. The standard layout that works:

1
2
3
4
[ system instructions / role / constraints ]      ← top: high attention
[ retrieved context, least→most relevant ]         ← most relevant nearest the query
[ conversation history / prior turns ]
[ the current user query / task ]                  ← bottom: high attention

Put the things the model must not ignore at the top and bottom. Put the most relevant retrieved chunk closest to the query, not first in the list. This costs nothing and measurably improves recall of the information you went to the trouble of retrieving.

3. Compression — making it smaller

When you have more relevant information than fits, or when a loop is accumulating history, you compress rather than truncate.

Summarisation of history. In a long agent run or chat session, periodically replace the oldest N turns with a model-generated summary. You lose detail but keep the gist, and you reclaim budget for current work.
Tool-result distillation. A tool that returns a 5,000-token JSON blob is usually returning 200 tokens of information. Post-process tool results — extract the fields that matter before appending to context — rather than dumping raw output.
Structured over prose. A table or a few key-value lines convey the same facts as a paragraph in a fraction of the tokens.

The discipline here is the same as caching: you’re deciding what to keep at full fidelity, what to keep compressed, and what to evict.

4. Isolation — keeping contexts separate

The lever people discover last and value most. Not everything needs to share one context window. Sub-agents, separate tool-calling contexts, and scratchpad-style external memory let you keep the main context clean.

Example: instead of having one agent read a 50-page document into its context to answer a question, spawn a sub-agent whose entire job is “read this document, return the answer to this question.” The sub-agent burns its own context window on the document; the parent agent gets back only the distilled answer. The 50 pages never touch the parent’s context.

This is the same principle as process isolation. A failure or a token explosion in the sub-context doesn’t pollute the parent. It’s how you build agent systems that don’t degrade over long runs.

Where This Sits Relative to RAG and Prompting

To be precise about the taxonomy, because the terms get muddled:

Term	What it actually is
Prompt engineering	Crafting the fixed instructional parts — role, format, constraints, examples. Still real, now a subset.
Retrieval (RAG)	The selection lever specifically — finding relevant external knowledge to inject.
Context engineering	The superset: managing the whole window as a budget — selection + ordering + compression + isolation, on every call.

Context engineering is not a replacement for prompt engineering or RAG. It’s the layer that sits above both and treats them as components. A good system has all three: well-crafted instructions (prompting), good retrieval (RAG), and disciplined assembly of the whole window under a budget (context engineering).

What It Doesn’t Mean

Equally important, because the term gets stretched to cover everything:

It does not mean “use a model with a bigger context window.” A 1M-token window doesn’t remove the budget problem; it raises the ceiling and makes the dilution and lost-in-the-middle problems worse if you naively fill it. Bigger windows are a tool, not a solution.
It is not a synonym for “doing AI stuff well.” Plenty of production LLM quality problems are eval problems, model-choice problems, or plain bugs. Not everything is context.
It is not magic that compensates for a bad underlying system design. If your retrieval corpus is garbage, no amount of clever assembly saves you.

How I Actually Work On It

The thing that turns this from hand-waving into engineering is measurement. You cannot tune the four levers by intuition — the non-monotonicity guarantees your intuition will be wrong sometimes. The loop:

Instrument the context. Log the actual assembled context for a sample of production calls. You will be surprised how much junk is in there. The first time I logged real assembled contexts, roughly 40% of the tokens were stale tool results and duplicate retrieved chunks.
Build an eval set. A few dozen representative tasks with known-good outcomes. (This is the subject of its own discipline — see Evaluating LLM-Integrated Systems.)
Change one lever, measure. Add reranking. Re-measure. Reorder. Re-measure. Compress tool results. Re-measure. Treat it like performance tuning, because that’s what it is.
Watch the budget over the run, not just per call. For agents, plot context fill level across the steps of a run. The degradation is usually visible as a curve.

This is the same empirical loop I’d use tuning a GC or a cache. The substrate is different; the discipline is identical — a finite resource, a non-obvious relationship between how you use it and the outcome, and no substitute for measurement.

Context engineering is a real discipline wearing a slightly silly name. Strip the marketing and it’s this: the context window is a fixed budget, output quality is non-monotonic in how you fill it, and you have four levers — selection, ordering, compression, isolation — to manage it. The teams getting good results from LLM systems in 2025 are the ones treating those levers as an engineering problem with measurement attached, not as a prompt they’ll eventually get right by feel.

The Renaming Is Pointing at Something Real#

The Context Window Is a Resource Budget#

Why More Context Hurts: The Failure Modes#

The Four Levers#

1. Selection — what goes in#

2. Ordering — where it goes#

3. Compression — making it smaller#

4. Isolation — keeping contexts separate#

Where This Sits Relative to RAG and Prompting#

What It Doesn’t Mean#

How I Actually Work On It#