Context Management

15%

Context Management is the smallest CCA-F domain at 15%, but it is the area most often responsible for an agent that works in demo and quietly degrades in production after turn 30.

Context Management is 15% of the CCA-F blueprint — roughly 9 of the 60 official-simulation questions — and the smallest single domain. It is tempting to under-prepare it for that reason. Do not. This is the domain that most directly determines whether a Claude-powered agent stays reliable past the first few minutes of use, and the exam writes its distractors accordingly: most of the wrong answers in this section are things that look efficient but quietly destroy state.

The domain covers the input budget (the context window), what prompt caching actually does, how to compress a long history without throwing away the parts you will later need, when a scratchpad or structured summary is the right answer, and — critically — what to do when the agent runs out of policy guidance mid-conversation. The last point is more reliability than context, but the exam blueprint groups them together because both are about an agent staying truthful and useful across long sessions.

If you only remember three things walking into the exam: the context window is the input budget and max_tokens is the output cap; structured summarisation beats truncation; and when policy is silent, escalate to a human rather than letting Claude infer. Most of the 9 questions in this domain reduce to one of those three rules.

What the exam tests in this domain

  • ·Context window vs. max_tokens — input budget vs. output cap (commonly confused)
  • ·Prompt caching: what gets cached, when it helps, when it doesn't
  • ·Structured summarisation / scratchpad patterns for long conversations
  • ·Token optimisation without losing early critical context
  • ·When to escalate to a human because policy guidance is missing
  • ·Reliability and behaviour drift across long agent sessions

Key concepts

Context window vs. max_tokens

These two parameters are confused in roughly half the wrong answers in this domain. The context window is the total input budget — the maximum number of tokens the model can read in a single request, including the system prompt, conversation history, tool definitions, tool results, and the current user message. max_tokens is a separate, much smaller cap on the model's output for that one response. Raising max_tokens does not extend the context window; it only lets Claude write a longer reply. If a conversation is bumping up against the input limit, max_tokens is irrelevant. The fix lives on the input side: prune, summarise, or extract structured state. Any exam option that proposes 'increase max_tokens to fit more history' is wrong by construction.

Prompt caching — what it actually does

Prompt caching reduces cost and latency by reusing previously computed prefix tokens. It is most effective when a large, stable prefix — a system prompt, a policy document, a tool schema, a long set of examples — is sent on every request and only the tail varies. It does not extend the context window, it does not summarise anything, and it does not help when the prefix changes every turn. A long, churning conversation history is exactly the case where caching does not save you. If the exam asks 'how do I handle a 200-turn customer-support session?', prompt caching is not the answer; structured summarisation is. Use caching to cheapen repeated stable prefixes, not to rescue a context-window problem.

Structured summarisation and the scratchpad pattern

The correct way to extend a long conversation is not to truncate the last N turns — that throws away the original problem statement, the customer's name, the decisions you have already committed to, and anything else that lives early in the history. The correct pattern is to extract the durable facts (who the customer is, what they want, what you have tried, what has been promised, what is still open) into a compact structured summary, and use that summary as the standing context for subsequent turns. This is often called a scratchpad: a small, authoritative state object that survives turn-by-turn pruning. It costs you a summarisation step but it preserves the early context that naive recency-based truncation destroys.

Escalation when policy guidance is missing

Reliability across long sessions is partly about token budgets and partly about the agent knowing the edge of its authority. When the agent hits a scenario the policy does not cover — a fraud-flagged account, an undefined refund category, a regulated decision — the correct behaviour is to escalate to a human and hand over the context it has already gathered. It is not to invent a policy, infer one from similar cases, lie about the system being down, or proceed on vibes. The exam tests this directly: any option that has Claude 'reason about what the policy probably is' for a sensitive decision is wrong. Escalation is the safe default whenever the agent cannot cite an authoritative rule.

Reliability across long sessions

Long agent sessions degrade in predictable ways: the model loses track of early instructions, contradicts itself, repeats already-completed tool calls, or hallucinates facts that were never in the input. Most of these failures trace back to context hygiene — a bloated history, a summary that dropped a key fact, a missing scratchpad, or no termination condition on a loop that keeps adding turns. The fix is upstream: a deliberate summarisation strategy, a structured state object the agent re-reads each turn, and explicit policy for what to do when the agent is uncertain (almost always: stop and escalate). Reliability is a context-management problem before it is a model-quality problem.

Common traps and distractor patterns

Trap 1 — 'Increase max_tokens to extend the context window'

This is the single most common distractor in this domain. max_tokens caps the output of a single response. It has nothing to do with how much history the model can read. If an option offers 'raise max_tokens' as a fix for a long conversation, it is wrong. The input side is fixed by the model's context window; the only levers you have are pruning, structured summarisation, and external memory.

Trap 2 — 'Summarise only the most recent N turns and discard the rest'

Recency-based truncation loses the early context that is usually the most important: the original problem statement, the customer's identity, the commitments already made. The correct pattern extracts durable facts from the entire history into a structured summary, not just the tail. Any option that keeps only the last few turns is throwing away the bits the agent needs most.

Trap 3 — 'Ask Claude to infer the missing policy'

When the agent encounters a sensitive decision with no policy guidance — fraud flags, regulated transactions, anything irreversible — the wrong move is to have Claude reason out a policy from similar examples. The right move is to escalate to a human with the context attached. Distractors that frame inference as 'using Claude's judgement' are wrong by design.

Trap 4 — 'Start the conversation over and ask the user to repeat'

Restarting a long support session is sometimes proposed as a clean way to reset context. On the exam it is almost always wrong: it discards everything the agent learned, frustrates the user, and signals an architectural failure. Structured summarisation gives you the reset benefit without the data loss.

Sample questions with explanations

  1. Question 1

    A customer support agent has been running for 50 turns and is approaching the context window limit. Which strategy best preserves the ability to continue the conversation?

    • A.Increase the max_tokens parameter to extend the context window
    • B.Extract key facts from the conversation into a structured summary and use it as the context for the next turn✓ Correct
    • C.Start a new conversation from scratch and ask the customer to repeat their issue
    • D.Summarize only the most recent 5 turns and discard the rest
    Why B is correct: Extracting structured facts from the conversation history (e.g., customer name, issue type, steps already taken, decisions made) into a compact summary preserves the essential context while dramatically reducing token usage. This scratchpad pattern allows the agent to continue working without losing important state. Starting over loses context and frustrates customers. max_tokens controls output length, not context window size. Summarizing only recent turns risks losing critical early context like the original issue description.
  2. Question 2

    Your customer support agent is unable to process a refund because the customer's account is flagged for fraud review. The agent has no policy guidance for this scenario. What is the correct behavior?

    • A.Tell the customer the system is unavailable and end the session
    • B.Attempt the refund anyway since the customer seems legitimate based on conversation context
    • C.Escalate to a human agent, explaining the flag and the customer's request✓ Correct
    • D.Ask Claude to infer the correct policy based on similar scenarios it has seen
    Why C is correct: When an agent encounters a policy gap (no guidance for fraud-flagged accounts), the correct behavior is to escalate to a human. The escalation should include context: what the customer wants, what the agent found, and why it cannot proceed autonomously. Attempting the refund ignores the fraud flag and could cause harm. Claiming the system is unavailable is deceptive. Asking Claude to infer policy for sensitive financial decisions is a reliability anti-pattern — escalation exists precisely for situations where autonomous action is inappropriate.
  3. Question 3

    What is 'context management' in the context of working with language models?

    • A.The practice of strategically organizing, prioritizing, and controlling the information included in a model's input to maximize relevance and minimize noise✓ Correct
    • B.A database management technique for storing conversation history
    • C.Managing the historical context of a conversation for legal compliance
    • D.The process of translating content between different languages
    Why A is correct: Context management is the practice of strategically organizing, prioritizing, and controlling what information enters a model's input window. It involves selecting the most relevant context, structuring it effectively, managing context window limits, and ensuring the model has the right information at the right time without being overwhelmed by irrelevant data.

How to study this domain

Treat every Context Management question as either an input-budget question, a state-preservation question, or an escalation question. Read the scenario, decide which of the three it is, and then evaluate options against that lens. Input-budget questions: anything mentioning max_tokens as a fix for history length is wrong. State-preservation questions: anything that drops early turns is wrong; structured summary is the answer. Escalation questions: any option that has Claude invent or infer policy for a sensitive decision is wrong.

Memorise the exact distinction between context window (input budget, fixed by the model) and max_tokens (output cap, set per request). Memorise what prompt caching does (cheapens repeated stable prefixes) and what it does not do (extend the window, summarise history). Practice 15–20 questions in this domain; that is enough given the weight. Spend the time you save on Agentic Architecture, which carries nearly twice as many questions.

Frequently asked questions

How many questions on the CCA-F exam are from Context Management?
Context Management is 15% of the blueprint — the smallest single domain. On the 60-question Official Simulation that maps to roughly 9 questions.
What is the difference between the context window and max_tokens?
The context window is the input budget — the total number of tokens the model can read in one request, including system prompt, history, tools, and the new message. max_tokens is the cap on the model's output for that one response. Raising max_tokens does not give you more room for history; it only lets Claude write a longer reply.
Does prompt caching extend the context window?
No. Prompt caching reuses computation for a stable prefix you send on every request, which lowers cost and latency. It does not increase how much history the model can read, and it does not help if your prefix changes every turn. It is a cost lever, not a capacity lever.
Why is summarising only the most recent turns the wrong answer?
Because the most important context usually lives in the early turns — the customer's identity, the original problem, commitments already made. Recency truncation throws that away. Extract durable facts from the whole conversation into a structured summary instead.
When should an agent escalate to a human?
Whenever it hits a decision the policy does not cover, especially for anything sensitive, irreversible, or regulated — fraud flags, undefined refund categories, compliance edge cases. Hand over the context you have gathered. Do not let Claude invent a policy from similar examples.
How much study time should I spend on this domain?
Proportionally less than Agentic Architecture or Prompt Engineering. 15–20 focused practice questions are usually enough to lock in the input-budget, state-preservation, and escalation patterns. Spend the saved time on the larger domains.

Practice Context Management questions

$24.99 lifetime access. 1,000+ scenario-based questions across all 5 domains, adaptive difficulty, written explanations, 7-day refund.

Get Lifetime Access — $24.99

Or try 15 free questions first, or see 10 free sample questions.

Other CCA-F domains