Claude API Context Windows and Tokens: Limits and Management

What counts against the context window, what happens past the limit, and how to manage tokens with counting, max_tokens, and compaction.

In the Claude API, the context window is the total capacity a model can handle in one request. Your input — system prompt, conversation history, attached documents, tool results — and the model’s generated output all count against the same limit. This guide covers what goes into the context window, what happens when you exceed it, and how to plan and manage tokens, based on the official docs. (As of June 2026. Per-model limits and behavior may change — see the official context windows docs.)

Context window — input and output share one capacity Input tokens system prompt · history · documents · tool results Output tokens the generated response Total = context window (200K standard, up to 1M tokens) - As a conversation grows, every prior turn stacks up as input, shrinking what remains. - With extended thinking, previous thinking blocks are auto-stripped and do not consume later turns. - Exact per-model limits: see the official model comparison table (may change).

What counts against the window

Tokens in a request are not just the reply text. The system prompt, the accumulated conversation history (every prior user and assistant turn), attached documents and images, tool definitions, and tool results all count as input tokens — and the output the model generates must fit inside the same limit. The standard context window is 200K tokens, with support for up to 1M tokens (varies by model and conditions — check the official model comparison table). In multi-turn conversations and agentic workflows, tool results accumulate at every step, so the real thing to manage is usually cumulative growth rather than a single request.

One exception: with extended thinking, thinking tokens are billed as output within max_tokens, but previous turns’ thinking blocks are automatically stripped from the context calculation by the API. You do not remove them yourself, and they do not eat your conversation capacity.

What happens when you exceed the limit

Per the official docs, behavior differs by model generation. On Claude 4.5 models and newer, the API accepts a request even if input + max_tokens exceeds the window; if generation actually reaches the limit, it stops with stop_reason: "model_context_window_exceeded". Earlier models returned a validation error instead (a beta header opts into the newer behavior). If the response is cut off by max_tokens, you get stop_reason: "max_tokens". In both cases, check stop_reason in your code and handle it — warn the user, continue generation, and so on. For error handling in general, see the errors and rate limits guide.

Three tools for planning and managing tokens 1. Token counting API Count input tokens before sending the request Same format incl. system prompt, tools, images, and PDFs Prevents overflows and surprise costs 2. max_tokens Caps the output length When reached, response ends with stop_reason: "max_tokens" Output rate limits (OTPM) count actual generation — no penalty for a high cap 3. Compaction & editing The primary strategy for long conversations and agents Server-side compaction summarizes history; context editing clears tool results and thinking blocks Count before sending -> cap the output -> manage accumulation

Token counting API — count before you send

The starting point for limit management is the token counting API. It accepts the same structured input as message creation (including system prompts, tools, images, and PDFs) and returns the total input token count. Call it before the real request to avoid overflows and unintended long-context costs. Since tokens are cost, pair this with the cost optimization guide (prompt caching and batches).

max_tokens and rate limits

max_tokens caps output length. Per the official rate limits docs, output-tokens-per-minute (OTPM) limits are evaluated in real time on actually generated tokens; the max_tokens value itself does not factor into OTPM. So setting it generously to avoid truncation carries no rate-limit downside.

Long conversations: use compaction

As a conversation approaches the limit, the official docs recommend server-side compaction (summarizing the history) as the primary strategy, with context editing for finer control such as clearing tool results and thinking blocks. For app-side tips on long chats in claude.ai, see keeping context in long conversations. For choosing and pinning model IDs, see model IDs and versioning.

The limits here (200K, 1M) and behaviors reflect the official documentation as of June 2026; they vary by model and plan and may change. For exact per-model context sizes, see the official docs and the model comparison table. This site is not an official Anthropic site.

Keep reading

Have a question or want to share how you use Claude?

Join the community to share tips with other users, or explore more guides.