Guide
Stop guessing max_tokens. Calculate it from your actual input size.
Claude rejects requests when input tokens + max_tokens > context limit. Measure input first, then set max_tokens to what actually fits.
Anthropic count_tokens endpoint · Native message format · Free preflight calls
The problem
The error is deterministic — retrying will not help
You set max_tokens to 8192 because you want a long response. Your conversation history grew across twelve agent steps. Claude returns invalid_request_error: input tokens + max_tokens exceeds the context window.
This is not a rate limit. The same request fails every time until you reduce either the input or max_tokens. Developers using Vercel AI SDK reported this exact failure when inputs plus maxTokens exceeded 200K.
- Retrying without changing the payload wastes latency and logs noise
- Blindly lowering max_tokens may leave too little room for useful output
- Extended thinking adds budget_tokens that must stay below max_tokens
The solution
Measure input, then derive max_tokens
Anthropic count_tokens accepts messages, system, and tools. tokens-usage forwards messages and system when using mode endpoint. Tool blocks already in your message history are counted when countAssistantTools is true.
tokens-usage wraps this endpoint with a unified API. Count your input, subtract a safety margin, and set max_tokens to the remainder.
const CONTEXT = 200_000
const SAFETY = 5_000
const MIN_OUTPUT = 1_024
const { tokens: inputTokens } = await countTokens({
provider: 'anthropic',
model: 'claude-sonnet-4-20250514',
content: messages,
system: 'Be concise.',
})
const maxTokens = Math.max(
MIN_OUTPUT,
CONTEXT - inputTokens - SAFETY,
) How it works
Three steps to a safe preflight
- 1
Count input tokens with count_tokens
Pass your messages array as content and system as a top-level field. The Anthropic endpoint returns input_tokens without generating a response.
- 2
Calculate safe max_tokens
Subtract input tokens and a safety margin from the model context window. Never set max_tokens to the model default without checking input size.
- 3
Send messages.create with derived max_tokens
Use the calculated value in your API call or pass it to AI SDK streamText as maxTokens.
Implementation
Production-ready code
Dynamic max_tokens guard for Claude agent loops
import { countTokens } from 'tokens-usage'
import Anthropic from '@anthropic-ai/sdk'
const CONTEXT_BY_MODEL: Record<string, number> = {
'claude-sonnet-4-20250514': 200_000,
'claude-opus-4-20250514': 200_000,
}
async function createWithDynamicMaxTokens(
model: string,
messages: Anthropic.MessageParam[],
system?: string,
) {
const context = CONTEXT_BY_MODEL[model] ?? 200_000
const safety = 5_000
const { tokens: inputTokens, price } = await countTokens({
provider: 'anthropic',
model,
content: messages,
system,
mode: 'endpoint',
})
const maxTokens = Math.max(1024, context - inputTokens - safety)
console.log(`Input: ${inputTokens}, max_tokens: ${maxTokens}`)
if (price) console.log(`Estimated input cost: $${price.usd}`)
const client = new Anthropic()
return client.messages.create({
model,
max_tokens: maxTokens,
system,
messages,
})
} Deep dive
The formula
safeMaxTokens = contextWindow - inputTokens - safetyMargin. Use at least 1,024 for max_tokens to get meaningful responses. For extended thinking, ensure max_tokens > thinking.budget_tokens.
| Parameter | Recommended value | Why |
|---|---|---|
| Safety margin | 5,000–10,000 tokens | Buffer for count_tokens estimate drift |
| Minimum max_tokens | 1,024 | Avoid truncated one-line responses |
| Extended thinking | budget_tokens < max_tokens | Required by Anthropic API |
Extended thinking and max_tokens
When thinking is enabled, budget_tokens is carved out of max_tokens. If you set them equal, you get thinking-only responses with no user-visible answer. Count input tokens first, then allocate max_tokens with room for both thinking budget and final output.
- max_tokens must be strictly greater than thinking.budget_tokens
- Thinking tokens bill as output tokens on Sonnet and Opus
- Haiku does not support extended thinking
When to compact instead of shrinking max_tokens
If safeMaxTokens drops below your minimum useful output (e.g. 2,048 for code generation), do not just send a tiny max_tokens. Compact conversation history — summarize older turns, drop tool results, or split the task across multiple calls.
FAQ
Common questions
Does count_tokens cost money?
Anthropic count_tokens requests are free. They have independent rate limits from message creation, so preflight checks do not consume your message quota.
Is the count_tokens result exact?
Anthropic documents it as an estimate. Actual input tokens may differ slightly. Use a safety margin of 5,000–10,000 tokens when deriving max_tokens.
What error message indicates this problem?
Look for invalid_request_error mentioning input tokens and max_tokens exceeding the context window, or "Input is too long for requested model".
Does this work with tool definitions in the payload?
tokens-usage counts tool_use and tool_result blocks in your messages array. Tool definitions passed separately to messages.create are supported by Anthropic count_tokens natively but are not yet forwarded by tokens-usage — budget them via Anthropic SDK count_tokens or response.usage.
Can I use this with Vercel AI SDK maxTokens?
Yes. Calculate maxTokens with tokens-usage preflight, then pass it to streamText or generateText. See our AI SDK guide for the full agent loop pattern.
Related guides
AI SDK streamText preflight
Agent loops grow fast. Count messages and tool history before streamText — not after onFinish.
tiktoken vs OpenAI API
tiktoken.encode(prompt) ignores message framing. That gap causes context_length_exceeded errors in production.
Tool schema token overhead
Function schemas add 800–1,500 input tokens per turn. Measure tool overhead before your agent loop scales.
AI SDK compaction triggers
Your AI SDK agent loop grows every step. This guide tells you when to compact — before safeMaxTokens hits zero and streamText returns a 400.
Start counting before you send
Add tokens-usage to your stack today. Source-available license — see LICENSE.md for terms.
npm install tokens-usage