tokens-usage

Guide

Stop guessing max_tokens. Calculate it from your actual input size.

Claude rejects requests when input tokens + max_tokens > context limit. Measure input first, then set max_tokens to what actually fits.

Anthropic count_tokens endpoint · Native message format · Free preflight calls

The problem

The error is deterministic — retrying will not help

You set max_tokens to 8192 because you want a long response. Your conversation history grew across twelve agent steps. Claude returns invalid_request_error: input tokens + max_tokens exceeds the context window.

This is not a rate limit. The same request fails every time until you reduce either the input or max_tokens. Developers using Vercel AI SDK reported this exact failure when inputs plus maxTokens exceeded 200K.

  • Retrying without changing the payload wastes latency and logs noise
  • Blindly lowering max_tokens may leave too little room for useful output
  • Extended thinking adds budget_tokens that must stay below max_tokens

The solution

Measure input, then derive max_tokens

Anthropic count_tokens accepts messages, system, and tools. tokens-usage forwards messages and system when using mode endpoint. Tool blocks already in your message history are counted when countAssistantTools is true.

tokens-usage wraps this endpoint with a unified API. Count your input, subtract a safety margin, and set max_tokens to the remainder.

typescript
const CONTEXT = 200_000
const SAFETY = 5_000
const MIN_OUTPUT = 1_024

const { tokens: inputTokens } = await countTokens({
  provider: 'anthropic',
  model: 'claude-sonnet-4-20250514',
  content: messages,
  system: 'Be concise.',
})

const maxTokens = Math.max(
  MIN_OUTPUT,
  CONTEXT - inputTokens - SAFETY,
)

How it works

Three steps to a safe preflight

  1. 1

    Count input tokens with count_tokens

    Pass your messages array as content and system as a top-level field. The Anthropic endpoint returns input_tokens without generating a response.

  2. 2

    Calculate safe max_tokens

    Subtract input tokens and a safety margin from the model context window. Never set max_tokens to the model default without checking input size.

  3. 3

    Send messages.create with derived max_tokens

    Use the calculated value in your API call or pass it to AI SDK streamText as maxTokens.

Implementation

Production-ready code

Dynamic max_tokens guard for Claude agent loops

typescript
import { countTokens } from 'tokens-usage'
import Anthropic from '@anthropic-ai/sdk'

const CONTEXT_BY_MODEL: Record<string, number> = {
  'claude-sonnet-4-20250514': 200_000,
  'claude-opus-4-20250514': 200_000,
}

async function createWithDynamicMaxTokens(
  model: string,
  messages: Anthropic.MessageParam[],
  system?: string,
) {
  const context = CONTEXT_BY_MODEL[model] ?? 200_000
  const safety = 5_000

  const { tokens: inputTokens, price } = await countTokens({
    provider: 'anthropic',
    model,
    content: messages,
    system,
    mode: 'endpoint',
  })

  const maxTokens = Math.max(1024, context - inputTokens - safety)

  console.log(`Input: ${inputTokens}, max_tokens: ${maxTokens}`)
  if (price) console.log(`Estimated input cost: $${price.usd}`)

  const client = new Anthropic()
  return client.messages.create({
    model,
    max_tokens: maxTokens,
    system,
    messages,
  })
}

Deep dive

The formula

safeMaxTokens = contextWindow - inputTokens - safetyMargin. Use at least 1,024 for max_tokens to get meaningful responses. For extended thinking, ensure max_tokens > thinking.budget_tokens.

Parameter Recommended value Why
Safety margin5,000–10,000 tokensBuffer for count_tokens estimate drift
Minimum max_tokens1,024Avoid truncated one-line responses
Extended thinkingbudget_tokens < max_tokensRequired by Anthropic API

Extended thinking and max_tokens

When thinking is enabled, budget_tokens is carved out of max_tokens. If you set them equal, you get thinking-only responses with no user-visible answer. Count input tokens first, then allocate max_tokens with room for both thinking budget and final output.

  • max_tokens must be strictly greater than thinking.budget_tokens
  • Thinking tokens bill as output tokens on Sonnet and Opus
  • Haiku does not support extended thinking

When to compact instead of shrinking max_tokens

If safeMaxTokens drops below your minimum useful output (e.g. 2,048 for code generation), do not just send a tiny max_tokens. Compact conversation history — summarize older turns, drop tool results, or split the task across multiple calls.

FAQ

Common questions

Does count_tokens cost money?

Anthropic count_tokens requests are free. They have independent rate limits from message creation, so preflight checks do not consume your message quota.

Is the count_tokens result exact?

Anthropic documents it as an estimate. Actual input tokens may differ slightly. Use a safety margin of 5,000–10,000 tokens when deriving max_tokens.

What error message indicates this problem?

Look for invalid_request_error mentioning input tokens and max_tokens exceeding the context window, or "Input is too long for requested model".

Does this work with tool definitions in the payload?

tokens-usage counts tool_use and tool_result blocks in your messages array. Tool definitions passed separately to messages.create are supported by Anthropic count_tokens natively but are not yet forwarded by tokens-usage — budget them via Anthropic SDK count_tokens or response.usage.

Can I use this with Vercel AI SDK maxTokens?

Yes. Calculate maxTokens with tokens-usage preflight, then pass it to streamText or generateText. See our AI SDK guide for the full agent loop pattern.

Start counting before you send

Add tokens-usage to your stack today. Source-available license — see LICENSE.md for terms.

npm install tokens-usage