Guide
Your tiktoken count is wrong — and it is costing you 400 errors
Raw text tokenization ignores message framing, role tags, tool schemas, and reply priming. Here is what actually gets billed — and how to count it before you send.
Provider endpoint + tiktoken fallback · Chat message overhead handled · Multi-provider support
The problem
The prompt fits — until the API rejects it
You ran tiktoken.encode on your prompt text. The count looked safe. You sent the request anyway and got context_length_exceeded. The usage log shows 14 tokens; tiktoken said 8.
This is not a tiktoken bug. Raw encoding counts text tokens only. OpenAI chat completions add structural overhead on every message: role labels, delimiters, tool schemas, and a reply primer the model reserves before generating.
- tiktoken.encode(prompt_text) ignores per-message framing — the real count is almost always higher
- Calculating max_tokens from a raw count leaves too little room for completion
- Tool definitions injected into the request are invisible to simple encode()
The solution
Count the request shape, not just the text
OpenAI publishes a num_tokens_from_messages formula in their cookbook: add tokens_per_message for each message, encode each field, subtract for name fields, then add reply priming tokens.
tokens-usage handles this for you. With mode auto, it tries the provider endpoint when you have an API key, then falls back to tiktoken with the correct message structure for OpenAI models.
How it works
Three steps to a safe preflight
- 1
Stop counting raw prompt strings
Never pass a concatenated string to tiktoken.encode when using chat completions. Always count the messages array structure.
- 2
Budget tool definitions separately
Function schemas are billed as input tokens but are not part of the messages array. Compare response.usage.prompt_tokens with and without tools on identical messages.
- 3
Apply a safety margin
Even with correct counting, reserve 10–15% of the context window for output and tokenizer drift.
Implementation
Production-ready code
Accurate OpenAI preflight with tokens-usage
import { countTokens } from 'tokens-usage'
const messages = [
{ role: 'system', content: 'You are a helpful assistant.' },
{ role: 'user', content: 'Summarize this quarterly report.' },
]
// Wrong: tiktoken.encode(JSON.stringify(messages)) — misses per-message overhead
const { tokens, method, estimated } = await countTokens({
provider: 'openai',
model: 'gpt-4o',
content: messages,
mode: 'auto',
apiKey: process.env.OPENAI_API_KEY,
})
// Tool definitions add ~100–200 tokens each on top — measure via response.usage
// with and without tools on identical messages, or see the tool schema guide.
const MODEL_LIMIT = 128_000
const SAFE_BUDGET = Math.floor(MODEL_LIMIT * 0.85) - tokens
console.log(`${tokens} input tokens (${method}, estimated: ${estimated})`)
console.log(`Safe max_tokens for completion: ${SAFE_BUDGET}`) Deep dive
What tiktoken misses vs what the API bills
The gap is structural, not proportional. Short prompts see the largest relative error. Long prompts still miss tool overhead entirely.
| Element | tiktoken.encode(text) | OpenAI chat API |
|---|---|---|
| Message role + delimiters | No | ~3–4 tokens per message |
| Reply priming | No | +3 tokens |
| Tool / function schemas | No | 100–200 tokens per tool |
| Multi-byte Unicode / emojis | Approximate | Exact server-side |
Three dead ends to avoid
These patterns appear frequently in production codebases and consistently produce wrong budgets.
- Dead end 1: tiktoken.encode(prompt_text) — ignores all message framing
- Dead end 2: context_budget = model_max - tiktoken_count — overestimates available output tokens
- Dead end 3: Using response.usage only — accurate but requires a paid API call that may fail
Calibrate with response.usage after the fact
For billing reconciliation, always log response.usage.prompt_tokens from real API responses. Compare against your preflight estimate and track drift over time. tokens-usage gives you the preflight; usage gives you ground truth.
FAQ
Common questions
Why does tiktoken show fewer tokens than the API usage field?
Chat completions add per-message overhead, tool schema tokens, and reply priming that tiktoken.encode on raw text does not include. The API usage field reflects the full request the server processed.
Should I use cl100k_base or o200k_base?
Match the encoding to your model. GPT-4o family uses o200k_base. tokens-usage selects the correct encoding when falling back to local mode for OpenAI models.
Does tokens-usage call the OpenAI API to count tokens?
With mode auto or endpoint and an API key, tokens-usage calls OpenAI /v1/responses/input_tokens. On failure it falls back to local tiktoken with proper message formatting. Use mode local for fully offline counting.
How much safety margin should I use?
Reserve 10–15% of the context window below your preflight count. Set max_tokens to the remaining budget, not the model maximum.
What about streaming responses without usage data?
Streaming may not include usage until the final chunk. Preflight counting with tokens-usage avoids needing usage data for context management decisions.
Related guides
Tool schema token overhead
Function schemas add 800–1,500 input tokens per turn. Measure tool overhead before your agent loop scales.
AI SDK streamText preflight
Agent loops grow fast. Count messages and tool history before streamText — not after onFinish.
Claude dynamic max_tokens
Claude rejects requests when input + max_tokens > context limit. Measure first, then set max_tokens.
AI SDK compaction triggers
Your AI SDK agent loop grows every step. This guide tells you when to compact — before safeMaxTokens hits zero and streamText returns a 400.
Start counting before you send
Add tokens-usage to your stack today. Source-available license — see LICENSE.md for terms.
npm install tokens-usage