tokens-usage

Guide

Your tiktoken count is wrong — and it is costing you 400 errors

Raw text tokenization ignores message framing, role tags, tool schemas, and reply priming. Here is what actually gets billed — and how to count it before you send.

Provider endpoint + tiktoken fallback · Chat message overhead handled · Multi-provider support

The problem

The prompt fits — until the API rejects it

You ran tiktoken.encode on your prompt text. The count looked safe. You sent the request anyway and got context_length_exceeded. The usage log shows 14 tokens; tiktoken said 8.

This is not a tiktoken bug. Raw encoding counts text tokens only. OpenAI chat completions add structural overhead on every message: role labels, delimiters, tool schemas, and a reply primer the model reserves before generating.

  • tiktoken.encode(prompt_text) ignores per-message framing — the real count is almost always higher
  • Calculating max_tokens from a raw count leaves too little room for completion
  • Tool definitions injected into the request are invisible to simple encode()

The solution

Count the request shape, not just the text

OpenAI publishes a num_tokens_from_messages formula in their cookbook: add tokens_per_message for each message, encode each field, subtract for name fields, then add reply priming tokens.

tokens-usage handles this for you. With mode auto, it tries the provider endpoint when you have an API key, then falls back to tiktoken with the correct message structure for OpenAI models.

How it works

Three steps to a safe preflight

  1. 1

    Stop counting raw prompt strings

    Never pass a concatenated string to tiktoken.encode when using chat completions. Always count the messages array structure.

  2. 2

    Budget tool definitions separately

    Function schemas are billed as input tokens but are not part of the messages array. Compare response.usage.prompt_tokens with and without tools on identical messages.

  3. 3

    Apply a safety margin

    Even with correct counting, reserve 10–15% of the context window for output and tokenizer drift.

Implementation

Production-ready code

Accurate OpenAI preflight with tokens-usage

typescript
import { countTokens } from 'tokens-usage'

const messages = [
  { role: 'system', content: 'You are a helpful assistant.' },
  { role: 'user', content: 'Summarize this quarterly report.' },
]

// Wrong: tiktoken.encode(JSON.stringify(messages)) — misses per-message overhead

const { tokens, method, estimated } = await countTokens({
  provider: 'openai',
  model: 'gpt-4o',
  content: messages,
  mode: 'auto',
  apiKey: process.env.OPENAI_API_KEY,
})

// Tool definitions add ~100–200 tokens each on top — measure via response.usage
// with and without tools on identical messages, or see the tool schema guide.

const MODEL_LIMIT = 128_000
const SAFE_BUDGET = Math.floor(MODEL_LIMIT * 0.85) - tokens

console.log(`${tokens} input tokens (${method}, estimated: ${estimated})`)
console.log(`Safe max_tokens for completion: ${SAFE_BUDGET}`)

Deep dive

What tiktoken misses vs what the API bills

The gap is structural, not proportional. Short prompts see the largest relative error. Long prompts still miss tool overhead entirely.

Element tiktoken.encode(text) OpenAI chat API
Message role + delimitersNo~3–4 tokens per message
Reply primingNo+3 tokens
Tool / function schemasNo100–200 tokens per tool
Multi-byte Unicode / emojisApproximateExact server-side

Three dead ends to avoid

These patterns appear frequently in production codebases and consistently produce wrong budgets.

  • Dead end 1: tiktoken.encode(prompt_text) — ignores all message framing
  • Dead end 2: context_budget = model_max - tiktoken_count — overestimates available output tokens
  • Dead end 3: Using response.usage only — accurate but requires a paid API call that may fail

Calibrate with response.usage after the fact

For billing reconciliation, always log response.usage.prompt_tokens from real API responses. Compare against your preflight estimate and track drift over time. tokens-usage gives you the preflight; usage gives you ground truth.

FAQ

Common questions

Why does tiktoken show fewer tokens than the API usage field?

Chat completions add per-message overhead, tool schema tokens, and reply priming that tiktoken.encode on raw text does not include. The API usage field reflects the full request the server processed.

Should I use cl100k_base or o200k_base?

Match the encoding to your model. GPT-4o family uses o200k_base. tokens-usage selects the correct encoding when falling back to local mode for OpenAI models.

Does tokens-usage call the OpenAI API to count tokens?

With mode auto or endpoint and an API key, tokens-usage calls OpenAI /v1/responses/input_tokens. On failure it falls back to local tiktoken with proper message formatting. Use mode local for fully offline counting.

How much safety margin should I use?

Reserve 10–15% of the context window below your preflight count. Set max_tokens to the remaining budget, not the model maximum.

What about streaming responses without usage data?

Streaming may not include usage until the final chunk. Preflight counting with tokens-usage avoids needing usage data for context management decisions.

Start counting before you send

Add tokens-usage to your stack today. Source-available license — see LICENSE.md for terms.

npm install tokens-usage