Guide
Your agent sends 1,500 tokens of tools before the user says a word
Function schemas are re-sent every turn. Measure tool token overhead across your agent loop — and decide what to trim before costs compound.
countAssistantTools toggle · Tool history vs schema measurement · Per-turn overhead visibility
The problem
Hidden overhead on every agent turn
OpenAI injects function definitions into the request on every call. They are billed as input tokens on each request. A moderate schema with five tools adds 800–1,200 tokens per request. Ten well-described tools run about 1,500 tokens per turn before the user message.
In a ten-step agent loop, that is 15,000 tokens of pure tool overhead — with zero user content growth. Most cost dashboards attribute spend to "prompts" without separating tool schema tax.
- Tool schemas do not benefit from conversation truncation — they repeat every turn
- Verbose parameter descriptions inflate counts linearly with tool count
- Agent loops multiply overhead: 5 tools × 200 tokens × 10 turns = 10,000 tokens
The solution
Measure schema and history separately
Tool definitions and tool history are different costs. Use response.usage.prompt_tokens to measure schema overhead — compare identical messages with and without tools. Use countAssistantTools to toggle tool-call and tool-result blocks already in your messages array.
Run both measurements before deploying a new tool set. If overhead exceeds 15% of your context budget, consolidate tools, shorten descriptions, or pass a subset per route.
How it works
Three steps to a safe preflight
- 1
Measure tool schema overhead via the API
Send identical messages with and without your tools array. The prompt_tokens delta is your per-request schema tax — typically 100–200 tokens per tool.
- 2
Measure tool history with countAssistantTools
Pass messages that include function_call or tool_result blocks. Toggle countAssistantTools to see how much prior tool turns add to the count.
- 3
Optimize before scaling traffic
Shorten descriptions, merge related tools, use tool subsetting per route, or tool search on gpt-5.4+ for large catalogs.
Implementation
Production-ready code
Measure schema overhead (API) and tool history (countAssistantTools)
import { countTokens } from 'tokens-usage'
import OpenAI from 'openai'
const client = new OpenAI()
const messages = [{ role: 'user', content: 'Find me a laptop' }]
const tools = [/* your function definitions */]
// 1. Tool schema overhead — compare API usage with and without tools
const [withTools, withoutTools] = await Promise.all([
client.chat.completions.create({ model: 'gpt-4o', messages, tools, max_tokens: 1 }),
client.chat.completions.create({ model: 'gpt-4o', messages, max_tokens: 1 }),
])
const schemaOverhead =
withTools.usage!.prompt_tokens - withoutTools.usage!.prompt_tokens
// 2. Tool history overhead — toggle countAssistantTools on message blocks
const history = [
{ role: 'user', content: 'Find me a laptop' },
{ type: 'function_call', call_id: 'c1', name: 'search', arguments: '{}' },
{ type: 'function_call_output', call_id: 'c1', output: '{"results":[]}' },
]
const withHistory = await countTokens({
provider: 'openai',
model: 'gpt-4o',
content: history,
countAssistantTools: true,
})
const withoutHistory = await countTokens({
provider: 'openai',
model: 'gpt-4o',
content: history,
countAssistantTools: false,
})
const historyOverhead = withHistory.tokens - withoutHistory.tokens
console.log(`Schema overhead: ${schemaOverhead} tokens/request`)
console.log(`History overhead: ${historyOverhead} tokens in current messages`) Deep dive
The math on agent loop overhead
At GPT-4o input pricing, 15,000 overhead tokens across a session is real spend — before counting user content, system prompts, or tool results from web search.
| Scenario | Tokens per turn | 10-turn total |
|---|---|---|
| 5 tools, moderate schema | ~1,000 | ~10,000 |
| 10 tools, verbose descriptions | ~1,500 | ~15,000 |
| 20 tools (accuracy risk zone) | ~3,000+ | ~30,000+ |
Optimization strategies that actually work
Cutting a five-line tool description to one line saves those tokens on every request forever. Optimize once during development, not after the bill arrives.
- Tool subsetting: pass 4–6 relevant tools per route instead of the full catalog
- Consolidation: one tool with an action parameter vs five separate tools
- Tool search (gpt-5.4+): defer definitions until the model queries the index
- Prompt caching: can reduce cost on stable prefixes — tool schemas may not cache depending on request layout
Anthropic tool_use blocks
Anthropic counts tool definitions, tool_use, and tool_result content in count_tokens. Server tool token counts apply to the first sampling call only. tokens-usage handles Anthropic native payloads with the same countTokens interface.
FAQ
Common questions
What does countAssistantTools do?
When true (default), function_call, tool_use, and functionResponse blocks in your messages are included in the count. Set false to exclude tool history. It does not measure tool definition schemas — use response.usage for that.
Are tool definitions cached by OpenAI?
Tool schemas are re-sent on every request and count as input tokens each time. Prompt caching may reduce cost on stable request prefixes, but do not assume schemas are free — measure with response.usage.
How many tools is too many?
Accuracy degrades beyond 10–20 tools per request. If overhead exceeds 15% of your context budget, split into sub-agents or use tool search on supported models.
Do tool results count as input tokens on the next turn?
Yes. Tool results become part of conversation history and are billed as input on every subsequent call. Truncate large tool outputs before appending to messages.
Can I estimate USD cost for tool-heavy requests?
Yes. tokens-usage returns a price field with USD estimate when the model exists in the pricing table.
Related guides
AI SDK streamText preflight
Agent loops grow fast. Count messages and tool history before streamText — not after onFinish.
tiktoken vs OpenAI API
tiktoken.encode(prompt) ignores message framing. That gap causes context_length_exceeded errors in production.
Claude dynamic max_tokens
Claude rejects requests when input + max_tokens > context limit. Measure first, then set max_tokens.
AI SDK compaction triggers
Your AI SDK agent loop grows every step. This guide tells you when to compact — before safeMaxTokens hits zero and streamText returns a 400.
Start counting before you send
Add tokens-usage to your stack today. Source-available license — see LICENSE.md for terms.
npm install tokens-usage