Guide
Your AI SDK agent loop is out of context. Compact on token triggers.
streamText and prepareStep do not tell you when to cut. Count input tokens before each step, compact when you hit 80% or safeMaxTokens drops — then re-measure before you stream.
Vercel AI SDK · 80% context trigger · safeMaxTokens guard · measure → compact → re-measure
The problem
You know the count — but not when to cut
Guides 1–4 show how to count tokens before streamText — messages, tool schemas, tiktoken drift, Claude max_tokens. None answer the next question for AI SDK agent loops: when should you compact?
Developers on the Vercel Community report hitting context limits with 150+ messages and large tool results — but prepareStep does not expose usage data, so message-count heuristics become the fallback. That leads to either premature summarization (wasted API calls) or late compaction (400 errors mid-loop).
- Message-count triggers (e.g. compact at 150 messages) ignore that one web-search tool result can exceed 50 turns of chat
- Shrinking max_tokens when input grows leaves too little room for useful output — safeMaxTokens can drop below your minimum before you notice
- Re-compacting every request wastes tokens: the frontend resends full history while prepareStep compacts again on each call
The solution
Token-budget triggers, not message counts
Compact when measured input tokens cross a percentage of your context window — commonly 80% — or when safeMaxTokens falls below your minimum output budget. Count with tokens-usage before every agent step, compact the highest-impact payload first, then re-measure to confirm you are back under budget.
AI SDK v5 has no built-in compaction API yet (Issue #14017 proposes one). Until then, implement compaction in prepareStep or a pre-stream guard using tokens-usage as the decision signal.
How it works
Three steps to a safe preflight
- 1
Measure input tokens before each step
Call countTokens with your full messages array at the start of every agent step. This is your pre-compaction baseline — the same count you use for max_tokens budgeting.
- 2
Evaluate trigger conditions
Compact if inputTokens >= contextWindow × 0.8, or if safeMaxTokens = contextWindow - inputTokens - safetyMargin drops below MIN_OUTPUT (e.g. 2,048 for code). Either condition means the next turn risks overflow or unusable output.
- 3
Trim in priority order
First truncate or drop large tool results (web search, file reads). Then remove or summarize older turns while preserving tool-call/tool-result pairs atomically. System prompt is last — shorten only if still over budget.
- 4
Re-measure after compaction
Run countTokens again on the compacted messages array. Log before/after delta. Only call streamText when safeMaxTokens meets your MIN_OUTPUT threshold.
Implementation
Production-ready code
Measure → compact → re-measure guard for AI SDK agent loops
import { streamText } from 'ai'
import { anthropic } from '@ai-sdk/anthropic'
import { countTokens, type ModelMessage } from 'tokens-usage'
const CONTEXT = 200_000
const SAFETY = 5_000
const MIN_OUTPUT = 2_048
const COMPACT_AT = 0.8 // 80% of context window
function safeMaxTokens(inputTokens: number) {
return CONTEXT - inputTokens - SAFETY
}
function shouldCompact(inputTokens: number) {
const atThreshold = inputTokens >= CONTEXT * COMPACT_AT
const outputTooSmall = safeMaxTokens(inputTokens) < MIN_OUTPUT
return atThreshold || outputTooSmall
}
function compactHistory(messages: ModelMessage[]): ModelMessage[] {
// 1. Truncate large tool results first (highest ROI)
const trimmed = messages.map((m) => {
if (m.role !== 'tool') return m
const content = typeof m.content === 'string' ? m.content : JSON.stringify(m.content)
if (content.length <= 4_000) return m
return { ...m, content: content.slice(0, 4_000) + '\n...[truncated]' }
})
// 2. Drop oldest non-system turns (keep last N + system)
const system = trimmed.filter((m) => m.role === 'system')
const rest = trimmed.filter((m) => m.role !== 'system')
return [...system, ...rest.slice(-24)]
}
async function agentStep(messages: ModelMessage[]) {
let { tokens: before } = await countTokens({
provider: 'anthropic',
model: 'claude-sonnet-4-20250514',
content: messages,
mode: 'auto',
})
if (shouldCompact(before)) {
messages = compactHistory(messages)
const remeasured = await countTokens({
provider: 'anthropic',
model: 'claude-sonnet-4-20250514',
content: messages,
mode: 'auto',
})
console.log(`Compacted: ${before} → ${remeasured.tokens} tokens`)
before = remeasured.tokens
}
const maxTokens = Math.max(MIN_OUTPUT, safeMaxTokens(before))
return streamText({
model: anthropic('claude-sonnet-4-20250514'),
messages,
maxTokens,
})
} Deep dive
Three trigger conditions that actually work
Message-count thresholds (compact at 150 messages) appear in Vercel Community examples but ignore payload variance. Token-budget triggers align with production libraries and provider docs.
| Trigger | When to fire | Source |
|---|---|---|
| 80% context fill | inputTokens >= contextWindow × 0.8 | agentic_context_compaction default trigger_at_percent |
| Output budget exhausted | safeMaxTokens < MIN_OUTPUT | Claude dynamic max_tokens guide pattern |
| Provider threshold | Rendered tokens cross compact_threshold | OpenAI context_management docs |
What to trim first — and what to keep
Compaction is not random deletion. Preserve atomic tool-call/tool-result pairs — breaking them causes API validation errors. The OpenAI Agents context compaction library documents boundary-aware compaction for this reason.
- Priority 1: Large tool results — web search JSON, file reads, API dumps. Claude Cookbook notes discarding detailed tool results while keeping completion summaries.
- Priority 2: Older conversation turns — summarize or drop turns beyond a sliding window. Vercel Community recommends keeping the last ~20 messages after summarization.
- Priority 3: System prompt — shorten only as last resort. Anthropic documents system tokens as fixed overhead every request.
- Never trim: The most recent user message, in-flight tool pairs, and any compaction summary checkpoint.
Micro-compact before LLM summarize
contextkit describes a 3-tier strategy used in production: micro-compact (free — trim old tool results without an LLM call), auto-compact (LLM summarization when approaching limit), and a circuit breaker after consecutive failures. Start with free trimming before paying for a summarization call.
Provider-side compaction vs client-side
OpenAI supports server-side compaction via context_management with compact_threshold on the Responses API — the rendered token count crossing the threshold triggers compaction in-stream. Anthropic Opus 4.6 recommends server-side compaction; the Claude Agent SDK compaction_control uses a default context_token_threshold of 100,000 tokens.
AI SDK Issue #14017 proposes a provider-agnostic compact() primitive with compactWhen: hasMoreTokensThan(n) for ToolLoopAgent — still open and not merged. Until then, client-side compaction with tokens-usage preflight is the portable pattern across OpenAI, Anthropic, and Google.
- OpenAI example: compact_threshold: 200000 in context_management
- Anthropic SDK default: context_token_threshold: 100000
- AI SDK v5: no built-in compaction — use prepareStep + tokens-usage (Vercel Community, Dec 2025)
Avoid re-compaction on every request
A common pitfall: the frontend sends all messages for UI consistency, prepareStep compacts every time, and summarization runs on every request. Track compaction state (summary + compactedUpToIndex) on the frontend or persist it server-side. Only re-summarize when new content crosses the threshold again.
Sources
- Vercel AI SDK — Compaction API proposal (Issue #14017)
- OpenAI — Server-side compaction (context_management)
- Claude Cookbook — Automatic context compaction
- Vercel Community — Conversation compaction with AI SDK v5
- contextkit — 3-tier compaction strategy
- agentic_context_compaction — trigger_at_percent default 0.8
FAQ
Common questions
Should I compact at 80% or wait until safeMaxTokens is zero?
Compact at 80% of your context window (inputTokens >= contextWindow × 0.8) to leave room for the next tool result and output tokens. Also compact immediately if safeMaxTokens drops below your MIN_OUTPUT — waiting guarantees truncated or failed responses.
Why not use message count instead of token count?
Message count ignores payload size. One tool result from a web search or file read can exceed dozens of chat turns. Vercel Community developers compact at 150+ messages, but token-budget triggers (80% fill) match how OpenAI and Anthropic compaction APIs actually work.
Does AI SDK have a built-in compaction API?
Not yet. Issue #14017 proposes compact() and compactWhen for ToolLoopAgent but remains open (April 2026). Vercel Community confirms AI SDK v5 requires manual compaction via prepareStep or frontend state tracking.
What is the measure → compact → re-measure loop?
Count tokens with tokens-usage before compaction (baseline), apply trimming or summarization, then count again to verify you are under budget. Log the delta for observability. Only stream when the post-compaction count satisfies your trigger conditions.
Can I use provider-side compaction instead?
OpenAI Responses API supports context_management with compact_threshold for server-side compaction. Anthropic Opus 4.6 recommends server-side compaction. These are provider-specific — client-side compaction with tokens-usage works across all three providers in AI SDK agent loops.
Will compacting invalidate prompt caching?
Yes. Issue #14017 discussion notes that modifying past tool calls and results invalidates provider-side cache. Batch masking of old tool results only when constraints are met — do not compact on every turn if cache efficiency matters for your cost model.
How does this connect to the other tokens-usage guides?
Guides 1–4 teach measurement: preflight counting, tiktoken accuracy, Claude max_tokens, and tool schema overhead. This guide closes the loop: measure → decide → compact → re-measure. Start with count-tokens-vercel-ai-sdk, then apply triggers here.
Related guides
AI SDK streamText preflight
Agent loops grow fast. Count messages and tool history before streamText — not after onFinish.
Claude dynamic max_tokens
Claude rejects requests when input + max_tokens > context limit. Measure first, then set max_tokens.
Tool schema token overhead
Function schemas add 800–1,500 input tokens per turn. Measure tool overhead before your agent loop scales.
tiktoken vs OpenAI API
tiktoken.encode(prompt) ignores message framing. That gap causes context_length_exceeded errors in production.
Start counting before you send
Add tokens-usage to your stack today. Source-available license — see LICENSE.md for terms.
npm install tokens-usage