tokens-usage

Guide

Your AI SDK agent loop is out of context. Compact on token triggers.

streamText and prepareStep do not tell you when to cut. Count input tokens before each step, compact when you hit 80% or safeMaxTokens drops — then re-measure before you stream.

Vercel AI SDK · 80% context trigger · safeMaxTokens guard · measure → compact → re-measure

The problem

You know the count — but not when to cut

Guides 1–4 show how to count tokens before streamText — messages, tool schemas, tiktoken drift, Claude max_tokens. None answer the next question for AI SDK agent loops: when should you compact?

Developers on the Vercel Community report hitting context limits with 150+ messages and large tool results — but prepareStep does not expose usage data, so message-count heuristics become the fallback. That leads to either premature summarization (wasted API calls) or late compaction (400 errors mid-loop).

  • Message-count triggers (e.g. compact at 150 messages) ignore that one web-search tool result can exceed 50 turns of chat
  • Shrinking max_tokens when input grows leaves too little room for useful output — safeMaxTokens can drop below your minimum before you notice
  • Re-compacting every request wastes tokens: the frontend resends full history while prepareStep compacts again on each call

The solution

Token-budget triggers, not message counts

Compact when measured input tokens cross a percentage of your context window — commonly 80% — or when safeMaxTokens falls below your minimum output budget. Count with tokens-usage before every agent step, compact the highest-impact payload first, then re-measure to confirm you are back under budget.

AI SDK v5 has no built-in compaction API yet (Issue #14017 proposes one). Until then, implement compaction in prepareStep or a pre-stream guard using tokens-usage as the decision signal.

How it works

Three steps to a safe preflight

  1. 1

    Measure input tokens before each step

    Call countTokens with your full messages array at the start of every agent step. This is your pre-compaction baseline — the same count you use for max_tokens budgeting.

  2. 2

    Evaluate trigger conditions

    Compact if inputTokens >= contextWindow × 0.8, or if safeMaxTokens = contextWindow - inputTokens - safetyMargin drops below MIN_OUTPUT (e.g. 2,048 for code). Either condition means the next turn risks overflow or unusable output.

  3. 3

    Trim in priority order

    First truncate or drop large tool results (web search, file reads). Then remove or summarize older turns while preserving tool-call/tool-result pairs atomically. System prompt is last — shorten only if still over budget.

  4. 4

    Re-measure after compaction

    Run countTokens again on the compacted messages array. Log before/after delta. Only call streamText when safeMaxTokens meets your MIN_OUTPUT threshold.

Implementation

Production-ready code

Measure → compact → re-measure guard for AI SDK agent loops

typescript
import { streamText } from 'ai'
import { anthropic } from '@ai-sdk/anthropic'
import { countTokens, type ModelMessage } from 'tokens-usage'

const CONTEXT = 200_000
const SAFETY = 5_000
const MIN_OUTPUT = 2_048
const COMPACT_AT = 0.8 // 80% of context window

function safeMaxTokens(inputTokens: number) {
  return CONTEXT - inputTokens - SAFETY
}

function shouldCompact(inputTokens: number) {
  const atThreshold = inputTokens >= CONTEXT * COMPACT_AT
  const outputTooSmall = safeMaxTokens(inputTokens) < MIN_OUTPUT
  return atThreshold || outputTooSmall
}

function compactHistory(messages: ModelMessage[]): ModelMessage[] {
  // 1. Truncate large tool results first (highest ROI)
  const trimmed = messages.map((m) => {
    if (m.role !== 'tool') return m
    const content = typeof m.content === 'string' ? m.content : JSON.stringify(m.content)
    if (content.length <= 4_000) return m
    return { ...m, content: content.slice(0, 4_000) + '\n...[truncated]' }
  })

  // 2. Drop oldest non-system turns (keep last N + system)
  const system = trimmed.filter((m) => m.role === 'system')
  const rest = trimmed.filter((m) => m.role !== 'system')
  return [...system, ...rest.slice(-24)]
}

async function agentStep(messages: ModelMessage[]) {
  let { tokens: before } = await countTokens({
    provider: 'anthropic',
    model: 'claude-sonnet-4-20250514',
    content: messages,
    mode: 'auto',
  })

  if (shouldCompact(before)) {
    messages = compactHistory(messages)
    const remeasured = await countTokens({
      provider: 'anthropic',
      model: 'claude-sonnet-4-20250514',
      content: messages,
      mode: 'auto',
    })
    console.log(`Compacted: ${before} → ${remeasured.tokens} tokens`)
    before = remeasured.tokens
  }

  const maxTokens = Math.max(MIN_OUTPUT, safeMaxTokens(before))

  return streamText({
    model: anthropic('claude-sonnet-4-20250514'),
    messages,
    maxTokens,
  })
}

Deep dive

Three trigger conditions that actually work

Message-count thresholds (compact at 150 messages) appear in Vercel Community examples but ignore payload variance. Token-budget triggers align with production libraries and provider docs.

Trigger When to fire Source
80% context fillinputTokens >= contextWindow × 0.8agentic_context_compaction default trigger_at_percent
Output budget exhaustedsafeMaxTokens < MIN_OUTPUTClaude dynamic max_tokens guide pattern
Provider thresholdRendered tokens cross compact_thresholdOpenAI context_management docs

What to trim first — and what to keep

Compaction is not random deletion. Preserve atomic tool-call/tool-result pairs — breaking them causes API validation errors. The OpenAI Agents context compaction library documents boundary-aware compaction for this reason.

  • Priority 1: Large tool results — web search JSON, file reads, API dumps. Claude Cookbook notes discarding detailed tool results while keeping completion summaries.
  • Priority 2: Older conversation turns — summarize or drop turns beyond a sliding window. Vercel Community recommends keeping the last ~20 messages after summarization.
  • Priority 3: System prompt — shorten only as last resort. Anthropic documents system tokens as fixed overhead every request.
  • Never trim: The most recent user message, in-flight tool pairs, and any compaction summary checkpoint.

Micro-compact before LLM summarize

contextkit describes a 3-tier strategy used in production: micro-compact (free — trim old tool results without an LLM call), auto-compact (LLM summarization when approaching limit), and a circuit breaker after consecutive failures. Start with free trimming before paying for a summarization call.

Provider-side compaction vs client-side

OpenAI supports server-side compaction via context_management with compact_threshold on the Responses API — the rendered token count crossing the threshold triggers compaction in-stream. Anthropic Opus 4.6 recommends server-side compaction; the Claude Agent SDK compaction_control uses a default context_token_threshold of 100,000 tokens.

AI SDK Issue #14017 proposes a provider-agnostic compact() primitive with compactWhen: hasMoreTokensThan(n) for ToolLoopAgent — still open and not merged. Until then, client-side compaction with tokens-usage preflight is the portable pattern across OpenAI, Anthropic, and Google.

  • OpenAI example: compact_threshold: 200000 in context_management
  • Anthropic SDK default: context_token_threshold: 100000
  • AI SDK v5: no built-in compaction — use prepareStep + tokens-usage (Vercel Community, Dec 2025)

Avoid re-compaction on every request

A common pitfall: the frontend sends all messages for UI consistency, prepareStep compacts every time, and summarization runs on every request. Track compaction state (summary + compactedUpToIndex) on the frontend or persist it server-side. Only re-summarize when new content crosses the threshold again.

FAQ

Common questions

Should I compact at 80% or wait until safeMaxTokens is zero?

Compact at 80% of your context window (inputTokens >= contextWindow × 0.8) to leave room for the next tool result and output tokens. Also compact immediately if safeMaxTokens drops below your MIN_OUTPUT — waiting guarantees truncated or failed responses.

Why not use message count instead of token count?

Message count ignores payload size. One tool result from a web search or file read can exceed dozens of chat turns. Vercel Community developers compact at 150+ messages, but token-budget triggers (80% fill) match how OpenAI and Anthropic compaction APIs actually work.

Does AI SDK have a built-in compaction API?

Not yet. Issue #14017 proposes compact() and compactWhen for ToolLoopAgent but remains open (April 2026). Vercel Community confirms AI SDK v5 requires manual compaction via prepareStep or frontend state tracking.

What is the measure → compact → re-measure loop?

Count tokens with tokens-usage before compaction (baseline), apply trimming or summarization, then count again to verify you are under budget. Log the delta for observability. Only stream when the post-compaction count satisfies your trigger conditions.

Can I use provider-side compaction instead?

OpenAI Responses API supports context_management with compact_threshold for server-side compaction. Anthropic Opus 4.6 recommends server-side compaction. These are provider-specific — client-side compaction with tokens-usage works across all three providers in AI SDK agent loops.

Will compacting invalidate prompt caching?

Yes. Issue #14017 discussion notes that modifying past tool calls and results invalidates provider-side cache. Batch masking of old tool results only when constraints are met — do not compact on every turn if cache efficiency matters for your cost model.

How does this connect to the other tokens-usage guides?

Guides 1–4 teach measurement: preflight counting, tiktoken accuracy, Claude max_tokens, and tool schema overhead. This guide closes the loop: measure → decide → compact → re-measure. Start with count-tokens-vercel-ai-sdk, then apply triggers here.

Start counting before you send

Add tokens-usage to your stack today. Source-available license — see LICENSE.md for terms.

npm install tokens-usage