Skip to content
Foundationshigh

Token Economics & Cost

Tokens are the billing unit. Input tokens (prompt) and output tokens (response) are priced separately, with output usually 4–5x more expensive. Sonnet 4.6 is around $3/M input, $15/M output; Opus 4.6/4.7 is around $5/M input, $25/M output — about 1.7× Sonnet, a significant convergence from the ~5× Opus/Sonnet gap in the Claude 3 era. Verify current pricing on the Anthropic console; it adjusts with model releases. Costs add up fast in long agent loops.

Memory anchor

Tokens are taxi meter clicks — input clicks every time you pull up to a stoplight (turn), output clicks while driving. Caching is the monthly transit pass — 90% off if you board before the pass expires (5 min).

Expected depth

Each turn re-sends the entire conversation as input — so a 50-turn agent run with 50K tokens of accumulated context pays for 50K input tokens × 50 turns = 2.5M input tokens before counting output. Mitigations: prompt caching (90% discount on cached prefix), batching, smaller models for routine work (Haiku for grep/file reads, Sonnet for design). Track cost per session; flag runaway loops where input grows without progress.

Deep — senior internals

Prompt caching has two TTL options: 5-minute (default since March 2026, ~25% write premium) and 1-hour (explicit `"ttl": "1h"` in cache_control, 2× write premium). Before March 2026 the default was 1 hour, which is a common gotcha in older example code. Cache-aware design: stable system prompt + CLAUDE.md at the front (cached), volatile user content at the end (not cached). Reads are 90% off — break-even is roughly 2 uses on the 5-minute tier. For high-frequency agents (CI bots, scheduled tasks), cache hit rate is the dominant cost driver. The 1-hour TTL is worth it for long-running tasks (>5 min between turns) where you'd otherwise re-cache. Output tokens dominate when responses are long (code generation, explanations); choose terse output formatting in those cases.

🎤Interview-ready answer

Tokens are how LLMs are priced — input and output separately, output usually 4–5x more expensive. The non-obvious cost in agents is that every turn re-sends the full conversation, so a long loop with verbose tool output gets expensive quickly. Prompt caching is the biggest lever: a stable prefix (system prompt + CLAUDE.md) gets cached for 5 minutes at a 90% discount on reads. I structure agent workflows to keep cache hits high — stable prefix at the front, volatile user content at the end, no sleeps longer than 5 minutes mid-loop.

Common trap

Ignoring prompt caching. Without it, a typical Claude Code session costs 3–5x more than it needs to. The cache is mostly automatic but breaks if you change the system prompt mid-session or sleep past 5 minutes.

Related concepts