Context Window & Management
The context window is the maximum number of tokens (input + output) the model can process in one request. Claude Sonnet 4 launched at 200K tokens (~150K words); newer Sonnet/Opus releases (4.5+) offer up to 1M tokens for longer contexts. GPT-4 Turbo: 128K; o-series and GPT-5 vary. Once full, you must compress, prune, or fail.
Context window is a desk — small enough that piling unread papers (whole-file reads) leaves no room to write. The tidier the desk, the better the work.
Window structure: system prompt (CLAUDE.md, tool defs, persistent memory) + conversation history + current user message + room for response. Long agent runs fill the window with tool results — file contents, command output, search results. Once you hit ~80% of the window, quality degrades (recency bias, lost-in-middle effects). Mitigations: file reads with offset/limit, summarizing old turns, dispatching subagents to absorb verbose work, caching the static prefix.
Different models behave differently at context-window edges. Claude has 'auto-compaction' in Claude Code that summarizes old turns automatically. Some harnesses implement memory files (CLAUDE memory) so persistent facts don't live in the rolling window. Position matters: information at the very start (system) and very end (most recent) is recalled best — the middle is the 'lost in the middle' valley. For RAG-style agents, retrieval beats stuffing — only fetch what's relevant per turn. Budgeting: a rough rule is keep working context under 50K tokens for predictable quality, even with a 200K or 1M limit — quality degrades long before the hard cap.
The context window is the LLM's working memory — everything the model 'sees' in a single forward pass. Claude models offer 200K (standard) up to 1M tokens (extended on newer Sonnet/Opus releases). The challenge in agents is that tool results accumulate fast: reading three files, running a grep, and a build can easily consume 50K tokens. Mitigations include reading file ranges instead of whole files, using grep for targeted searches, dispatching subagents to absorb verbose work, and prompt caching the static prefix. Quality degrades well before the hard limit, so I aim to stay under ~50% utilization for the working portion.
Reading entire files when you need 20 lines. Each unnecessary file read shrinks the room left for thinking and future tool calls — and trains a habit that breaks at scale.