Context Windows

A context window is the maximum amount of text (measured in tokens) that a large language model can process in a single interaction—encompassing both the input (prompt, documents, conversation history) and the output (the model's response). It is one of the most practically significant parameters defining what an AI system can do.

Context windows have expanded dramatically. GPT-3 (2020) supported 4,096 tokens—roughly 3,000 words. By 2024, Claude offered 200,000 tokens and Gemini reached 1 million. In 2025-2026, experimental models push toward 2 million tokens and beyond. This expansion transforms capability: a model with a 4K context can answer a question about a paragraph; a model with a 200K context can analyze an entire codebase, legal contract, or research paper in one pass.

The engineering behind long contexts is non-trivial. The self-attention mechanism scales quadratically with context length—doubling the context quadruples compute and memory requirements. Innovations like Flash Attention (IO-aware attention computation), RoPE (Rotary Position Embeddings for position encoding), sliding window attention, and KV-cache optimization have made long contexts practical. High Bandwidth Memory (HBM) is critical because the key-value cache grows linearly with context length and must remain in fast memory.

For AI agents, context windows define working memory. An agent operating on 14-hour autonomous task horizons must maintain context about what it's doing, what it's already tried, and what it's learned. Combined with RAG (which extends effective context by retrieving relevant information from external stores) and vector search, long context windows enable agents to work on complex projects that require understanding thousands of interconnected details simultaneously.