I saw a tweet last week from someone who was genuinely excited about the 1M token context window. “We had 4,700 lines of markdown state machines,” they wrote, “and the 1M window made it unnecessary.”
I read that three times trying to figure out if I was missing something. You had 4,700 lines of state management scaffolding… and your solution was to not manage state at all? To just dump everything into a bigger pile and hope the model figures it out?
That’s not progress. That’s substituting scale for architecture.
The bigger-is-better fallacy
Every few months, a model vendor announces a bigger context window. 32K became 128K. 128K became 200K. Now it’s 1M. And every time, the same chorus: “This changes everything. Now we can fit the whole codebase in context.”
I’ve been building on top of these models for a while now, and I’m increasingly convinced this is a trap. Not because bigger windows are bad — they’re fine. But because they solve the wrong problem, and solving the wrong problem convincingly is worse than not solving it at all.
Take a real task: “add rate limiting to the API gateway.” The relevant context is maybe 2,000 tokens — the gateway middleware file, the team’s convention for cross-cutting concerns, and the decision from six months ago to use a particular algorithm. Everything else in your codebase is noise.
At 200K tokens, you stuff in 50K of source files and hope the model finds the right ones. At 1M tokens, you stuff in 250K of source files and hope the model finds the right ones. The search space got 5x bigger. The signal didn’t change. You’re paying 5x more for the model to attend to 5x more irrelevant code.
What a bigger window can’t do
Here’s what keeps bugging me. No matter how big the window gets, it can’t solve the actual problems:
It can’t remember yesterday. 1M tokens is 1M tokens of this session. Close the tab, it’s gone. The decision your teammate made last week about the database schema? Not in the window. Never was. A bigger window doesn’t create memory. It creates a bigger ephemeral scratch pad.
It can’t represent knowledge that isn’t code. Why did the team choose Postgres over DynamoDB? Who’s the expert on the billing system? What happened in the last production incident? You could have a 10M token window — none of this would be in it, because it doesn’t exist in files.
It can’t scale with cost. 1M tokens at current pricing is 10-50x more expensive than a focused 20K-token request. And you’re paying that premium for context the model probably won’t use. Token prices will drop — they always do — but the fundamental economics don’t change: you’re paying for the model to attend to irrelevant code. Cheaper irrelevance is still irrelevance.
The number I keep coming back to
500 tokens of the right knowledge outperforms 500K tokens of raw transcript.
I’m not being hyperbolic. A typed decision item — “we use leaky bucket for rate limiting, chosen over token bucket for smoother traffic shaping” — is 40 tokens. The conversation where that decision was discussed, with all the back-and-forth, the code examples, the alternatives explored, the tangents? Maybe 50K tokens. The 40-token version is more useful for future sessions than the 50K-token version, because it’s structured, searchable, and directly applicable.
This is the difference between recall and understanding. A 1M-token window gives you recall — the raw material is somewhere in there. Understanding requires structure — knowledge that’s been extracted, typed, scoped, and made retrievable.
The compounding problem
The deepest issue with bigger windows, and the one nobody talks about: they don’t compound.
Remember that tweet about 4,700 lines of state management? Those lines existed because the tool has no memory. The context window was being used as a substitute for persistent state. The engineer was hand-building a memory system in markdown because the product didn’t have one. A bigger window doesn’t fix that — it just lets you defer hitting the wall. And when someone tells you “the bigger window made our state management unnecessary,” what they’re really saying is they replaced engineered state with hoping the model can find things in a larger pile of raw transcript. I’ve seen how that ends — the model confidently cites something from turn 3 that was corrected in turn 47, because both are in context and the model has no way to know which one is current.
But the bigger issue is across sessions. Session 1: you discuss the architecture and make decisions. 1M tokens of context. Session 2: you start fresh. Zero tokens of context. Everything from session 1 is gone. You spend 20 minutes rediscovering what you decided yesterday.
I’ve watched this happen on my own team. An engineer has a productive Monday session — makes decisions, discovers patterns, fixes subtle bugs. Tuesday morning, new session. The AI doesn’t know any of it. The engineer re-explains the relevant parts. Wednesday, same thing. By Friday, they’ve spent hours re-teaching instead of building.
A system that extracts the 500 tokens of durable knowledge from Monday’s session and injects them into Tuesday’s session is infinitely more valuable than one that could hold 10M tokens but forgets everything overnight.
What we do instead
Here’s what we built. After every session, the AI extracts durable knowledge — decisions, error patterns, coding conventions, architectural insights — and stores them as structured, typed, searchable items. A 50K-token conversation about rate limiting produces a 40-token decision item: “we use leaky bucket for rate limiting, chosen over token bucket for smoother traffic shaping.” Next session, that item is automatically injected into the AI’s context before it writes a single line of code.
The AI doesn’t need to rediscover what was decided last week. It doesn’t need 50K tokens of transcript to understand the team’s rate limiting approach. It needs 40 tokens of extracted knowledge and the right retrieval system to surface them at the right moment.
We have five layers of this — from the current turn all the way back to the full organizational knowledge base. Each layer is progressively broader and retrieved differently: the most recent context is always present, project-level rules are auto-injected, and the full history is searchable on demand. The 1,000th session has access to everything the organization learned in the first 999 — not because the window is big enough, but because the knowledge was extracted, structured, and made retrievable.
Could you build something like this on top of a 1M-token window? Technically, sure. But you’d be using the window as a database, which it isn’t. It has no indexing, no typing, no scoping, no persistence, and no way to distinguish current truth from historical artifact. You’d be rebuilding a knowledge system inside a context window that was designed for conversation.
Where this leaves us
The window size arms race is a distraction. The vendors want you to think the constraint is window size because they can sell you a bigger window. The actual constraint is that AI tools are stateless — they have no persistent memory, no organizational knowledge, no compounding intelligence.
A bigger window treats the symptom (not enough room) instead of the disease (no durable knowledge). And it treats it expensively.
The shift isn’t 200K to 1M. It’s from stateless to stateful. From tools that forget to tools that learn. That’s not a context window problem. That’s an architecture problem. And architecture doesn’t get solved by making the window bigger.
Next: why adding more agents doesn’t help either — and why the agent explosion is really a confession that the tools don’t know enough.