save your tokens

In partnership with

How do you currently manage context in your Claude Code sessions?

The 1M token context window for Claude Code went GA in March with no premium surcharge.

Sonnet 4.6 and Opus 4.6 both support it at flat standard rates.

My natural response was to stop managing context and load everything in.

I operated that way for two weeks and my usage costs went up without my output getting better.

So I spent the weekend reading up and trying all I could to optimize my workflow to maintain the output quality while cutting costs.

Some obvious but not-so-obvious takeaways about how you load your sessions, how they are structured, etc.

DevTools of the Week

1. Amp by Sourcegraph

Model-agnostic coding agent with a free ad-supported tier. The only serious Claude Code alternative that lets you swap models mid-session without losing context.

2. Context7 MCP

Fetches live, version-specific library docs at call time and injects them into your agent context. Kills hallucinated API calls from stale training data.

3. Git Pitcher

It turns your repo into structured plans and prompt packs for downstream agents.

What the 1M Window Changes

Every message you send in Claude Code re-sends the entire conversation to the API.

Turn 1 sends your system prompt, CLAUDE.md, tool definitions, and your first message.

Turn 50 sends all of that plus 49 rounds of conversation and tool call output.

At 200K tokens, sessions would hit compaction around turn 50-80 depending on how much tool output accumulated.

When that was fired, Claude summarized the conversation and continued from the summary, dropping the details.

At 1M tokens, you have five times the space, which in a normal session, you will not hit the limit.

The practical change is that long, complex multi-file tasks can now run to completion without losing context mid-session.

Where Costs Come From

Sonnet 4.6 is $3 input / $15 output per million tokens.

Output costs 5x input, which is very asymetric.

The real bill comes from output: 10K tokens of generated code at $15/million is $0.15 per turn.

I tested 20 turns on a complex task and spent $3 on output from a single session before even accounting for the growing input.

So I found the three things drive your bill up without improving your output:

1/ Carrying stale context after a task is done

2/ Loading a bloated CLAUDE.md every session regardless of what you're doing,

3/ Running extended thinking on tasks that don't need deep reasoning.

The 1M window gives these problems room to compound.

The Prompt Caching Layer

Anthropic published a post last week from the Claude Code team that everyone should read in full.

Their headline claim is that they declare internal SEVs when their prompt cache hit rate drops too low.

Cache hit rate is a first-class metric for them because of the discount: 90% off cached input tokens

A 2,000-token system prompt costs $0.006 normally.

Cached, it costs $0.0006.

The critical constraint is that caching is a prefix match.

Any change to the cached prefix invalidates everything after it.

Claude Code's entire architecture is built around this: stable pieces at the top, conversation at the bottom.

If you put anything dynamic near the top of your system prompt (like a timestamp, session ID, anything that changes per request), you break caching for the entire session and pay full input price on every turn.

If you're building your own agent harness on the raw API, you have to set cache_control explicitly.

Claude Code does it automatically and yet most devs building custom tooling skip it and pay 10x more on input than they need to.

The Cost Breakdown

Here's what a session that previously would have triggered compaction costs now.

Total input per request at turn 30: ~124,500 tokens.

Without caching, that's 124,500 × $3/million = $0.37 per turn i.e. $11.25 in input for 30 turns.

With caching, the 44,500 stable tokens (system prompt + CLAUDE.md + file context) cache after turn 1. Cached tokens cost $0.30/million instead of $3.

Your per-turn input cost drops to roughly $0.25 and keeps falling as the cache stabilizes.

Add output at 30 turns × 800 average output tokens × $15/million and you're around $0.61 per turn total with caching vs $0.73 without.

That's 16% cheaper in one session, that across a month of heavy use, is at least $50 in difference.

What To Do, Session By Session

Tip 1 : Keep your CLAUDE.md under 200 lines.

Every token in that file loads at session start and re-sends on every turn, whether or not it's relevant to what you're doing.

PR review workflows, database migration steps, deployment checklists are all in context even when you're writing a UI component.

Move specialized workflows into Claude Code Skills, which load on-demand, and keep CLAUDE.md for the things that are always true like naming conventions, test commands, branch patterns.

Tip 2: Clear between unrelated tasks.

Use /clear when switching from a bug fix to a new feature.

Stale context from the previous task adds input tokens on every subsequent message without contributing to quality.

If you want to keep the high-level context but drop the detail, use /compact with custom instructions, for example:

“/compact Focus on: current file structure, architecture decisions made, active errors.

Drop: individual file contents already written, completed tool calls.”

Tip 3: Turn off extended thinking for simple tasks.

Extended / Adaptive thinking is on by default and the token budget can hit 30-40K thinking tokens per request on a hard problem.

Thinking tokens bill as output tokens at $15/million.

For tasks that don't need deep reasoning, drop the effort level:

Leave it on for complex architectural decisions and turn it off for everything else.

Tip 4: Audit what's in your context.

Run /context and you'll often find MCP servers loaded that you're not using, tool definitions from 6+ servers adding thousands of tokens each session, and file reads from earlier still cached in context.

Disable unused MCP servers with /mcp.

The Toolradar guide published last month flagged that 5+ loaded MCP servers with 15 tools each can consume 50,000-75,000 tokens before you send a single message.

Three servers is the practical ceiling before you're paying for context you don't need.

When 1M Is The Right Tool

Auto-compaction is good engineering but it loses things like specific variable names and types from files read early in the session, the reasoning behind architectural decisions made in earlier turns, error details from tool calls that happened before the summary window.

For quick tasks, compaction is fine. But otherwise, that information loss shows up as subtle bugs and inconsistencies that are hard to trace.

My Take

I spent two weeks treating the 1M window as loading everything and moving fast, and my usage bills reflected that.

The part worth pulling from the Claude Code team's caching post is that compaction breaks your cache.

The summarization call uses a different system prompt, so the cache prefix diverges at the first token and you get zero savings on that call.

Running the summary with the same system prompt prefix as your main agent fixes this, and it changes how you think about structuring any long-running session.

If you're still on Sonnet 4.5 with the 1M beta, you're paying a 2x input and 1.5x output surcharge on sessions that cross 200K tokens.

Moving to Sonnet 4.6 removes that entirely at the same base price.

Until next time,
Vaibhav 🤝🏻

If you read till here, you might find this interesting

#Partner 1

200+ Proven Ways to Make Money With AI in 2026

The next wave of millionaires will be people who figured out how to make AI work for them.

The window to get ahead is still open. But not for long.

Here are 200+ proven ways to make money with AI in 2026.

Sign up for Superhuman AI, the free daily newsletter read by 1M+ professionals, and get instant access to all 200+ ways to profit from AI this year.

Claim your free list

#Partner 2

The best prompt engineers aren't typing. They're talking.

Power users figured this out early: speaking a prompt gives you 10x more context in half the time. You include the edge cases, the examples, the tone you want — because talking is fast enough that you don't skip them.

Wispr Flow captures everything you say and turns it into clean, structured text for any AI tool. Speak messy. Get polished input. Paste into ChatGPT, Claude, Cursor, or wherever you work.

89% of messages sent with zero edits. 4x faster than typing. Works system-wide on Mac, Windows, and iPhone.

Start flowing free