
What's eating most of your agent's quality?
I spent two months thinking my agent had a prompting problem because the outputs were inconsistent and citations were wrong half the time.
I kept rewriting the system prompt and nothing fundamentally changed.
The problem was that I was handing the agent a task without the information it needed to do that task.
And fixing it at the context level changed my output quality more in a week than months of prompt iteration had.
Context engineering is the work of deciding what information enters the agent's window, when it enters, in what order, and how much it fits before quality degrades.
So we're going to build a research agent from scratch, twice.
Once with the Claude API directly and once with Claude Code, using the same concepts on different surfaces.
What You Need
Python 3.10+
anthropic SDK: pip install anthropic
tavily-python for web retrieval: pip install tavily-python
Anthropic API key and Tavily API key (both have free tiers)
Part 1: Claude API
The problem
This runs, and Claude produces something. But it's working entirely from training data, with no retrieval and nothing you can verify.
The output quality is capped by what Claude already knows, which means it's stale before you read it.
The deeper issue is that you gave the model a task without giving it any information.
The solution
Run it:
Multi-turn research without context rot
The single-turn version is tidy because the work is one-shot, which real research rarely is.
You ask a follow-up, then another, and five turns in the context are clogged with resolved questions and stale material from the start of the conversation.
Here's how to keep the session from degrading as that happens.
Cheap-model summarization holds up on simple research. When a multi-turn session builds on nuanced technical detail across many turns,
Haiku flattens some of it.
I've had sessions where the summary dropped a key constraint from turn 2 that the rest of the conversation depended on, and I didn't catch it until the output was wrong.
For complex research, compact with Sonnet and pay the cost.
Part 2: Claude Code
The same agent can live as a Claude Code workflow as the four operations are identical; the difference is that configuration files make the decisions instead of Python code.
Claude Code loads those files at startup, so the agent knows your context architecture without being told.
This is the better setup when you do research regularly inside a project and don't want to re-explain your conventions every session.
Project structure
CLAUDE.md
This is the first thing Claude Code reads in every session. Be specific here, because vague instructions in CLAUDE.md produce vague output.
My long-running mistake with CLAUDE.md was that I wrote it once and never touched it again.
Claude Code was working from a map that no longer matched the territory, and most of the bugs I was chasing traced back to it.
Treat CLAUDE.md as a living document and update it whenever a decision changes.
.claude/rules/source-handling.md
Rules fire only for matching file patterns. This one loads when Claude Code is working with source data,
.claude/skills/research-report/SKILL.md
report-template.md
Running it
Claude Code reads CLAUDE.md on startup, loads source-handling.md when it touches sources, runs the skill workflow, and saves the report.
You don't repeat instructions or paste the context architecture each session, because it already lives in the config files.
Summary
What to extend next
The agent above handles one topic per session. A few extensions take it further.
Add a planning step.
Before retrieval, use Haiku to decompose the topic into three to five specific sub-questions, then retrieve sources for each one separately.
Targeted retrieval makes the final synthesis better.
Add source validation.
After retrieval, run a quick pass per source: "is this relevant to [topic]? yes or no." discard the no results before synthesis.
Make the Claude Code version stateful.
Add a research-index.json that tracks past topics, source URLs, and report locations, and reference it in CLAUDE.md.
Claude Code can then see what's already been researched and surface relevant prior work when you start a new topic.
My take
I wasted two months on prompt iteration before I looked at the context layer, and that wasn't a one-off.
Almost every agent problem I've debugged since traced back to what the model could see, not how I asked.
The reason context engineering doesn't get talked about is that it's unglamorous. retrieval, truncation, isolation, offloading.
None of it sounds like AI research. but it's what moved my output quality.
If your agent is inconsistent and a better model barely helps, look at what's in the window before you touch the prompt again.
Reply below: what's the worst context bug you've debugged?
Until next time,
Vaibhav 🤝🏻
If you read till here, you might find this interesting
#Partner 1
Your first HR system, implemented right
Rolling out your first HR tool? Get a step-by-step guide to avoid common mistakes, drive adoption, and build a scalable HR foundation.
#Partner 2
What if your job search ran automatically 24/7?
AIApply is your AI Career Agent working 24/7 to find the best jobs online, tailor every application to your profile, and automatically apply on your behalf so you can spend less time job hunting and more time landing interviews.


















