
What decides which model goes in your agent?
Most people pick an agent model on one of those four, so did I until this week.
Two models landed this month for running agents on their own, for hours at a stretch.
GPT-5.5 came on April 23 and Qwen 3.7 Max more recently.
On Terminal-Bench 2.0, the benchmark built for exactly that work, GPT-5.5 leads by 13 points, 82.7 to 69.7.
It also costs 4x more on output and 2x as much on input.
Benchmark-watchers and price-watchers are looking at the same two models and coming to opposite conclusions.
I didn't want to settle it from a table of scores.
So I gave both the same job on the same machine, told them not to check in with me, and watched what they did.
Before We Begin
DevTools That Caught My Attention
1. Boxes.dev: Gives every Claude Code or Codex thread its own cloud VM with a full dev environment, so you can run and steer agents from desktop or mobile instead of leaving a laptop cracked open all day. It runs on your existing Codex or Claude Code subscription.
2. Paseo: An open-source app that drives Claude Code, Codex, Copilot and OpenCode through one interface across desktop, mobile, CLI and web, with voice control and a self-hosted, no-telemetry design.
3. Lowfat: A pluggable command-line filter that sits between you and an LLM and strips noise from the context before it gets sent, aimed at cutting the token bill on agent-heavy workflows. The author reports it trimmed 91.8% of his tokens in testing.
The Test
I wanted a task that couldn't be faked.
Fixing a bug in an existing repo is an option, but it skips whether a model can build from scratch and then live with what it built.
So I handed each one an empty directory and the whole job, write a small service, test it, extend it, keep the tests green after the change.
That's the loop where a solid agent shows itself, because somewhere in there it has to trip over its own setup and recover without me.
Same prompt to both, verbatim:

GPT-5.5 ran in Codex at medium reasoning, and Qwen 3.7 Max in OpenCode.
The Run
Both got to green eventually but the difference was in how.
Qwen barely broke stride.
It tried pip3 install, hit the externally-managed-environment error, and switched to a venv on the next move.
From there it built the service, wrote the tests, added cancel, ran it all again, so eight tool calls with no detours.
GPT-5.5 hit the same wall and fumbled it.
It tried to run the suite three ways, pytest, then python -m pytest, then python3 -m pytest, each time calling a test runner that wasn't installed yet.
Only after the third miss did it build a venv, the move Qwen had made instantly.
Then it did something Qwen never did: it stopped and asked me. Installing dependencies tripped Codex's sandbox, so it paused for permission instead of pushing through.
Its own final report said it finished "mostly, but not fully in the strict sense." while Qwen just kept going.

The tests said the same in miniature.
The spec asked for two cancel-path tests, success and 409.
Qwen wrote three and left no warnings. GPT-5.5 wrote the two, then carried a pytest warning through both runs without chasing it down.
Side by side:

Then there's the bill.
OpenCode showed Qwen's whole run landing at $0.08.
Codex doesn't put its spend on screen, but the math is plain: GPT-5.5 costs 4x more per output token and made more calls to get to the same place.
When Each One Wins
I’d call this a split decision.
GPT-5.5 still owns the hard, long-haul jobs, and one task doesn't change that.
The 13-point Terminal-Bench lead is legit, and it shows up on work gnarlier than mine.
Qwen wins the loop most of us live in, the ordinary work of adding something and keeping everything else green.
That's the bulk of daily agent work, and Qwen handled it in fewer steps and recovered faster, at a quarter the output cost.
Over a long run the gap widens, since its cached input drops 90% the deeper the context goes.
Two caveats:
The models ran in different harnesses, and part of the "GPT-5.5 stopped to ask" gap is Codex's approval config.
And this was an easy task, a four-endpoint service that never stress-tests Qwen's 35-hour marathon or GPT-5.5's long-horizon lead.
The launch catch still stands too.
Qwen is closed-weights, API-only on DashScope, so "Chinese model" doesn't mean self-hostable.
Reviewers also flagged that its low hallucination score leans on the model abstaining more often, worth testing on your own data before you ship.
My take
I expected GPT-5.5 to win this, and not by a little.
The agentic-coding gap is wide, this was an agentic coding task, and the favorite was supposed to walk it.
It didn't.
The cheaper model got there in fewer moves and cleared its one snag faster.
None of that shows up on a leaderboard.
Terminal-Bench doesn't tell you how many wrong turns it took, what they cost, or whether it stopped halfway to ask permission.
On a real loop, those are the numbers that hit your bill and your patience.
"Which model is best" stopped being the useful question a while ago.
What matters is how a model behaves when it hits a wall and no one's watching.
Until next time,
Vaibhav 🤝🏻
If you read till here, you might find this interesting
#Partner 1
Get 2 hours back. Every day.
The average professional spends 28% of their workday on email. The other 72% is spent recovering from it.
Lindy is an AI assistant that reads every email, sorts out the noise, and drafts replies that sound like you. Before calls, it texts you a brief over iMessage with context from your last conversation. You text it back like a friend. And it only takes one minute to set up.
#Partner 2
Payroll errors cost more than you think
While many businesses are solving problems at lightspeed, their payroll systems seem to stay stuck in the past. Deel's free Payroll Toolkit shows you what's actually changing in payroll this year, which problems hit first, and how to fix them before they cost you. Because new compliance rules, AI automation, and multi-country remote teams are all colliding at once.
Check out the free Deel Payroll Toolkit today and get a step-by-step roadmap to modernize operations, reduce manual work, and build a payroll strategy that scales with confidence.





