In partnership with

GPT-5.5 hallucinated a file that doesn't exist. Has a model ever confidently misled you in a codebase

Login or Subscribe to participate

Opus 4.7 and GPT-5.5 shipped last week, and not without controversy, as always.

The benchmarks are split, and so are the opinions. So, I ran two tests that don't agree with either of them.

DevTools of the Week

A macOS menu bar app that surfaces all your open pull requests from GitHub, GitLab, and Azure DevOps in one place, with color-coded labels by project and real-time notifications. Supports multiple accounts across providers, customizable filters, and launches at login so you never lose track of review queues.

A spec-driven platform where you describe what an AI agent should do in plain English and get a typed, tested, versioned REST API endpoint in under 60 seconds. It handles model routing across OpenAI, Anthropic, Google, and Perplexity, plus built-in observability, prompt versioning, and automated test generation so teams ship agents instead of building infrastructure.

An AI product engineer that connects to PostHog, GitHub, and Slack to autonomously watch session replays, surface bugs, and identify conversion drop-offs before users churn. It can also propose and run A/B experiments, submit variants as PRs, and answer ad-hoc questions about how users interact with your product.

The Tests

GPT-5.5 scores 82.7% on Terminal-Bench 2.0, where Opus 4.7 scores 69.4%.

On SWE-bench Pro, it flips: Opus 64.3%, GPT-5.5 is at 58.6%.

OpenAI disputes Anthropic's eval setup and Anthropic disputes OpenAI's. Both are probably partially right.

The benchmarks give us a split verdict. So I ran two tests on codebase reasoning and silent bug detection. Here's what happened.

Test 1: Codebase Reasoning

The setup: Hono's full source repository, around 40 files, TypeScript, covering the router, middleware pipeline, context handling, and adapter layer for different runtimes.

Both models received identical context.

The prompt:

Here is the full Hono codebase. Do not summarize it. Answer three questions with specific file and function references:

  1. What is the riskiest architectural assumption that would be painful to undo at scale?

  2. If I needed to add distributed tracing across all middleware, where exactly would I hook it in and what would break?

  3. Is there anywhere two separate parts of the codebase are solving the same problem with different implementations?

Output:

GPT-5.5 started with a broad regex search across the repo, more tool calls on orientation than verification.

For question 1, it identified the mutable shared Context object as the core risk. Middleware cooperatively mutates the same context before and after await next(), baked into built-in middleware like logger, compress, etag, and timing.

For question 2, it found the same primary hook point as Opus but caught one thing Opus missed: combine.every() deliberately freezes routeIndex during nested middleware execution, which flattens span naming if you key spans off routeIndex.

For question 3, it found the route matching split between RegExpRouter and TrieRouter and the streaming helper duplication. Missed the form data parsing duplication.

Opus 4.7 dispatched two parallel subagents before writing a single word. 34 and 49 tool calls respectively, then read 5 more files to verify.

For question 1, it identified the single-handler fast path in hono-base.ts:424-441. When the router matches one handler, Hono skips compose() entirely.

The two paths have different semantics for next() call guards and context.finalized behavior, middleware written against one silently breaks on the other.

For question 3, it found the form data parsing duplication between validator.ts and body.ts that GPT-5.5 missed.

Winner: Tie.

Test 2: Silent Bug Detection

The setup: A TypeScript rate limiter module with three intentional bugs planted at different severity levels.

A sliding window implementation that has a race condition under concurrent load, an off-by-one in the expiry cleanup, and a missing validation that allows negative limits to be set without error.

Both models were given the file and its test suite with no indication bugs exist.

The prompt:

Audit this code. Find every place where a failure could be silent, a value could be wrong under specific conditions, or an assumption is made that isn't enforced. Do not flag style issues. Only flag things that could cause incorrect behavior in production. For each finding, state the exact condition that triggers it.

Output:

GPT-5.5 found the off-by-one in expiry cleanup and explained it correctly: requests at the boundary of the time window can be counted twice if they arrive within the same millisecond as a cleanup cycle.

It missed the race condition and the negative limit bug entirely. 

It also flagged two false positives in the form of a Date.now() precision concern the implementation already handles, and a style preference that isn't a bug.

One real find out of three planted.

Opus 4.7 found all three.

The race condition explanation was specific: get and set on the sliding window aren't atomic, so two concurrent requests can both read the same count, increment locally, and one write overwrites the other, undercounting that only surfaces under load.

For the negative limit, it caught that the code accepts any number type without a guard, meaning a caller passing -1 gets a rate limiter that allows infinite requests.

For the off-by-one, it found the same issue GPT-5.5 found with the same explanation.

One false positive on parseInt coercion that isn't a real bug in this implementation.

Winner: Opus 4.7.

My Take

Last edition I ran both models on the same bug set and found that the 15 bugs only one model caught were the ones that would have shipped.

The conclusion was to run both.

This edition gave me a different reason to reach the same conclusion.

On bug detection, Opus was clearly better. 

And codebase reasoning, they didn't agree on what was risky, and both were right.

So, we still don’t have a clear winner. I'm continuing to use both.

What about you?

Until next time,
Vaibhav 🤝🏻

If you read till here, you might find this interesting

#Partner 1

Deloitte: Robot “Adoption is Accelerating Exponentially”

Robots are going from niche to mainstream, per Deloitte. They say it’s especially true in places where “physical AI solves real problems.” Take the $1 trillion fast-food market, where brands turn to robots to alleviate 144% labor turnover. 

Miso’s Flippy Fry Station AI robot has already been adopted by major brands like White Castle, frying 5M+ baskets of food to date. That earned strategic investment from industry powerhouse Ecolab and an unique collaboration with NVIDIA.

Now, after acquiring Zignyl, the powerful restaurant-operations tool, Miso adds powerhouse operators like Cinnabon, Jamba, and Jersey Mike’s under their umbrella.

Next up? Miso’s scaling across a $4B/year revenue opportunity. Join 39,000+ people as an early-stage Miso investor before they reach 100,000+ target locations.

This is a paid advertisement for Miso Robotics’ Regulation A offering. Please read the offering circular at invest.misorobotics.com.

#Partner 2

Hiring in 8 countries shouldn't require 8 different processes

This guide from Deel breaks down how to build one global hiring system. You’ll learn about assessment frameworks that scale, how to do headcount planning across regions, and even intake processes that work everywhere. As HR pros know, hiring in one country is hard enough. So let this free global hiring guide give you the tools you need to avoid global hiring headaches.

Reply

Avatar

or to participate

Keep Reading