
When you use AI with an image, what do you need it to do?
Yesterday, Anthropic released Claude Opus 4.7.
Along with it sneaked in the Mythos Preview benchmark numbers and holy shit!
If you haven't seen them, check it out. It makes sense why Anthropic isn't releasing it publicly.
The announcement was mostly about agentic workflows, software engineering, and all the dev stuff.
We’ll explore that in the upcoming editions but buried in it was this specific claim: "substantially better vision."
I wanted to know if that was real.
So I ran tests across three models which have consistently been the strongest on visual tasks.
But before that, let’s catchup on AI this week:
TOOLS of the Week
1. X-Pilot: Upload a PDF, PPT, or document and it turns it into a narrated course video with animated visuals. You can make changes using plain English commands like "shorten the intro" and it applies them instantly.
2. ClayHog: Tracks how your brand shows up in AI-generated answers across ChatGPT, Gemini, Perplexity, Google AI, and Claude. Gives you prompt tracking, competitor monitoring, and citation tracking in one dashboard.
3. Avec: A free iOS email app that surfaces your most important Gmail messages one at a time and learns your preferences as you swipe. Think of it as Tinder for your inbox, minus the existential dread.
What I tested and why
Seeing and reasoning are different things.
Most models can tell you what's in an image but only few can tell you what it means, what's wrong with it, or what you should do about it.
So I built four tests around that gap:
A cluttered checkout screen: can it diagnose real UX problems and suggest specific fixes?
A physics diagram: can it extract information from a visual and solve a multi-step problem?
A tense startup meeting: can it read beneath the surface and infer what's going on?
A chaotic Kanban board: can it identify constraints and build a real plan from what it sees?
Test 1: The checkout screen
The prompt: You are a product designer…[See full prompt here]
Input:
I gave all three models a fictional Indian e-commerce checkout page, which had five UX problems deliberately baked in.
I wanted to see who could prioritize the problems correctly and suggest fixes specific enough to implement.
Output:
All three caught the main issues. But Opus 4.7 went further.
It caught the duplicate rupee symbol in the Pay Now button (₹ ₹807.00) i.e. a small bug neither of the others noticed.
It also rewrote the order summary as a formatted table, that’s a proper handoff.
Gemini spotted the same issues but stayed at the description level. Opus 4.6 was close but more verbose without adding more.
Winner: Opus 4.7
Test 2: The physics diagram
The prompt: Solve this problem step by step…[See full prompt here]
Input:
A classic mechanics problem: Block A (4kg) on a 30° incline, connected by a string over a pulley to Block B (6kg) hanging vertically. Coefficient of friction 0.3. Find the acceleration.
I already knew the answer going in: 2.96 m/s².
This test was about how carefully the models interpret the diagram before solving.
Output:
All three got it right. showed their working and self-checked.
Gemini was the clearest teacher.
It was the most explicit about why each step followed from the previous one, and why the motion arrows in the diagram determined the direction of friction.
If I were a student, that's the explanation I'd want.
Before solving, I noticed that Opus 4.7 flagged that it wanted to note something about the target answer.
A small metacognitive signal that I thought was cool, but not decisive.
Winner: Draw
Test 3: The startup meeting
The prompt: Look at this image and answer…[See full prompt here]
Input:
I generated an image of a tense strategy session.
Output:
Opus 4.7's response was in a different class.
It noticed the spilled coffee and connected it to the Q2 plan sitting underneath it.
Also read the garbled text on the whiteboard and called it out as either a rushed capture problem or a team clarity issue.
It spotted the competitor analysis sitting untouched and said: "they're doing internal blame-assignment when the answer might be external."
Its prediction for what happens next is a solid inference from a still image.
Opus 4.6 was solid as it noticed the revenue dip driving the conversation, made a reasonable call about the meeting leading to a product pivot.
Gemini spotted the garbled text too, but used it to conclude the image was AI-generated and spent most of its response on that observation instead of reasoning about the scene.
Winner: Opus 4.7
Test 4: The Kanban board
The prompt: You are an operator looking at this dashboard…[See full prompt here]
Input:
A project board for a fictional D2C coffee brand called Brewhaus.
Output:
All three produced actionable plans, but the quality of constraint identification varied.
Opus 4.7 caught that two card IDs appeared twice each (BRW-51 and BRW-57). Other models didn’t get it.
A data hygiene issue that signals either miscounted work or a team that's lost track of what it's building.
It also named Priya as the bottleneck directly and built the plan around redistributing her load specifically and not just "reduce WIP" as a general principle.
And its trade-off section was candid.
It said cutting the subscription model means the launch becomes "a packaging-and-PR event" and called that a real business consequence.
Gemini was well-structured but felt templated.
Winner: Opus 4.7
So, is the claim true?
Partially.
The "substantially better vision" framing made me expect sharper image perception in regards to higher resolution understanding and catching more visual detail.
That didn't really change.
What improved is the inference layer.
Opus 4.7 sees more clearly than 4.6, and it also reasons further from what it sees.
The gap on pure perception tasks is narrow, but on tasks that require going from image to insight to action is wider.
One surprising result for me is Gemini.
On three of four tests it was solid, structured, capable, and occasionally clearer than either Claude model.
Test 3 in particular: when the image was ambiguous or imperfect, it defaulted to meta-observation instead of reasoning through the ambiguity.
Final Scorecard:
UI Reasoning: Opus 4.7
Diagram Reasoning: Draw
Real World Reasoning: Opus 4.7
Planning from Visuals: Opus 4.7
My Take
I expected to either be impressed or have something to debunk. Neither happened.
Test 3 showed me Opus 4.7 can go further with an image than anything I've used before, but it didn't change how I think about any of this.
It's a better tool doing the same job better. That's what most upgrades are.
I won't be routing all my image tasks through 4.7. The tokens add up and the gap isn't wide enough to justify it for everything.
I'll use it when the task actually needs that inference depth. For everything else, any model is fine.
All prompts and the full model outputs are in the prompt book if you want to run these tests yourself.
Until next time,
Vaibhav 🤝🏻
If you read till here, you might find this interesting
#AD 1
Someone just spent $236,000,000 on a painting. Here’s why it matters for your wallet.
Late last year, a Klimt sold for the highest price ever paid for modern art at auction.
An outlier sure, but it wasn't a fluke. U.S. auction sales grew 23.1% in 2025. The $1-5mm segment even grew 40.8% YoY.
Now, the S&P, teetering on all time highs, just posted its worst quarter since 2022, oil was up 94% (briefly), and Moody's puts recession odds at 48.6%.
Each environment is unique, but after dot-com, post war and contemporary art grew about 24% annually for a decade. After 2008, about 11% for 12 years.
It’s also had near-zero correlation with the S&P 500 since ‘95.*
Now, Masterworks lets you invest in shares of artworks featuring legends like Banksy, Basquiat, and Picasso.
$1.3 billion invested across over 500 artworks.
28 sales to date.
Net annualized returns on sold works held 12 months+ like 14.6%, 17.6%, and 17.8%.
Shares can sell quickly, but my subscribers can skip the waitlist:
*Investing involves risk. Past performance is not indicative of future returns. See important Reg A disclosures at masterworks.com/cd.
#AD 2
Hiring in 8 countries shouldn't require 8 different processes
This guide from Deel breaks down how to build one global hiring system. You’ll learn about assessment frameworks that scale, how to do headcount planning across regions, and even intake processes that work everywhere. As HR pros know, hiring in one country is hard enough. So let this free global hiring guide give you the tools you need to avoid global hiring headaches.
















