technologybusiness

#425 Neil: Claude Opus 4.7 After 5 Brutal Tests That Broke 4.6 Hard

AI Fire Daily·Apr 17, 2026·10:14

About this Episode

Five brutal tests. Four models. One verdict. See where Claude Opus 4.7 crushes 4.6, where GPT 5.4 still beats it on speed, and where Gemini 3.1 Pro wins on long multimodal work. Plus the token cost trap and the settings trick that saves your bill this month ⚡

We'll talk about:

What actually changed between Claude Opus 4.6 and 4.7, including xhigh effort mode, the /ultrareview command, and the new tokenizer
Whether 4.7 is a real upgrade or just Anthropic resetting 4.6 defaults back to where they used to be
Test 1, a NVIDIA stock chart analysis that exposed 4.6's biggest weakness on instruction following
Test 2, a SaaS financial model where 4.7 caught its own math errors mid-build
Test 3, a hard coding refactor on a real Express.js project, with all four source files included
Test 4, a six-document legal due diligence review with embedded contradictions and risk statements
Test 5, a high-resolution vision task on a crowded retail media dashboard and handwritten whiteboard
Head-to-head matrix comparing Claude Opus 4.7 against Gemini 3.1 Pro and GPT 5.4 across every test
The token cost problem, where xhigh effort can make your bill climb 1.35x without warning
A three-test rule for deciding if 4.7 is worth it on your own workflow this week

Keywords: Claude Opus 4.7, Claude Opus 4.6, Claude Opus 4.7 Vs 4.6, Claude Opus 4.7 Vs Gemini 3.1 Pro, Claude Code, AI Tools.

Links:

Newsletter: Sign up for our FREE daily newsletter.
Our Community: Get 3-level AI tutorials across industries.
Join AI Fire Academy: 500+ advanced AI workflows ($14,500+ Value)

Our Socials:

Facebook Group: Join 287K+ AI builders
X (Twitter): Follow us for daily AI drops
YouTube: Watch AI walkthroughs & tutorials

Hosts & Guests

AIFire.co

Host

Transcript

And Doug, there's nowhere I wouldn't go to help someone customize and save on car insurance with Liberty Mutual.

Even if it means sitting front row at a comedy show.

Hey everyone, check out this guy and his bird. What is this your first date?

Oh no, we help people customize and save on car insurance with Liberty Mutual together. We're married.

Ah! Mito a human, him to a bird.

Yeah, the bird looks out of your leg anyways.

Only pay for what you need at Liberty Mutual.com.

Liberty, Liberty, Liberty, Liberty.

Liberty.

Capital One's tech team isn't just talking about multi-agentic AI.

They already deployed one.

It's called chat concierge.

And it's simplifying car shopping using self-reflection and layered reasoning with live API checks.

It doesn't just help buyers find a car they love.

It helps schedule a test drive, get pre-approved for financing,

and estimate trading value.

Advanced, intuitive, and deployed.

That's how they stack.

That's technology at Capital One.

An engineer at AMD recently analyzed roughly 7,000 AI coding sessions.

Yeah, that massive data set.

Right. And the results were honestly shocking.

The AI's reasoning depth suddenly dropped by 73%.

It just stopped thinking.

Exactly. It completely stopped reading files.

And well, it started breaking things.

Absolutely devastated developer workflows.

I mean, it became a reckless liability overnight.

Welcome to The Deep Dive.

Today, we're looking at Claude Opus 4.7.

The big question is whether it's a true upgrade or, you know,

just a bandaid for the massive complaints users had with 4.6.

Right. So we are walking through five brutal side-by-side tests.

We've got financial analysis, sauce modeling, hard coding, legal reasoning, and vision.

And we'll see exactly where it wins and where it completely fails.

Plus how it stacks up against Gemini 3.1 Pro and GPT 5.4.

But to really understand why these tests matter, you have to look at the drama first.

Right. The drama that forced Anthropic to release 4.7.

Yeah. The fall of 4.6 was rough.

Editing files without reading them jumped from 6% to nearly 34%.

Wow. Users had to interrupt it 12 times more often.

It made up fake Git commit hashes.

Git commit hashes are unique IDs for saved code changes.

Right. And it referenced fake APIs.

It's accuracy on bridge bench, just completely plummeted.

I have to admit. Yeah.

Beat, I still wrestle with prompt drift myself.

Oh, we all do.

Watching a model confidently go rogue halfway through a task is incredibly frustrating.

It destroys your trust in the tool.

So 4.7 brought in some serious fixes.

Like the new effort level. Right. The axe I said.

Yeah. It forces the model to compute longer.

And they added an ultra review command for a secondary review pass.

And the context window context window is the model's short term memory during a chat.

They pushed it to 1 million tokens massive.

Whoa. Eat imagine stacking Lego blocks of data until you fit an entire company's history into one session.

It's wild.

But the catch is the new tokenizer.

It means it costs one to 1.35 times more tokens.

Right. It's more expensive.

Yeah. But biomolecular reasoning safety jumped from 30.9% to 74%.

So did anthropic actually build a smarter model?

Or just turn the safety knobs back to where they used to be?

Well, a jump that huge and a hard science proves it's foundational.

You can't just tweet safety dials to double accuracy.

So it's a real foundational upgrade, not just a quick settings patch.

Exactly. It's a real architectural shift.

Okay. Let's unpack this.

0.7 is truly smarter.

It should follow strict instructions without losing its mind.

Right. Let's look at the financial chart test.

We gave both models a 12 month Nvidia stock chart.

The prompt demanded exactly four numbered sentences.

Just history, key signal, hidden risk and concrete action.

Right. No fluffle out.

And 4.6 completely ignored the formatting.

Yeah. It failed.

It wrote this panicked rambling paragraph instead.

But 4.7 followed the rules perfectly.

Four clean sentences.

But what's fascinating here is the actual insight it provided.

Oh, absolutely.

It noticed the 12 month chart was hiding a 95% gain.

It looked like a flat line.

Right. Which is a massive risk most retail traders miss entirely.

Exactly.

It even suggested a concrete 5% position sizing rule with weekly tranches.

Why is formatting matter so much if the financial advice from both models was still decent?

Because skipping structural rules is a huge red flag.

It shows attention decay.

If it ignores simple constraints, you can't trust it on larger tasks.

Right. Sloppy formatting means the model isn't paying attention to your actual instructions.

Precisely. It's a foundational processing flaw.

So formatting is one thing.

But what happens when the logic in the prompt itself is fundamentally flawed?

Oh, this is the B2B sauce model test. It's totally a trap.

We asked for 12 months of projections.

Yeah.

Three pricing tiers, churn, marketing spend.

But the starting numbers were secretly broken.

Yeah, 4.6 fell right into it.

It built a beautifully polished spreadsheet immediately.

But it built it blindly based on bad math.

4.7 stops.

It totally pumped the brakes.

It flagged four massive issues before writing a single formula.

Yeah. It pointed out the 150K cash would burn out by month four.

What?

And it caught that net revenue attention was mathematically uncomputable.

Right. Because we didn't give it any expansion data.

Exactly.

It also noted that a 4% monthly churn equals a brutal 39% annual churn.

It's like hiring an accountant.

4.6 just files the bad paperwork.

Yeah, without saying a word.

4.7 stops you and says, hey, you're going bankrupt.

It's an incredible self-correction feature.

Does this pushback feature make 4.7 harder to use for quick simple tasks?

I mean, yeah, if you just want a quick template that hesitation adds friction.

But for business strategy, that friction is vital.

Got it. So it prioritizes business usability over just giving a fast, pretty answer.

Exactly. A fast, wrong answer is still wrong.

So we know it catches bad math.

But what about the hard coding redemption test?

This is what made 4.6 infamous.

Right. Legacy code is chaotic.

One wrong move breaks the whole app.

We ran an express API refactor test.

We asked it to add an endpoint and refactor middleware.

And we explicitly said, don't break existing routes.

Right. And it had to read the files before editing.

Well, 4.6 gave vague bullets.

It didn't name validation libraries.

No backward compatibility plan either.

Right. You couldn't run it safely without a dozen follow-up questions.

Here's where it gets really interesting.

4.7 wrote a PR style plan.

It independently chose joy from the package file.

It handled backward compatibility with default sub-doguments.

Default sub-doguments are nested records filling in missing data automatically.

Exactly. It made sure existing imports wouldn't break.

Execution ready immediately.

It anticipated the blast radius of its changes across the whole system.

If I'm not a developer, why should I care how an AI writes an API endpoint?

Because it proves deep architectural foresight.

It maps out dependencies before making irreversible changes.

Because it proves the model now plans complex multi-step actions

before recklessly executing them.

Exactly. It thinks before it types.

Planning in short bursts is one thing.

How does this critical thinking hold up with a million token memory?

The massive context flood.

Right. We uploaded six PDFs, 180,000 words of due diligence.

Decks, legal term sheets, surveys.

The task was to find every legal risk and write a 300 word memo.

And 4.6 acted like a junior analyst.

It just dumped a flat list of risks by document.

Curated, but totally overwhelming.

Yeah, completely.

Before .7 acted like senior legal counsel.

It tiered the risks by severity.

Tier one for securities exposure.

Tier two for marketing misstatements.

It explicitly named consequences too.

Right. Warning the CEO about rescission and personal liability.

Is the difference here about having a better memory or having better reasoning?

Oh, it's definitely better reasoning.

Both models remembered the exact same facts.

But only 4.7 understood the hierarchy of those facts.

Right. They remembered the same facts.

But 4.7 actually understood how to prioritize them.

Yeah, it connects the dots across hundreds of pages.

Okay. So it handles text and code.

But anthropic claims 4.7 also fixed vision.

Let's look at the pixels.

Higher vision is tough.

We use two messy images.

A dense analytics dashboard with tiny numbers.

And a smudged whiteboard with color-coded arrows.

4.6 pulled the numbers into a table.

But it hit its mistakes completely.

Right. The retailer names were physically cropped out of the image.

So 4.6 just guessed.

It wrote A and S and pretended it was fine.

It hallucinated confidence.

Because of the worst trait an AI can have.

But 4.7 explicitly flagged that the labels were illegible.

It proposed a workaround.

It suggested labeling rows R1 to R8 instead.

And it caught a year-over-year car that 4.6 completely hallucinated right past.

You know, the true mark of intelligence is stating exactly what you cannot see.

Why did 4.6 try to hide the fact that it couldn't read the cropped names?

It's an alignment issue.

Older models mistakenly think that guessing looks more helpful than admitting failure.

It was prioritizing a complete looking answer over an honest partially incomplete one.

Exactly.

An anthropic train 4.7 to value honesty.

So 4.7 destroys 4.6.

But how does it stack up against the other heavyweights?

Right. Nobody works in a vacuum.

You've got Gemini 3.1 Pro and GMT 5.4 out there.

So what does this all mean for your wallet?

Let's look at the master matrix.

Use Claude Opus 4.7 for hard coding and deep math.

Basically tasks where a mistake is expensive.

Exactly. That's when you use the X-high effort setting.

And what about Gemini 3.1 Pro?

Use Gemini if you're dumping video, audio and documents into a single, massive, long, multi-modal session.

In GBT 5.4.

Use that for raw speed.

Fast research and rapid creative brainstorming.

So 4.7 gave up ground on raw speed to win on accuracy and self-correction.

Yeah. It's a deliberate trade-off.

If 4.7 costs more tokens and is slower, is it still worth keeping as a daily driver?

It absolutely is. You just need to stick to default effort for simple tasks to save money.

Yes. But only if you stick to default settings for simple everyday tasks.

Right. You just have to manage it actively.

Let's sum up this deep dive. Claude Opus 4.7 isn't just a patch.

It's a massive return to form.

Having discipline is back. Halucinations are down.

And it actually pushes back on bad assumptions.

But remember, it costs more tokens.

So use that X-high effort setting strategically.

Yeah. Don't use it to summarize simple emails.

We saw in the SAS test that 4.7 actively push back on a flawed business plan before executing it.

As these models get better at telling us we're wrong,

at what point do they transition from being tools we command to partners that actually manage us?

Think about that next time you hit send on a prompt.

It's a huge shift in the dynamic.

Thank you for joining us for this deep dive.

And Doug.

There's nowhere I wouldn't go to help someone customize and save on car insurance with Liberty Mutual.

Even if it means sitting front row at a comedy show.

Hey everyone, check out this guy and his bird.

What is this, your first date?

Oh, no. We help people customize and save on car insurance with Liberty Mutual together.

We're married.

Meet a human, him to a bird.

Yeah, the bird looks out of your leg anyways.

Only pay for what you need at Liberty Mutual.com.

Liberty Liberty Liberty Liberty Liberty Liberty.

Capital One's tech team isn't just talking about multi-agentic AI.

They already deployed one. It's called chat-concierge.

And it's simplifying car shopping using self-reflection and layered reasoning with live API checks.

It doesn't just help buyers find a car they love.

It helps schedule a test drive, get pre-approved for financing, and estimate trading value.

Advanced, intuitive, and deployed.

That's how they stack.

That's technology at Capital One.

#425 Neil: Claude Opus 4.7 After 5 Brutal Tests That Broke 4.6 Hard

About this Episode

Hosts & Guests

More from AI Fire Daily

#437 Neil: Live Claude Artifacts Replace Every Static And Boring Data Report

🎙️ EP 257: OpenAI’s War on Goblins & The Rise of Edge AI

#424 Max: The Claude OS – Turning Your Assistant into Infrastructure (2026)

#437 Neil: Claude Prompts Killing Weak Assumptions With Strategic Friction