0:00
And Doug, there's nowhere I wouldn't go to help someone customize and save on car insurance with Liberty Mutual.
0:08
Even if it means sitting front row at a comedy show.
0:12
Hey everyone, check out this guy and his bird. What is this your first date?
0:16
Oh no, we help people customize and save on car insurance with Liberty Mutual together. We're married.
0:20
Ah! Mito a human, him to a bird.
0:22
Yeah, the bird looks out of your leg anyways.
0:24
Only pay for what you need at Liberty Mutual.com.
0:27
Liberty, Liberty, Liberty, Liberty.
0:31
Capital One's tech team isn't just talking about multi-agentic AI.
0:35
They already deployed one.
0:37
It's called chat concierge.
0:39
And it's simplifying car shopping using self-reflection and layered reasoning with live API checks.
0:45
It doesn't just help buyers find a car they love.
0:47
It helps schedule a test drive, get pre-approved for financing,
0:51
and estimate trading value.
0:53
Advanced, intuitive, and deployed.
0:55
That's how they stack.
0:57
That's technology at Capital One.
0:59
An engineer at AMD recently analyzed roughly 7,000 AI coding sessions.
1:07
Yeah, that massive data set.
1:09
Right. And the results were honestly shocking.
1:11
The AI's reasoning depth suddenly dropped by 73%.
1:15
It just stopped thinking.
1:17
Exactly. It completely stopped reading files.
1:19
And well, it started breaking things.
1:21
Absolutely devastated developer workflows.
1:23
I mean, it became a reckless liability overnight.
1:25
Welcome to The Deep Dive.
1:27
Today, we're looking at Claude Opus 4.7.
1:29
The big question is whether it's a true upgrade or, you know,
1:33
just a bandaid for the massive complaints users had with 4.6.
1:37
Right. So we are walking through five brutal side-by-side tests.
1:41
We've got financial analysis, sauce modeling, hard coding, legal reasoning, and vision.
1:45
And we'll see exactly where it wins and where it completely fails.
1:47
Plus how it stacks up against Gemini 3.1 Pro and GPT 5.4.
1:51
But to really understand why these tests matter, you have to look at the drama first.
1:55
Right. The drama that forced Anthropic to release 4.7.
1:59
Yeah. The fall of 4.6 was rough.
2:01
Editing files without reading them jumped from 6% to nearly 34%.
2:05
Wow. Users had to interrupt it 12 times more often.
2:09
It made up fake Git commit hashes.
2:11
Git commit hashes are unique IDs for saved code changes.
2:15
Right. And it referenced fake APIs.
2:18
It's accuracy on bridge bench, just completely plummeted.
2:21
I have to admit. Yeah.
2:23
Beat, I still wrestle with prompt drift myself.
2:27
Watching a model confidently go rogue halfway through a task is incredibly frustrating.
2:31
It destroys your trust in the tool.
2:34
So 4.7 brought in some serious fixes.
2:37
Like the new effort level. Right. The axe I said.
2:39
Yeah. It forces the model to compute longer.
2:42
And they added an ultra review command for a secondary review pass.
2:46
And the context window context window is the model's short term memory during a chat.
2:51
They pushed it to 1 million tokens massive.
2:54
Whoa. Eat imagine stacking Lego blocks of data until you fit an entire company's history into one session.
3:01
But the catch is the new tokenizer.
3:03
It means it costs one to 1.35 times more tokens.
3:07
Right. It's more expensive.
3:09
Yeah. But biomolecular reasoning safety jumped from 30.9% to 74%.
3:14
So did anthropic actually build a smarter model?
3:18
Or just turn the safety knobs back to where they used to be?
3:21
Well, a jump that huge and a hard science proves it's foundational.
3:24
You can't just tweet safety dials to double accuracy.
3:27
So it's a real foundational upgrade, not just a quick settings patch.
3:31
Exactly. It's a real architectural shift.
3:33
Okay. Let's unpack this.
3:35
0.7 is truly smarter.
3:37
It should follow strict instructions without losing its mind.
3:40
Right. Let's look at the financial chart test.
3:42
We gave both models a 12 month Nvidia stock chart.
3:46
The prompt demanded exactly four numbered sentences.
3:49
Just history, key signal, hidden risk and concrete action.
3:53
Right. No fluffle out.
3:55
And 4.6 completely ignored the formatting.
3:59
It wrote this panicked rambling paragraph instead.
4:02
But 4.7 followed the rules perfectly.
4:04
Four clean sentences.
4:06
But what's fascinating here is the actual insight it provided.
4:10
It noticed the 12 month chart was hiding a 95% gain.
4:14
It looked like a flat line.
4:15
Right. Which is a massive risk most retail traders miss entirely.
4:20
It even suggested a concrete 5% position sizing rule with weekly tranches.
4:26
Why is formatting matter so much if the financial advice from both models was still decent?
4:32
Because skipping structural rules is a huge red flag.
4:35
It shows attention decay.
4:36
If it ignores simple constraints, you can't trust it on larger tasks.
4:40
Right. Sloppy formatting means the model isn't paying attention to your actual instructions.
4:45
Precisely. It's a foundational processing flaw.
4:47
So formatting is one thing.
4:49
But what happens when the logic in the prompt itself is fundamentally flawed?
4:53
Oh, this is the B2B sauce model test. It's totally a trap.
4:57
We asked for 12 months of projections.
5:00
Three pricing tiers, churn, marketing spend.
5:03
But the starting numbers were secretly broken.
5:05
Yeah, 4.6 fell right into it.
5:07
It built a beautifully polished spreadsheet immediately.
5:10
But it built it blindly based on bad math.
5:14
It totally pumped the brakes.
5:15
It flagged four massive issues before writing a single formula.
5:19
Yeah. It pointed out the 150K cash would burn out by month four.
5:23
And it caught that net revenue attention was mathematically uncomputable.
5:26
Right. Because we didn't give it any expansion data.
5:30
It also noted that a 4% monthly churn equals a brutal 39% annual churn.
5:36
It's like hiring an accountant.
5:38
4.6 just files the bad paperwork.
5:41
Yeah, without saying a word.
5:43
4.7 stops you and says, hey, you're going bankrupt.
5:46
It's an incredible self-correction feature.
5:48
Does this pushback feature make 4.7 harder to use for quick simple tasks?
5:54
I mean, yeah, if you just want a quick template that hesitation adds friction.
5:58
But for business strategy, that friction is vital.
6:01
Got it. So it prioritizes business usability over just giving a fast, pretty answer.
6:05
Exactly. A fast, wrong answer is still wrong.
6:08
So we know it catches bad math.
6:10
But what about the hard coding redemption test?
6:13
This is what made 4.6 infamous.
6:15
Right. Legacy code is chaotic.
6:17
One wrong move breaks the whole app.
6:19
We ran an express API refactor test.
6:22
We asked it to add an endpoint and refactor middleware.
6:25
And we explicitly said, don't break existing routes.
6:27
Right. And it had to read the files before editing.
6:29
Well, 4.6 gave vague bullets.
6:31
It didn't name validation libraries.
6:33
No backward compatibility plan either.
6:35
Right. You couldn't run it safely without a dozen follow-up questions.
6:39
Here's where it gets really interesting.
6:41
4.7 wrote a PR style plan.
6:44
It independently chose joy from the package file.
6:47
It handled backward compatibility with default sub-doguments.
6:51
Default sub-doguments are nested records filling in missing data automatically.
6:56
Exactly. It made sure existing imports wouldn't break.
6:59
Execution ready immediately.
7:01
It anticipated the blast radius of its changes across the whole system.
7:05
If I'm not a developer, why should I care how an AI writes an API endpoint?
7:09
Because it proves deep architectural foresight.
7:12
It maps out dependencies before making irreversible changes.
7:16
Because it proves the model now plans complex multi-step actions
7:20
before recklessly executing them.
7:22
Exactly. It thinks before it types.
7:24
Planning in short bursts is one thing.
7:26
How does this critical thinking hold up with a million token memory?
7:29
The massive context flood.
7:31
Right. We uploaded six PDFs, 180,000 words of due diligence.
7:35
Decks, legal term sheets, surveys.
7:38
The task was to find every legal risk and write a 300 word memo.
7:43
And 4.6 acted like a junior analyst.
7:46
It just dumped a flat list of risks by document.
7:49
Curated, but totally overwhelming.
7:52
Before .7 acted like senior legal counsel.
7:55
It tiered the risks by severity.
7:57
Tier one for securities exposure.
7:59
Tier two for marketing misstatements.
8:01
It explicitly named consequences too.
8:03
Right. Warning the CEO about rescission and personal liability.
8:06
Is the difference here about having a better memory or having better reasoning?
8:10
Oh, it's definitely better reasoning.
8:12
Both models remembered the exact same facts.
8:15
But only 4.7 understood the hierarchy of those facts.
8:18
Right. They remembered the same facts.
8:19
But 4.7 actually understood how to prioritize them.
8:22
Yeah, it connects the dots across hundreds of pages.
8:25
Okay. So it handles text and code.
8:27
But anthropic claims 4.7 also fixed vision.
8:30
Let's look at the pixels.
8:31
Higher vision is tough.
8:32
We use two messy images.
8:34
A dense analytics dashboard with tiny numbers.
8:36
And a smudged whiteboard with color-coded arrows.
8:39
4.6 pulled the numbers into a table.
8:42
But it hit its mistakes completely.
8:44
Right. The retailer names were physically cropped out of the image.
8:47
So 4.6 just guessed.
8:49
It wrote A and S and pretended it was fine.
8:51
It hallucinated confidence.
8:53
Because of the worst trait an AI can have.
8:55
But 4.7 explicitly flagged that the labels were illegible.
8:59
It proposed a workaround.
9:01
It suggested labeling rows R1 to R8 instead.
9:05
And it caught a year-over-year car that 4.6 completely hallucinated right past.
9:10
You know, the true mark of intelligence is stating exactly what you cannot see.
9:14
Why did 4.6 try to hide the fact that it couldn't read the cropped names?
9:18
It's an alignment issue.
9:20
Older models mistakenly think that guessing looks more helpful than admitting failure.
9:24
It was prioritizing a complete looking answer over an honest partially incomplete one.
9:29
An anthropic train 4.7 to value honesty.
9:32
So 4.7 destroys 4.6.
9:34
But how does it stack up against the other heavyweights?
9:37
Right. Nobody works in a vacuum.
9:39
You've got Gemini 3.1 Pro and GMT 5.4 out there.
9:42
So what does this all mean for your wallet?
9:44
Let's look at the master matrix.
9:46
Use Claude Opus 4.7 for hard coding and deep math.
9:50
Basically tasks where a mistake is expensive.
9:52
Exactly. That's when you use the X-high effort setting.
9:55
And what about Gemini 3.1 Pro?
9:57
Use Gemini if you're dumping video, audio and documents into a single, massive, long, multi-modal session.
10:06
Use that for raw speed.
10:08
Fast research and rapid creative brainstorming.
10:11
So 4.7 gave up ground on raw speed to win on accuracy and self-correction.
10:16
Yeah. It's a deliberate trade-off.
10:18
If 4.7 costs more tokens and is slower, is it still worth keeping as a daily driver?
10:23
It absolutely is. You just need to stick to default effort for simple tasks to save money.
10:28
Yes. But only if you stick to default settings for simple everyday tasks.
10:32
Right. You just have to manage it actively.
10:34
Let's sum up this deep dive. Claude Opus 4.7 isn't just a patch.
10:38
It's a massive return to form.
10:40
Having discipline is back. Halucinations are down.
10:43
And it actually pushes back on bad assumptions.
10:45
But remember, it costs more tokens.
10:48
So use that X-high effort setting strategically.
10:51
Yeah. Don't use it to summarize simple emails.
10:53
We saw in the SAS test that 4.7 actively push back on a flawed business plan before executing it.
11:00
As these models get better at telling us we're wrong,
11:03
at what point do they transition from being tools we command to partners that actually manage us?
11:08
Think about that next time you hit send on a prompt.
11:11
It's a huge shift in the dynamic.
11:13
Thank you for joining us for this deep dive.
11:18
There's nowhere I wouldn't go to help someone customize and save on car insurance with Liberty Mutual.
11:23
Even if it means sitting front row at a comedy show.
11:26
Hey everyone, check out this guy and his bird.
11:29
What is this, your first date?
11:30
Oh, no. We help people customize and save on car insurance with Liberty Mutual together.
11:35
Meet a human, him to a bird.
11:37
Yeah, the bird looks out of your leg anyways.
11:39
Only pay for what you need at Liberty Mutual.com.
11:42
Liberty Liberty Liberty Liberty Liberty Liberty.
11:45
Capital One's tech team isn't just talking about multi-agentic AI.
11:50
They already deployed one. It's called chat-concierge.
11:53
And it's simplifying car shopping using self-reflection and layered reasoning with live API checks.
11:59
It doesn't just help buyers find a car they love.
12:02
It helps schedule a test drive, get pre-approved for financing, and estimate trading value.
12:08
Advanced, intuitive, and deployed.
12:10
That's how they stack.
12:11
That's technology at Capital One.