technology

GPT 5.4 vs Gemini: Benchmarks, Codex, Excel

The Daily AI Show·Mar 6, 2026·56:38

About this Episode

Beth Lyons and Andy Halliday open the show with a focused breakdown of GPT-5.4, framing it less as a universal leap and more as a strong advance in white-collar knowledge work and real-world task performance. Much of the conversation compares GPT-5.4 with Gemini 3.1 Pro Preview, Claude models, Codex, and other systems across benchmarks like GPT-Val, coding, long-context reasoning, hallucination resistance, and visual reasoning, with repeated emphasis that users still need to pick models based on the actual job to be done. Beth also shares a practical complaint about Gemini hallucinating around silent screen recordings and uses that to argue for a more dependable “colleague layer” in agentic systems. Later, Karl Yeh joins to talk through hands-on experience with GPT-5.4 in Codex, comparisons with Claude in Excel and Gemini in Sheets, and where the new release feels genuinely useful in day-to-day work.

Key Points Discussed

00:00:18 Welcome and setup for a GPT-5.4-focused episode

00:02:47 GPT-Val and white-collar knowledge work framing

00:08:51 Benchmark comparison across GPT-5.4, Claude, Gemini, and others

00:16:26 Gemini strengths in video and visual reasoning

00:18:05 Beth’s Gemini transcription / hallucination workflow example

00:23:54 “Then we’ll move to more news” and handoff to Karl Yeh

00:24:24 Karl Yeh on real-world use cases over benchmarks

00:55:30 Closing recommendations: try GPT-5.4, use Codex, newsletter and community plug

The Daily AI Show Co Hosts: Beth Lyons, Andy Halliday, Karl Yeh

Hosts & Guests

Karl

Host

Jyunmi

Host

Beth

Host

The Daily AI Show Crew - Brian

Host

Andy

Host

and Eran

Host

Transcript

Hey, good morning, everybody. It is Friday, March 6th. Yes, it's March. It seems weird.

Because it's March 26th, which is also weird. It seems like it was just yesterday that we were in 2025.

And today is show. I think 675. I'll double check that, but I believe that's our number today.

My name is Beth Lyons. I am hosting today with me in the studio is Andy Halliday.

And how are things? Andy, what's going on in your world?

Well, I'm glad to be back. I had a wonderful 10 days of skiing in Canada with, you know, long ago friends from the Canadian ski scene.

And I got to tell you, you know, you go away and don't pay attention for a week.

Just as a warning to the daily AI show listeners, it's going to be really hard to catch up.

But there's big splashes every day. And of course today's big splash is the release of chat GBT 5.4.

And so I look forward to kind of laying out, you know, just how that compares across a number of measures

against all the other, you know, very competent models out there.

And, you know, teaser here. It wins in a couple of important ways.

But there are other places where Gemini, you know, 3.0 preview is still the leader in dramatic, you know, demonstration will show some graphics about that.

So it's interesting. Yes. There are it. We've been sort of coming to this point over several months, maybe even over almost a year, where there are still measurable differences in the benchmark.

Right. This got 86.8% versus 84.2% for this other one.

But I'm not sure those are like they're statistically significant, but are they experientially significant, right? That we sort of come back to that.

Now Carl, who was playing yesterday was definitely yes, this is experientially significant for me. Go playing with it. So yeah, Andy, let's get us kicked off.

Sure. Okay. I'm going to put it first in the context of where, you know, GPT 5.4 has been tuned and focused and is definitely advancing the state of the art.

And that is in real white collar knowledge worker job execution.

So this is where it really shines. So for context, you know, there's this big company out there called volunteer and, you know, a renegade kind of CEO Alex carp was in some kind of a presentation to people in the tech industry.

And he told the attendees that you, you know, AI can't and those companies that are building AI can't simultaneously eliminate white collar jobs and distance itself from the US military as an anthropist trying to do and expect that there won't be consequences.

He said, quote, if Silicon Valley believes we are going to take away everyone's white collar job and you're going to screw the military, if you don't think that's going to lead to nationalization of our technology, you're retarded.

That was his comment. That's the kind of bold way that he phrases these things. And then Elon Musk responded to that on X, you know, responding to tokens, you know, reporting that this is what carp had said he said good point.

Okay, and all of this is happening just after block the company behind square and the payment system there announced plans to cut 4,000 employees, which is nearly half of their employees.

And he said the decision was because AI enabled productivity gains make it so that, quote, intelligence tools have transformed what it means to build and run a company.

And we've been touting that ourselves about, you know, the eventuality of $1 billion operating companies that are driven by one person along with AI.

And that's that's the impact of it all. Okay, so how's this relate to GPT 5.4? The way it relates is that there is a benchmark test that open AI built expressly to measure what it is.

What it is that the AI is accomplishing compared to human experts in a wide range of industries and it's called GPT Val.

Imagine that since they built the benchmark, they're they're building their models, you've got that benchmark and the top on that benchmark.

They do surprisingly well on the test that they built. That's right. What a surprise.

But that test, I think it's worth our discussion a little bit about what that test really is and how it's how it's graded so that you realize that it's not something that you could easily gain, you know, open AI can't gain the test and everybody else can take the test to.

So the way it works is there's a large number, they work with experienced professionals, a wide range of industries. Let me read them off.

Finance and insurance, retail trade, wholesale trade, real estate and rental and leasing, government, manufacturing, professional scientific and technical services, health care and social assistance, information like audio and video technicians, producers, directors, news analysts, reporters.

And under each one of those things that I mentioned, they have individual experts who helped create a set of tasks that are representative of what they do.

So imagine, you know, as a consultant in AI, you go in and you try to evaluate in that company what the business processes are and you define the task and then you try to automate it.

Well, GDP Val has a GPT Val not GDP.

GPT Val has done exactly that.

And then all of these example problems on the test are real world tasks that have to be accomplished by people as developed by the experts in each one of these things.

Now, the tasks has went through rounds and rounds of expert review and now they have a fully reviewed set of complex tasks for each industry.

Now, Ethan Maulick said about this test that in head that this is the most important and relevant test of AI when it comes to doing real application work with it.

And he also said about this said in head to head competition with human experts on tasks that require four to eight hours of a human doing them.

GPT 5.2 not GPT 5.3 and not GPT 5.4 but GPT 5.2 171% of the time as judged by other humans.

And now just a few months later, GPT 5.4 exceeds the performance of human professionals in those 83% of the time.

Now, this is what's happening that's really quite remarkable in GPT Val.

Now, it's not the only benchmark. Now, I'm going to show you graphically where the other players are.

And, you know, these aren't exactly normalized.

I don't think with respect to, you know, the why access here, but not access access.

Let me get here. I'm fumbling now a little bit. Let's go to this window here. Okay.

Now you're looking at graphs of on the top left.

This is a GDP valve from artificial analysis dot AI, which uses that test and actually allows, you know, both AI and human evaluators on that platform artificial analysis to show where each model performs relative.

The other moms. So here instead of the ELO numbers that the actual test produces where GPT 5.4 wins this shows, you know, a percentage of toast.

So here we have GDP valve AA, a gentle real world tasks.

And you see GPT 5.4 is at 58 while sonnet 4.6 is at 57%.

So it didn't really blow sonnet away. It's interesting that opus, the longer reasoning with reasoning set to max is right behind a sonnet 4.6.

On that, a gentle real world tasks andthropic actually holds three positions opus 4.6 without max thinking opus 4.6 with max thinking at 57.

And then in I'm sorry, sonnet at 57 and then at 55, just a couple of points behind that's opus 4.6 max. Well, GPT 5.4 got 1 percentage point advantage in this set of evaluation.

But now it's going to scroll down and see where other other tests show that GPT 5.4 really hasn't blazed forward and, you know, broken ice in new ways.

But when it comes to coding terminal bench hard over here on the right, you see GPT 5.4 passed past by the 53 and 54% performances of sonnet 4.6 and Gemini 3.1 pro preview by bumping up 4 points more to up to 58% above Gemini 3.1 pro preview.

Okay, I'm going to just point out over here on this next measure on agentic tool use GLM 5 and Kimi K2.5, a couple of open source models from ZAI and Kimi that I think I think it's mini max is the actual name of the company if I'm not mistaken.

But then Gemini 3.1 pro preview is really good, you have to go all the way back here in the middle of the pack for agentic tool use for Gemini GPT 5.4.

So it doesn't really do that well along all dimensions of testing of AI models. It is in the leader position here behind 5.3 codex in long context reasoning.

And you know, anthropic has recently come out with long context million context window, million token context window performances and it's back here at the 71% level.

Whereas GPT 5.4 gets 74%. Not really huge percentage differences. Like if you think about what percentage increase is it from the number 71 to 74. It's not a bunch.

Here's another one AA omniscience, which is about how much knowledge does the model have and Gemini 3.1 pro preview is well ahead at 55% of GPT 5.4 at 50%. That's five points again not a huge percentage increase, but well ahead and 46% for opus 4.6 max.

So say what that what that measure was again, it's just omniscience is about just how accurate are their responses in the context of knowledge if it's giving you factual responses, you know, how often does that response represent the real world factual basis that's behind that response.

And you can see that you go all the way down to these very small models that are performing pretty poorly at 16% Nvidia Neemotron.

What's interesting is Claude 4.5 Haiku right is that's 4.5 not 4.6 remember and you know there's some really low performances on omniscience.

So if you're asking pertinent questions, you'll want to make sure that your perplexity is using one of these models up at the top end of the a omniscience accuracy test.

If you're asking Haiku omniscience questions, you've misunderstood your assignment and the tools that have been presented to you.

That's right. This comparison showing all the different performances across models on different sort of task assignments is reflective of an important point, which is you need to pick the right model for the job that you're doing, not just take the latest and greatest performance on the one thing.

If you're doing it in the context of work performance, you can be sure that the ones that are on GPT Val at the top of those charts are really good choices.

Now look at humanity's last exam, which is reasoning and knowledge. Here Gemini 3.1 pro preview as well ahead of the pack at 45%.

GPT 5.4 nestled in at 42 and codex is at 40% and then Claude Opus 4.6 on max reasoning is at 37%.

So that's impressive, right? It's impressive that 5.4 has sidled up to where Gemini 3.1 pro preview has been for a while.

So I end up in my mind, I keep thinking about the frontier model competition being between open AI and anthropic.

But wait, Gemini 3.1 pro preview on reasoning and knowledge and other things like psychoding right here is that, you know, pretty high above at 59% Opus 4.6 max at 52%.

Even before 5.4 GPT 5.4 came out, it was, you know, it was already blasting well past the leaders on that, on that measure.

Here's physics reasoning and on that one, a very impressive performance by passing a bunch of others, including Gemini 3.1 pro preview to come out with the top number 20% on physics reasoning,

Claude 4.6 Opus comes out at only 13%. So that's a big jump.

So, and that black line, sorry, it's all very tiny from my eyes, because looking at a laptop screen, that black line is GPT 5.4.

That is leading in the, who's leading in the physics reasoning on the physics reasoning 5.4 is now the state of the art.

And will you pop the link to that in the chat?

Well, because Garrett said I would love to look at that chick.

And you get, there are artificial analysis is really the go to if you want to compare models.

That checks that out. Okay, I'm going to stop sharing here.

So I have something that I want to say about Gemini, though, from a user standpoint, because it does so well on so many of these benchmarks.

It has unique capabilities, although deep seek may match these wanted eventually drops, but at the moment, Gemini is the only thing that you can talk to about actual video that isn't just like screen shoting it and then trying to give you back as now I can process a bunch of screenshots.

Gemini actually can really analyze the video, watch the video with you and say something about that.

On visual reasoning, Gemini 3.1 pro preview is the top of the charts at 82%.

Gemini 3 flash is ahead of GPT 5.4 extra high.

That is interesting.

Because the prices between those two are very different, like don't overlook Gemini flash.

I will say from my user standpoint, however, is that Gemini is still personality wise for me unreliable.

So my workflow, I have said this on the show before many times, is when I find something on X that I'm on my phone about, I just screen record the scroll.

I give it to Gemini to transcribe.

And so now I have the text, right, much easier than trying to copy text on a phone or trying to get something out of out of X and I get the comments that I'm interested in too.

And I misunderstood that to screen screen record, also helpful tap if you did not know this to screen record with voiceover you have to hold the record button on your iPhone and it will pop up a little thing that says would you like the mic to work as well, I did not know that so I thought I had given a voiceover commenting about the, about the post that I wanted to talk about on the show.

Gemini came back with the transcription and not at all what I said about it, but had given me a full transcription of what my commentary was as well for that video.

And I was like, did I go back to what I said, right?

Like go back to what I said and give me my direct quote that is telling you that that's what I said.

Like gave me again, a completely hallucinated direct quote.

And I said, okay.

So it sounds like this video is completely silent. Is that true?

And right then we have a, oh, you got me is true.

I was just doing stuff and then it did it again in that actual conversation.

You got me is totally true. Yep, it was silent. And I made all of this stuff up.

Here's what you really meant when you said, but when you watch the video.

And I'm like, if it's silent, how do you know what I mean?

Right?

Like that kind of interaction is still a problem for me.

And when it gets, when it gets lazy, when it's like, I, you know, I gave you the first four comments.

And then I gave you the two with the end. You didn't need the whole thing. Did you?

That makes it an unreliable tool for me.

Yeah, I want to show you one other thing here on this very thing that you're discussing, which has to do with the reliability of, you know, a model when it comes to hallucinations.

This is the, again, on artificial analysis. This is the omniscience index.

Omniscience index hires better measures knowledge reliability and hallucination.

It rewards correct answers. So that's that, you know, test where, you know,

Gemini 3.1 pro preview is the leader.

And it penalizes hallucinations and has no penalty penalty if the model says I don't know and refuse to answer.

And look at how Gemini 3.1 pro preview exceeds the performance of opus 4.6 max and sonnet and Gemini 3 flash Gemini 3 not even 3.1.

3 flash is pretty well at parity with sonnet 4.6.

So you can go to Gemini in it with a very inexpensive model. You can get a very reliable knowledge based response from flash with the very low hallucination rate.

And here you see GPT 5.4 on extra high reasoning only gets six points on this index compared to 33 for Gemini 3.1 pro preview.

So when you go ahead.

Yeah, and I want to say hi, Carl. Thank you. But also I find myself particularly moving into agente cares and the and models being able to do more and more work on their own.

I need the model to work with me.

I need like your, your getting a transcript, but your miss transcribing something, the fact that every time I say, Claude, you think I said cloud, like we can work with that.

That's okay. I've modified my thing for that. But big things I want you to be able to give me back the information that I'm missing.

And if, because as we move into agente work, I don't want to be the one and ultimately can't be the one who's double checking all of the things. If I've asked you to transcribe my voice over.

And you don't hear one. That's information to give me back. Right. That's not.

And that was part of the conversation too. Like, at what point in my instruction, did you rely on she must mean if there is no voice over, I should pretend there was.

And what she might have said, like, where is that in the instruction and why isn't the instruction defaulting to, hey, don't hear a voice over what's point me to it, right, which is what cloud would say.

And I am doing work on this agentic layer, what I call the colleague layer that, that is a compound learning thing that I am doing with the way that I work with the agents and specifically because I think that these things are going to be more and more necessary.

Like, I know Beth, when you say this, this is what you mean, right, because we've worked together and now it's getting short handed.

And I will talk more and more about that as we go on. One thing I want to say, and then we'll move to more news and maybe Carl will share what he has been doing with, with his experience with the GPU for 5.4 drop.

Yesterday was the women's college awareness day, something like that. And I missed the day. I went to Holland's college. That's what this is the green and the gold. I have the green and the gold. Thank you very much shout out to Holland's where I learned so much. All right.

Carl, what you got?

What's happening?

How are you doing today?

So one thing I would like to point out, because there's a lot of, I think I caught everything on, we're talking about benchmarks.

I think we got to be, I think everybody knows for those who don't, you got to test it on your own use cases because those benchmarks are useless unless you leverage it into whatever you need to do.

Because you're going to get people who are like, this fits me, this does the work I do because it does this versus this work does better for me because it does this.

And I think what was there was an article back a while, someone, I don't know, I can't remember that the Chinese models were actually terrible at real world use cases, just not on the benchmarks, but because they were targeting the benchmarks, the models performed well against benchmarks.

But when you put them into actual, like, not simulated, but real world tests, it doesn't perform as well. So you still got to be extremely careful about what you're, you've got to, we know what you're putting in on that, not that the benchmarks are not important to look at.

But I think now it's getting to the point you've got to be able to know how to run it. And as Andy said, picking the right one for your task, based on all the context, all the data and all the, like just everything related to what your, your personal or what your business is doing, it's just, it's interesting, just having seen it company to company to company, different models just work differently in that context.

And some better, some, I'm surprised, some work way better, some work didn't, some, where Gemini was better, somewhere a cloud was better, somewhere chat GPT was consistently better. So it's a very, very interesting, where we are now, what I did find for me, I don't know if any of you work in codex, have you worked like, especially on the desktop app, I think the best part of this has been 5.4 in codex.

I could pretty much say 90, 95% can do everything that I want that cloud code can do. And now the UI, it doesn't matter from a CLI person, it doesn't matter. But if you wanted to run it as like a co work, it can.

And you actually don't get rate limited as much in codex than you would do in co work. I think that's something that you got to consider to, especially you get rate limited quite a bit when you do it in co work, which is, or your conversations do get compacted.

But that's what I just been finding out that, you know, it's been awesome to work with and they have fast mode to, so he had fast mode in codex. Yeah, it's not as fast as spark, but it's, yeah, 4 5.4 in codex is pretty darn good with a couple of, I just started a couple mobile apps that I was trying to build in there. So that was pretty good, but even non technical tasks.

It's just obviously co work has a better UI overall. And then I started using Excel, because that's a big one for our clients.

They kind of sucked for one because they bought a bunch of cloud licenses. And then now they're like, oh crap, we've got this. And when they were testing this out on the enterprise, but it was, it was pretty good. From a charts and graphs perspective.

Cloud, cloud still is significantly better, but from everything else, it's like on par. So the question is, do you need the only reason why this company bought this is because it was so good in Excel, like it was really, really good.

Now you have it right there. And if it's coming to sheets as is, that's, yeah, that's a pretty significant thing. So I thought anyway, so that was just my experience with 5.4.

Let me ask you about sheets. I use Google, you know, over Excel. I don't even own any more an Excel or office 365 subscription. And you can, you know, kind of move back and forth between sheets and Excel, you know, format pretty easily.

Is, is cloud in Excel, like if you're using Excel, is that now, you know, matched by something from codex or, or does, does open AI have a similar thing that's integrated inside Excel to have the same level of capabilities.

Yeah, because it's literally side by side, like I can show you what that actually looks like. The one thing is when I was testing codex and see here.

Yeah, when I was testing Gemini in Excel versus Claude, like, so sorry, Gemini in sheets, Claude in Excel, it's nowhere near, which is very odd because I would think with in sheets, Gemini would do a better job.

But in terms of really complex cross tab analysis, it kind of is weird that it doesn't, it, it's just not as good.

So that when you're in sheets, there's a Gemini panel on the right and I've not had much suffer while getting Gemini to do anything meaningful in sheets.

So, but so what I would do as a work around there is I would export it as an Excel document, use cloud in Excel, and then just re import the Excel.

It's a, that's a human net to do that. It's like people who are in the Microsoft environment, people who are in the Google environment, it's better for them to use Gemini, because I think the time savings you get is kind of offset by the pain in the neck to go keep converting it to CSV and going back to the horn.

And I don't, yeah, any even CSV's don't don't have tabs, it's just a pain, but now if you have that, it's actually not too bad.

By the way, I'm sharing my screen. Do I do, I'm sorry.

No, I, can you add it to stage? Do you have a add to stage?

I'm I, I'm already, I thought I'm already presenting, which I was.

Sorry. Nope, you've got to say Beth, put it on the phone.

Okay.

Um, there we go. Yep, there we go.

No, you've got here, right? So you've got essentially about cloud and you've got Chatchy BT now.

So I'm going to do one which actually so creating you dashboard. So I'm going to run this. I'm going to put it on office.

And then on Chatchy BT, I'm going to put that on heavy.

We may have to come back to this because I've never tried heavy before and who knows how long this will take.

Okay.

Anyway, so this, again, what you're using.

So you're in.

Yeah, so I'm in Excel, but so this is a.

I've stripped essentially private information. I've also stripped.

I've also can reconfigure the data. So this is a company's sent me multi entity deck.

I mean, a master sheet for a specific project.

And you can see here has multiple business units consolidated.

It is a pretty complex spreadsheet, right? That's the whole I was getting.

So the whole point here is, um, this is using the consolidated tab financial.

So using this, build me a five year DCF model.

So essentially it's supposed to build you a brand new financial model based on it.

And so that's, uh, yes, always allow.

And so in this case is building, you know what I should have done is.

I want to see actually how it interacts with each other because it's supposed to build a new dashboard.

So I'm curious to see if they'll build two dashboards.

Or they'll screw each.

Yeah. So as you can see there.

So this is the.

Oh, this is great.

You can see the chat GPT version.

And then you'll see a lot version.

And you'll see them be built at the same time.

So you'll kind of see what they actually look like in practice.

But regardless of which one you use, I think like.

If you step back a couple of places, like if you think about from a financial and accounting perspective,

we're thinking it would be insane for any company not to have one of these tools now.

Like just the amount of work that you can do in here faster.

The only thing that this can't do is do work, but to work with comparisons.

So that's it.

It's interesting what it will do.

So anyway.

What I do notice from chat GPT and Claude is that just the design perspective still Claude does so much better.

Like the graphs and the charts and how it's laid out is better than chat GPT.

Just I think like they're trying to address that.

But yeah, just from the two.

You can kind of start seeing where it's coming from and how they're presenting the information.

I would imagine both will be pretty accurate.

But you know, and I didn't prompt like we'd probably prompt it a little bit more than that.

But yeah, that's.

So so Garth is saying yeah, Claude is much more thoughtful.

And it makes me wonder if the.

If that level of interaction is then solved once you work something with Claude and you get it where you want.

You start using it as an example.

And then.

Not in Excel, but in codex, what I find is.

5.4 is definitely there's I can definitely notice a difference when I'm working with it versus code x 5.3.

When I'm working with it in chat GPT, when you say that level of thoughtfulness, it actually, I don't know, I feel both of them are.

For things that I do, I've noticed improved significant improvement in chat GPT to the point where I'm like, yeah, both of these kind of get what I do.

And a lot of the times I do both like I promise in both places.

I just haven't been able to use a lot of agent feature in chat GPT until mostly now in codex that I'm like, OK, this actually kind of gets what I need to be done and applied.

So, yeah, it's just I'm curious to see.

And it's doing, it's cleaning up the charts.

But this one, as you can see, it gives you the charts, right?

Like we'll see what it wants it finishes.

How they want to present it and whatever else.

It's just a kind of seed here, the different ways that it presents the information, but it's not done yet.

So I have no clue what the final product for both are.

But this again, this is a pretty significant thing.

Like I would take probably multiple days, if not a week, to kind of figure out how to build a five-year DCF model.

Like this is not my, this is not my area, but for people who do this on a regular basis.

I showed them like heat mapping and stuff on their data.

They're like, you know how long it takes us to do that.

And like, I don't know how long it takes you.

She's like, I'm just like, OK, and it's like, you just, it just did it in whatever, whatever it takes, right?

So I'm like, sure.

I feel like we're looking at the difference between like 2026 and like 2002 or maybe like 1995, because that looks very much like old windows, right?

And the, yeah, and the new one is a little mature.

If I, if I put the contacts, like this thing didn't need the extra contacts, this thing would.

So if I be like, hey, may get more visual.

If I do that, everything I do that, then what is really cool about this?

So do you notice here on, wait a second.

What am I on?

OK.

Inclod, do you notice there is no undo.

Inclod, they're like, once you have it done, and this is the big caution.

I put to anybody who's working in this, like, hey, if you have a, like, you know,

selling companies have like a live Excel spreadsheet that is like, like, the source of truth for everybody.

And it's live and some people have access, some people don't.

I'm like, if you run Clod on it, it makes calculations.

And you don't know what it did.

There's no way to go back.

Like, I haven't seen any button that says go back.

It's nice for chat.

It's like, you need to have an undo.

It's like, oh, I just undo.

That is to me.

Like I said, like, unless you really need to duplicate that spreadsheet and work on it.

Do not let Clod to whatever it wants to it.

It's like pretty.

It'll do stuff to it that.

Yeah.

Anyway, so this is the final.

Well, that's what the revision is for, right?

But what Clod, what you're saying is that Clod does it so quickly that you're new rev mode.

It's not like saving in between.

But if you, I'm curious.

If you gave part of the instruction, like, I need you to say that each whatever mile marker changes,

then you would be able to use the revision rollback.

You have Clod and chat GBT codecs in here, right?

If they're worth, they're both working.

I wouldn't keep those things from colliding.

What's what's going on here with these individual A's is making changes to your sheet.

So chat GBT, I was kind of wondering what would happen.

So it created its own tab.

So it's DCF tab.

Clod had its own tab, the dashboard tab.

So these two are essentially.

This is, yeah, this is their each of their own.

Okay.

But what if you, what if you asked chat GBT to make changes to all the source tabs.

Look, actually, you know what?

Let's do that.

Let us do that.

Okay.

So here's one that will be hilarious.

Okay.

So I'm going to put in this one is add an EBITDA row to each BU.

So thank goodness.

This is a copy.

Not again.

Each business, so each business unit will now have an EBITDA row, right?

Opera case.

So I'm going to put that on both.

Still working on that DCFA.

Okay.

So what's interesting here is, you know how it built out that form?

What are, oh, I asked it to, um, I'm going to stop it.

I don't care about this anymore.

Okay.

So I'm going to put, I'm just going to make it standard.

So I can do this.

So add an EBITDA row.

And then.

And add an EBITDA row.

I don't know what it will do.

And they don't know what you will do.

Well, let's look at.

Do you see Geras Cabot?

Make it break, make it break.

What can it do this?

Does it flow?

You know what we can do?

We can add another one.

We can add co-pilot agent mode and ask you to do the same thing.

So.

Yes.

So now we have three of them all operating at the same time, doing the exact same thing.

So there, there was a.

There was a discussion on X, um, from a couple different people.

And one of them was associated with, um, and reason, Horowitz.

And they were asking why there aren't more tech, twitch streams, right?

Because there are a couple tech, twitch streams, but there aren't really.

Text, twitch streams.

And this is, I think, the tech, twitch stream Carl, like, but, but wait.

What else can we do?

Is it able to do this next piece?

I'm not, I don't know what is happening.

I see there's an EBITDA row, right?

I just don't know what tool made it happen?

I'm inserting the EBITDA row.

I'm like, okay, who's inserting the look?

What are you inserting?

And this is.

I have all.

Yeah, I don't know what's happening.

I don't know what to do.

And what's the difference between.

This is the distance.

22 and 44.

I have no clue.

I like Agent Modo.

Hey, it looks like there isn't EBITDA row.

Let me check it if it exists.

There's like, wait a second.

It already is one.

Yeah.

So this is the disadvantage of doing all three at the same time.

Thank you for going on this journey with us.

We do not recommend at home that if you would like to know what it's doing.

Why that you do three at the same time?

That's great.

See, I like all of them already have one.

Let me check it.

So I wonder what this will do.

Will it do anything?

Wait.

I need to add and even after that.

Let me check if there's space.

So maybe it'll add another one.

They're going to keep correcting each other.

This is.

Oh, well.

All right.

So the the technical difference between those two rows.

We lost.

Carl.

Well.

He pressed the wrong button on that one.

They can.

The agent left out of the spreadsheet.

Except you make it a slip bad.

Carl.

Absolutely not.

I think you're right, Andy.

That as many of you know is what's known as too many tabs open.

And when you click the tab, it exits you out of the program that you were in.

Yeah.

Okay.

Well, we've, we've successfully arranged a demonstration of the incorporation of GPT 5.4

into Excel, I think.

Yeah.

Right.

Yeah.

Okay.

That was cool.

I like that.

The other thing that dropped that I'm.

There are a bunch of things that dropped that I'm interested in.

But Google workspace CLI tools dropped as a repo a couple days ago.

So that means is that you're no longer needing to go through MCP.

You can actually directly work with the tools from your command line.

And I'm not sure how that works with an IDE.

But I think it must work with that too.

Carl.

Did.

Did it break.

Yeah.

You broke our system.

We.

I imagine that the agents got together.

We're like, you're making us look mad.

Carl.

No.

Yeah.

They just, all three of them just were like, hey, wait a minute.

All three of them rebelled.

It's like, I'm not doing this work anymore.

They kicked Carl.

That's what you get.

If you make them fight, they will crash your computer.

Thanks, guys.

Oh, that's amazing.

Okay.

Cool.

I was mentioning while you're gone, the Google workspace apps repo that was dropped so that you can use the workspace apps directly.

I wonder if that's a way to get an agent to do some of the things that you're doing in the actual desktop of the app, but from a, from a command line interface.

Like, could I get.

Could I have Claude do something with or sorry.

I realize it's a Google repo.

I would likely have Claude direct the repo, which is why I'm saying that.

But let's just say could Gemini in the CLI CLI do some of what you were doing in through code and command and not actually have the visual until it's done or maybe have the visual.

While it's working on it, but it's not using computer use to do that or browser.

Plug in.

I'm lost now.

When you mention repo, are you talking about a GitHub replacement that Google's providing.

Yes.

Google.

Google workspace CLI repo.

It dropped two days ago.

And it.

It's one command line tool for drive Gmail calendar sheets, docs, chat, admin and more dynamically built from Google's discovery service and includes AI agent skills.

Sorry.

I just moved to a different client and audience.

Because what y'all were doing was was visually interesting, but now I'm wondering what could happen without watching it happen.

And Carl's like.

I didn't drink some water.

But the only thing that I know is if you were in the Google environment, when I read this, then you see I'm like, you probably want to use.

I haven't seen the thing is I haven't stirred or seen anyone use that.

So I don't know anything about like what.

I'll do some tests this weekend.

I do have something completely different.

So the meta is sued for privacy violations after workers review pirate.

Yeah, it's the intimate smart glasses footage.

And I was like, yeah, I was like, when he was inevitable, that would happen.

So Swedish investigators just revealed what meta's privacy design smart glasses actually do with your most intimate moments.

Now workers in Nairobi reviewed footage that includes nudity, sex, undressing toilet use.

Of course you're going to get personal.

Wait, wait, wait, what?

What the hell is somebody who's undressing or going to the toilet doing with their ray band meta cameras on?

It's like they took the glasses off.

Like is that sitting on the toilet?

The shower is there, right?

And you didn't turn it off.

You just took it off your face.

It could be like that says, that's one possibility.

Oh, yeah, let me.

Oh, I know a good use for my ray band meta glasses.

I have a new Tinder date tonight.

I'm going to put my glasses on the on the nightstand.

And she won't know.

Yes.

It's just very weird.

So, well, it's interesting because this is relates to when we talk with our clients, we say, hey, you've got to make sure you're.

AI policies regularly updated.

Because this is one of the clients told us, hey, there's somebody walking around here with a meta ray band glasses.

I'm like, okay, it's like for works like no, just like what would should be put in our policy? I'm like, okay.

Because you can have just people walking around and start clicking and taking pictures and video of like your forget like personal like your business accounts.

You do putting in your password, like all those things people could be doing that.

And then so I'm not surprised that this is.

Yeah, this is meta like you're going to have to.

I thought that risk was built in if you buy these glasses.

Yeah, we're data is kind of protected, but kind of so I was like, you shouldn't be surprised.

And what are you doing filming all your private stuff anyway?

Right. And that's that's the piece is that if you bought the glasses, you opted into filming.

The glasses don't point at you. They point at what you're watching.

And nobody that you're watching, including apparently you from the back of the toilet while you shower or I don't know.

I'm not going there.

That was not that was not an informed consent opted.

Well, there is there is a story about the guy who.

Like he put out he put out healthy identify this woman.

Because she broke his meta glasses on the on the train.

And so like he had an image of her and he's like healthy identify this woman.

And they were like, nope, that looks like a mask.

I don't think anybody knows who she is.

And it seems like maybe you got what was coming to you from videoing on the on the metro on the train.

So Gwen asked a question that maybe somebody was reading glasses can answer that the cameras go off when you take them off your head.

I don't I don't think they would have a sensor in there to identify whether you're just moving around or whether you've actually taken them off and they're suddenly still.

I don't know.

The other thing is that you can.

Replace the lens in the Ray bands.

Apparently Carlos bring up too much content for us.

The sensors have removed him again, but I think we can replace the lens with your prescription lens in which case you would actually have them most of the time.

You wouldn't wear them in the you wouldn't wear them in the shower because that would brick them in a in a sweet minute.

All right.

Oh, yeah.

We're all trying to add you.

I think I pissed off all three.

The three eyes are like this guy's not getting back on that show.

No, no, we're not.

Okay, stop me so many times.

And it's still running too.

And I'm like, yeah, I got to turn that off because yeah, I think maybe that was causing it.

I don't know.

I have a lot of tabs open though.

So yeah.

The difference one thing.

Sorry, Andy.

No, no, I was going to say unless you have something really fascinating to say I suggest we wrap this one.

But what what do you give us talk about?

This is one thing and I put it out yesterday.

I know last week.

It's from Futurism where the Pope implores priests to stop writing sermons using chat GPT.

I thought it was.

It's like where everyone's not immune because I could just see it, right?

If you're if you're a Catholic, you know, like the.

The priest has does homilies.

So it's like, you know, for what 15, 20 minutes they talk about, you know, the readings and what it means for real life in some cases.

And sometimes it's really boring.

Most of the time it's really boring.

But now they don't actually have to think about those 20 minutes.

Like writing chat GPT and and there was one time I was out of wedding.

I would say it's like, you know, the structure of this sounds very, very familiar.

I should count on my finger.

I was like, was it written by chat GPT and they're just going through it?

And they're like, you know, I bet you had M dashes in it too while you.

Well, I mentioned this yesterday, maybe on the show, maybe in another conversation.

I'm seeing more and more posts that have not checked the models being referenced or the year being referenced.

So I'm seeing posts that clearly have gone through editing.

And like it's it's humanized.

But it's saying this is likely to happen in Q2 2025.

We'll look for that.

Yeah.

And it's like, what?

Do you need to have someone with a humanized read that?

Or, you know, well, the model over the model club 3.7.

Oh, I don't think wow.

I don't think we can get 3.7 of us.

Maybe we can.

But those are those are pieces that they they flag for me now.

When I don't see any other tells about AI writing.

And those are huge tells.

Okay.

So.

Go use a GBT 5.4.

If you have not yet.

Carl's recommending downloading codecs the the desktop app, which I think means only on Mac.

Okay.

Cool.

Okay.

I'm going to try it there.

I don't see Jeff in the chat, but Jeff uses codecs loves codecs and has just the plus subscription.

So you don't have to upgrade right now.

Get some get some tests in and do things tomorrow.

The condendrom episode will come out.

Those are wonderful.

Right now you need to get them on Spotify.

So go to Spotify.

And if you're there on Spotify, go ahead and give us like rate us with the stars and say that you.

Really love our show because we also really love our show and want you to love it to and let people know about how many people really love the show.

Someday is the newsletter day and those will come out.

You can get that at the daily AI show dot com.

If you want to continue the chat and find the links of what people talked about today.

Go to dailyIshowcommunity.com and we will be back again on Monday.

Thanks so much everybody.

Take care.

GPT 5.4 vs Gemini: Benchmarks, Codex, Excel

About this Episode

Hosts & Guests

More from The Daily AI Show

The Acoustic Trust Conundrum

Google TurboQuant Changes Everything

Anthropic Strikes Back: Return of the AI

Sora Shuts Down, AI Science Speeds Up