Loading...
Loading...

In this episode, we explore the new features and improvements in OpenAI's latest ChatGPT 5.4 model, highlighting its enhanced capabilities in coding, knowledge work, and professional applications. We also discuss practical use cases such as mid-response prompting and advanced online research, and compare it to previous versions and competing models.
Chapters
00:00 Introducing ChatGPT 5.4
01:16 Professional Capabilities and Scale
03:12 Benchmarks and Computer Use
08:04 Cool Features: Steerability & Research
10:25 Limitations and Regulatory Concerns
Links
Viscally responsible, financial geniuses, monetary magicians.
These are things people say about drivers who switch their car insurance to progressive
and save hundreds.
Because progressive offers discounts for paying in full, owning a home, and more, plus
you can count on their great customer service to help when you need it so your dollar goes
a long way.
Visit progressive.com to see if you could save on car insurance.
Positive casualty insurance company and affiliates, potential savings will vary, not available
on all states or situations.
Every day, excessive delays and denials from big insurance keep patients from accessing
the care they need.
And when care is urgent, these delays can be disastrous.
These practices cost billions in wasteful spending, driving up costs for American families.
But while big insurance put up barriers, America's hospitals and health systems are in your
corner.
Navigating endless reviews and appeals to get you the care you need when you need it
most.
They curb these harmful practices and put the focus back on patients, brought to you by
the coalition to strengthen America's health care.
Opening up has just rolled out chat GPT 5.4.
There's actually a couple of cool features in here that I'm really excited about, that
I've been wishing chat GPT has been able to do in the past and they finally launched it.
And of course, if you look at all of their marketing, it's going to just basically be
them saying, this is our most capable model yet.
And of course, it's the most capable model if it wasn't, I mean, what would they even
be making an update for?
So I'm just going to get past all of the hype and all of the buzz from what they said
in the launch.
And I'm going to tell you some really interesting use cases and some ways that I actually think
this is useful GPT 5.4.
Before we get into all of that, if you want to try all of the latest models, go check
out my startup AI box dot AI.
We have the latest models from the top 15 different AI companies, everything from GROC
to Gemini, to Anthropic to OpenAI, 11 labs for audio, tons of cool image generation
models.
And we're 50 models on the platform total.
You can try all of them side by side and it's only 899 a month.
So much cheaper than Chad's GPT, but you get way more models.
And of course, you can also use it to automatically create AI workflows that can complete tasks
for you that are automated.
So there's a ton of cool stuff going on, but go check out AI box dot AI if you want to
get access to all of the top models for only 899 a month.
And it's 20% off if you can annual plan as well.
So there's a lot of cool stuff there.
All right, let's get into what's going on.
The first thing I want to mention here is that this is called GPT 5.4 thinking.
They have a higher performance variant that is known as GPT 5.4 pro, but both of these
together are designed to kind of handle everything from some complex analysis.
They do a lot of coding, a lot of long running workflows across a lot of different professional
software tools.
And they're kind of dubbing this as like their professional work tool.
They're trying to get into, you know, into the hands of more working professionals.
And this is coming right on the backs of them, signing a whole bunch of deals with a bunch
of different consulting firms that are going to allegedly get chat GPT into more businesses
and in kind of the professional environment.
And at the same time, they're having kind of this, you know, they're locked in a battle
with even Google's in this right now, but really with Anthropics Cloud Code, their Codex
tool.
They're really trying to push forward in kind of how software is using AI models and how
computer use is going on.
So this is where they're really focusing.
One of the most, one of the biggest changes basically about this is the scale.
So in the API GPT 5.4 has a context window of up to a million tokens, which basically
lets them work with huge documents, really big conversations, big data sets.
And really, I mean, if you think about this, a huge benefit is going to be coding where
you can look at bigger code, you know, code bases to actually work with.
So something, this was something that Anthropics was really crushing at and then opening
eyes, trying to get into this opening.
It also says that the model is specifically more what they're saying is token efficient,
which I this is actually one thing that I'm excited about basically can solve the same
problems using a lot less tokens and GPT 5.2.
So your costs are going to come down.
It's actually kind of cool.
If you already had 5.2 running in a software, which even if you don't, a lot of the software
you use will the costs come down a lot for that.
And it also gets a lot faster.
So the cost come down and the speed goes up.
And so yeah, for me, this is something I'm actually excited about.
So as far as how the benchmarks look, I know, you know, I'm not trying to like sit here
and nitpick the benchmark percentages, but I did want to talk about some interesting
use cases and reasons why these are, why they're good.
Specifically, it's, it's kind of leading on a bunch of the better known benchmarks.
One of those is for coding.
Of course, we know why that's important right now, but also computer use.
And this is something I'm excited about right now.
I feel like Anthropic is really crushing with computer use, basically, you know, it can
look at everything on your computer and go click on stuff and get stuff done for you.
This is a use case that I've been using a lot with Claude's Anthropic browser, the
Claude Browkrome browser extension.
Basically, it's a button you click, it opens a side chat bar.
I go to really complex UI or complex websites.
I'm not a developer, but if I'm going into like, for example, recently I had to do some
stuff on Google Cloud to set up a tool that I was vibe building on, lovable, and then
needed to beef up my back end so it could, you know, do some extra fancy stuff.
I didn't really understand anything that lovable was telling me I needed to be able to do.
So I opened up the Claude side bar, told it, look, I'm on my, you know, my Google Cloud
account, go and here's the instructions from lovable and it clicked around and set up some
stuff for me.
Now, should I have a real developer look over this?
I mean, we're going to throw caution to the wind for the time being and I hear all the
developers screaming into their headphones right now.
But at the end of the day, it got it done and my software is now functioning and I have,
I did not have to watch a whole bunch of long YouTube tutorials on how to set up some
complex, I mean, for me, complex, because I have no idea how to code Google Cloud stuff.
So this is a really incredible use case for a lot of reasons.
And I think OpenAI beefing up their capabilities and computer use is really exciting because
they're going to start competing more directly with OpenAI.
I mean, this is not like error with Anthropic.
It's not like Anthropic is kind of like the only one working on this.
OpenAI has been doing this for a long time with agents, but it feels like it's getting
a lot better.
The other one that I'm excited for is they're getting a lot better at knowledge work.
And so I mean, these are kind of things that I think everybody uses it for.
So this is something we're just going to see some incremental improvements on.
On OpenAI's GDPVal benchmark, which basically checks tasks, it has up like 44 different
occupations.
So it's kind of like showing you how you can use this for different professionals.
It is exceeding industry professionals in 83% of comparison.
So they're like, look, these are all the tasks that people in all of these different
professional industries are doing.
It is better than 83% or it's, you know, it's beating one of industry professionals might
give you in 83% of these cases, specifically, I think, for knowledge work.
And it has a really big jump from achieving about 71% and the GPT 5.2 is getting.
So upgrading this to now GPT 5.4, we're getting from 71% to 83%.
It just basically is going to be a lot better for knowledge work.
I mean, and by a lot better, I mean, we're seeing, you know, at 10% jump here or, you
know, 12% jump here, which is pretty significant.
On some of the coding benchmarks, so SWE, SWE bench pro, it's a, you know, software
engineering bench pro, the model is getting slightly better than the last version.
So I mean, this is good, but, but, you know, beyond just getting slightly better, it is
actually a bit quite a bit faster.
So if anybody has used a lot of these software tools, is specifically, we use cloud code
AI box.
My developer sends me screenshots of like, because of these really long elaborate tasks
that it's doing on our back end, our code base.
And he, I swear, it's like a goal for him to see how long he can get cloud code to run
continuously without stopping on a project.
He gives it.
It's funny because I'm, you know, vibe coding stuff on lovable and I usually get a lovable
response back to me and like, you know, a minute or two.
He has a go for like three and a half hours doing a task.
So in this model gets faster, I'm excited because hopefully that three and a half hours
gets cut down on, you know, some of the stuff that we're working on.
I think one of the things that it's also very good at is for real computer interaction.
There's an OS world verified.
It basically evaluates how well an AI can operate a desktop environment.
It's, it's, you know, pretty much just like takes a screenshot and then it uses the keyboard
and mouse commands to go and click stuff.
Right now it has about a 75% success rate.
I've used chat to be agents.
It's not perfect.
It's actually not my go to.
I don't use it that much.
I wish I could use it more.
I think anthropic is doing better in this, but 75% success rate, like they are improving
their success rate is up a bit.
It's better than GPT 5.2.
I still don't think it's the best.
There's a major focus on kind of how it is being used professionally.
Open AI says their model right now is significantly better at basically giving the kind of deliverables
that people use in real work.
So things like spreadsheets, presentations, financial models, legal analysis, all of those.
They've done a bunch of different tasks and they had one performed by a junior investment
banker analyst.
It got 87% compared to 68% that GPT 5.2 got.
Some human evaluators also preferred it about 68% of the time.
They said it had better visuals and better structure.
So there's some cool stuff.
Okay.
Cool features that you might actually use today.
This is the one I'm very excited about.
It has what they're calling steerability, but basically when you're talking to chat GPT,
it's available in the API too, which is I think crazy, but it's on chat GPT.
If you're talking to chat GPT and you can kind of see it's reasoning, right?
It's thinking through some stuff and it puts a couple steps down and you realize it's
going in the wrong direction.
Maybe you're like, hey, I'm trying to visit the best beach for surfing.
It's like, okay, looking at beaches in Kauai and you're like, oh crap, I'm in California,
I don't want to see Kauai and you're like, then you can type a message specifically in
California and mid prompt, mid response, it actually takes into account what you just said
and is steerability.
It's going to go and incorporate that into what it's looking at and into its reasoning and
give you an updated response.
Basically, you can do mid response prompts and it's going to take that into account and
change its prompt and give you a better prompt, mid response.
It's kind of interesting because I think they did a couple of clever things here, but one
of them is when you ask a question, you have to wait for it to think, you have to wait
for it to reply, you sit there and you wait.
We all hate waiting and so if in the middle of waiting, we're reading its line of reasoning
and we're giving a more input and more feedback, it feels like we did a lot less waiting.
We're really just kind of reading and trying to throw in something and it can get it
down faster and better rather than having to wait for it to spit out the whole thing and
you'd be like, okay, this is wrong and here's why it's wrong and here's what you should
do instead.
You can do that in the middle of the chat conversation response, which is really cool
in my opinion.
Something else I've kind of focused on right now is online research.
Apparently, it can search across like a greater number of sources on the web so it can kind
of instead of just like, okay, we're looking at this website, get in some data and now we're
going to look at this website.
It's going to go in and search just like a ton at the same time across the web and then
it's going to follow leads across different pages.
So it might get an idea from something to read on one article.
It's going to go follow that to another article bounce around a lot more.
So it's kind of doing like, I know we have had deep research for a while but it's doing
deep research if that's the thing and it's going to combine all the information that
it gets into one coherent answer.
So basically, this is going to be more useful for some of the more complex questions where
the information is kind of scattered across a lot of different sites instead of sitting
in one place.
Now, not every question you ask, this is going to be relevant but sometimes when you have
a complex and our question, it's going to be able to go get you a more coherent answer
quicker.
So this is great.
They have all this like kind of, I don't know, fluff and their launch about how it hallucinates
less and it has less, you know, it has more, it has less factual errors and all this
kind of stuff.
I don't think that's super important.
One thing that we also heard about it is that it is going to, it's going to turn you
down less.
So like if you ask a question and they're like, hey, you know, I don't know, you ask a
question, it's going to be, it's less likely allegedly according to them all meant to like
not answer.
However, our good friend Connor Grenin, who hosts the AI Applied Podcast with myself, he
was tested and I saw a post he made on LinkedIn where he asked it, is it true that airbubble
inside of an IV can cause me or can, you know, could kill me?
And it said, you know, apparently it typed out the whole response to him kind of and just
like we saw with like deep seek in the Chinese censored model, if you ask anything about
Tiananmen Square to deep seek it, like it types it out and then it disappears and it's
like sorry, can't answer this.
Apparently, Chad CBT said the exact same thing.
And also, this is kind of at a tricky moment because we're seeing New York right now is
trying to pass some legislation where they, they're saying, hey, we don't, like they're
basically trying to pass legislation saying, AI models can't ask answer any questions about
medical, health, legal, like they have all of these different areas.
I think even hairstylists, they're trying to put in there.
It's basically all of the, all of the different industries with regulatory capture.
They just don't want people to be able to get the answers for free.
So pretty, I don't know, kind of bummed about that legislation and people like seriously
considering that.
However, so it doesn't seem like it's that much better, but maybe it's moving in a good
direction.
I'm not 100% sure.
It still feels like there's other models that are more of the adult in the room, but you
also get pros and cons with those models.
Grock famously is going to answer any question you have about basically any of those topics,
but you know, there's a lot of different, there might be some other cons with Grock.
So pros and cons to all of the models.
Thank you so much for tuning into the podcast today, guys.
If you enjoyed the episode, it would really help to show a ton if you left a rating review
wherever you listen to your podcasts.
Just drop me a note, say if you enjoy it, you know, say where you're from, say what
topics are interesting to you.
I read all the reviews and all the comments.
It helps a ton.
Also, make sure you go check out aibox.ai if you want to get access to all of these latest
models in one place so you don't have to pay a $20 subscription to 10 different platforms.
It's $899 a month and you get access to over 40 different AI models.
So go check it out, link in the description, aibox.ai, I'll catch you guys all in the
next episode.
Craving the coffee flavor you love.
But without the caffeine, Kachava's got you covered with their newest coffee flavor.
This all-in-one nutrition shake delivers bold, authentic flavor, crafted from premium
decaffeinated Brazilian beans with 25 grams of protein, 6 grams of fiber, greens, and
so much more.
Treat yourself to the flavor and nutrition your body craves.
Go to kachava.com and use code news.
New customers get 15% off their first order.
That's K-A-C-A-V-A dot com code news.
Craving the coffee flavor you love.
But without the caffeine, Kachava's got you covered with their newest coffee flavor.
This all-in-one nutrition shake delivers bold, authentic flavor, crafted from premium
decaffeinated Brazilian beans with 25 grams of protein, 6 grams of fiber, greens, and
so much more.
Treat yourself to the flavor and nutrition your body craves.
Go to kachava.com and use code news.
New customers get 15% off their first order.
That's K-A-C-A-V-A dot com code news.



