Loading...
Loading...

In this episode, we explore the new features and improvements in OpenAI's latest ChatGPT 5.4 model, highlighting its enhanced capabilities in coding, knowledge work, and professional applications. We also discuss practical use cases such as mid-response prompting and advanced online research, and compare it to previous versions and competing models.
Chapters
00:00 Introducing ChatGPT 5.4
01:16 Professional Capabilities and Scale
03:12 Benchmarks and Computer Use
08:04 Cool Features: Steerability & Research
10:25 Limitations and Regulatory Concerns
Links
Hello, it is Ryan and we could all use an extra bright spot in our day, couldn't we?
Just to make up for things like sitting in traffic, doing the dishes, counting your steps, you know,
all the mundane stuff. That is why I'm such a big fan of Chamba Casino.
Chamba Casino has all your favorite social casino style games that you can play for free
anytime, anywhere with daily bonuses. So sign up now at chamba casino.com.
That's chamba casino.com.
Opening has just rolled out chat GPT 5.4. There's actually a couple of cool features in here that I'm
really excited about, that I've been wishing chat GPT has been able to do in the past and they
finally launched it. And of course, if you look at all of their marketing, it's going to just
basically be them saying, this is our most capable model yet. And of course, it's the most
capable model if it wasn't, I mean, what would they even be making an update for? So I'm just going
to get past all of the hype and all of the buzz from what they said in the launch. And I'm going to
tell you some really interesting use cases and some ways that I actually think this is useful.
GPT 5.4. Before we get into all of that, if you want to try all of the latest models,
go check out my startup AIbox.ai. We have the latest models from the top 15 different AI
companies, everything from GROC to Gemini, to Anthropic to OpenAI, 11 Labs for Audio,
tons of cool image generation models. I think there's over 50 models on the platform total.
You could try all of them side by side and it's only 899 a month. So much cheaper than chat
GPT, but you get way more models. And of course, you can also use it to automatically create AI
workflows that can complete tasks for you that are automated. So there's a ton of cool stuff going
on, but go check out AIbox.ai if you want to get access to all of the top models for only 899 a
month. And it's 20% off if you can annual plan as well. So there's a lot of cool stuff there.
All right, let's get into what's going on. The first thing I want to mention here is that this is
called GPT 5.4 thinking. They have a higher performance variant that is known as GPT 5.4 pro.
But both of these together are designed to kind of handle everything from some complex analysis.
They do a lot of coding, a lot of long running workflows across a lot of different professional
software tools. And they're kind of dubbing this as like their professional work tool. They're
trying to get into, you know, into the hands of more working professionals. And this is coming
right on the backs of them, signing a whole bunch of deals with a bunch of different consulting
firms that are going to allegedly get chat GPT into more businesses and in kind of the professional
environment. At the same time, they're having kind of this, you know, they're locked in a battle
with even Google's in this right now, but really with Anthropics Cloud Code, their Codex tool,
they're really trying to push forward in kind of how software is using AI models and how computer
use is going on. So this is where they're really focusing. One of the most, one of the biggest
changes basically about this is the scale. So in the API GPT 5.4 has a context window of up to a
million tokens, which basically lets them work with huge documents, really big conversations,
big data sets. And really, I mean, if you think about this, a huge benefit is going to be coding,
where you can look at bigger code, you know, code bases to actually work with. So some of this
was something that Anthropics was really crushing at and then opening eyes trying to get into this.
Opening it also says that the model is specifically more, what they're saying is token efficient,
which I, this is actually one thing that I'm excited about. Basically, can solve the same
problems using a lot less tokens than GPT 5.2. So your costs are going to come down. It's
actually kind of cool. If you already had 5.2 running in a software, which even if you don't,
a lot of the software you use will, the costs come down a lot for that. And it also gets a lot
faster. So the costs come down and the speed goes up. And so yeah, for me, this is something
I'm actually excited about. So as far as how the benchmarks look, I know, you know, I'm not trying
to like sit here and nitpick the benchmark percentages, but I did want to talk about some
interesting use cases and reasons why these are, why they're good. Specifically, it's,
it's kind of leading on a bunch of the better known benchmarks. One of those is for coding. Of
course, we know why that's important right now, but also computer use. And this is something I'm
excited about right now. I feel like Anthropic is really crushing with computer use. Basically,
you know, it can look at everything on your computer and go click on stuff and get stuff done for
you. This is a use case that I've been using a lot with Claude's Anthropic browser, the Claude
Browse Chrome browser extension. Basically, it's a button you click. It opens a side chat bar.
I go to really complex UI or complex websites. I'm not a developer, but if I'm going into like,
for example, recently I had to do some stuff on Google Cloud to set up a tool that I was
building on Lovable and then needed to beef up my back end so it could, you know, do some extra
fancy stuff. I didn't really understand anything that Lovable was telling me I needed to be able to
do. So I opened up the Claude sidebar, told it, look, I'm on my, you know, my Google Cloud account,
go and here's the instructions from Lovable and it clicked around and set up some stuff for me.
Now, should I have a real developer look over this? I mean, we're going to throw caution to the wind
for the time being and I hear all the developers screaming into their headphones right now.
But at the end of the day, it got it done and my software is now functioning and I have,
I did not have to watch a whole bunch of long YouTube tutorials on how to set up some complex,
I mean, for me, complex because I have no idea how to code Google Cloud stuff. So this is a really
incredible use case for a lot of reasons. And I think OpenAI beefing up their capabilities and
computer use is really exciting because they're going to start competing more directly with OpenAI.
I mean, this is not like error with Anthropic. It's not like Anthropic is kind of like
the only one working on this. OpenAI has been doing this for a long time with agents,
but it feels like it's getting a lot better. Okay, the other one that I'm excited for
is they're getting a lot better at knowledge work. And so I mean, these are kind of things that
I think everybody uses it for. So this is something we're just going to see some incremental
improvements on. On OpenAI's GDPVal benchmark, which basically checks tasks. It has up like 44
different occupations. So it's kind of like showing you how you can use this for different professionals.
It is exceeding industry professionals in 83% of comparisons. So they're like, look,
these are all the tasks that people in all of these different professional industries are doing.
It is better than 83% or it's, you know, it's beating one of industry professionals might
give you in 83% of these cases, specifically, I think for knowledge work. And it has a really big
jump from achieving about 71% and the GPT 5.2 is getting. So upgrading this to now GPT 5.4,
we're getting from 71% to 83%. It just basically is going to be a lot better for knowledge work.
I mean, and by a lot better, I mean, we're seeing, you know, at 10% jump here or, you know, 12%
jump here, which is pretty significant. On some of the coding benchmarks, so SWE,
SWE Bench Pro, this is a, you know, software engineering bench pro. The model is getting slightly
better than the last version. So I mean, this is good, but, you know, beyond just getting
slightly better, it is actually quite a bit faster. So if anybody has used a lot of these software
tools, specifically, we use Cloud Code AI Box. My developer sends me screenshots of like, because
of these really long elaborate tasks that it's doing on our back end, our code base. And he, I swear,
it's like a goal for him to see how long he can get Cloud Code to run continuously without stopping
on a project. He gives it. It's funny because I'm, you know, vibe coding stuff on lovable and I
usually get a lovable response back to me and like, you know, a minute or two. He has a go for like
three and a half hours doing a task. So in this model gets faster, I'm excited because hopefully
that three and a half hours gets cut down on, you know, some of the stuff that we're working on.
I think one of the things that it's also very good at is for real computer interaction. There's
an OS world verified. It basically evaluates how well an AI can operate a desktop environment. It's,
it's, you know, pretty much just like, takes a screenshot and then it uses the keyboard and mouse
commands to go and click stuff. Right now it has about a 75% success rate. I've used Cheshire
PD agents. It's not perfect. It's actually not my go to. I don't use it that much. I wish I could
use it more. I think Anthropic is doing better in this. But 75% success rate, like they are
improving their success rate is up a bit. It's better than GPT 5.2. I still don't think it's the
best. There's a major focus on kind of how it is being used professionally. Open AI says their
model right now is significantly better at basically giving the kind of deliverables that people
use in real work. So things like spreadsheets, presentations, financial models, legal analysis,
all of those. They've done a bunch of different tasks and they had one performed by junior
investment banker analyst. It got 87% compared to 68% that GPT 5.2 got. Some human evaluators also
preferred it about 68% of the time. They said it had better visuals and better structure. So
there's some cool stuff. Okay. Cool features that you might actually use today. This is the one I'm
very excited about. It has what they're calling steerability. But basically when you're,
when you're talking to chat GPT, it's available in the API too, which is I think crazy. But
it's on chat GPT. If you're talking to chat GPT and you can kind of see it's reasoning,
like it's thinking through some stuff and it puts a couple steps down and you realize it's going
in the wrong direction. You know, maybe you're like, hey, I'm trying to visit the best beach
for surfing. And it's like, okay, looking at beaches and quiet and you're like, oh crap,
like I'm in California. I don't want to see quiet and you're like, then you can set type of message
like specifically in California. And mid like prompt mid response that actually takes into account
what you just said and is, you know, steerability. It's going to go and incorporate that into it's
into what it's looking at and into its reasoning and give you an updated response. So basically,
you can do mid response prompts and it's going to take that into account and change its prompt and
give you better prompt mid response. So it's kind of interesting because I think they did a couple
clever things here. But one of them is like, when you ask a question, you have to wait for it to
think. You have to wait for it to reply. You sit there and you wait. We all hate waiting. And so
if in the middle of waiting, we're reading it's it's line of reasoning and we're giving a more
input and more feedback. It feels like we did a lot less waiting. We're really just kind of reading
and trying to throw on something and it can get it down faster and better rather than having to wait
for it to spit out the whole thing and you'd be like, okay, this is wrong. And here's why it's wrong.
And here's what you should do instead. You could do that in the middle of the chat conversation
response, which is really cool in my opinion. Something else I've kind of focused on right now is
online research. Apparently it can search across like a greater number of sources on the web.
So it can kind of instead of just like, okay, we're looking at this website, get in some data.
Now we're going to look at this website. It's going to go in and search just like a ton at the
same time across the web. And then it's going to follow leads across different pages. So it might
get an idea from something to read on one article. It's going to go follow that to another article
bounce around a lot more. So it's kind of doing like, I know we have had deep research for a while,
but it's doing deep research if that's the thing. And it's going to combine all the information that
it gets into one coherent answer. So basically this is going to be more useful for some of the more
complex questions where the information is kind of scattered across a lot of different sites inside
of sitting in one place. Now, not every question you ask, this is going to be relevant. But sometimes when
you have a complex and our question, it's going to be able to go get you a more coherent answer
quicker. So this is great. They have all this like kind of, I don't know, fluff in their launch
about how it hallucinates less. And it has less, you know, it has more, it has less factual errors
and all this kind of stuff. I don't think that's super important. One thing that we also heard about
is that it is going to, it's going to turn you down less. So like if you ask a question and they're
like, Hey, you know, I don't know, you ask a question. It's going to be, it's less likely allegedly
according to them. I'm going to like not answer. However, our good friend Connor Grannon, who host the AI
applied podcast with myself, he was testing. I saw a post he made on LinkedIn where he asked it,
is it true that air bubble inside of an IV can cause me or can, you know, could kill me? And it said,
you know, apparently it typed out the whole response to him kind of. And just like we saw with
like deep seek in the Chinese censored model, if you asked anything about Tiananmen Square to deep seek,
it like types it out and then it disappears. And it's like, sorry, can't answer this. Apparently
chat should be said, the exact same thing. And also, this is kind of a tricky moment because we're
seeing New York right now is trying to pass some legislation where they, they're saying, hey,
we don't, like they're basically trying to pass legislation saying, AI models can't ask
answer any questions about medical, health, legal. Like they have all of these different areas.
I think even hairstylists, they're trying to put in there. It's basically all of the,
all of the different industries with regulatory capture. They just don't want people to be able to
get the answers for free. So pretty, I don't know, kind of bummed about that legislation. And people
like seriously considering that. However, so it doesn't seem like it's that much better, but maybe
it's moving in a good direction. I'm not 100% sure. It still feels like there's other models that are
more of the adult in the room, but you also get pros and cons of those models. Grok famously is
going to answer any question you have about basically any of those topics, but you know, there's
a lot of different, there might be some other cons with Grok. So pros and cons to all of the models.
Thank you so much for tuning into the podcast today, guys. If you enjoyed the episode,
it would really help the show a ton if you left a rating review wherever you listened to your podcasts.
Just draw me a note, say if you enjoy it, you know, say where you're from, say, say what topics
are interesting to you? I read all the reviews and all the comments. It helps a ton. Also,
make sure you go check out AIbox.ai if you want to get access to all of these latest models in one
place. So you don't have to pay a $20 subscription to 10 different platforms. It's $899 a month
and you could access to over 40 different AI models. So go check it out, link in the description,
AIbox.ai. I'll catch you guys all in the next episode. Hello, it is Ryan. And I was on a flight
the other day playing one of my favorite social spin slot games on chumbacaceno.com. I looked over the
person sitting next to me and you know what they were doing. They were also playing chumbacaceno.
Everybody's loving having fun with it. Chumbacaceno is home to hundreds of casino style games that
you can play for free anytime, anywhere. So sign up now at chumbacaceno.com to claim your
free welcome bonus. That's chumbacaceno.com and live the chumbac. Sponsored by chumbacaceno.



