Loading...
Loading...

The March 11, 2026 episode opens with a discussion about public skepticism toward AI, using polling data to frame how AI is being perceived politically and socially. The hosts then move through several major stories, including Yann LeCun’s new venture Advanced Machine Intelligence, a humorous token-cost comparison clip, and Andre Karpathy’s open-source auto research project for AI-driven model improvement. Later segments focus on self-improving agents, multi-model workflows and skills, and an AI-in-science feature on Zephyrus, a system that lets researchers query weather and climate data in plain English. The episode closes with a broader reflection on conversational access to complex scientific data and how that could reshape research workflows.
Key Points Discussed
00:00:44 AI Popularity and Public Perception
00:05:00 Yann LeCun’s Advanced Machine Intelligence
00:08:03 Karl Yeh Joins with the Token Cost Clip
00:12:08 Andre Karpathy’s Auto Research
00:21:12 Self-Improving Agents and Anthropic Institute
00:38:04 Multi-Model Workflows and AI Consensus
00:43:30 Turning Repeated AI Work into Skills
00:49:15 AI and Science: Zephyrus for Weather Data
The Daily AI Show Co Hosts: Andy Halliday, Beth Lyons, Jyunmi Hatcher, Karl Yeh
Oh, Loha, it is March 11th, 2026, and I am joined by Andy Beth, and I'm Junmi, and this
is the Daily AI Show.
Today, we're going to cover a whole bunch of AI news, and then I've got a little bit
of an AI in science story for you.
So let's dive in really quickly, and Andy, what story do you have for us today?
Well, I want to lead off with the question of just how popular is AI?
We see the huge numbers of people who are using it, but a new just this month poll by NBC News
of 1,000 registered voters showed that there are only two items on their list with lower
negative ratings than AI, and so I'm sure you're just wanting to know what those two things
are. They are Iran and the Democratic Party.
Those two have the biggest difference between the number of positive people who voted positive
on that particular subject and people who are voting negative, but there are some others
around that same cluster at the bottom of the list that I thought were interesting.
One of them is ICE, but that doesn't surprise you.
It's way down there, but AI is below ICE.
ICE has a negative 18 difference, and AI has a negative 20.
So wow.
Okay, but there was a big chunk of neutral feelings about AI where there wasn't so much
about the war or ICE, I thought, but yes, in the absolutely no.
Yes, no, yeah, lies, dam lies, and statistics, here we go, so the poll does require further
interpretation.
I'm frankly, I like AI.
I don't know what they're talking about, but a lot of it has to do with the lack of knowledge
about what the positive benefits are of using the full range of frankly inscrutable tools
that are available to us and the threats to society and otherwise.
So I can see why overall for voters, AI is a negative issue.
What can also see, for me, my experience of people who are younger than we are, it feels
very wrapped in like the dislike, absolutely not, feels very wrapped into climate and what
it's doing to the environment, right, which is also a runaway story.
So the idea of what it's doing to the environment, it has a seed of truth, but is interpreted,
I think more than it actually turns out to be, it doesn't mean that it won't turn out
to be that, right, but it feels a little more like they are, they're pretty clear on
what the benefits are wrong.
If you're not using it every day, you're not clear on what the benefits, but the pushback
isn't about the benefits that pushback is about environment.
I think just stepping back, it's interesting in my lifetime that if we looked at a snapshot
of what the important voter issues were, both positive and negative, across the decades,
I didn't foresee that artificial intelligence was going to be so omnipresent in the news
and in the dialogue around what's happening in the world.
It's really front and center, whether it's in respect of war and the application of AI
to war, or it's in the context of the significant income disparity that's emerged since the 1970s
and so on, and being accelerated by AI, all of those things, AI is touching on every one
of those things now, and it's really centered, it's just centerpiece in my life, frankly,
now I didn't see that coming.
No.
And you saw all kinds of things coming, Andy?
I thought I did, but like,
certainly wasn't able to take advantage of them.
Right.
Right, right.
Beth, do you have a story that's a, that's a particular interest?
I do.
So we have, we have shared Jan LeCoon's vision or pushback on AI and large language models
many times on this show, right?
Like he's fairly famous saying that the large language model isn't as smart as your cat,
because your cat has a sense of the world, and he also this war was recently that he left
meta, right?
The confirmed leaving after the scaled person came and took over AI at meta, not Brian
scaled.
Alex one.
Yeah.
You?
Yeah.
So Le Jan LeCoon's advanced machine intelligence just emerged.
So that's his company with 1.03 billion seed round.
He's a touring award winner.
I'm reading this from the run, when the run down.
So he left in November after 12 years with fair, telling Mark Zuckerberg, you could build
world models faster, cheaper, and better on his own.
So the seed round is significant at one, at one billion, basically, but the other significance
is that it's the biggest seed round in investment in Europe at this point.
Yeah.
He's over in France.
He moved from Metta's headquarters, I think, or maybe he was working from France, but
he's a French guy.
Yes.
Yeah.
And by the way, the valuation that the investors gave that company, which hasn't done anything
yet, although behind the scenes, I'm sure they've got a lot.
It's three and a half billion.
So it's already it's a unicorn before it ever starts, it's a unicorn in the money provided
as the seed funding.
How do you spend a billion dollars on developing world models well?
Nvidia can tell you because they're building world models and they say it way more than
that.
And Nvidia.
And Nvidia.
Nvidia is part of the backers and video Samsung, Bezos, Expeditions, Eric Schmidt and Mark
Cuban are among the backers.
The headquarters is in Paris.
And that's interesting because we had a little tour when Brian was in Paris of like an incubator
that that was in a it was like in a it was a converted railway station of firemen or
so that's right.
I'm not saying that he's there, but I'm not saying he's not there.
And that would be very cool.
And we're starting to see posts of people saying, Hey, I'm pleased to announce that I'm
going to be working with Yann LaCoon at Advanced Machine Intelligence.
So very cool.
Yeah.
Carl's joined us.
So Carl, do you feel I'll throw you into the deep end?
Do you have a story that you want to talk about today?
Oh, yes, I do.
Hold on.
Give me one second.
There was something that I posted yesterday to the chat and I thought it was very funny.
I'm just going to let's see if I can play it.
Yeah.
Okay.
So I want to play this and you let me know you all think of it.
Okay.
How many tokens a year is my employer getting from me in terms of productivity?
Let me remind math real quick, assuming you're a full time human, 40 hours a week, 50
weeks a year.
That's about 2000 hours.
If we figure your brains doing 20,000 tokens a day, like we talked about, that's roughly
100,000 tokens per work week, multiply by 50, 5 million tokens a year.
So yeah, your boss is basically getting 5 million little thoughts, decisions, emails,
and why is this spreadsheet broken again, moments for whatever they're paying you.
How to make you wonder who the real bargain is, huh?
So 5 million tokens a year they get from me.
What is the cost of getting 5 million tokens with AI right now?
Right now, March, 2026, prices have dropped hard since last year.
For 5 million tokens, depends on the model, but let's keep it real.
With something like OpenAI's GPT 5.2, their current top dog, you're looking at about 9 bucks
for input heavy work.
They don't look at so many 5 million input, 14 output, if it's half and half, call it
12 to $15 total.
Cheaper options?
Grocks fast here from XAI is stupid cheaper, like $1 or less for 5 million if you're mostly
prompt to, want generating nothing.
God, still around 3 to 15 per million, so maybe 20 bucks tops for a balanced run.
Bottom line, your 5 million human tokens a year, and AI could replicate the raw output
for under 20 bucks, maybe 5 if you go bud, makes you think, are we the expensive part
now?
Anyway, so yeah, that was literally, that's a good one, thanks for finding that, that's
excellent.
So it's like, then I thought, hold on a second, the big arch has worked more than 9 bucks,
so the new McDonald's sandwich is worth more than my entire output for the year.
I don't know, I think there's something wrong with the token math there, you know, a token
is a, is a smaller atom, if you will, than, you know, a completed thought.
I think there's a lot more tokens being generated than 5 million a year by a human rate.
Well, I mean, there's a reason that everybody's trying to figure out how to power AI at the
efficiency of the human brain with selective recall, with auto pruning recall, right, our
memories.
And yeah, well, that was fun, I like, I was, I was very entertained by that, especially
though, if you were, if you were just listening to us and you couldn't see the, the face expression
of the guy who's asking the questions to the AI model and I don't know which AI model
it was, I've never heard that particular voice before, anyway, his expression was really
wonderful.
You acted that one out beautifully.
Yeah.
It was also created by AI and not edited because 5.2 has not been the premier model for
since the training data, like many models have happened since the training data stopped,
right?
Right, right, right, right, right, or it's an old one.
Well, speaking of how AI can do more things more efficiently and with less tokens, I wanted
to talk a little bit about Andre Carpathy's auto research.
Have you guys covered that at all?
No, I'm just, you know, in hell, this came out, I think he released it, I like on the
eighth.
But essentially, so what he did was he open sourced a project called auto research.
In his ex post, he got 8.6 million views for this and the GitHub repo has crossed
8,000 stars in the last couple of days.
And so the implications are, you know, pretty big for a weekend project.
So what he did was he created a 300 or 630 line Python script and you give, you give an
AI agent a language model training setup, but it's for a single GPU and you just point
it at this training script and go on with your day.
So the agent modifies the code, runs a five minute training experiment, checks if the
model improved, keeps or discards the change and repeats.
So by morning, he had roughly 100 completed experiments and he didn't have to touch anything.
So the way he puts it is, the goal is to engineer your agents to make the fastest research
progress indefinitely and without any of your own involvement.
Now what happens next is pretty interesting.
So a company called hyperspace AI distributed the loop across a peer-to-peer network.
And on the eighth, 35 autonomous agents rent 333 experiments unsupervised.
The interesting finding the agents, the agents on powerful, sorry, he released an unpowerful
age 100 GPUs, used brute force to find aggressive learning rates.
While agents on weaker laptop hardware got more clever, they focused on initialization
strategies and normalization choices.
So different hardware, different optimizations both seem to work well.
At the same time, Shopify CEO Toby Lutkey adapted the framework overnight and reported
a 19% improvement with the agent optimized a smaller model outperforming a larger model
that had been configured manually.
So why does this matter?
This is AI doing AI research.
It closes the loop, researchers have been theorizing for years, AI improving the very training
code that produces better AI.
And because it's 630 lines, open source, it's under the MIT license and can run on a single
GPU, anyone can run it.
So not just big labs, anyone with a graphics card and a markdown file.
The humans roll ships from experiment to experimental designer, you define the problem
and what better means the agent grinds through the search space while you sleep.
So in terms of ML research, developing better AI, you can have your AI developing better
versions of itself.
And you can do it on a single GPU.
That was the thing that shocked me the most when I read about it is that this is single
GPU ML research and anyone could do it.
Anyone I can run it right now on my social GPU.
So I have some questions about what the actual setup is.
There's 630 lines of Python code.
That's a little package.
And then there's a GPU.
We've talked about those two components.
But there is not an agentic LLM thing that's working that set of 630 lines of code.
What's being improved actually?
Is it the 630 lines of code that self improves or is it an LLM that's getting improved?
Like a local model that you're basically giving the task of your rumors, the 630 lines
to kind of move forward.
Right.
So the script is running the loop of ML experiments.
So whatever your experiment stack is to start developing your LLM or your other ML based
AI model is what this is running.
So it's doing the improvement, that improvement loop or that improvement stack for all of
your experiments.
And so if you've done ML research, you're just going to have this huge list of experiments
that you're trying to run on your A on whatever model you're trying to build and improving
it from there and there.
I think the key takeaway is that you are doing AI, it's an AI self improvement loop.
It's tiny and runs on relatively insignificant hardware requirements.
All right.
It's got like a new fine tuning approach to a model that you may have running locally.
Yeah, you just have a local running model because you can't change the parameters on,
you know, a cloud-based model unless you have an isolated instance somewhere in the cloud.
You know, am I right about that?
Yeah, you're not like, you know, you're not improving like a chat GPT or a GPT 5.4 or whatever.
This is more for building your own model.
Yeah, so it's a tiny model because this is when I talked about it with GROC and on the show,
the other day, GROC referred to it as a toy lab, right?
So you're not even necessarily able to do something in like a general language model,
although it may be that day-g-changing and doing other things like Jumi just said makes
that more possible.
So in a follow-up post on Monday, March 9th, Andre said three days ago, I left auto research
tuning NanoChat for about two days on Depth equals 12 model.
And it found 20 changes that improved the validation loss.
He tested the changes and all of them were additive and transferred to larger
depth equals 24 models, right?
So he's doing the research on a tiny model.
It can move to its generalizable or it can move to the larger model
and stacking up all of the changes.
Today, I measured that the leader boards time to GPT 2, which is an internal measure, I'm sure,
drops from 2.02 hours to 1.8 hours.
That's an 11% improvement.
So that's the kind of results that this could give you.
And then you can make the inferences not AI inference.
You can make the assumptions or do the tests from the small tests to the larger test
and then see improvement being stacked.
And the real key for this is that he says that I'm very used to doing the iterative optimization
of neural network training manually.
You come up with ideas, you implement them, you check if they work, better validation loss.
You come up with new ideas based on that and you do that.
What he did in leaving it alone, he was very surprised that on top of the tuning that he'd
ever, that he'd already done over a good amount of time, this was so successful.
Cool. Yeah, Prometheus unbound.
Yeah, those first steps anyways, right?
Is anybody okay?
We're on the slide, right?
I mean, when we talk about Prometheus, we're like, okay,
so it was this and then, boom, it was super fast.
But there was a slow ramp up to the curve.
You just didn't notice that it was the curve when the ramp up was happening.
It's like, oh, this is interesting, why does that matter?
Oh, there's a lot of talk about self-improving agents, agents with persistent memory, right?
Those are the things that we're talking about.
That is the starting of the descent before the boom, I think.
Do you agree, Andy?
I see that there's probably a tipping point that's coming where the rapid acceleration of
addition of context, retention in the form of persistent memory, and also self-improvement
reinforcement learning techniques, like the 630 lines of code for tuning a model,
combined to really rapidly advance the capabilities toward AGI.
And the anthropic just four hours ago introduced the anthropic institute on that note,
because to facilitate and seed conversations about powerful AI, because we believe
being forewarned is being forearmed, the anthropic institute will tell the world what we are
seeing and expecting for the technology we built. It will lead to new research into the challenges
posed by more powerful AI and partner with others to address them.
This is a little bit, you know, like a follow-up to their
relationship breakup with the Department of War, but also, I think that all the labs are kind of
saying, hey, everybody wake up, we just need to have a little bit more of a conversation about
what's possible now and what we see coming in the relative near future.
And so a weave off of that story, did you talk, have you talked about the open AI robotics
leader leaving there? No, we had it teed up, but yes, it's a relevant story.
Go on Saturday, Caitlin Klonowski, the executive leading open AI's hardware and robotics
team, publicly resigned. Her statement was direct. AI has an important role in natural
national security, but surveillance of Americans without judicial oversight and lethal autonomy
without human authorization are lines that deserve more deliberation than they got.
She clarified in a follow-up, my issue is that the announcement was rushed without the guard
well as defying. It's a governance concern first and foremost. So a little bit of a context.
In days before the Pentagon had been negotiating with Anthropic,
about deploying AI on classified networks, Anthropic pushed for strict limits,
no mass domestic surveillance, no fully autonomous weapons, those negotiations collapsed,
and then Pentagon did something unprecedented. It designated Anthropic a supply chain risk.
That's not symbolic. It could force companies like Nvidia to sever commercial ties with Anthropic.
Anthropic says it will fight the designation in court, which has happened. Then very quickly,
opening AI announced that its own agreement along its technology and classified environments,
even CEO Sam Altman reportedly acknowledged the rollout appeared opportunistic and sloppy.
Kalanowski's departure is especially significant because she wasn't a policy person,
she led robotics and hardware. When the person running your robotics division resigns over
concerns about lethal autonomy, that someone with direct line of sight into what these technologies
can do, saying the process wasn't right. An opening AI spokesperson said the agreement
creates a workable path for responsible national security uses of AI,
with red lines against domestic surveillance and autonomous weapons.
The bigger picture. Anthropic supply chain designation sends a chilling signal to any AI
company that might want to negotiate limits on military use, and Kalanowski's resignation
sends a different signal. The people inside these companies believe the pace of
deal-making is outshining or outrunning the pace of governance.
So a significant move. As we've seen over the last months or year,
and this might be due to the limited talent pool or because of meta's, you know,
enormous sums they've been throwing around, the leaders and talent of these programs
or do have a spotlight on them. So anytime one of them makes the move, that becomes a significant
significant bit of news, and why they make the move becomes more significant as well.
So to have the head of open AI's robotics team say, hey, this is not done correctly,
and I'm out of here, I think is a big step in saying, okay, well, it's time to re-evaluate
because the people working on the stuff doesn't seem to align with what's happening outside of their
their work, I guess, direct purview.
So, Gwen asked a question here in the thread. It's saying, do you think there's something
going on with the type of boots on the ground there will be? And I think you're talking about
the possibility of autonomous killing weapons that are robotic, and we have those in a sense in
the form of autonomous drones that have pre-programmed to, you know, take action in, you know,
battle space, but it's a very fearsome concept of the idea of having, you know, really fast moving,
humanoid or dog type or other format robots being able to rapidly dominate the space that's
beyond the line of advance in a military confrontation, right? Or let's just say, what if China unleashed
autonomous killing robots that, you know, didn't have a whole lot of discrimination, but we're just
designed to kill anything that looked like it was wearing a military uniform in Taiwan, right?
And just release them and let them go. That's really scary because these machines can move
much faster. You know, the things we've seen in Star Wars, you know, the drone armies marching
very slowly, that's not the way it's going to look, right? These things will be very fast. You've seen
the dancers, the humanoid robot capabilities already, the dancers on stage, they can outperform
human dancers. You know, they're not going to be dumb droids. They're going to be very, very fast
and accurate, and it's going to be scary, and it's a whole new level of human confrontation that's
possible with this. So yeah, I think that someone working in robotics inside OpenAI saying,
no, I really can't sign up for this. That's an important demonstration of an ethical
red line that, you know, that person has against, you know, involving themselves in the
forward development of these kinds of capabilities. Right. I think also the boots on the ground
is referencing the talks that's coming from the Department of Defense and the president
of we're looking at boots on the ground in the war in Iran. Yeah.
Well, moving on from the doom and gloom, I wanted to just mention to everyone who's been waiting
for Google to come forward with additional tools that will improve their abilities to hopefully
catch up with and match the abilities of Claude in Excel. Google has just released what they
call context-aware generation across the entire Google workspace. So in docs and slides,
Gemini can create full drafts or design, you know, aligned presentations using your own files,
your calendar, and web data. It'll go out and search for you and then build these documents.
In Sheets, it will populate missing fields for you. It'll use web data to get that information
if necessary. It'll also automate tools that build entire dashboards from a single prompt.
So you can now talk to Gemini. I've not tried this yet, but I'm grateful that Google is finally
coming forward with some additional tools in Sheets because I've tried to move away from Excel
and work only in Sheets in order to sort of centralize both my storage and my
context, as Google would say, around, you know, Sheets that are actually operating models,
you know, for various tasks that I do. So it'll do cross-file analysis, meaning it'll provide you
with answers and summaries with direct citations across your full set of documents, whether Sheets
or Google docs or slides. So you can start to ask Gemini now things about and ask it to act on
and change those files in your Google app suite, now called Workspace.
Hmm, all right. And they also dropped yesterday or announced yesterday the embeddings,
model, multimodal embeddings, which is being able to process video, audio, text, image,
an image that made a specific, it called that specifically the ability to process PDFs, right?
So not just text and image, but text with image. The multimodal is supposed to have enough
understanding of the various ways that information is communicated in those modalities
that it can do a comprehensive analysis. Yeah, so this is a capability that was developed at deep
mind Google's division for really advanced research. And what that enables is embedding into a
retrieval database. All the different modalities like so you have video and audio and anything that
you put into it is semantically clustered so that they're all the retrieval of data is both text
and video are all not both across all of those different modes. And what is very interesting and
Brian will be very interested in hearing this is that this allows developers to build search tools
that could find, for example, find the moment in video that matches this text description.
And so now you've got a single vectorized data store that has all the information semantically
coded in very high dimension space. I saw something about with this new embedding model. It has
a clever way of reducing a 3000 plus dimension vector space down to 1500, which is the typical
sort of maximum dimension space for vector databases. So anyway, that's really an important
development, especially with respect to our current project Bruno, which has a video search capability
built into it. We're currently using Gemini as a way of doing that, but I'm not I'm not certain.
I'm not certain, you know, what the vector database is that Brian's implemented. So we'll have to
talk to him about that. But this is important use to that project. It's yeah, man, it's the
atomization process. The the slight caveat to that is that it is $12 per one million tokens
input. So and I'm not entirely sure like one million tokens will find out because we're
totally going to do it for the show. But like what does that mean in terms of a show like ours
with so like what does the one hour show cost and tokens embedded? Yeah. Right. Because right now
using the levels of Gemini that that we're using that I'm using with the atomization that I do
afterwards that is like under a dollar. And in fact, the atomization itself is like I don't know
15 cents maybe. But what else the other thing that I learned was that it looks like basically
you're paying the highest that you need. So if you are processing a video that has audio and a
screen that has an image with text on it, right? Like a PowerPoint or something like that.
You're paying the $12 million, you're paying the $12 per the million cost, not also processing
audio, not also processing these other things. And I'm sure the semantics like understanding of
what's happening on the screen as well. So more experiments to come. Yeah, always something new and
it really does bring up the Andy. What was the what was that term about just waiting for the
technology to go? Oh, yeah. It's called the waiting equation, waiting question. Do you have
a overall more economic result and faster result by waiting for the technology to catch up rather
than investing early on in the cycle? But you should still be using the technology, right?
Rachel on a post, Rachel Woods had a post, I think on LinkedIn the other day, which is probably
a week ago or two weeks ago, but saying I cannot imagine being the company or the person now
who's looking at what's happening and saying, you know, I'm still going to wait till it stabilizes,
right? I'm going to wait till everything's decided and then I'm going to jump in.
And how like what a learning curve you would have for them? Oh, absolutely. Like just getting started now
as opposed to having a year in or more. Yeah, I think conceptually just understanding those
foundations is probably that biggest hurdle. I wonder, I guess this would be a question for
chatter. Any of our viewers later is if you are new to AI, what has been the largest hump or the
most difficult, you know, learning curve that you've had to experience? Because I think I still
imagine my head that like when I was first getting into it, it was just the trying to deluge myself
in information and then just constantly learning more and more and spending large portions of the
day just absorbing enough to then start making forward steps. So I'm wondering if that's still the
if that's still the difficult part of getting into AI is that, okay, I need to do a lot of reading.
I need to do a lot of watching of YouTube videos and you know, I need to go back and watch
all 678 episodes of the daily AI show to catch up. Which you don't have to, but we'd like it if you did.
So, right, so before I move on to AI in science, I just wanted to check if there was any other
stories that either of you wanted to cover. Yeah, I have something I can probably knock out in a
couple of minutes. This is a, this is around the idea that you get better results if you as a
marshaler of all of your AI resources involve multiple models in the approach to whatever it is
that you're doing. And I've talked about this on the show previously about vibe coding and using
Gemini as a coach in working with loveables, you know, clawed based coding agent. And the combination
of those two is better. And I've actually had recent experience where loveable actually did
something that clawed code didn't quite fully understand. And I couldn't quite figure out because
it's not transparent on loveable, you know, what it was that loveables multi-agent approach was
doing that got to a better result than what clawed coded proposed. And, you know, clawed acknowledged,
oh, that's a much better idea, right? That kind of thing. So, I haven't been talking to each other.
Well, there's a company that is a Boston based startup called Collective IQ. Collective,
meaning multi-models. It started as a platform for enterprises in the procurement space.
And what the CEO of that procurement company felt was, and he's very technically proficient,
obviously, because he started this startup that has a really incredible capability available now
to do this collective IQ approach. He found that it was really constraining to have to decide
on a single model. I'm going to be a chat GPT person and I'm going to use that model. It was much
better. They got better results when they had this sort of collaboration happening across models.
So, they unveiled this new product that they're offering to enterprise called an AI consensus
platform. And it's designed to simultaneously query chat GPT clawed Gemini Grock and up to 10
additional large language models before synthesizing the result into a single annotated answer
that shows where the, and it doesn't just go and get individual answers. It actually has the
coordinates, a dialogue across those models. So, one model can be acting in a role that
challenges another one and also offers its own result. So, the highlights then for the user
where the models agree, it flags the disagreements and aims to reduce hallucinations and bias
that plague any single model approach to work with AI. So, I just wanted to say, this is why
I have so many subscriptions to different AI models. I don't want to give any of them up because
I feel like each one of them has something to offer. Right, right, because some things are,
each model has its own sort of expertise or seems to work best in a given situation.
And we also know that from posing, from giving another model what a different model has said,
you get a different level of conversation, right, the same as if you put five people in a room
to have a conversation about a topic that's different than going out and doing five separate
conversations, right? There is an additive, maybe a little competitive process of like, no,
well, if they gave you that, I'm going to give you something just a little bit better, right?
So, Andy, how would we best implement this? Would it be just an extra step in our questions
that we're asking these models? Well, I think what Claude Cowork has made possible for me is
eliminating the necessity of copying and pasting things from Gemini back into, you know,
Claude or vice versa. Instead, now Claude Cowork can look at all of my transaction and interaction
with lovable, for example. And I just say to what, when I want to get advice or perspective or
new plans from Claude Cowork, I just ask, using whisper flow, I ask Claude Cowork to look at,
you know, my most recent conversation and it will then scroll through the entire thing.
And it digests that very rapidly within 10 or 15 seconds and then says, okay, here's what I think
you should do. Or, you know, this is what I would do as an approach. And so, I'm, you know,
co-work to TLDRs, co-works automating the conversation with anything that's open in the browser for you.
Yeah. There is, I just want to drop one more thing. And one of the things that's very cool about
what's happening with social media is that I can, I can read tips from people who posted
in a different language. And this comes from, so deep dive, who posted, with, who posted
originally in Korean. And basically what they're doing is telling Claude, code, Claude sessions,
scrape all my Claude sessions, stored on this computer, organize the tasks I do frequently,
classify which ones would be good to put into skills, right, repeated prompt collection, plug-ins,
external integration, agents, autonomous agents, or Claude MD. And then make those, right,
if it's a repetitive task, it can be automated. And auditing your own use, Andy, right? Like,
that's very much in line with what you're talking about. And, and how cool, right?
Yeah. Yeah. No, that's, that's super interesting. I'm, I'm always interested in the ways that I can
make my own day-to-day easier or automated, I guess, in the sense. Just as a complete side note,
Andy, I know you were saying you used whisper flow. So I found an open source project, which I
really like called the whisper ring. So not as polished as whisper flow. But it gets the job done.
And I've been using that this weekend. And that's, that's been a, an overall improvement on,
on my interaction. Though I am kind of a little irritated with anthropic and Claude because I keep
on hitting my limits. What's going on? Shouldn't I be able to work all day without having the,
you know, limit? By the way, that's another one of the advantages of using multiple models.
So I, I'll often just pass it over to another model. If I'm starting, you know, if I'm
fearing the approach of either a compaction or, you know, a, you know, a strict limit. I,
it's been a while now, but I, but I'm not spending hours at the terminal. My, my, my, my terminal,
my, my workspace. I'm not spending hours at it. I, I get it maybe in bits and bits here, you
know, I get maybe a total of two hours a day outside of all the other responsibilities I have.
But I can get a lot done in two hours, especially when I'm using multiple models. And I'm spreading
the token limits across multiple models. It's true. It's true. I, it's, it's probably my own
fault for wanting to use Opus 4.6 thinking for every single question. Um, but, and thinking high,
like, yeah, thinking medium. I was like skating along. I was thinking medium and, uh,
had several conversations like, what? Why, why would I tell you that March 10th is a Tuesday?
Do you tell me that March 3rd is a Monday? Let's just talk about that, right? Like, let's just
have some perspective conversation. Uh, and the answer is training data more than, uh, more,
not doing date path. So, Jun, uh, you'll be interested to know. And perhaps you missed it because
we only mentioned it in fast passing on the show. But in benchmark performances,
in a gentick coding, Sonnet 4.6 matches or slightly exceeds Opus 4.6. So, at least in my use case,
I use Sonnet. I don't use Opus at all. Well, that, that's probably the, uh, the comparison I need
to do. I've been doing a lot of, uh, replicating of, uh, systems or projects that I've created
in, uh, in chat GPT. And I want to, and I've replicated them over into Claude to see what kind of
differences. And, um, if I can get better answers, you know, through that process as well. So setting
that stuff always takes up more, uh, more time because you're asking these more in-depth questions
like, Oh, I have, I have 300 files, source files to upload. You can only take 20 at a time. Okay,
well, let's, let's start, uh, absorbing it. Yeah. Also, by the way, if you haven't already, uh,
if you try and export your data from OpenAI, it took me three days to get it. So, just as a note,
what was it? Wanted to take you three days? Three days. Three days. For what? To get an
export of all my data from OpenAI. Oh, yeah. Okay. So just because there was a final march, so much data
there. Yeah. Yeah. Yeah. Um, I'm sure the, the process is also a low priority on their, you know,
list of things for compute. But, um, but just as a note, it could take you three days or, or more
because they don't, they don't make a promise of that. So if you're planning to move data over,
or do a backup process of your, of your data, know that it, uh, the request may take, uh, three days
to process. But that's a side note. Okay. Unless there's anything else that you all want to cover,
I think it's time for AI and science science science. Okay. So today's AI and science story is
about something that sounds deceptively simple, but could change how an entire scientific discipline
works. Researchers at UC San Diego have built an AI agent that they're calling Zephyrus. They
can answer plain English questions about weather and climate data. That might not sound
revolutionary at first, but to understand why it matters, you have to understand the problem
it's solving. So right now, if a client's, uh, a climate scientist or meteorology student wants
to ask a question, Mike, what was the average wind speed over the Gulf of Mexico last Tuesday?
They can't just, you know, ask, they have to write code, they have to know which data set to pull
from, how to query it, how to, uh, format that output, and how to interpret the results.
The barrier of entry isn't the science, it's the data engineering. So weather and climate
science sits on top of some of the most complex high volume data sets on the planet.
We're talking about constantly fluctuating streams of temperature, atmospheric pressure,
humidity, wind speed, precipitation, uh, all collected by surface, uh, balloon satellites,
radars, ocean bullies, ships, and aircraft. Making sense of all that data has always required
specialized technical skills that have nothing to do with understanding the atmosphere.
So what, so what the UC San Diego team did is build an AI agent that acts as a bridge.
So Zephyrus takes a natural language question, question in plain English, translated into code,
uh, queries the relevant AI, uh, weather forecasting models, retrieves the data, runs the analysis
and gives you back a plain language answer. The researchers, including Duncan Watson-Paris from
Scripps Institute of Oceanography and Rose U from Department of Computer Science,
describe their goal as lowering the barrier to entry to analyzing critical data.
They specifically want students and early career researchers to be able to interact with
these data sets without needing years of coding experience first. As you put it, Zephyrus is a
crucial step towards creating AI co-scientists that dramatically lower the barrier to entry,
allowing students and researchers everywhere to access and reason without critical weather
and climate data at unprecedented speeds. Now, it's important to be honest about
what Zephyrus is right now. The team reports it performs well on basic and intermediate tasks,
but it still struggles with complex multi-step queries. This is a first generation tool, not a
finished product. They're presenting it at the International Conference on Learning Representations,
the ICLR, in Rio de Janeiro this April. But here's why this story is bigger than one research paper.
Zephyrus arrives at a moment when AI weather forecasting itself is going through a revolution.
Just last month, the NOAA, the National or Oceanic and Atmospheric Administration,
deployed a new suite of AI-driven global weather prediction models into operational use.
Their system, called AIGFS, can produce a 16-day weather forecast using just 0.3%
of the computing resources of the traditional global forecast system. And it finishes in about
40 minutes. To put that into perspective, the traditional system requires
vastly more supercomputer to produce essentially the same forecast.
All right. The system was built on Google DeepMind's GraphCast model, then fine-tuned with NOAA's
own global data assimilation system data. That additional trading with NOAA's data
improved performance, especially when using GFS-based initial conditions. And it already
shown the results. Significantly better tropical cyclone track predicts at longer lead times.
The trade-off is that intensity forecasting still needs improvement, which future versions
will address. But for track accuracy, which tells you where the hurricane is going,
AI system is already outperforming the traditional model.
Yeah. We're with Yavro. Wow.
See if I can get through this. Excuse me. Separately, researchers at the University of
Washington published a study showing something even more ambitious. They built an AI model
called DLESim that can simulate a thousand years of Earth's current climate in just 12 hours.
And running on a single process. The same simulation on a state-of-the-art
supercomputer would take approximately 90 days. And the model just didn't run fast, it ran well.
The researchers compared its output to the four leading models from IPCC's
coupled model Intercomparrison project. Excuse me, I'm so sorry. Which is the gold standard
physics-based climate model that run on supercomputers and inform the intergovernmental panel on climate
change. The DLESim simulated tropical cyclones and the Indian summer monsoon better than those
leading models. In mid latitudes, it compared a month to month and year to year weather variability
at least as well including atmospheric blocking events. Those are the ridges that keep regions stuck
in heatwaves or cold snaps by deflecting incoming weather systems. The lead researcher,
Dale Duran, frames the tools purpose this way. Helping scientists answer whether a given
extreme weather event is the kind of thing that happens naturally within our current climate
or something that defies the odds. When we talk about 100-year storms, that seemed to happen
every few years now. This is the kind of tool that helps determine whether the climate itself
has shifted. Meanwhile, at Boston University, climate scientists Elizabeth Barnes, the universities
in inaugural Dalton family chair in environments of data science and sustainability,
is working on a complementary piece of the puzzle. Her focus isn't just on making AI
predictions better, it's on making them trustworthy. So Barnes works on explainable AI,
cracking open the black box to understand how the models learn from the data and quantifying how
much certainty we should attach to any given prediction. As she puts it, uncertainty,
quantification is huge in earth sciences. We do a lot of predictions. You can think of it as
weather forecasting, but we go out years to decades into the future. We don't know exactly what's
going to happen, so uncertainty has to be a part of what we produce. This matters because the AI
model that gives you a confident wrong answer about next week's hurricane track is worse than
no answer at all. The ability to say, here's our prediction, here's how confident we are,
and explain why is what separates useful tool from a dangerous one.
Okay. So, with all of these examples, what you have is a convergence across multiple fronts.
AI weather models are getting dramatically more accurate, outperforming, performing
physics-based models that took decades to develop. They're getting dramatically faster
from 90 days on a supercomputer to 12 hours on a single processor. They're getting dramatically
cheaper. Noah's system uses 0.3 percent of the compute of the system it's supplementing.
Researchers are working on making the models explainable and trustworthy and not just fast,
but all of the capabilities only useful of scientists can actually access it and
interrogate the data it produces. That's the gap Sephiris is trying to fill. Instead of writing code
to query a data set, you ask a question in English. Instead of spending hours on data wrangling,
you spend your time on the science. So, the researchers see meteorology as a test case for
something much broader, weather prediction. They know is a perfect proving ground because it combines
large complex data sets that change over time with the need to reason about those data in plain
language. If it works here, the same approach could work in genetics, material science,
epidemiology, any field drowning in data that's hard to access.
As Watson Paris puts it, we want to increase the speed in which we can reason about multimodal
data and learn about the earth by making it easier for students and young scientists to interact
with different data sets. The tool isn't there yet. It handles the basic and intermediate queries,
but stumbles on complex ones. But the vision, that you should be able to talk to your scientific
data and have a talk and have it talk back in the language you can understand is one that could
reshape how research gets done across the sciences. And the timing is right. The AI models generating
the data already, the computing infrastructure is ready for the most part. What's been missing is
the interface between the scientists and the data and Zephyrus is the first step toward filling
that gap. So that is your AI and science story. AI in weather prediction and weather science
seems to be moving along at a steady clip there. Zephyrus's original paper, I think, came out
in October of last year, and now they've moved on to presenting their actual model and their
program in April. So not a lot of time has gone from initial release of the paper to
putting it into circulation, if you pardon the pun.
I like the general conclusion that this is the development of the ability to talk naturally
to a data set that's very complex and changing continuously. And it's a more natural interface,
obviously, for humans to have an ongoing conversation about something in order to develop
understanding and action. And I think broadly that this beautiful example of how we're at the
edge of the application of AI to really complex data in weather forecasting is really an analog
for what has happened in the AI era since the emergence of chat GPT-3, which is now you can talk
to the computer. And it's a conversational dialogue. And now with the latest frontier models,
you're talking to a computer that is much, much bigger and more knowledgeable than anything we
imagined before. The training that can be condensed into a single frontier model of only a trillion
plus parameters is astounding. The level of knowledge and now reasoning capability that's
packed into that, that you can speak to naturally. You can speak back to you in whatever
style or language you ask it to. That's really, that's amazing to me and I'm really happy that I
got to live to experience it. Yeah, strong. Well, all great topics that we've covered today.
And if you want to continue that conversation, please join our community at dailyaishowcommunity.com.
To continue that conversation and to share all of your ideas. We're going to wrap it up
for today. So I want to thank everybody who joined us in chat and everyone who's just watching
a little later. You can always catch us Monday through Friday, same bad time, same bad channel.
Please make sure if you've enjoyed, if you've enjoyed any of this, please make sure to click the
things that make the things. And remember to keep your minds and your hearts open. Aloha.

