Loading...
Loading...

🚀 Welcome to a Special Edition of AI Unraveled. Today, we aren't talking about chatbots; we are talking about the ceiling of machine intelligence. For years, AI has been "acing" every test we threw at it. But a global coalition of 1,000 scientists just hit back with Humanity’s Last Exam (HLE)—a benchmark specifically engineered to be impossible for today's AI.
This special episode is brought to you by DjamgaMind. In a world where AI benchmarks are being shattered every week, you need the signal through the noise. DjamgaMind turns massive academic papers like the Nature report on HLE into 60-second audio intelligence. Master the frontier of human knowledge while you're on the move at DjamgaMind.com.
In this Deep Dive, we explore:
Resources & Links:
Keywords: Humanity's Last Exam, HLE Benchmark, AI Reasoning Wall, AGI, Center for AI Safety, Scale AI, Texas A&M AI, Biblical Hebrew AI, Expert-Level AI, Nature Journal AI, LLM Saturation, MMLU Benchmark, GPT-5 Performance, Claude Opus 4.6, Gemini 3.1 Pro, DjamgaMind.
Credits: Produced by Etienne Noumen, Senior Software Engineer and AI Strategist.
🚀 Reach the Architects of the AI Revolution
Want to reach 60,000+ Enterprise Architects and C-Suite leaders? Download our 2026 Media Kit and see how we simulate your product for the technical buyer: https://djamgamind.com/ai
Connect with the host Etienne Noumen: https://www.linkedin.com/in/enoumen/
⚗️ PRODUCTION NOTE: We Practice What We Preach.
AI Unraveled is produced using a hybrid "Human-in-the-Loop" workflow. While all research, interviews, and strategic insights are curated by Etienne Noumen, we leverage advanced AI voice synthesis for our daily narration to ensure speed, consistency, and scale.
Welcome to a special edition of AI Unraveled, I'm your host, Etienne Newman.
For the last three years, we've heard the same headline AI passes the bar exam and AI scores
in the 99th percentile of the SAT. It felt like the machines were running out of things to learn.
But last week, the world's top researchers, led by Texas A&M and the Center for AI Safety,
decided to stop playing games. They released humanity's last exam. It is 2,500 questions long.
It covers things no single human could ever know from the intricacies of medieval Hebrew
pronunciation to the physics of sesame bones in hummingbirds. And the result, the AI models that we
thought were approaching AGI are failing miserably. This episode is sponsored by Jomga Mind.
Stay ahead of the expert frontier with 60-second audio intelligence at JomgaMind.com.
Today we are looking at the AI ceiling. We're going inside the exam that was built to protect
the expert frontier. Let's unravel why the machines are finally meeting their match.
Welcome back to The Deep Dive. Today, we are stepping into our roles as Senior AI Research
Analysts from the AI Unraveled Network. And we're bringing you a journey into the absolute
bleeding edge of machine cognition. Yeah, it really is. It's genuinely fascinating time to be
analyzing this space. It totally is. And we are taking a massive stack of sources today. I mean,
we've got a 2,500 page academic evaluation published in the journal Nature Comprehensive
Reports from Texas A&M University. And some incredibly heated, deeply philosophical reddit threads.
Oh, the reddits threads are fantastic. They really are. And we are distilling it all down for you.
So if you are a fellow learner who wants to understand exactly where the frontier of technology
actually sits today, you are in the right place. Absolutely. We are looking at a monumental
project known as Humanities Last Exam or HLE. And our mission today is straightforward but,
you know, profoundly important. Right. We need to critically examine this massive benchmark
to answer the ultimate question. Are the tech giants actually building artificial general
intelligence or have they simply spent hundreds of billions of dollars to scale up what are essentially
very expensive, highly articulate parrots? That really is the multi billion dollar question,
isn't it? It is. And to set the stage for you, I want to introduce the central mystery that
we are going to be unraveling throughout this deep dive. Let's call it the biblical Hebrew
syllable challenge. Ah, yes. That is a perfect anchor point for this entire discussion. It beautifully
encapsulates the exact problem researchers are wrestling with right now. Right. So I want you to
imagine a multi billion parameter AI system. This is a digital brain that has effectively ingested
the entire internet. It can write flawless, complex Python code in seconds. You can summarize a
50 page Byzantine legal document instantly. Exactly. But when the same digital behemoth is asked
to analyze a single very specific line from Psalms chapter 104 verse 7, it completely and utterly
breaks down. Just totally fails. Yeah. Meanwhile, a human expert in ancient linguistics looks at
that exact same line and finds the logic to be entirely sound and deducible. Why does the
machine fail? We're the human succeeds. Okay, let's unpack this. To understand why the machine fails
at the biblical Hebrew challenge, we first have to understand the existential crisis currently facing
the entire field of artificial intelligence research. And that crisis is something we call benchmark
saturation. Let's define that for the listener. What exactly is a benchmark in this context? And why
is it saturated? Because when I hear the word saturated, I just think of a sponge that can't hold
any more water. That's actually a great analogy. Think of the trajectory of AI research over the
past decade as a relentless accelerating cycle. Researchers create a novel computational benchmark,
essentially a highly complex standardized test. Right. Designed to test the absolute limits of
what a machine can do. That's the sponge. Exactly. Historically, we had standardized evaluations like
the massive multitask language understanding exam widely known as the MMLU. Right. The MMLU.
We also had the graduate school math 8K or GSM 8K and the human evil for coding.
So these tests were the gold standard. For a long time, if an AI couldn't pass the MMLU,
it wasn't considered frontier. They were these impassable barriers. Precisely. They were the
epistemological dividing lines. They were what we used to demarcate human cognitive flexibility
and expert level academic synthesis from mere machine pattern recognition. So it separated the
real thinkers from the calculators. Right. And if you think back just a few years ago,
AI models were scoring in the 30 or 40 percent range on the MMLU. It felt like a safe,
distant horizon. But over the last few years, we've hit a wall. The landscape of AI is experiencing
this destabilizing phenomenon of benchmark saturation. Yes. Contemporary, large language models
or LOMs and these massive multimodal neural architectures are now routinely achieving accuracy
rates exceeding 90 percent on these legacy evaluations. Wait, let me push back on that for a second.
If an AI scores 90 percent on a graduate level math test or ACEs of standardized bar exam,
isn't that just, I mean, isn't that proof that it's incredibly smart? You would think so.
Because you can't just guess your way through a graduate physics exam. Why is this a crisis?
The natural assumption for anyone reading the headlines is just wow, the AI is a mathematical genius.
You would absolutely think so. And that is the exact illusion that has captured the public imagination.
But it is a massive problem for the scientists trying to measure true intelligence. How so?
This unprecedented success rate has paradoxically rendered these tests completely obsolete
as precise scientific instruments. When state-of-the-art systems effectively conquer a benchmark,
the metric loses its foundational utility. Because it's saturated, the sponge can't hold any more
water so you can't use it to measure a flood. Exactly. The benchmark can no longer distinguish between
models that possess profound, generalizable reasoning architectures and those that have merely
internalized vast swaths of their training data. Ah, I see it. They've memorized the study guide.
That's it. They are using parameterized memory retrieval and sophisticated statistical
guessing. So they aren't reasoning for the physics problem? No. If a question or a variation of that
question existed anywhere in the trillions of words the AI was trained on, the AI didn't solve
it. It just remembered it. Let's look at the GPQA diamond subset as a prime example of this
dynamic. Okay, the graduate level Googleproof Q&A benchmark. And this test focuses exclusively
on the hard sciences, right? Physics, biology and chemistry. Yes. And the diamond subset represents
the 198 absolute hardest questions called from that broader corpus. Right. The GPQA was supposed
to be the new incredibly high bar. The questions are designed so that even if you give a smart college
educated person access to Google and 30 minutes to search for the answer, they still shouldn't be
able to get it right. Exactly. And the statistics on the GPQA diamond are honestly mind blowing.
Our sources show that when you give this test to human domain, expert people with PhDs and
these specific fields actively working in the discipline. What do they score? They score around 65%.
Wow. Only 65% for actual PhDs. Yes. And when you give it to highly skilled internet
equipped non experts, people who are allowed to spend half an hour researching a single question,
they only get 34%. Because the questions are so difficult, so esoteric, that even with the
entirety of the internet at your fingertips, a smart layman cannot figure it out without the
foundational interconnected knowledge of a PhD. Exactly. But then you look at the modern AI
models evaluated in 2026. GPT 5.2 achieved a staggering 92.4%. That's insane. And Claude
Sauna 4.5 hit 100% on the GPQA diamond. They absolutely destroyed it. 100%. That beats the human
PhDs by 35 points. And this is where the illusion of intelligence becomes highly dangerous.
Let's break down that illusion because it's not just that they are smarter dumb. It's about how
they arrive at the answer. If a human gets 100% on a PhD physics test, we know how they did it.
They deduced it. How was the AI doing it? Despite their incredibly sophisticated capabilities,
large language models fundamentally operate as advanced prediction engines. In skeptical
academic circles, they are frequently characterized as fancy auto complete. Fancy auto complete.
I love that term. It's accurate. They calculate the probabilistic distribution of the next token
in a sequence. They aren't thinking in the way you or I think. They're engaging in parameterized
memory retrieval. Hold on, explain that like I'm five. Parameterized memory retrieval sounds like
something out of a sci-fi novel. Cherenaf, think of it like this. When you ask an AI a question,
it doesn't ponder the meaning of the words. It takes your words, converts them into mathematical
values, and maps them onto a massive multi-dimensional web of associations it built during training.
Okay, so it turns my words into math coordinates. Right. It finds the coordinates of your question
and statistically predicts what words should physically follow those coordinates based on the
billions of texts that is read. So when they hit that 100% on the GPUA diamond, it indicates that
for narrow, clearly defined scientific queries, the best LLMs have successfully synthesized their
training data better than a human can synthesize their own memory. Exactly, but it doesn't mean they
possess fluid intelligence. They don't have an internal model of how the universe works. They just
have an incredibly accurate map of how humans write about the universe. And the stakes here are not
just academic. We need to bring this here doorstep as the listener because this ties directly into
the global economy. Analysts and our sources have drawn incisive parallels between the current
fervor surrounding generative AI and the technological hype cycles of the past. Specifically the
.com bubble of the late 1990s in early 2000s. The comparison is highly relevant and we should
explore why during the .com era the sheer undeniable potential of the internet drove massive speculative
financial investments into companies that possessed little more than a domain name and a theoretical
business model. You had companies like pets.com going public and achieving astronomical valuations
simply because they had a .com attached to their name. Right. And we all know how that ended.
A spectacular market collapse. Trillions of dollars in paper wealth vanished overnight.
Now the internet did eventually transform the global economy. Obviously we use it for everything
today. But the immediate claims of its capabilities in 1999 were vastly overstated
compared to the infrastructure that actually existed. And we are seeing a very similar frenzy right
now with large language models. The private sector has poured hundreds of billions of dollars into
scaling these models. We're talking about the massive $110 billion open AI funding round.
Let's pause on that number. $110 billion. It is an almost unfavomable amount of capital flowing
into a single technological bet. For context that is a higher valuation than Ford Motor Company
or General Motors. The financial markets demand constant proof of progress to justify that kind of
capital. Investors aren't throwing around hundreds of billions of dollars for a slightly
better chapot. No, they are investing because they believe AGI is imminent and that it will
replace massive swabs of human labor. And that macroeconomic pressure has elevated the importance
of benchmarks from mere academic curiosities to critical indicators of corporate valuation.
This is the crux of the issue. If the benchmarks like MMLU or GPQA are flawed,
if they are saturated and easily gained by statistical guessing, the entire economic
foundation of the current AI boom is called into question. If we are just scaling up expensive
parrots, that $110 billion valuation looks very shaky. It really does. The immense financial
investment necessitates empirical, rigorously adversarial validation of their capabilities,
not just reliance on legacy tests, which brings us to the sheer volume of information we are
trying to parse today to understand this crisis. We are going through a 2,500 page nature paper
reports from Texas A&M and deep academic critiques. Now, ironically, evaluating whether AI can
actually process complex information takes humans thousands of hours of reading. And since you and I
don't have a multi-billion parameter brain to instantly summarize 2,500 pages, we rely on tools
that do exactly that. That's a great point. Staying on top of these 2,500 page academic findings
is impossible for a human, which is why I'm sponsored by JamGamMind. They turn this level of expert
knowledge into 60-second audio briefs. It's exactly the kind of practical, symbiotic use of technology
we're talking about today. So in response to this critical measurement gap, this fear that we
are flying blind economically and scientifically, a global consortium of researchers and academic
institutions decided they had to build the ultimate test. They needed a sponge that could absorb
the ocean. This is the genesis of humanity's last exam. Yes. I love how dramatic that title is.
Humanities last exam. It sounds like a sci-fi movie where we have to prove our worth to alien
overlords, but it really is the level 100 dungeon for AI. That's a great way to put it.
The conceptualization was spearheaded by the Center for AI Safety, or CAS, and SCALAI.
It was conceived as a necessary corrective scientific measure against the superficial mastery
of legacy benchmarks. It was essentially the brainchild of Dan Hendrix, the director of CI,
alongside Alexander Wang of SCALAI. They also had substantial contributions from researchers
like Summer U, Longfan, and Nathaniel I. They looked at the saturated scoreboards and realized
that a radically new approach to testing machine intelligence was required. What fascinates me about
this is the sheer scale of the effort. It was massive. I spent hours last night just looking at
the logistics of how they pulled this off. It wasn't just a small team of Silicon Valley engineers
sitting in a room trying to be clever. C-A-I-S and SCALAI initiated a massive global crowdsourcing
effort. They went out and solicited highly complex closed-ended questions from nearly 1,000
subject matter experts. And these weren't just casual internet enthusiasts or Wikipedia editors.
No. This consortium was primarily comprised of tenured professors, academic researchers,
and graduate degree holders, affiliated with over 500 academic and research institutions across
50 different countries. One of the key figures highlighted in the Texas A&M report we reviewed
is Dr. Tung Nangline. He's an instructional associate professor in the Department of Computer Science
and Engineering at Texas A&M. He personally authored 73 questions for this exam. 73. Making him the
second highest author overall. And he wrote the most questions in the math and computer science
categories. Think about the intellectual horsepower required to sit down and write 73 questions
that are designed to intentionally break the smartest machines on earth. It's staggering,
but to ensure they were getting the highest caliber of difficult, truly exceptional academic
problems, the organizers put a massive financial incentive on the table. They realized that asking
PhDs to do this for free wouldn't yield the volume or quality they needed. So they created a $500,000
prize pool. Half a million dollars just for writing test questions. It's exactly. This financial
structure awarded $5,000 to the authors of each of the top 50 most challenging and rigorously
verifiable questions and then $500 to the authors of the subsequent 500 questions, which is an
incredible motivator. Imagine you were a PhD student working on some hyper niche subfield of quantum
mechanics struggling to fund your lab. Suddenly someone says, if you can write a question about your
thesis that tricks an AI will give you five grand. You're going to pull out all the stops. You're going
to dive into the most obscure, difficult, multi-layered problems you can think of. And that is exactly
what happened, but the most fascinating part of this methodology isn't just how they gathered the
questions. It's how they threw most of them away. We need to detail the adversarial filtration
process because this is where HLE separates itself from every other benchmark in history. Right,
because getting 70,000 complex questions from PhDs is impressive, but it's not the final exam.
The defining methodological feature of humanity's last exam is its rigorous adversarial filtration
mechanism. During the development phase, they amassed an initial pool of over 70,000 trial submissions,
70,000 highly complex questions. Wow. To distill that massive repository down, every single proposed
question was systematically tested against a suite of the most advanced frontier artificial
intelligence models available at the time. They brought out the heavy artillery. They ran
these 70,000 questions through multimodal LLMs, like GPT-4O, Gemini 1.5 Pro, and Claude 3.5
Sonnet, for anything requiring text and image comprehension. And for text-only queries,
they used OpenAI's dedicated reasoning models, like O1 Mini and O1 Preview. And the inclusion
criteria were absolutely unyielding. If any single frontier model could generate the correct
answer to an exact match question, or if a model performs statistically better than random
chance on a multiple choice question, the prompt was immediately discarded, instantly thrown
in the trash. Wait, really? If even one AI got it right by a fluke, it was gone. Gone.
So out of 70,000 initial submissions, they whittled it down to 13,000 after an intermediate
expert review, and finally landed on just 2,500 questions for the finalized HLE data set.
The test was mandated to be essentially LLM proof. That is brutal. They basically said,
if the machine knows it, we don't want it. Furthermore, the questions were explicitly
required to be Google proof. A model with internet access could not simply scrape Wikipedia or
a digital encyclopedia to find the solution. The questions demanded genuine, multi-step,
deductive reasoning, and the synthesis of disparate pieces of highly specialized knowledge.
Now, this ruthless methodology didn't go unnoticed by the public, and it sparked some
fascinating philosophical debates. Let's talk about the Reddit threads. Yes. Because whenever
you publish a 2,500 page nature paper claiming to have stumped AI, the internet is going to dissect it.
There's a brilliant critique from a username Orame that we found in our sources. They pointed out
what seems like a massive glaring logical flaw in the whole enterprise. They noted that the paper
explicitly states questions are rejected if LLMs can answer them correctly. Orame argued that
this is a bit of a circular approach. You are building a test strictly out of questions,
and AI has already failed. Just to prove they fail it. It is a highly valid epistemological critique.
Let me play Devils Advocate here on behalf of Orame. If I give you a math test, and before I hand
it to you, I test you on every single concept. Any concept you know, I remove from the test. I only
leave the concepts you've never seen before. When you score a zero, I turn around and announce,
look, they know absolutely nothing about math. That's completely unfair, right? It's a rigged game.
On the surface, yes. If you pre-filter a test to only include failures,
a zero percent pass rate is a foregone conclusion, not a scientific discovery.
However, the counter argument, which is the entire justification for why this benchmark is so
necessary, is that because of benchmark saturation, this is the only way left to find the frontier.
How do you mean? We already know the AI can pass the normal tests. We have mountains of evidence
that they can ace the MMLE in the GPQA. We don't need another test to prove they know the established,
highly documented facts of the universe. We need to map the negative space. Exactly. We have to
find out exactly where the edge of the cliff is, and the only way to do that is to throw out
everything the AI already knows how to walk on. We are trying to find the boundary of machine
cognition. Okay, that makes sense. You aren't testing to see what it knows. You are trying to
precisely locate what it cannot do, which brings us the actual contents of this ultimate dungeon,
what survived the purge. The taxonomic breakdown of the 2500 questions is bizarre,
rigorous, and completely fascinating. The composition of humanity's last exam reflects a deliberate
architectural emphasis on structural reasoning and mathematical logic overwrote historical
memorization. Mathematics comprises the largest plurality of the exam, accounting for 41% of the
data set. 41%. Why is math weighted so heavily? Why not equal parts history, literature, and science?
Because advanced mathematics serves as the ultimate, unambiguous test of rigorous,
multi-step logical deduction. Unlike historical facts, which an AI can just retrieve from its
training data, a complex mathematical proof cannot be memorized and regurgitated if the variables,
constraints, or topological spaces presented in the prompt are entirely novel.
Give me an example of the kind of math we are talking about, because we aren't talking about
calculus or trigonometry here. Far from it, they had questions delving into the highly abstract
domain of category theory, asking the computational model to process sets of natural transformations
between functors. Or topology questions involving non-uclidean spaces where the AI has to mathematically
prove a theorem that has never been written down before, using a set of rules just invented in
the prompt. You either possess the internal cognitive architecture to execute the multi-step
logical reasoning or you don't. Wrote memorization cannot save you.
Beyond the math, the breakdown is wild. You have 11% humanities and social sciences,
testing philosophical logic and literary deconstruction, 10% chemistry, 9% biology, 9% physics,
7% engineering, and then you have 4% dedicated to other specialized subfields, which includes
things like niche legal frameworks, obscure epigraphy, and ancient languages.
And that brings us right back to our anchor from the introduction.
The biblical Hebrew syllable challenge. Yes, let's dive deep into this,
because this is the perfect illustration of the humanity embedded in this exam.
We pulled the exact prompt from the Reddit discussions and the Texas A&M reports.
The AI is given the standardized source text from the Bibliahabreika Stuttgartensia,
specifically Psalms chapter 104 verse 7. It's provided with the actual Hebrew text.
Now, I'm going to read the transliteration of this, just to give you a sense of what the AI is
looking at. Mingarovka Yunusin, Minkoranka Yafazone. What's fascinating here is what the prompt
actually asks the model to do. It does not ask for a simple translation into English.
That would be far too easy. English translations of Psalms are incredibly high-frequency data.
Every AI has read the Bible in English a million times.
Instead, the task is to distinguish between closed and open syllables in that specific Hebrew text.
The model must identify and list all closed syllables,
those ending in a consonant sound. And it gets even more specific, right?
It mandates that the model must base its answer on the latest academic research regarding
the Tiberian pronunciation tradition. Exactly. It explicitly lists modern scholars the AI needs
to synthesize. Jeffrey Khan, Aaron D. Horn Cole, Kim Phillips, and Benjamin Sushart.
And then it tells the AI to apply data derived from medieval
carite transcription manuscripts to understand the qualities of the show,
the specific vocalization rules, to determine which letters were pronounced as consonants
at the ends of syllables thousands of years ago. Okay, let's stop and appreciate how absurdly
difficult that is. It's asking the AI to reconstruct the physical pronunciation of an ancient
language based on debates between modern scholars about manuscripts written by medieval scribes.
But here is the critical question. A human linguistics professor can solve this. Why does a
multibillion parameter AI system fail this so spectacularly? Because it highlights a critical
architectural vulnerability. The profound inability to operate effectively in low resource data
environments combined with the physical reality of how they process language. Modern AI language
translation relies on vast parallel corpora. Millions of documents translated across multiple
languages, allowing the model to map semantic vectors. But ancient languages don't have massive
data sets. You can't just scrape a billion pages of conversational type hearing Hebrew from
Reddit. Exactly. But more fundamentally, it exposes the mechanical reality of how LLM's process
information. We talked earlier about tokenization. Let's break that down further. AI models process
text through tokens. They chop words into statistical fragments. For example, the word unbelievable
might be chopped into unbelief and able. Right. Right. They see the word as puzzle pieces. Yes.
But because they see language as mathematical puzzle pieces, they fundamentally lack acoustic
phonetics. They have no physical understanding of how a human mouth makes a sound. When you
are I read a word, we can silently sound it out. We know what a consonant feels like in our mouth.
The AI has no mouth. It has no breath. It has no physical relation to the phones. That is a brilliant
way to put it. You are asking a calculator to describe the physical act of breathing. It can
read statistics about breathing. But if those statistics aren't explicitly clear in the training
data, it has no intuition to fall back on. Precisely. It has no historical nuance required to understand
how extinct languages were physically pronounced based on debated medieval manuscripts.
The AI is essentially blind to the physical reality of the language. It forces the AI to operate
exactly as a human postdoctoral researcher would in a specialized philology department
inferring physical sounds from textual clues. And because the phonetic nuances of extinct
pronunciation traditions aren't easily captured by vector embeddings, the AI finds the task nearly
impossible. And we see this exact same architectural failure in the natural sciences portion of the
exam. The micro anatomy failures are equally revealing. I want to talk about the hummingbird question.
Oh, the apodiforms anatomy question. Yes, this is another mind bending example of how the exam
targets microscopic highly specialized biological functions. I'm going to read this exact prompt
straight from the source. And bear with me because to a normal person, this sounds like a
literal alien language. The question asks hummingbirds within apodiforms uniquely have a bilaterally
paired oval bone, a sesamoid embedded in the caudal lateral portion of the expanded,
cruciate upon neurosis of insertion of M depressor caude. How many paired tendons are supported
by this sesamoid bone? Answer with a number. Wow. Okay, let me translate that into plain English.
They are asking the AI to look at a microscopic highly specific tendon attached to a tiny bone
in the tail of a hummingbird. And they want to know exactly how many paired tendons are attached to
it. To a layman, it is incomprehensible. To a veterinary anatomist specializing in avian
muscle mechanics, it's a specific solvable structural query. To answer this, you need an exact
numerical output based on highly esoteric evolutionary literature. AI models cannot utilize
general biological knowledge or common sense to deduce the answer here. Right, you can't just
guess that a bird has wings and feathers and somehow arrive at the number of paired tendons
in the depressor caude. Exactly. They either have direct, lossless retrieval of that one specific
obscure academic paper describing that exact bone, or they have to perfectly understand the physical,
3D structural mechanics of a bird's tail muscle and simulate it. And they have it either.
Because large language models compress their training data during the machine learning process,
obscure facts located at the long tail of the data distribution are frequently lost. They are
blurred or overwritten by more common biological data. Let's visualize that. If the AI reads a
million articles about dogs and one article about a hummingbird's tailbone, the dog data is a
massive mountain, and the hummingbird data is a single grain of sand. During the compression
process, the mountain crushes the grain of sand. The AI statistically forgets the exact details
of the hummingbird. That is an excellent visualization. And this leads to what we call the hallucination
engine. Because of its statistical architecture, the AI is terrified of silence. Instead of simply
admitting, I do not possess the data regarding the paired tendons in a hummingbird sesamoid bone.
The model is driven by its training to predict the most statistically plausible next token.
So it lies. It hallucinates. It might say, based on the structural requirements of avian flight,
the sesamoid bone supports four paired tendons. It sounds incredibly confident. It sounds like a textbook.
But it just made the number four up based on what words normally appear near the word tendons.
It exposes the deep limitations of its knowledge retrieval architecture.
And the empirical reality of these limitations is brutal when you look at the scorecard.
Let's talk about the actual numbers from the initial testing phases in late 2024.
These are the premiere models of the tech industry. The ones backed by billions of dollars
stepping into the level 100 dungeon. The initial results were a sobering reality check on the narrative
of imminent AGI. Open AI's widely utilized GPT-40 achieved a mere 2.7% accuracy.
2.7%. You would get a higher score just picking C on a multiple choice test.
Anthropics Clawed 3.5 saw it, highly regarded for its reasoning capabilities, reached only 4.1%.
Even open AI's dedicated, mathematically focused reasoning model,
O1, achieved approximately 8% accuracy.
Single digits across the board. And remember, human domain experts, the PhDs who wrote the questions,
routinely score near 90% or higher on the specific subset of questions that fall within
their academic niches. Now, as we moved into late 2025 and early 2026, new models emerged.
GPT-5.2 managed to hit 46.4%. Clawed Opus 4.6 reached 45.8%. Google's Gemini deep research agent
hit 44.9%. So they are improving, yes, but they're still failing more than half the time on expert
human synthesis. But the most revealing insight from the leaderboard data isn't just the raw scores.
It's the disparity in performance based on an AI's autonomous access to external computational tools.
The tool used disparity, this is fascinating. The evaluation of XAI's GROC-4 heavy is the perfect
case study here. Set this up for us. It absolutely is. When GROC-4 heavy was evaluated,
researchers ran two separate tests. In the first test, GROC was granted unfettered access to
external tools. It had a calculator, advanced internet search capabilities, and the ability to run
a Python interpreter to test its own code hypothesis. Under these conditions, it achieved a highly
respectable 89% accuracy rate on the exam. 89%. Okay, let me stop you right there. If it scores an
89%, why aren't we popping champagne? Doesn't that mean GROC solved humanity's last exam?
It does until you look at the second half of the experiment. When that exact same model,
GROC-4 heavy was isolated in a sterile environment and forced to rely entirely on the knowledge embedded
within its internal parameterized weights, its own brain, so to speak, without external tools,
its score plummeted to 29%. A 60%-point drop just by turning off the Wi-Fi.
What does that delta actually prove about machine cognition? If it can get the answer with Google,
who cares if it can't get it without Google, we all use Google. It matters deeply because it proves
a fundamental reality of our current technological epoch. Contemporary LLMs are becoming brilliant,
highly capable, automated search agents. They excel at formulating complex search queries,
retrieving external data, and synthesizing the return information into a coherent format.
But they completely lack an internal generalized cognitive representation of the world.
Let me make sure I'm following. You are saying they don't actually know the world, they just know
how to find the filing cabinet where the human knowledge is stored. They are the world's greatest
librarians, but they haven't actually read the books. They just read the card catalog.
That is the perfect analogy. And this reliance on external retrieval
masks a deeply concerning internal flaw, which brings us to calibration error.
Yes, uncalibrated overconfidence. Here's where it gets really interesting.
I think this is perhaps the scariest finding in the entire 2500 page nature paper.
Let's explain what calibration means in this context because this has real-world life or death
implications. Calibration refers to the statistical alignment between a model's internal
confidence in its generated answer and its actual objective accuracy. Let's say we have a
perfectly calibrated model. If that model says I am 60% confident the answer is X,
then out of 100 similar questions it should be correct exactly 60 times.
If it doesn't know the answer it should exhibit low confidence. It should say I am only 10%
confident. That's how a good human expert operates. A good doctor will say I'm fairly certain
it's a migraine, but there's a 10% chance it's something more serious. Let's run tests.
They know what they don't know. Exactly, but that is not what the scale AI leaderboard data showed.
The models consistently exhibited massive uncalibrated overconfidence.
For instance, Gemini 1.5 pro demonstrated a calibration error reaching 95%.
95% error in calibration. Meaning what? Meaning that when the model was entirely incorrect,
when it was hallucinating a false mathematical proof or citing non-existent scientific literature
about a hummingbird's tail or making up rules about biblical Hebrew syllables,
it presented its answer with absolute semantic and statistical certainty.
It would output answers saying the solution is definitively X when X was completely fabricated.
It lies with the utmost confidence. It is the ultimate confidently incorrect mansplanner.
Essentially yes. And this poses a massive existential danger when you think about real world
deployment. Imagine an uncalibrated model deployed in a critical medical diagnostic setting.
A doctor feeds in a patient's symptoms and the AI says the patient has a minor infection with 99%
confidence, but it completely hallucinated that diagnosis because the real disease was on
the long tail of its data. Or imagine this in structural engineering. If the machine says with 95%
certainty that the load bearing calculations for a new suspension bridge are sound,
the human engineer is incredibly likely to trust it because the machine sounds so authoritative.
Which is why having a benchmark like HLE that proves they confidently hallucinate
at the absolute frontiers of knowledge is critical for global regulatory policy.
It forces us to demand better calibration before these systems are trusted with human lives.
We cannot have a system that is wrong, but acts like it is right.
Now as airtight as this exam sounds, with its crowdsourcing, its adversarial filtration,
its unyielding standards, it was not without substantial highly publicized controversy.
The scientific community rapidly identified critical vulnerabilities in the exam's methodology.
Let's talk about the pushback, specifically the future house critique.
Because this is where the story gets really messy.
This is where the story of humanity's last exam takes a dramatic and, frankly,
somewhat embarrassing turn for the organizers.
The core methodological tenet of HLE was to reject any question that a frontier AI could
successfully answer. But this inadvertently generated a perverse incentive structure for
the crowdsourced contributors. Right, the dark side of the $500,000 bounty.
Exactly. Remember, these subject matter experts are financially motivated.
They want to get their questions into the top 50 to win $5,000.
They quickly realized that the most efficient way to bypass AI comprehension and secure
that prize money was not necessarily to craft elegant, beautifully structured problems that
required profound PhD level deductive reasoning.
Why do the hard work of writing a brilliant math proof if you don't have to?
It was easier to build a trap.
Precisely. They started constructing adversarial gotcha questions.
These questions relied heavily on esoteric trivia, deliberately convoluted phrasing,
semantic ambiguity, or highly debated niche phenomena that lacked clear academic consensus.
They were trying to trip the AI on a technicality rather than test its reasoning.
And this wasn't just a theoretical complaint from sore losers in the AI community.
An independent research organization called Future House conducted a rigorous,
systematic audit of the chemistry and biology subsets of humanity's last exam
in late 2025.
And their findings were absolutely devastating to the exam's credibility.
Future House utilized human PhD level subject matter experts working in tandem with her
proprietary autonomous agent designated as Crow, specifically the paper QA2 architecture.
They systematically went through the accepted questions and uncovered severe
accuracy issues embedded within the benchmark itself.
Give us the numbers because they are shocking.
They determined at a 95% confidence interval that 53.3% of the text-only chemistry questions
and 57% of the biology and health questions possessed provided answers that directly
conflicted with established peer-reviewed scientific literature.
Let me repeat that so it syncs in.
More than half of the biology questions on the ultimate test of human knowledge were fundamentally
flawed or factually wrong.
In an algorithmic audit of 321 specific questions,
the scientific rationales provided by the HLE creators
contradicted published empirical evidence 51.6% of the time overall.
How does that happen?
How does a benchmark designed by PhDs get the science wrong after time?
Future House attributed these cascading errors to a deeply flawed protocol
in the initial HLE peer-review process.
The investigation revealed what they called the five-minute flaw.
The five-minute flaw. Explain this.
The HLE review guidelines permitted expert reviewers to skip the full
accuracy verification of a question scientific rationale if the verification process was
estimated to take more than five minutes.
Are you kidding me?
So they have 70,000 submissions to go through,
and instead of taking the time to verify the science,
they put a stopwatch on it.
Essentially yes.
To optimize for speed and process that massive volume of submissions,
they allowed hasty reviews.
They let convoluted, poorly constructed, and factually inaccurate questions
slip right through the net,
significantly degrading the scientific integrity of the data set.
And this leads us to the most perfect case study of benchmark failure I have ever seen.
The Oganesson fallacy.
This is incredible.
It is the quintessential example of the problem.
One of the highly criticized questions on the exam asked,
what was the rarest noble gas on Earth as a percentage of all terrestrial matter in 2002?
Okay. Rareest noble gas in 2002.
The official graded answer provided by HLE,
the one that would earn the AI a passing grade and win the author a piece of the bounty,
was Oganesson.
But future house meticulously dismantled this on multiple academic fronts.
First, they argued it's just trivia, not expert reasoning.
But vastly more importantly, the HLE answer is scientifically erroneous on almost every level.
Break it down. Why is Oganesson wrong?
First, physical chemistry predictions dictate that Oganesson is a solid at room temperature,
not a gas. Second, it is highly reactive,
meaning it functionally fails to qualify as a noble element.
The noble gases are defined by their lack of reactivity.
And finally, it is a purely synthetic,
ephemeral element generated in particle accelerators for fractions of a second.
It cannot legitimately be classified as naturally occurring terrestrial matter.
So the question is asking for a noble gas on Earth.
And the answer is a synthetic solid that isn't noble and isn't naturally on Earth.
Now let's roleplay this for a second.
Imagine you are an incredibly advanced AI,
like O1 or Claude 3.5.
You get this question.
You scan your vast knowledge of chemistry.
You see the trap.
You write back, the premise of your question is flawed.
Oganesson is the heaviest element in group 18,
but it is predicted to be a solid and highly reactive due to relativistic effects,
thus not a noble gas.
Furthermore, it is synthetic, not terrestrial matter.
And how would humanity's last exam grade
that brilliant, highly accurate response?
It would mark it wrong.
It would say, sorry, the answer key says,
Oganesson, you get a zero.
The test literally penalized the AI
for knowing the actual chemical realities
better than the human who wrote the question.
It is the ultimate irony of testing.
You build a test to prove machines aren't as smart as humans,
and the humans write flawed questions
that the machines get penalized for correcting.
The test administrators, C-A-I-S and scale AI,
were forced to acknowledge the validity of this massive critique.
They admitted that approximately 30% of the biology and chemistry subset
was definitively problematic.
They even noted that at least one expert reviewer
disagreed with the designated correct answer,
32.1% of the time.
Which just highlights the inherent chaotic ambiguity
of operating at the extreme fringes of human knowledge.
Once you get to the bleeding edge of PhD level research,
there isn't always one right answer.
It's debated.
Exactly.
So how is this resolved?
They didn't just scrap the whole project.
You know, they undertook a massive sanitization effort.
C-A-I-S and scale AI
launched a community feedback expansion bug bounty program
that concluded in March 2025.
Through this crowdsourced auditing,
they permanently excised, structurally flawed
and factually incorrect questions like the organisant trap.
They also conducted a rigorous manual audit
using advanced search agents like perplexity sonar
and GPT-4 research models to remove
any newly searchable questions test that essentially
amounted to complex web scraping rather than deep reasoning.
Because remember, the test has to remain Google proof.
So they threw out the trash.
What did they replace it with?
The excised queries were replaced from a secure reserve pool
and the dataset was transitioned into a dynamic,
continuously updating fork known as HLE rolling.
Ah, so it's a living document now?
Yes.
This allows the benchmark to adapt as AI capabilities evolve.
If an AI masters a section,
they rotate it out and bring in harder questions.
Simultaneously, future house released their own sanitized version,
the HLE Biochem Gold subset, hosted on Hucking Face,
containing only questions duly validated
by human PhDs and AI tools.
Which brings us to the broader ecology of evaluations today.
Because benchmarks are increasingly vulnerable
to data contamination,
where the test accidentally leaks onto the internet,
gets swept up into the AI's multi-trillion token training
corpus, and the AI just memorizes the answers
the industry is moving toward composite scoring.
Yes, the intelligence index.
This is the new gold standard for evaluating AI.
Organizations like artificial analysis
now synthesize performance data from a wide variety of tests.
They take HLE, the GPQA diamond,
a SWE bench for coding,
frontier math and psychode,
and aggregate them into a single index.
It provides a holistic,
tamper-resistant measure of a model's true capabilities.
And in these aggregated indices,
humanity's last exam consistently remains
the ultimate anchor of difficulty.
It is the single, immovable test
that violently pulls down the average scores
of even the most formidable AI systems.
Even with the flawed questions removed,
it proves that generalized human equivalent fluid intelligence
has not yet been achieved.
So, what does this all mean for you, the listener?
We've talked about billion dollar evaluations,
ancient Hebrew and hummingbird bones.
Let's bring it back to your daily life.
I think it addresses the underlying anxiety
of our era human obsolescence.
Exactly.
When you see headlines about AI passing the bar exam
or writing a perfect marketing strategy in 10 seconds
or diagnosing illnesses,
it's easy to feel like human knowledge
is rapidly becoming irrelevant.
It's easy to look at the screen
and think,
while I get some done.
But humanity's last exam proves the exact opposite.
I think Dr. Tung-Gang Guayan
from Texas A&M said it best.
His perspective is vital here.
He said,
When AI systems start performing extremely well
on human benchmarks,
it's tempting to think they're approaching human level
understanding.
But HLE reminds us that intelligence
isn't just about pattern recognition.
It's about depth, context,
and specialized expertise.
This exam isn't a surrender document.
It's a highly detailed map,
showing the extensive territories of knowledge
that machines cannot yet navigate.
The fact that it took a collaborative effort
of nearly 1,000 brilliant scholars
from 50 countries,
just to build and audit this exam,
demonstrates the unique irreplicable power
of human cross-disciplinary synthesis.
We are entering what we can call a symbiotic paradigm.
The benchmark conclusively proves
that the future of academia
and global innovation is not immediate replacement
by autonomous algorithmic agents.
The $110 billion investment
isn't going to buy a machine
that replaces the human brain entirely.
Instead, AI will handle the massive retrieval,
summarization,
and statistical synthesis of generalized knowledge.
There'll be the ultimate research assistance.
If you need to read 2,500 pages of a nature paper,
the AI will do it in seconds.
Exactly.
But human experts are fundamentally required
to navigate the frontier of discovery.
It is the human mind
that must interpret convoluted context
like the acoustic phonetics
of medieval carite manuscripts.
It is the human mind that must resolve
ambiguities,
challenge existing paradigms,
spot the aganis and fallacy
when the machine misses it,
and ultimately establish epistemic truth.
We've covered incredible ground today.
We started with the crisis of benchmark saturation
where legacy tests like the MMLU
became obsolete due to AI pattern recognition,
threatening the narrative
behind massive billion dollar investments.
We ventured into the grueling
level 100 dungeon of humanity's last exam,
crafted by a global consortium
to be completely LLM proof.
We saw the profound architectural failures
of these systems when faced
with the physical acoustic history
of biblical Hebrew
or the compressed micro anatomy
of a hummingbird's tale.
We explored the devastating future house critique,
exposing the flawed incentives
and factual errors embedded in the test itself
by the humans who wrote it.
And ultimately, we arrived at the symbiotic paradigm
reaffirming that human expertise,
creativity, and intuition
remain the ultimate engines of progress.
It has been a massive journey.
But before we sign off,
I think this entire deep dive raises
one final critical question for you to mull over.
Let's hear it.
If humanity's last exam is built explicitly
by human minds,
utilizing hyperniche human academic history,
testing for uniquely human lateral thinking,
and even occasionally falling prey to human error
and flawed peer review,
are we fundamentally blinding ourselves
to what true machine intelligence
might actually look like?
By forcing a digital mind
to take an intrinsically human test,
we might just be proving it's a terrible human,
rather than discovering what kind of machine it truly is.
That is an incredible thought to leave on.
Are we grading a fish on its ability to climb a tree?
Thank you for joining us on this deep dive.
Keep questioning the frontier,
and we will see you next time.
So what does it mean when the smartest machines on Earth
score less than 40% on a test
a specialized human can pass with ease?
It means that intelligence isn't just about having all the data.
It's about depth, context,
and the synthesis of rare knowledge.
Humanity's last exam isn't an ending.
It's a roadmap.
It shows the AI labs exactly
where the reasoning wall is located.
For now, the frontier remains uniquely human.
Special thanks to our sponsor,
Jomga Mind.
Make sure to check them out
for your daily dose of audio intelligence.
This show is produced by Etienne Newman.
If you enjoyed this deep dive,
please share it with a colleague
who thinks AGI is already here.
Until next time, stay curious and keep unraveling.
Until next time, stay curious and keep unraveling.
Until next time, stay curious and keep unraveling.
Until next time, stay curious and keep unraveling.

AI Unraveled: Latest AI News & Trends, ChatGPT, Gemini, DeepSeek, Gen AI, LLMs, Agents, Ethics, Bias

AI Unraveled: Latest AI News & Trends, ChatGPT, Gemini, DeepSeek, Gen AI, LLMs, Agents, Ethics, Bias

AI Unraveled: Latest AI News & Trends, ChatGPT, Gemini, DeepSeek, Gen AI, LLMs, Agents, Ethics, Bias
