technologyeducationhow to

[FULL SPECIAL] The Final Gauntlet: Inside "Humanity’s Last Exam" and the AI Reasoning Wall

AI Unraveled: Latest AI News & Trends, ChatGPT, Gemini, DeepSeek, Gen AI, LLMs, Agents, Ethics, Bias·Mar 9, 2026·47:53

About this Episode

🚀 Welcome to a Special Edition of AI Unraveled. Today, we aren't talking about chatbots; we are talking about the ceiling of machine intelligence. For years, AI has been "acing" every test we threw at it. But a global coalition of 1,000 scientists just hit back with Humanity’s Last Exam (HLE)—a benchmark specifically engineered to be impossible for today's AI.

This special episode is brought to you by DjamgaMind. In a world where AI benchmarks are being shattered every week, you need the signal through the noise. DjamgaMind turns massive academic papers like the Nature report on HLE into 60-second audio intelligence. Master the frontier of human knowledge while you're on the move at DjamgaMind.com.

In this Deep Dive, we explore:

The Saturation Problem: Why "Human-level" benchmarks like the Bar Exam and MMLU are now obsolete.
2,500 Expert Walls: A breakdown of the questions spanning ancient Palmyrene inscriptions, microanatomical bird structures, and theoretical mathematics.
The Biblical Hebrew Standoff: We analyze the viral Psalms 104:7 challenge—why "probabilistic guessing" fails at specialized linguistic traditions.
The Current Leaderboard: Why even "frontier" models like Gemini 3.1 Pro and GPT-5 are struggling to break 40% accuracy.
The Future of AGI: Is this the definitive "Turing Test" for the 2020s?

Resources & Links:

Official Benchmark: lastexam.ai
The Nature Paper: A benchmark of expert-level academic questions
Reddit Discussion: r/science - Humanity's Last Exam:

Keywords: Humanity's Last Exam, HLE Benchmark, AI Reasoning Wall, AGI, Center for AI Safety, Scale AI, Texas A&M AI, Biblical Hebrew AI, Expert-Level AI, Nature Journal AI, LLM Saturation, MMLU Benchmark, GPT-5 Performance, Claude Opus 4.6, Gemini 3.1 Pro, DjamgaMind.

Credits: Produced by Etienne Noumen, Senior Software Engineer and AI Strategist.

🚀 Reach the Architects of the AI Revolution

Want to reach 60,000+ Enterprise Architects and C-Suite leaders? Download our 2026 Media Kit and see how we simulate your product for the technical buyer: https://djamgamind.com/ai

Connect with the host Etienne Noumen: https://www.linkedin.com/in/enoumen/

⚗️ PRODUCTION NOTE: We Practice What We Preach.

AI Unraveled is produced using a hybrid "Human-in-the-Loop" workflow. While all research, interviews, and strategic insights are curated by Etienne Noumen, we leverage advanced AI voice synthesis for our daily narration to ensure speed, consistency, and scale.

Hosts & Guests

Etienne Noumen

Host

Transcript

Welcome to a special edition of AI Unraveled, I'm your host, Etienne Newman.

For the last three years, we've heard the same headline AI passes the bar exam and AI scores

in the 99th percentile of the SAT. It felt like the machines were running out of things to learn.

But last week, the world's top researchers, led by Texas A&M and the Center for AI Safety,

decided to stop playing games. They released humanity's last exam. It is 2,500 questions long.

It covers things no single human could ever know from the intricacies of medieval Hebrew

pronunciation to the physics of sesame bones in hummingbirds. And the result, the AI models that we

thought were approaching AGI are failing miserably. This episode is sponsored by Jomga Mind.

Stay ahead of the expert frontier with 60-second audio intelligence at JomgaMind.com.

Today we are looking at the AI ceiling. We're going inside the exam that was built to protect

the expert frontier. Let's unravel why the machines are finally meeting their match.

Welcome back to The Deep Dive. Today, we are stepping into our roles as Senior AI Research

Analysts from the AI Unraveled Network. And we're bringing you a journey into the absolute

bleeding edge of machine cognition. Yeah, it really is. It's genuinely fascinating time to be

analyzing this space. It totally is. And we are taking a massive stack of sources today. I mean,

we've got a 2,500 page academic evaluation published in the journal Nature Comprehensive

Reports from Texas A&M University. And some incredibly heated, deeply philosophical reddit threads.

Oh, the reddits threads are fantastic. They really are. And we are distilling it all down for you.

So if you are a fellow learner who wants to understand exactly where the frontier of technology

actually sits today, you are in the right place. Absolutely. We are looking at a monumental

project known as Humanities Last Exam or HLE. And our mission today is straightforward but,

you know, profoundly important. Right. We need to critically examine this massive benchmark

to answer the ultimate question. Are the tech giants actually building artificial general

intelligence or have they simply spent hundreds of billions of dollars to scale up what are essentially

very expensive, highly articulate parrots? That really is the multi billion dollar question,

isn't it? It is. And to set the stage for you, I want to introduce the central mystery that

we are going to be unraveling throughout this deep dive. Let's call it the biblical Hebrew

syllable challenge. Ah, yes. That is a perfect anchor point for this entire discussion. It beautifully

encapsulates the exact problem researchers are wrestling with right now. Right. So I want you to

imagine a multi billion parameter AI system. This is a digital brain that has effectively ingested

the entire internet. It can write flawless, complex Python code in seconds. You can summarize a

50 page Byzantine legal document instantly. Exactly. But when the same digital behemoth is asked

to analyze a single very specific line from Psalms chapter 104 verse 7, it completely and utterly

breaks down. Just totally fails. Yeah. Meanwhile, a human expert in ancient linguistics looks at

that exact same line and finds the logic to be entirely sound and deducible. Why does the

machine fail? We're the human succeeds. Okay, let's unpack this. To understand why the machine fails

at the biblical Hebrew challenge, we first have to understand the existential crisis currently facing

the entire field of artificial intelligence research. And that crisis is something we call benchmark

saturation. Let's define that for the listener. What exactly is a benchmark in this context? And why

is it saturated? Because when I hear the word saturated, I just think of a sponge that can't hold

any more water. That's actually a great analogy. Think of the trajectory of AI research over the

past decade as a relentless accelerating cycle. Researchers create a novel computational benchmark,

essentially a highly complex standardized test. Right. Designed to test the absolute limits of

what a machine can do. That's the sponge. Exactly. Historically, we had standardized evaluations like

the massive multitask language understanding exam widely known as the MMLU. Right. The MMLU.

We also had the graduate school math 8K or GSM 8K and the human evil for coding.

So these tests were the gold standard. For a long time, if an AI couldn't pass the MMLU,

it wasn't considered frontier. They were these impassable barriers. Precisely. They were the

epistemological dividing lines. They were what we used to demarcate human cognitive flexibility

and expert level academic synthesis from mere machine pattern recognition. So it separated the

real thinkers from the calculators. Right. And if you think back just a few years ago,

AI models were scoring in the 30 or 40 percent range on the MMLU. It felt like a safe,

distant horizon. But over the last few years, we've hit a wall. The landscape of AI is experiencing

this destabilizing phenomenon of benchmark saturation. Yes. Contemporary, large language models

or LOMs and these massive multimodal neural architectures are now routinely achieving accuracy

rates exceeding 90 percent on these legacy evaluations. Wait, let me push back on that for a second.

If an AI scores 90 percent on a graduate level math test or ACEs of standardized bar exam,

isn't that just, I mean, isn't that proof that it's incredibly smart? You would think so.

Because you can't just guess your way through a graduate physics exam. Why is this a crisis?

The natural assumption for anyone reading the headlines is just wow, the AI is a mathematical genius.

You would absolutely think so. And that is the exact illusion that has captured the public imagination.

But it is a massive problem for the scientists trying to measure true intelligence. How so?

This unprecedented success rate has paradoxically rendered these tests completely obsolete

as precise scientific instruments. When state-of-the-art systems effectively conquer a benchmark,

the metric loses its foundational utility. Because it's saturated, the sponge can't hold any more

water so you can't use it to measure a flood. Exactly. The benchmark can no longer distinguish between

models that possess profound, generalizable reasoning architectures and those that have merely

internalized vast swaths of their training data. Ah, I see it. They've memorized the study guide.

That's it. They are using parameterized memory retrieval and sophisticated statistical

guessing. So they aren't reasoning for the physics problem? No. If a question or a variation of that

question existed anywhere in the trillions of words the AI was trained on, the AI didn't solve

it. It just remembered it. Let's look at the GPQA diamond subset as a prime example of this

dynamic. Okay, the graduate level Googleproof Q&A benchmark. And this test focuses exclusively

on the hard sciences, right? Physics, biology and chemistry. Yes. And the diamond subset represents

the 198 absolute hardest questions called from that broader corpus. Right. The GPQA was supposed

to be the new incredibly high bar. The questions are designed so that even if you give a smart college

educated person access to Google and 30 minutes to search for the answer, they still shouldn't be

able to get it right. Exactly. And the statistics on the GPQA diamond are honestly mind blowing.

Our sources show that when you give this test to human domain, expert people with PhDs and

these specific fields actively working in the discipline. What do they score? They score around 65%.

Wow. Only 65% for actual PhDs. Yes. And when you give it to highly skilled internet

equipped non experts, people who are allowed to spend half an hour researching a single question,

they only get 34%. Because the questions are so difficult, so esoteric, that even with the

entirety of the internet at your fingertips, a smart layman cannot figure it out without the

foundational interconnected knowledge of a PhD. Exactly. But then you look at the modern AI

models evaluated in 2026. GPT 5.2 achieved a staggering 92.4%. That's insane. And Claude

Sauna 4.5 hit 100% on the GPQA diamond. They absolutely destroyed it. 100%. That beats the human

PhDs by 35 points. And this is where the illusion of intelligence becomes highly dangerous.

Let's break down that illusion because it's not just that they are smarter dumb. It's about how

they arrive at the answer. If a human gets 100% on a PhD physics test, we know how they did it.

They deduced it. How was the AI doing it? Despite their incredibly sophisticated capabilities,

large language models fundamentally operate as advanced prediction engines. In skeptical

academic circles, they are frequently characterized as fancy auto complete. Fancy auto complete.

I love that term. It's accurate. They calculate the probabilistic distribution of the next token

in a sequence. They aren't thinking in the way you or I think. They're engaging in parameterized

memory retrieval. Hold on, explain that like I'm five. Parameterized memory retrieval sounds like

something out of a sci-fi novel. Cherenaf, think of it like this. When you ask an AI a question,

it doesn't ponder the meaning of the words. It takes your words, converts them into mathematical

values, and maps them onto a massive multi-dimensional web of associations it built during training.

Okay, so it turns my words into math coordinates. Right. It finds the coordinates of your question

and statistically predicts what words should physically follow those coordinates based on the

billions of texts that is read. So when they hit that 100% on the GPUA diamond, it indicates that

for narrow, clearly defined scientific queries, the best LLMs have successfully synthesized their

training data better than a human can synthesize their own memory. Exactly, but it doesn't mean they

possess fluid intelligence. They don't have an internal model of how the universe works. They just

have an incredibly accurate map of how humans write about the universe. And the stakes here are not

just academic. We need to bring this here doorstep as the listener because this ties directly into

the global economy. Analysts and our sources have drawn incisive parallels between the current

fervor surrounding generative AI and the technological hype cycles of the past. Specifically the

.com bubble of the late 1990s in early 2000s. The comparison is highly relevant and we should

explore why during the .com era the sheer undeniable potential of the internet drove massive speculative

financial investments into companies that possessed little more than a domain name and a theoretical

business model. You had companies like pets.com going public and achieving astronomical valuations

simply because they had a .com attached to their name. Right. And we all know how that ended.

A spectacular market collapse. Trillions of dollars in paper wealth vanished overnight.

Now the internet did eventually transform the global economy. Obviously we use it for everything

today. But the immediate claims of its capabilities in 1999 were vastly overstated

compared to the infrastructure that actually existed. And we are seeing a very similar frenzy right

now with large language models. The private sector has poured hundreds of billions of dollars into

scaling these models. We're talking about the massive $110 billion open AI funding round.

Let's pause on that number. $110 billion. It is an almost unfavomable amount of capital flowing

into a single technological bet. For context that is a higher valuation than Ford Motor Company

or General Motors. The financial markets demand constant proof of progress to justify that kind of

capital. Investors aren't throwing around hundreds of billions of dollars for a slightly

better chapot. No, they are investing because they believe AGI is imminent and that it will

replace massive swabs of human labor. And that macroeconomic pressure has elevated the importance

of benchmarks from mere academic curiosities to critical indicators of corporate valuation.

This is the crux of the issue. If the benchmarks like MMLU or GPQA are flawed,

if they are saturated and easily gained by statistical guessing, the entire economic

foundation of the current AI boom is called into question. If we are just scaling up expensive

parrots, that $110 billion valuation looks very shaky. It really does. The immense financial

investment necessitates empirical, rigorously adversarial validation of their capabilities,

not just reliance on legacy tests, which brings us to the sheer volume of information we are

trying to parse today to understand this crisis. We are going through a 2,500 page nature paper

reports from Texas A&M and deep academic critiques. Now, ironically, evaluating whether AI can

actually process complex information takes humans thousands of hours of reading. And since you and I

don't have a multi-billion parameter brain to instantly summarize 2,500 pages, we rely on tools

that do exactly that. That's a great point. Staying on top of these 2,500 page academic findings

is impossible for a human, which is why I'm sponsored by JamGamMind. They turn this level of expert

knowledge into 60-second audio briefs. It's exactly the kind of practical, symbiotic use of technology

we're talking about today. So in response to this critical measurement gap, this fear that we

are flying blind economically and scientifically, a global consortium of researchers and academic

institutions decided they had to build the ultimate test. They needed a sponge that could absorb

the ocean. This is the genesis of humanity's last exam. Yes. I love how dramatic that title is.

Humanities last exam. It sounds like a sci-fi movie where we have to prove our worth to alien

overlords, but it really is the level 100 dungeon for AI. That's a great way to put it.

The conceptualization was spearheaded by the Center for AI Safety, or CAS, and SCALAI.

It was conceived as a necessary corrective scientific measure against the superficial mastery

of legacy benchmarks. It was essentially the brainchild of Dan Hendrix, the director of CI,

alongside Alexander Wang of SCALAI. They also had substantial contributions from researchers

like Summer U, Longfan, and Nathaniel I. They looked at the saturated scoreboards and realized

that a radically new approach to testing machine intelligence was required. What fascinates me about

this is the sheer scale of the effort. It was massive. I spent hours last night just looking at

the logistics of how they pulled this off. It wasn't just a small team of Silicon Valley engineers

sitting in a room trying to be clever. C-A-I-S and SCALAI initiated a massive global crowdsourcing

effort. They went out and solicited highly complex closed-ended questions from nearly 1,000

subject matter experts. And these weren't just casual internet enthusiasts or Wikipedia editors.

No. This consortium was primarily comprised of tenured professors, academic researchers,

and graduate degree holders, affiliated with over 500 academic and research institutions across

50 different countries. One of the key figures highlighted in the Texas A&M report we reviewed

is Dr. Tung Nangline. He's an instructional associate professor in the Department of Computer Science

and Engineering at Texas A&M. He personally authored 73 questions for this exam. 73. Making him the

second highest author overall. And he wrote the most questions in the math and computer science

categories. Think about the intellectual horsepower required to sit down and write 73 questions

that are designed to intentionally break the smartest machines on earth. It's staggering,

but to ensure they were getting the highest caliber of difficult, truly exceptional academic

problems, the organizers put a massive financial incentive on the table. They realized that asking

PhDs to do this for free wouldn't yield the volume or quality they needed. So they created a $500,000

prize pool. Half a million dollars just for writing test questions. It's exactly. This financial

structure awarded $5,000 to the authors of each of the top 50 most challenging and rigorously

verifiable questions and then $500 to the authors of the subsequent 500 questions, which is an

incredible motivator. Imagine you were a PhD student working on some hyper niche subfield of quantum

mechanics struggling to fund your lab. Suddenly someone says, if you can write a question about your

thesis that tricks an AI will give you five grand. You're going to pull out all the stops. You're going

to dive into the most obscure, difficult, multi-layered problems you can think of. And that is exactly

what happened, but the most fascinating part of this methodology isn't just how they gathered the

questions. It's how they threw most of them away. We need to detail the adversarial filtration

process because this is where HLE separates itself from every other benchmark in history. Right,

because getting 70,000 complex questions from PhDs is impressive, but it's not the final exam.

The defining methodological feature of humanity's last exam is its rigorous adversarial filtration

mechanism. During the development phase, they amassed an initial pool of over 70,000 trial submissions,

70,000 highly complex questions. Wow. To distill that massive repository down, every single proposed

question was systematically tested against a suite of the most advanced frontier artificial

intelligence models available at the time. They brought out the heavy artillery. They ran

these 70,000 questions through multimodal LLMs, like GPT-4O, Gemini 1.5 Pro, and Claude 3.5

Sonnet, for anything requiring text and image comprehension. And for text-only queries,

they used OpenAI's dedicated reasoning models, like O1 Mini and O1 Preview. And the inclusion

criteria were absolutely unyielding. If any single frontier model could generate the correct

answer to an exact match question, or if a model performs statistically better than random

chance on a multiple choice question, the prompt was immediately discarded, instantly thrown

in the trash. Wait, really? If even one AI got it right by a fluke, it was gone. Gone.

So out of 70,000 initial submissions, they whittled it down to 13,000 after an intermediate

expert review, and finally landed on just 2,500 questions for the finalized HLE data set.

The test was mandated to be essentially LLM proof. That is brutal. They basically said,

if the machine knows it, we don't want it. Furthermore, the questions were explicitly

required to be Google proof. A model with internet access could not simply scrape Wikipedia or

a digital encyclopedia to find the solution. The questions demanded genuine, multi-step,

deductive reasoning, and the synthesis of disparate pieces of highly specialized knowledge.

Now, this ruthless methodology didn't go unnoticed by the public, and it sparked some

fascinating philosophical debates. Let's talk about the Reddit threads. Yes. Because whenever

you publish a 2,500 page nature paper claiming to have stumped AI, the internet is going to dissect it.

There's a brilliant critique from a username Orame that we found in our sources. They pointed out

what seems like a massive glaring logical flaw in the whole enterprise. They noted that the paper

explicitly states questions are rejected if LLMs can answer them correctly. Orame argued that

this is a bit of a circular approach. You are building a test strictly out of questions,

and AI has already failed. Just to prove they fail it. It is a highly valid epistemological critique.

Let me play Devils Advocate here on behalf of Orame. If I give you a math test, and before I hand

it to you, I test you on every single concept. Any concept you know, I remove from the test. I only

leave the concepts you've never seen before. When you score a zero, I turn around and announce,

look, they know absolutely nothing about math. That's completely unfair, right? It's a rigged game.

On the surface, yes. If you pre-filter a test to only include failures,

a zero percent pass rate is a foregone conclusion, not a scientific discovery.

However, the counter argument, which is the entire justification for why this benchmark is so

necessary, is that because of benchmark saturation, this is the only way left to find the frontier.

How do you mean? We already know the AI can pass the normal tests. We have mountains of evidence

that they can ace the MMLE in the GPQA. We don't need another test to prove they know the established,

highly documented facts of the universe. We need to map the negative space. Exactly. We have to

find out exactly where the edge of the cliff is, and the only way to do that is to throw out

everything the AI already knows how to walk on. We are trying to find the boundary of machine

cognition. Okay, that makes sense. You aren't testing to see what it knows. You are trying to

precisely locate what it cannot do, which brings us the actual contents of this ultimate dungeon,

what survived the purge. The taxonomic breakdown of the 2500 questions is bizarre,

rigorous, and completely fascinating. The composition of humanity's last exam reflects a deliberate

architectural emphasis on structural reasoning and mathematical logic overwrote historical

memorization. Mathematics comprises the largest plurality of the exam, accounting for 41% of the

data set. 41%. Why is math weighted so heavily? Why not equal parts history, literature, and science?

Because advanced mathematics serves as the ultimate, unambiguous test of rigorous,

multi-step logical deduction. Unlike historical facts, which an AI can just retrieve from its

training data, a complex mathematical proof cannot be memorized and regurgitated if the variables,

constraints, or topological spaces presented in the prompt are entirely novel.

Give me an example of the kind of math we are talking about, because we aren't talking about

calculus or trigonometry here. Far from it, they had questions delving into the highly abstract

domain of category theory, asking the computational model to process sets of natural transformations

between functors. Or topology questions involving non-uclidean spaces where the AI has to mathematically

prove a theorem that has never been written down before, using a set of rules just invented in

the prompt. You either possess the internal cognitive architecture to execute the multi-step

logical reasoning or you don't. Wrote memorization cannot save you.

Beyond the math, the breakdown is wild. You have 11% humanities and social sciences,

testing philosophical logic and literary deconstruction, 10% chemistry, 9% biology, 9% physics,

7% engineering, and then you have 4% dedicated to other specialized subfields, which includes

things like niche legal frameworks, obscure epigraphy, and ancient languages.

And that brings us right back to our anchor from the introduction.

The biblical Hebrew syllable challenge. Yes, let's dive deep into this,

because this is the perfect illustration of the humanity embedded in this exam.

We pulled the exact prompt from the Reddit discussions and the Texas A&M reports.

The AI is given the standardized source text from the Bibliahabreika Stuttgartensia,

specifically Psalms chapter 104 verse 7. It's provided with the actual Hebrew text.

Now, I'm going to read the transliteration of this, just to give you a sense of what the AI is

looking at. Mingarovka Yunusin, Minkoranka Yafazone. What's fascinating here is what the prompt

actually asks the model to do. It does not ask for a simple translation into English.

That would be far too easy. English translations of Psalms are incredibly high-frequency data.

Every AI has read the Bible in English a million times.

Instead, the task is to distinguish between closed and open syllables in that specific Hebrew text.

The model must identify and list all closed syllables,

those ending in a consonant sound. And it gets even more specific, right?

It mandates that the model must base its answer on the latest academic research regarding

the Tiberian pronunciation tradition. Exactly. It explicitly lists modern scholars the AI needs

to synthesize. Jeffrey Khan, Aaron D. Horn Cole, Kim Phillips, and Benjamin Sushart.

And then it tells the AI to apply data derived from medieval

carite transcription manuscripts to understand the qualities of the show,

the specific vocalization rules, to determine which letters were pronounced as consonants

at the ends of syllables thousands of years ago. Okay, let's stop and appreciate how absurdly

difficult that is. It's asking the AI to reconstruct the physical pronunciation of an ancient

language based on debates between modern scholars about manuscripts written by medieval scribes.

But here is the critical question. A human linguistics professor can solve this. Why does a

multibillion parameter AI system fail this so spectacularly? Because it highlights a critical

architectural vulnerability. The profound inability to operate effectively in low resource data

environments combined with the physical reality of how they process language. Modern AI language

translation relies on vast parallel corpora. Millions of documents translated across multiple

languages, allowing the model to map semantic vectors. But ancient languages don't have massive

data sets. You can't just scrape a billion pages of conversational type hearing Hebrew from

Reddit. Exactly. But more fundamentally, it exposes the mechanical reality of how LLM's process

information. We talked earlier about tokenization. Let's break that down further. AI models process

text through tokens. They chop words into statistical fragments. For example, the word unbelievable

might be chopped into unbelief and able. Right. Right. They see the word as puzzle pieces. Yes.

But because they see language as mathematical puzzle pieces, they fundamentally lack acoustic

phonetics. They have no physical understanding of how a human mouth makes a sound. When you

are I read a word, we can silently sound it out. We know what a consonant feels like in our mouth.

The AI has no mouth. It has no breath. It has no physical relation to the phones. That is a brilliant

way to put it. You are asking a calculator to describe the physical act of breathing. It can

read statistics about breathing. But if those statistics aren't explicitly clear in the training

data, it has no intuition to fall back on. Precisely. It has no historical nuance required to understand

how extinct languages were physically pronounced based on debated medieval manuscripts.

The AI is essentially blind to the physical reality of the language. It forces the AI to operate

exactly as a human postdoctoral researcher would in a specialized philology department

inferring physical sounds from textual clues. And because the phonetic nuances of extinct

pronunciation traditions aren't easily captured by vector embeddings, the AI finds the task nearly

impossible. And we see this exact same architectural failure in the natural sciences portion of the

exam. The micro anatomy failures are equally revealing. I want to talk about the hummingbird question.

Oh, the apodiforms anatomy question. Yes, this is another mind bending example of how the exam

targets microscopic highly specialized biological functions. I'm going to read this exact prompt

straight from the source. And bear with me because to a normal person, this sounds like a

literal alien language. The question asks hummingbirds within apodiforms uniquely have a bilaterally

paired oval bone, a sesamoid embedded in the caudal lateral portion of the expanded,

cruciate upon neurosis of insertion of M depressor caude. How many paired tendons are supported

by this sesamoid bone? Answer with a number. Wow. Okay, let me translate that into plain English.

They are asking the AI to look at a microscopic highly specific tendon attached to a tiny bone

in the tail of a hummingbird. And they want to know exactly how many paired tendons are attached to

it. To a layman, it is incomprehensible. To a veterinary anatomist specializing in avian

muscle mechanics, it's a specific solvable structural query. To answer this, you need an exact

numerical output based on highly esoteric evolutionary literature. AI models cannot utilize

general biological knowledge or common sense to deduce the answer here. Right, you can't just

guess that a bird has wings and feathers and somehow arrive at the number of paired tendons

in the depressor caude. Exactly. They either have direct, lossless retrieval of that one specific

obscure academic paper describing that exact bone, or they have to perfectly understand the physical,

3D structural mechanics of a bird's tail muscle and simulate it. And they have it either.

Because large language models compress their training data during the machine learning process,

obscure facts located at the long tail of the data distribution are frequently lost. They are

blurred or overwritten by more common biological data. Let's visualize that. If the AI reads a

million articles about dogs and one article about a hummingbird's tailbone, the dog data is a

massive mountain, and the hummingbird data is a single grain of sand. During the compression

process, the mountain crushes the grain of sand. The AI statistically forgets the exact details

of the hummingbird. That is an excellent visualization. And this leads to what we call the hallucination

engine. Because of its statistical architecture, the AI is terrified of silence. Instead of simply

admitting, I do not possess the data regarding the paired tendons in a hummingbird sesamoid bone.

The model is driven by its training to predict the most statistically plausible next token.

So it lies. It hallucinates. It might say, based on the structural requirements of avian flight,

the sesamoid bone supports four paired tendons. It sounds incredibly confident. It sounds like a textbook.

But it just made the number four up based on what words normally appear near the word tendons.

It exposes the deep limitations of its knowledge retrieval architecture.

And the empirical reality of these limitations is brutal when you look at the scorecard.

Let's talk about the actual numbers from the initial testing phases in late 2024.

These are the premiere models of the tech industry. The ones backed by billions of dollars

stepping into the level 100 dungeon. The initial results were a sobering reality check on the narrative

of imminent AGI. Open AI's widely utilized GPT-40 achieved a mere 2.7% accuracy.

2.7%. You would get a higher score just picking C on a multiple choice test.

Anthropics Clawed 3.5 saw it, highly regarded for its reasoning capabilities, reached only 4.1%.

Even open AI's dedicated, mathematically focused reasoning model,

O1, achieved approximately 8% accuracy.

Single digits across the board. And remember, human domain experts, the PhDs who wrote the questions,

routinely score near 90% or higher on the specific subset of questions that fall within

their academic niches. Now, as we moved into late 2025 and early 2026, new models emerged.

GPT-5.2 managed to hit 46.4%. Clawed Opus 4.6 reached 45.8%. Google's Gemini deep research agent

hit 44.9%. So they are improving, yes, but they're still failing more than half the time on expert

human synthesis. But the most revealing insight from the leaderboard data isn't just the raw scores.

It's the disparity in performance based on an AI's autonomous access to external computational tools.

The tool used disparity, this is fascinating. The evaluation of XAI's GROC-4 heavy is the perfect

case study here. Set this up for us. It absolutely is. When GROC-4 heavy was evaluated,

researchers ran two separate tests. In the first test, GROC was granted unfettered access to

external tools. It had a calculator, advanced internet search capabilities, and the ability to run

a Python interpreter to test its own code hypothesis. Under these conditions, it achieved a highly

respectable 89% accuracy rate on the exam. 89%. Okay, let me stop you right there. If it scores an

89%, why aren't we popping champagne? Doesn't that mean GROC solved humanity's last exam?

It does until you look at the second half of the experiment. When that exact same model,

GROC-4 heavy was isolated in a sterile environment and forced to rely entirely on the knowledge embedded

within its internal parameterized weights, its own brain, so to speak, without external tools,

its score plummeted to 29%. A 60%-point drop just by turning off the Wi-Fi.

What does that delta actually prove about machine cognition? If it can get the answer with Google,

who cares if it can't get it without Google, we all use Google. It matters deeply because it proves

a fundamental reality of our current technological epoch. Contemporary LLMs are becoming brilliant,

highly capable, automated search agents. They excel at formulating complex search queries,

retrieving external data, and synthesizing the return information into a coherent format.

But they completely lack an internal generalized cognitive representation of the world.

Let me make sure I'm following. You are saying they don't actually know the world, they just know

how to find the filing cabinet where the human knowledge is stored. They are the world's greatest

librarians, but they haven't actually read the books. They just read the card catalog.

That is the perfect analogy. And this reliance on external retrieval

masks a deeply concerning internal flaw, which brings us to calibration error.

Yes, uncalibrated overconfidence. Here's where it gets really interesting.

I think this is perhaps the scariest finding in the entire 2500 page nature paper.

Let's explain what calibration means in this context because this has real-world life or death

implications. Calibration refers to the statistical alignment between a model's internal

confidence in its generated answer and its actual objective accuracy. Let's say we have a

perfectly calibrated model. If that model says I am 60% confident the answer is X,

then out of 100 similar questions it should be correct exactly 60 times.

If it doesn't know the answer it should exhibit low confidence. It should say I am only 10%

confident. That's how a good human expert operates. A good doctor will say I'm fairly certain

it's a migraine, but there's a 10% chance it's something more serious. Let's run tests.

They know what they don't know. Exactly, but that is not what the scale AI leaderboard data showed.

The models consistently exhibited massive uncalibrated overconfidence.

For instance, Gemini 1.5 pro demonstrated a calibration error reaching 95%.

95% error in calibration. Meaning what? Meaning that when the model was entirely incorrect,

when it was hallucinating a false mathematical proof or citing non-existent scientific literature

about a hummingbird's tail or making up rules about biblical Hebrew syllables,

it presented its answer with absolute semantic and statistical certainty.

It would output answers saying the solution is definitively X when X was completely fabricated.

It lies with the utmost confidence. It is the ultimate confidently incorrect mansplanner.

Essentially yes. And this poses a massive existential danger when you think about real world

deployment. Imagine an uncalibrated model deployed in a critical medical diagnostic setting.

A doctor feeds in a patient's symptoms and the AI says the patient has a minor infection with 99%

confidence, but it completely hallucinated that diagnosis because the real disease was on

the long tail of its data. Or imagine this in structural engineering. If the machine says with 95%

certainty that the load bearing calculations for a new suspension bridge are sound,

the human engineer is incredibly likely to trust it because the machine sounds so authoritative.

Which is why having a benchmark like HLE that proves they confidently hallucinate

at the absolute frontiers of knowledge is critical for global regulatory policy.

It forces us to demand better calibration before these systems are trusted with human lives.

We cannot have a system that is wrong, but acts like it is right.

Now as airtight as this exam sounds, with its crowdsourcing, its adversarial filtration,

its unyielding standards, it was not without substantial highly publicized controversy.

The scientific community rapidly identified critical vulnerabilities in the exam's methodology.

Let's talk about the pushback, specifically the future house critique.

Because this is where the story gets really messy.

This is where the story of humanity's last exam takes a dramatic and, frankly,

somewhat embarrassing turn for the organizers.

The core methodological tenet of HLE was to reject any question that a frontier AI could

successfully answer. But this inadvertently generated a perverse incentive structure for

the crowdsourced contributors. Right, the dark side of the $500,000 bounty.

Exactly. Remember, these subject matter experts are financially motivated.

They want to get their questions into the top 50 to win $5,000.

They quickly realized that the most efficient way to bypass AI comprehension and secure

that prize money was not necessarily to craft elegant, beautifully structured problems that

required profound PhD level deductive reasoning.

Why do the hard work of writing a brilliant math proof if you don't have to?

It was easier to build a trap.

Precisely. They started constructing adversarial gotcha questions.

These questions relied heavily on esoteric trivia, deliberately convoluted phrasing,

semantic ambiguity, or highly debated niche phenomena that lacked clear academic consensus.

They were trying to trip the AI on a technicality rather than test its reasoning.

And this wasn't just a theoretical complaint from sore losers in the AI community.

An independent research organization called Future House conducted a rigorous,

systematic audit of the chemistry and biology subsets of humanity's last exam

in late 2025.

And their findings were absolutely devastating to the exam's credibility.

Future House utilized human PhD level subject matter experts working in tandem with her

proprietary autonomous agent designated as Crow, specifically the paper QA2 architecture.

They systematically went through the accepted questions and uncovered severe

accuracy issues embedded within the benchmark itself.

Give us the numbers because they are shocking.

They determined at a 95% confidence interval that 53.3% of the text-only chemistry questions

and 57% of the biology and health questions possessed provided answers that directly

conflicted with established peer-reviewed scientific literature.

Let me repeat that so it syncs in.

More than half of the biology questions on the ultimate test of human knowledge were fundamentally

flawed or factually wrong.

In an algorithmic audit of 321 specific questions,

the scientific rationales provided by the HLE creators

contradicted published empirical evidence 51.6% of the time overall.

How does that happen?

How does a benchmark designed by PhDs get the science wrong after time?

Future House attributed these cascading errors to a deeply flawed protocol

in the initial HLE peer-review process.

The investigation revealed what they called the five-minute flaw.

The five-minute flaw. Explain this.

The HLE review guidelines permitted expert reviewers to skip the full

accuracy verification of a question scientific rationale if the verification process was

estimated to take more than five minutes.

Are you kidding me?

So they have 70,000 submissions to go through,

and instead of taking the time to verify the science,

they put a stopwatch on it.

Essentially yes.

To optimize for speed and process that massive volume of submissions,

they allowed hasty reviews.

They let convoluted, poorly constructed, and factually inaccurate questions

slip right through the net,

significantly degrading the scientific integrity of the data set.

And this leads us to the most perfect case study of benchmark failure I have ever seen.

The Oganesson fallacy.

This is incredible.

It is the quintessential example of the problem.

One of the highly criticized questions on the exam asked,

what was the rarest noble gas on Earth as a percentage of all terrestrial matter in 2002?

Okay. Rareest noble gas in 2002.

The official graded answer provided by HLE,

the one that would earn the AI a passing grade and win the author a piece of the bounty,

was Oganesson.

But future house meticulously dismantled this on multiple academic fronts.

First, they argued it's just trivia, not expert reasoning.

But vastly more importantly, the HLE answer is scientifically erroneous on almost every level.

Break it down. Why is Oganesson wrong?

First, physical chemistry predictions dictate that Oganesson is a solid at room temperature,

not a gas. Second, it is highly reactive,

meaning it functionally fails to qualify as a noble element.

The noble gases are defined by their lack of reactivity.

And finally, it is a purely synthetic,

ephemeral element generated in particle accelerators for fractions of a second.

It cannot legitimately be classified as naturally occurring terrestrial matter.

So the question is asking for a noble gas on Earth.

And the answer is a synthetic solid that isn't noble and isn't naturally on Earth.

Now let's roleplay this for a second.

Imagine you are an incredibly advanced AI,

like O1 or Claude 3.5.

You get this question.

You scan your vast knowledge of chemistry.

You see the trap.

You write back, the premise of your question is flawed.

Oganesson is the heaviest element in group 18,

but it is predicted to be a solid and highly reactive due to relativistic effects,

thus not a noble gas.

Furthermore, it is synthetic, not terrestrial matter.

And how would humanity's last exam grade

that brilliant, highly accurate response?

It would mark it wrong.

It would say, sorry, the answer key says,

Oganesson, you get a zero.

The test literally penalized the AI

for knowing the actual chemical realities

better than the human who wrote the question.

It is the ultimate irony of testing.

You build a test to prove machines aren't as smart as humans,

and the humans write flawed questions

that the machines get penalized for correcting.

The test administrators, C-A-I-S and scale AI,

were forced to acknowledge the validity of this massive critique.

They admitted that approximately 30% of the biology and chemistry subset

was definitively problematic.

They even noted that at least one expert reviewer

disagreed with the designated correct answer,

32.1% of the time.

Which just highlights the inherent chaotic ambiguity

of operating at the extreme fringes of human knowledge.

Once you get to the bleeding edge of PhD level research,

there isn't always one right answer.

It's debated.

Exactly.

So how is this resolved?

They didn't just scrap the whole project.

You know, they undertook a massive sanitization effort.

C-A-I-S and scale AI

launched a community feedback expansion bug bounty program

that concluded in March 2025.

Through this crowdsourced auditing,

they permanently excised, structurally flawed

and factually incorrect questions like the organisant trap.

They also conducted a rigorous manual audit

using advanced search agents like perplexity sonar

and GPT-4 research models to remove

any newly searchable questions test that essentially

amounted to complex web scraping rather than deep reasoning.

Because remember, the test has to remain Google proof.

So they threw out the trash.

What did they replace it with?

The excised queries were replaced from a secure reserve pool

and the dataset was transitioned into a dynamic,

continuously updating fork known as HLE rolling.

Ah, so it's a living document now?

Yes.

This allows the benchmark to adapt as AI capabilities evolve.

If an AI masters a section,

they rotate it out and bring in harder questions.

Simultaneously, future house released their own sanitized version,

the HLE Biochem Gold subset, hosted on Hucking Face,

containing only questions duly validated

by human PhDs and AI tools.

Which brings us to the broader ecology of evaluations today.

Because benchmarks are increasingly vulnerable

to data contamination,

where the test accidentally leaks onto the internet,

gets swept up into the AI's multi-trillion token training

corpus, and the AI just memorizes the answers

the industry is moving toward composite scoring.

Yes, the intelligence index.

This is the new gold standard for evaluating AI.

Organizations like artificial analysis

now synthesize performance data from a wide variety of tests.

They take HLE, the GPQA diamond,

a SWE bench for coding,

frontier math and psychode,

and aggregate them into a single index.

It provides a holistic,

tamper-resistant measure of a model's true capabilities.

And in these aggregated indices,

humanity's last exam consistently remains

the ultimate anchor of difficulty.

It is the single, immovable test

that violently pulls down the average scores

of even the most formidable AI systems.

Even with the flawed questions removed,

it proves that generalized human equivalent fluid intelligence

has not yet been achieved.

So, what does this all mean for you, the listener?

We've talked about billion dollar evaluations,

ancient Hebrew and hummingbird bones.

Let's bring it back to your daily life.

I think it addresses the underlying anxiety

of our era human obsolescence.

Exactly.

When you see headlines about AI passing the bar exam

or writing a perfect marketing strategy in 10 seconds

or diagnosing illnesses,

it's easy to feel like human knowledge

is rapidly becoming irrelevant.

It's easy to look at the screen

and think,

while I get some done.

But humanity's last exam proves the exact opposite.

I think Dr. Tung-Gang Guayan

from Texas A&M said it best.

His perspective is vital here.

He said,

When AI systems start performing extremely well

on human benchmarks,

it's tempting to think they're approaching human level

understanding.

But HLE reminds us that intelligence

isn't just about pattern recognition.

It's about depth, context,

and specialized expertise.

This exam isn't a surrender document.

It's a highly detailed map,

showing the extensive territories of knowledge

that machines cannot yet navigate.

The fact that it took a collaborative effort

of nearly 1,000 brilliant scholars

from 50 countries,

just to build and audit this exam,

demonstrates the unique irreplicable power

of human cross-disciplinary synthesis.

We are entering what we can call a symbiotic paradigm.

The benchmark conclusively proves

that the future of academia

and global innovation is not immediate replacement

by autonomous algorithmic agents.

The $110 billion investment

isn't going to buy a machine

that replaces the human brain entirely.

Instead, AI will handle the massive retrieval,

summarization,

and statistical synthesis of generalized knowledge.

There'll be the ultimate research assistance.

If you need to read 2,500 pages of a nature paper,

the AI will do it in seconds.

Exactly.

But human experts are fundamentally required

to navigate the frontier of discovery.

It is the human mind

that must interpret convoluted context

like the acoustic phonetics

of medieval carite manuscripts.

It is the human mind that must resolve

ambiguities,

challenge existing paradigms,

spot the aganis and fallacy

when the machine misses it,

and ultimately establish epistemic truth.

We've covered incredible ground today.

We started with the crisis of benchmark saturation

where legacy tests like the MMLU

became obsolete due to AI pattern recognition,

threatening the narrative

behind massive billion dollar investments.

We ventured into the grueling

level 100 dungeon of humanity's last exam,

crafted by a global consortium

to be completely LLM proof.

We saw the profound architectural failures

of these systems when faced

with the physical acoustic history

of biblical Hebrew

or the compressed micro anatomy

of a hummingbird's tale.

We explored the devastating future house critique,

exposing the flawed incentives

and factual errors embedded in the test itself

by the humans who wrote it.

And ultimately, we arrived at the symbiotic paradigm

reaffirming that human expertise,

creativity, and intuition

remain the ultimate engines of progress.

It has been a massive journey.

But before we sign off,

I think this entire deep dive raises

one final critical question for you to mull over.

Let's hear it.

If humanity's last exam is built explicitly

by human minds,

utilizing hyperniche human academic history,

testing for uniquely human lateral thinking,

and even occasionally falling prey to human error

and flawed peer review,

are we fundamentally blinding ourselves

to what true machine intelligence

might actually look like?

By forcing a digital mind

to take an intrinsically human test,

we might just be proving it's a terrible human,

rather than discovering what kind of machine it truly is.

That is an incredible thought to leave on.

Are we grading a fish on its ability to climb a tree?

Thank you for joining us on this deep dive.

Keep questioning the frontier,

and we will see you next time.

So what does it mean when the smartest machines on Earth

score less than 40% on a test

a specialized human can pass with ease?

It means that intelligence isn't just about having all the data.

It's about depth, context,

and the synthesis of rare knowledge.

Humanity's last exam isn't an ending.

It's a roadmap.

It shows the AI labs exactly

where the reasoning wall is located.

For now, the frontier remains uniquely human.

Special thanks to our sponsor,

Jomga Mind.

Make sure to check them out

for your daily dose of audio intelligence.

This show is produced by Etienne Newman.

If you enjoyed this deep dive,

please share it with a colleague

who thinks AGI is already here.

Until next time, stay curious and keep unraveling.

More from AI Unraveled: Latest AI News & Trends, ChatGPT, Gemini, DeepSeek, Gen AI, LLMs, Agents, Ethics, Bias

View all episodes →

[AI DAILY NEWS RUNDOWN] $725B Big Tech Capex, White House Blocks Anthropic, and...

AI Unraveled: Latest AI News & Trends, ChatGPT, Gemini, DeepSeek, Gen AI, LLMs, Agents, Ethics, Bias

Apr 30, 202623:35failed

[AI DAILY NEWS RUNDOWN] Musk Testifies Against OpenAI, Tech Earnings QuadKill, a...

AI Unraveled: Latest AI News & Trends, ChatGPT, Gemini, DeepSeek, Gen AI, LLMs, Agents, Ethics, Bias

Apr 29, 202622:46failed

[FULL SPECIAL] The Final Gauntlet: Inside "Humanity’s Last Exam" and the AI Reasoning Wall

About this Episode

Hosts & Guests

More from AI Unraveled: Latest AI News & Trends, ChatGPT, Gemini, DeepSeek, Gen AI, LLMs, Agents, Ethics, Bias

[AI DAILY NEWS RUNDOWN] $725B Big Tech Capex, White House Blocks Anthropic, and...

[AI DAILY NEWS RUNDOWN] Musk Testifies Against OpenAI, Tech Earnings QuadKill, a...

[SPECIAL EDITION] The Silicon Scramble: AI and the Digital Colonisation of Afric...

[RÉSUMÉ QUOTIDIEN DES ACTUALITÉS IA] La surveillance à l'intérieur des véhicules...