sciencesocial

Should science stop worshiping statistical significance? (with Andrew Gelman)

Clearer Thinking with Spencer Greenberg·Mar 5, 2026·1:19:40

About this Episode

If you enjoy our podcast, we have some exciting news – we’ve just launched a new membership called Clearer Thinking Plus.

Members get this podcast completely ad-free, as well as two professional coaching sessions every month, access to our advanced cognitive assessment, and seven other exclusive perks.

Clearer Thinking Plus is one of the most affordable ways to get access to a high-quality coach - whether you want to improve your habits, find more effective ways to work towards your goals, or get assistance making difficult decisions. It is also a more affordable and convenient way to get all the perks we offer.

If you're not interested in coaching, you can still get ad-free access to this podcast and the other perks with our explorer plan.

Access www.clearerthinking.org/plus to become a member today. We hope to see you there!

What makes a piece of research “public property,” and what ethical obligations does that create for critics and authors alike? When a result feels wrong but you can’t locate the “smoking gun,” how should skepticism be calibrated without sliding into cynicism? How can a field avoid mistaking the absence of obvious errors for evidence that a claim is sound? What incentives cause entire literatures to form around fragile findings, and why do they persist for so long? Why do some researchers experience replication attempts as hostility, while others experience them as a gift? What norms would make constructive public criticism more common and less personally costly? How should we weigh a paper’s contribution when its analysis is flawed but its question is valuable? When is it rational to trust “the literature,” and when is the literature itself likely to be trapped in self-reinforcing error? What would it take for scientific communities to treat uncertainty as an honest output rather than a professional liability? Can a culture of open critique exist without amplifying bad-faith attacks or anti-science narratives?

Andrew Gelman, Ph.D., is Higgins Professor of Statistics, Professor of Political Science, and director of the Applied Statistics Center at Columbia University.

Links:

Andrew's Substack

Staff

Spencer Greenberg — Host + Director
Ryan Kessler — Producer + Technical Lead
WeAmplify — Transcriptionists
Igor Scaldini — Marketing Consultant

Music

Affiliates

[Read more]

Hosts & Guests

Spencer Greenberg

Host

Transcript

Andrew, welcome to the Clear Thinking Podcast, great to have you.

Thank you, glad to be here.

So Noah Smith once said, the scariest seven words in academia may be that Andrew Gelman

just blogged about your paper.

I think there are scarier things to be said in academia.

But why do you think he said that?

I guess he was trying to be kind of funny.

I'm a very nice person, and I think when I started blogging, I remember writing something

about this and saying that in real life, I'm kind of mean, but my blog personality is

very pleasant, and so everyone thinks I'm nicer than I really am, but I guess it didn't

really work out that well.

Well, I think you have it in for bad statistics, is that fair?

I think I used to have it in for bad graphs, and people would send me bad graphs, and then

people send me bad papers, but sometimes people send me stuff, and they want me to make

fun of it, and I like it.

Someone sent us this paper about 15 years ago saying that there were more babies born

on Valentine's Day and fewer babies born on Halloween, and like is in the silly, and

I looked at the paper and it looked pretty reasonable.

They only had data on 30 days of the year, it was the week before and the week after

for each holiday, so a week is 7, 15 times two is 30.

I wanted to see all 366 days, so I posted it on the blog, and someone made a graph, and

then my collaborator, Aki, did some analysis, and it became a big research project.

All because I was nice.

Sometimes being nice pays off.

So how often when you read social science papers, do you think, oh, this is a load of crap?

I mean, you know, is it 50% of the papers you read?

Is it 1% of the papers you read?

How common a problem is this?

I don't want to answer that because it's like selection bias and what people send me.

I'll say the hardest thing is sometimes you see a paper and it doesn't seem right, but

you can't pick out exactly what's wrong with it, and I think then you have to avoid the

presumption that it's correct, just because you can't find the smoking gun.

And there are a lot of cases where something seems reasonable, and then you look at five

years later, and the bad things were obvious, but you didn't notice at the time.

Would you say that statistical problems and scientific papers are really common, though,

even though you have selection bias and the ones you read?

Yeah.

I mean, I think there's a lot of mistakes or poor analyses in my own papers, so I can

only assume that other people are even worse.

Well, given that you're one of the best statisticians out there, I'm sure that that's

a good part.

Exactly.

Yeah.

I don't know.

I mean, what does it mean?

Right?

Like, so maybe the result can still be true or valid, that's why you're supposed to kind of

look at the whole literature and hope that the literature itself is not in some sort of

trap where it's reifying itself.

How do people typically react when you point out statistical flaws in their papers?

I don't usually contact them directly because it doesn't work.

It's not very pleasant when they do that.

And this came up once because I said something like, like, I don't enjoy interpersonal conflict

and that made me look like a really bad person, but I'm going to give you my take on this,

my official take, which is that if you publish something, it's public.

And if you're willing to let people forward your paper and say how great it is without

asking you for permission, you should be willing to let people forward your paper and say

how it sucks without asking for permission too.

We do this project called Transparent Replications where we replicate new papers coming on

on top psych journals.

We'll use a random number generator, pick a study and go try to replicate it.

And it's interesting, we get a range of reactions because we always reach out to them and say,

hey, guess what?

You've been chosen.

Sorry.

And then we go try to replicate their paper and you get everything from stonewalling to

super play to trying to be polite but clearly really unhappy with the fact that you're

replicating their paper.

So I do this wide range of reactions.

Oh, yeah.

They should want it.

I mean, I love criticism.

Many years ago, we did our Red State Blue State project and looked at how voting varied

by income in different states and people asked, what about white people?

Like everyone wants to know about white people in this country.

They're always asking about that, not just all the voters.

So we did analysis of just the white people and had the maps of just whites and everybody

and I posted it and then some blogger, somewhere, some political blogger just attacked me.

All sorts, he had about 12 things he said about me that maybe three or four were completely

wrong and three or four were completely irrelevant but there were some real points.

And it turned out, for example, that if you looked at our map carefully, we estimated

that 108% of the non-whites in New Hampshire voted for Barack Obama, which like it can't

be more than a hundred percent, like statistics, right?

So my colleague and I went back and we spent three months improving our model, doing something

which now would be trivial to do but back then was a lot of work.

And I was so happy.

I didn't care that they were rude.

I really appreciated that they found a flaw in my work and so, yeah, I think if people

should be thrilled, if there are things that doesn't mean it's real, not real because

it can be that it's a noisy replication study, there can be real stuff happening.

But yeah, they should be thrilled.

Criticism is wonderful, people should value it more.

Well, it also points to the fact that statistics is a pretty hard subject and you hear a lot

of criticism of science these days.

It seems to be almost like a movement of criticizing science.

But the statistical parts of the criticism, maybe don't get as much error as the sort

of, you know, more tangible stuff than most people can wrap their head around.

What is a balance between the statistics and the theory that often, in extreme example,

was the study from 2011 claiming to find that Cornell students had ESP and that, well,

that one knows that they only had that Harvard, right?

Yeah.

And so, they found, but this was presented, this was published and it was presented as like,

they worked hard and played by the rules so they deserve to get published.

This was a solid paper.

You look back later, it had a lot of problems, but also there were some theoretical problems

in that they didn't have a very good mechanism.

So I think these go together when there's a strong mechanism and there's statistical problems

that's maybe less of an issue, because then you can go and try to do better measurements.

It reminds me of the Darryl Ben paper when they found that people could predict the future

if the future involved pornography.

Yeah, that was the one.

Oh, is that the same paper?

Well, they actually, they followed up with a paper, they followed up with a paper with,

I think, like, 82 replications of it, which is pretty impressive, except many of the

replications came before the published paper and one of them was on spiders.

It was...

Spider porn?

Spiders, just actual spiders.

No, they had a, I mean, they had a protocol and they found all the studies that they

followed the protocol.

These studies had problems, too, of course.

Well, when one of the replications failed, if I recall correctly, Ben said that it's because

the pornography was not sufficiently evocative.

Well, I mean, yeah, I hate to laugh at this.

Let's take something more serious, which was a study promoted by prominent political

scientists claiming that flashing a subliminal smiley face on the screen could change people's

attitude on immigration by 18 percentage points.

Well, I don't think that's plausible, but, like, there's a theory behind it and the theory

is, as you might have heard, it's very hard to change people's opinions on immigration.

You have to start, like, shooting people and stuff, then opinions change a little bit,

but it takes a lot, right?

So there was...

The theory is not unreasonable, it's hard to change people's opinions, but you can do

an end run.

Like, people won a different opinion, but they don't...

So it's not impossible, and that's the kind of study where you'd say, I think the problem

is they need better measurements.

The real effect is going to be much smaller if it's there, so that's a little...

You can't just laugh at it, you can laugh at it, but you can't just laugh at it there.

There could be something there, and that's where the statistics' ideas of measurement

are very important.

One time a friend of mine had this idea, she was like, what if you could sublimely prime

people throughout the day with just being happy, like, show them things that make them happy

all day long, subconsciously, and I was like, wow, that's an amazing idea.

So we actually tried to work on it, we worked on it, we worked on it, and we eventually

realized that computer monitors don't flash things fast enough to do some little messaging,

and actually the real studies on it have to use special computers that have higher frame

rates.

So that is a tricky idea.

That I didn't know about.

I will say, my general feeling is that direct approaches a little better.

If you want to convince somebody of something, it's better to talk to them about it directly.

So for me, I do the happiness treatment myself.

I spend as much time with my family as possible, because they make me very happy, and I don't

need the subliminal flat.

That's a whole sum.

All right, let's talk about the replication crisis, quite a bit less

wholesome.

Can you explain what the replication crisis is?

There were a bunch of high-profile studies that were where there would be maybe an entire

literature, like hundreds or even thousands of published papers.

They'd be in psychology textbooks, they'd be in popular books.

But then when people would try to replicate the study, they would fail to replicate.

And the key is that even if there might have been hundreds or thousands of papers on

a topic, each paper did something slightly different.

So no one wanted to do an exact replication, and then when people started doing that, they

were finding things that weren't replicating.

And then sometimes you can go back and just do a statistical analysis and say, this never

had a chance of replicating.

We call those a dead-on-arrival studies.

Well, an interesting example of this would be ego-depletion, I don't know if you looked

in that literature in particular, but the basic idea is that they thought willpower is

kind of like a muscle, and you can kind of deplete it.

Like it's this, imagine as you've got this one bar of your willpower, and as you use your

willpower kind of goes down, and then as it's gone down, you won't be able to do it later.

So like if earlier you resisted eating the cookies five times and you won't be able to resist

the cookie later on.

And what's so fascinating about it is there are literally hundreds of studies on this topic.

And yet when they eventually tried to like say, okay, let's sit down, let's really try

to replicate this carefully, a bunch of the big replications completely failed.

And that raises this bizarre question, what on earth were they doing?

How can you have hundreds of published papers on a topic when there doesn't seem to be a

phenomenon under study, it doesn't make any sense?

So I don't know about ego-depletion really, because I spent so much effort looking at

other studies, so I just didn't have anything left to put it in my pocket.

No, let me, I'll give you an answer, because I have an answer to your question.

But first, I want to say that, let me say how I've not done a lot of psychology research,

I published a bit in psychology, but mostly on methods.

To the extent that the ego-depletion doesn't seem like the stupidest thing in the world

to study.

It seems like it's worth studying.

If I were studying it, what would I want to do would be, I'd want to measure the mechanism

as directly as possible.

So there's a tendency in research for people to do what's in economics called a reduced

form analysis.

You do the experiment and then you look at the outcome and you say, what's the treatment

effect?

I think that's going to work with some complicated thing like ego-depletion.

I think they have to do a lot more qualitative work on it, but then do quantitative work

informed by that, which means measuring intermediate outcomes, asking for people's attitudes.

I'm not saying it's easy.

So part of the problem is what I've called the penicillin model of science or the take

a pill, push a button model of science, which is that there's this treatment, you do

the treatment, and the most scientific thing in the world is to have the treatment group

in the control group, put them in a black box, and then see what wins.

And that's not going to work if your treatment effect is contextual, if it's positive in

some places.

And obviously, ego-depletion to the extent it exists, and I have no doubt that it does exist

to some extent, sometimes it's positive and sometimes it's negative, right?

Sometimes you get that success from the hard working, you want to do more, other times

you've depleted your ego, it's not going to be studied.

It would be the equivalent of a pill that half the time it makes people sick and half the

time it makes people healthy.

You want to figure out when that's happening, don't just give people the pill and look

at the average.

Yeah, one of the funny things about ego-depletion in particular is it sounds so much like things

we're all familiar with, right?

If you're planning a wedding and you had to make a hundred different decisions about

like table napkins, yeah, you probably are not going to be making the best decisions

later when you're choosing your snacks, right?

And I think we have all these words for that, like exhaustion, fatigue, boredom, frustration,

and I think most people realize those do affect your decision-making.

It's just that they then took these kind of common sense notions and they tried to define

it in this very scientific, technical, sounding way, but that technical sounding way doesn't

seem to hold up.

No, well, two things.

One is I want to avoid the one-way street fallacy, which is the assumption that like,

oh, why not try this?

It can't hurt.

Well, it could hurt, right?

And again, with ego-depletion, there are going to be times when you have anti-depletion,

so your ego builds up.

It's strength from its successes.

And that's there, too.

So I think there's a very naive thing when people have an idea and then they think, well,

it either works or it doesn't work.

And so let's study it.

And we'll see if it works.

But what if it's negative in some settings and positive?

When I was a kid, I read this book, I can't remember which one.

It was like with four kids, like they were doing stuff together.

And they had a housekeeper who was, like, could always come up with a snappy answer to everything.

Any question they had the housekeeper would, like, match to say no.

And so the kids said they want to find her book.

And the book is the thing where they either say, like these sayings, like, too many cooks

spoiled a broth, but then there's another saying where, like, many hands help out.

So they said, we want to find the book that has all the sayings in both directions.

So when she says one, we can say the opposite.

So that's how I feel.

Like, when you say this is convincing, I'm like, yeah, that makes sense.

And the opposite also makes sense, you know, but there is no book.

That's the trouble.

To what extent do you think that the replication crisis is really a crisis about statistics

in particular versus say other factors of science?

I think that the most important thing about statistics is measurement.

The problem is it's a, people are sometimes picking out the wrong parts of the scientific

methods.

So it's not bad to randomly assign treatments or find natural experiments.

It's not bad to measure outcomes, but it's also important to measure things carefully.

And I think people sometimes think they do this, this, and this, and they've done science.

I mean, I do have an answer, if you want.

I have an answer to your earlier question about how could this have persisted?

Yeah, how could it have persisted?

So it goes like this, you do a study.

People sometimes talk about the file juror effect that the studies that aren't successful

go in the file juror and the ones that are successful get published.

That doesn't happen.

Everything gets published.

I mean, come on.

Even the null results?

You find the null result, ideally you publish the null result because it's useful to know.

Right?

Like, forget about psychology.

Think about medicine.

What if you have a promising treatment idea and it doesn't work?

Damn straight.

Are we talking about the real world?

Or what actually happens?

I'm talking about the real world and what happens.

I mean, they do publish medical studies saying this doesn't work, okay?

But what is often done in a lower stake setting is that there are enough things you could

look at in the data so people find a statistically significant pattern.

So the statistics part here is that to be statistically significant, the effect has to

be two standard errors away from zero.

If you have a noisy study, meaning not great measurements, then the standard error or

the uncertainty in your estimate is large.

So your estimate to appear and published has to be large.

So when they said that they thought that women during certain time of the single unmarried

women during certain times of the month were 20 percentage points more likely to vote

for Barack Obama, which I don't believe, that's because the estimate had to be at least

16%, because anything less than that wouldn't have been statistically significant.

That's important.

I want to unpack that for a second.

Okay.

Then I'll give you the rest to the story.

Yeah, because I think what you're getting at is that if you have a very noisy measurement,

which could happen maybe because of a small sample, like you don't have that many participants

in your study.

Or, for example, noisy measurement.

Yeah, or another form of noisy measurement.

Then basically you can't detect small effects, right?

So you can't do it.

So necessarily if you find anything, it's got to be large, which means that you tend to,

when the things you see in the literature tend to be big overestimates.

And then the next study gets done.

You're designing a study and you say, well, I, from the literature, it looks like the

effects about 20%.

So how big a study would I need to design to do that?

So you design a study following the statistical rules, which is supposed to have 80% power,

meaning there's an 80% chance you'll get something statistically significant.

Then you do your new study.

Well, it turns out that when you do your new study, like you don't see it, but you know

it had 80% power, you're supposed to find something.

But you realize that if you analyze the data just a little differently, you do find it.

And so then it can persist.

So I think that this idea that first people think that everything they're doing is completely

kosher because they're doing random assignments and they have unbiased measurements.

And then they think the effects are real because they're statistically significant.

And then that leads them to think that they can do future studies of the same sort.

And yeah, it really happens that way.

So what you're touching on, I think is sometimes referred to as p-hacking, do you want to explain

what p-hacking is?

Well, the goal to get the statistical significant result is a result that is unlikely to occur

if there were by chance alone.

It has a very small probability to occur by chance alone and p is for probability.

So p-hacking is a term that I don't love, but in p-hacking, you hack your data.

You keep looking at your data until you find something with a low p-value that could

never occur by chance alone.

I prefer the term garden of forking paths in that the same thing, but I don't think people

are necessarily hacking.

I think they see their data and they see a pattern, and then they do the analysis.

And one way you can see people get mad at me when I say, oh, here is this result.

Well, something came up recently, it was about mind-bodied healing.

They had a result which was 11 standard errors away from zero, which is a lot in statistics

jargon.

That's the kind of things that a statistic is laugh about, but normal people don't.

Yeah, you're not supposed to laugh.

You're just supposed to feel that you're being treated with respect, giving you the jargon.

So we did our re-analysis and it turned out well, maybe it was really two standard errors

away from zero, which is still statistically significant.

But then we said, well, there's forking paths.

There's many ways of analyzing the data.

And they were really mad at how dare you accuse us of forking paths.

We know what we're doing, but you can tell because people in their own papers, they might

have five studies and analyze each study a little differently.

It's not a bad thing.

I do that too, but yeah, you can't overstate, it's a problem if it leads people to overstate

the strength of their evidence.

I think a lot of people don't work in science, don't realize that this idea of a p-value

is so critical to how much of the way the science has practiced.

It's become this sort of magical thing.

You calculate the p-value.

If it's less than .05, you can publish it.

If it's not, well, you can't publish it, or maybe you can go find something else that's

P-O, less than .05.

I mean, yeah, but the funny thing is people know, like, scientists are not stupid and they're

like even like, let's set aside people who are trying to cheat.

But the ones who are doing their best and still do bad work, they know this.

That's the funny thing.

They don't want to be that guy, and they feel that they have strong theory and they have

a strong literature.

There's a principle of mathematics you might have heard of, you must have heard of, that

if a problem is hard, the way to solve it is to embed it in a harder problem.

If you want to prove a theorem about prime numbers, you can prove it on the space of ideals,

for example, more general opinion.

And statistics, the principle, is that analyzing one study is hard, doing a meta-analysis

of many studies is easier.

And I think people realize, like, the trouble is that the meta-analysis, they do the literature,

they do, they do, is just a whole literature of biased results.

And so it's too bad.

But the trouble, if the answer is not to, you can't just tell people, hey, don't just

follow the p-value, that's just a number, blah, blah, blah, blah, blah.

They realize that.

They feel like they're doing the right thing, but I think they're still misled sometimes,

unfortunately.

I was talking to one social scientist, and he, when he said to me about p-values, he's

like, look, everyone has their, their cutoff, and I was like, what do you mean everyone

has their cutoff?

He's like, well, you know, for some people if they get p equals 0.08, they'll find a way

to get it below 0.05, and for other people it's p equals 0.1, and like, that was the way

he talked about it.

It's just, you know, this is the thing, I've already done this.

Do you think that it just became accepted scientific practice that you can kind of just, oh, well,

it's probably true anyway, and like, I've got the theory to support it.

And so, yeah, sure, I didn't quite meet the threshold, but like, if I, if I take out

a couple of outliers across the threshold, so it's fine.

Well, I'll say this, if right now you have a promising drug, and you do a trial on it,

and you get a result, and it's like not statistically significant, but it goes in a positive

direction, I'd probably think the drug works more likely to work than not, like, so really

the question is what are the costs and benefits?

So there's nothing wrong with people publishing things that might, might be wrong, but then

when they're designing their new study, they should be realistic about potential effect

sizes.

And again, stay away from this one side of thinking.

So we do a study, and it seems to work in this setting, and that's great.

Maybe in another setting, you'll have a negative effect you'd want to know that, but I think

people should be more open about their uncertainty.

Whether you're a marketing manager, a product engineer, a CEO, a researcher, or a social scientist,

you sometimes need to know what lots of people think about a thing, or you might want to

have people enroll in a study or experiment.

But recruiting study participants can be time-consuming, error-prone, and expensive.

Well, good news.

Positively is here to help.

Positively addresses the common pain points that researchers encounter when recruiting

study participants.

It aims to solve common research problems and dramatically improve the speed, quality,

and affordability of human subject research.

Positive positively.

Researchers, marketers, and product developers are empowered to produce better results by

accessing high-quality participants through an easy-to-use web interface, making it easy

to run surveys on thousands of people in mere hours, and it can now be used to recruit

people in over a hundred different countries.

To learn more and to give your research project superpowers, visit positively.com, that's

p-o-s-i-t-l-y.com.

Some people say, people always create this huge problem in science.

We should just rip them out, get rid of them.

Do you think that they really present a problem, or is it really the way people think about

them that presents a problem?

Well, I don't really use them myself, so I think it would be fine if they had never

been invented.

I sometimes think that actually a lot of science would be better if statistics had never

been invented.

If all of you have to explain it, yeah.

For example, you do one of these studies and you, what have you just plotted the data?

So you're allowed to plot.

Not all the analytic statistics have never been.

Of course, there are times when statistical analysis is great.

We can learn a lot.

For example, when we're doing election forecasting, we have a lot of polls.

There is this poll in Iowa that was surprising to people.

Really, statistical analysis is perfect at that kind of problem that trying to attribute

like what happened to it.

It will be hard to do that in your head.

But if you're doing a study and you have some data, you have treatments and controls and

pre-test and post-test, make a graph, see what the data look like.

And if the data is like here, there's a positive effect.

There's an overlap between the groups.

I don't know if that much is gained by doing, get doing the formal analysis.

I mean, you might as well, but you don't need certainty, right?

Statistics shouldn't be thought of as, we wrote a paper once called the triple A tranche

of statistical uncertainty and the idea was that statistics is viewed as a tool to convert

uncertainty into certainty, just like they did with the mortgage things, right?

Like you keep scraping out the statistically significant results until you find something

that is like triple A value.

And it's not really, it's noisy, that's good, it's there's uncertainty.

Like taking a bunch of really garbage mortgages and then bundling them up and saying, oh, look

if I bundled them just the right way, I get triple A.

Yeah, that's how I feel with a lot of that.

But I think it's such a relief to be able to be uncertain.

To just say, well, I'm not sure, yeah, this result wasn't statistically significant.

I have some uncertainty about the effect that doesn't mean I think it's zero.

It's wonderful feeling.

Now how much, so when you read papers with p-values, typically people will say, oh, it was p-lesson

.05, therefore the effect was there, or p-gradon .05, we found no effect, right?

Which is basically treating it as a magic threshold.

But I mean, I would argue that it's much better to not treat it as an dichotomy and just

think of it as evidence.

Would you agree with that?

Yeah, I wouldn't, I mean, I just, the p-value being the probability that a result as extreme

as what you would see could be seen under a certain model if the world were a random

number generator, which it's not.

So I'm not, I am interested in measures, although I would prefer to put things on a real scale.

So like when I said an 18% shift in opinion, or increasing sales by 2%, or reducing the

death rate by 5% or whatever, that's how I'd rather do it.

I'm not, I don't see the benefit of doing it in terms of what's the probability that we

would have seen something as extreme or more extreme than this, that the data really came

from a random number generator.

The very fact that it's so hard to say it is an indication that I think it's the wrong

thing to say.

A journalist I know went around a science conference, and these are talking to people who use p-values

every day, and every one of them, she said, what's a p-value?

And she said only one of them could give a quick succinct explanation for it.

And so yeah, it's just, it's such an unnatural, kind of awkward idea, and everyone wants to

convert it in their head into something that's not.

They want to say, what's the probability that the result is true?

No, it's not that, right?

Right.

So when someone does something that you think is stupid, then you want to think, why are

they doing it?

So this hypothesis testing has a role, and people are concerned that in the absence, remember

I said, I think the world would be better without statistics in some way, okay?

So imagine a world in which we didn't do statistical analysis, we just made graphs, and then

people do what I said, which is they said, well, if you see a positive result, then we

reported as positive, and just say you're not sure, but it looks like a promising therapy,

and then maybe that would get approved, and as a drug, or as an educational intervention

or whatever, then the concern is that it creates a moral hazard by which researchers would

have the incentive to do really noisy, crappy studies, because all you have to do is get,

if all you have to do is get a positive result and publish it, you don't need to reach

a p-value threshold, just to study with like 10 people, get a result 50% of the time

it will be positive.

So this is like, there's a need, and there's a need to not have the moral hazard of encouraging

people to publish really noisy things, and the p-value is a, serves that need, and I think

there are better ways of serving that need, but we should respect what the need is to

understand why people would be doing it.

Yeah, and I think some people have talked about getting rid of p-values, not sort of in

the way that you get rid of p-values, but just as like, oh, p-value is bad, we should

get rid of them, well, it's like, well, what's the bar then, right?

At least there's a bar, like, yes, people learn to game the system, but like, if you just

get rid of the gate, does that mean it's going to be better, or now there's no bar?

I think that the bar would just be a different thing, like you might say, you might have

rules about looking at the graph, or, you know, people always ask this, like, well, there's

no, you know, it'll be anarchy, right?

But think of all the studies that are submitted that happen to have a p-value of less than

.05, they don't all get published, right?

So the journals are still having to decide, so I think the same rules that they use about

plausibility of the finding, importance, you know, quality of the measurement, I think

it would be fine to reject a paper setting aside p-value by just saying these measurements

are so noisy that they're not answering the question you want to be answering.

And, you know, there are already journals already use judgment, so I think that's fine.

A topic that the general public seems to talk about a lot more with regard to science

than statistics is about politicization of science, how, oh, well, maybe, you know, because

scientists tend to be progressive, most academics tend to be progressive, that might create biases

in the way they do research, maybe you're, it's going to be harder to publish something

that contradicts progressive narratives, maybe it's easy, sorry, easier to publish results

that go with progressive narratives.

Do you, to what extent do you think that that's true in your experience?

Yeah, I think it's true.

I think that more generally, one thing I don't have a great thought on this, I'm still

thinking it through, but I do feel that there is certain political content to unreplicable

research.

So, for example, a lot of the unreplicable research in political science and psychology

has to do with, they're being hidden forces that drive your actions.

Now, I think there are a lot of hidden forces that drive our actions, and a lot of science

is trying to find those hidden forces.

But with characteristic of these studies is not that, it's not that they're saying, oh,

you know, you're driven by your subconscious or whatever, or driven by some enzyme happening

and you don't know what it is.

The studies have this push a button feeling, right?

I'm a researcher.

I push this button and that causes your opinion to change, or I'm a researcher and I put

this word on the questionnaire and it causes you to walk slowly, whatever it is.

And I think, actually, that's a, there's political content to that which I disagree with.

Like, for one thing, when it's applied to political science, if it's really true that

women change, that 20% of women change who they would vote for based on what time of

the month it is, which again, I don't think is true because we have direct evidence

where we survey people, but if that were true, that would suggest that we shouldn't take

votes very seriously because voters are a bunch of like crazy people, right?

And a lot of, I feel like, so I feel like there's a political content which is, it's

neither left nor right.

Like in some sense, it fits in, like, it's, it's some way it fits into a kind of cynical

leftist, extreme leftist view to say, oh, politics is all a joke, everything's run by

a few people are pulling the strings.

It also kind of is consistent with a conservative view that, well, voting isn't to be trusted

and that, like, a kind of anti-democratic take.

But I think there's political content, but I don't know that the people doing the studies

think of it as having political content.

I think they just feel like they're doing, they're doing research, but they're working

within it, a very political paradigm, I think.

Yeah.

I mean, I guess at the end of the day, it's part of the competition, right?

Everyone's fighting to get published.

It's incredibly hard to get published in these top journals and, you know, if they can

get a slight edge, whether it's by using slightly fissure statistics or by making their

message a little more palatable, maybe that's part of the incentive structure.

Yeah.

Of course, if you could really push some buttons and change people's behavior, that's worth

a lot of money, right?

And so this kind of research could be potentially valuable, although I don't think it's really

working.

So I'm not really so concerned about its implications.

But if we took seriously the worldview of a lot of these failed to replicate studies,

it would be like, oh, constantly, everything around you is altering your behavior to shocking

degree.

You know, if you just read the word slow on an ad, then you'd be walking slower and so

on.

Yeah.

You'd be subject to many influences.

Whether you're like the, whether you're older sibling is a man or a woman, whether the

local college football team won their game the week before the election, your various

hormone levels, whether you're married or not, et cetera, et cetera.

It would be a strange world, which isn't, I don't think, accurate.

One thing I wonder about, you know, when you criticize science and you make these valid

criticism of science, some people might say, well, look, there's so much science denial

is on these days.

There are people that believe like all kinds of bizarre, crazy things that are totally

anti-science and doesn't criticizing science, sort of support their worldview or give

them fodder.

What do you think of that critique?

Well, a couple of colleagues in I, Eric and Wittold and I recently wrote a paper called

a qualified, some, a tarpa academic title, like it was like a scientific case for, a

statistical case for qualified scientific optimism.

And what we argued was that although replication rates are pretty low, that in fields like, in

certain fields of medicine and psychology and other fields, that most, most results are

in the right direction.

So just because something doesn't replicate, it doesn't mean that the effect is not there.

I think a lot of things that don't replicate are in the right direction, but the effects

can vary.

So I think it's basically a short version is, it depends on what you're studying.

I think there's a lot of high-quality science and even stuff that is not statistically

significant, but there's bad stuff too, and the characteristic of the measurement is

very important, and that's not something that we usually talk about.

And have you gotten the critique from people that, hey, you're supporting anti-science, like

you're giving them fodder to a ton of attack science with by pointing out all these flaws?

Well, no one would say anything mean to me personally, but I know that there, the science

reform movement includes a kind of science reform reform movement, and they will argue

that the science reform movement is too negative, and so some of our recent research has been

trying to explore that.

I've heard you were a methodological terrorist, is that right?

No, I'm not.

I was called a methodological terrorist, but I think that the term terrorism is inappropriate

in New York.

Because it's a slight exaggeration, right?

Shifting topics slightly, if you look at claims about health, or even some psychological

claims, many of them involve things where it's very hard to randomize people, right?

So a classic example would be, you've probably seen studies on like, well, coffee reduces

strokes and heart disease, and if you think about, well, how would you really prove that?

It's pretty difficult to prove.

You probably have to randomize people to have different quantities of coffee, which people

really don't want to do.

You'd have to probably do it for years, and you'd probably have to do with hundreds or

thousands of people, and then monitor how many heart attacks they have.

That's a pretty darn hard study to do.

You've got to say that it's impossible, and sometimes people do pull off studies like

that, but it's pretty rare.

And so much more commonly, what you're really looking at is some correlation, right?

Someone has taken some big data set where they ask people how much coffee they drink,

and whether they've had a heart attack, and they're running a correlation.

And this is what gets reported on the news a lot, right?

This is a lot of like the health information we see from the influencers, from the news,

is this kind of relationship.

So I just wanted to comment on, to what extent should we just completely dismiss this

kind of evidence, or do you think that there's something to this kind of evidence?

Oh, I don't think we should dismiss the evidence, but I think the serious researchers in these

fields, they are looking at intermediate outcomes and trying to understand mechanisms.

But yeah, epidemiology can tell you stuff.

So I think it's fine to report things like that.

How do you make your decision?

It's based on a number of factors, but it's not...

Yeah, no.

I mean, such studies are not useless, and even if you did have a randomized controlled

study that only applies to whoever's in the study, and it only applies to the past, not

the future.

So I wouldn't like hold out for something that wouldn't be, wouldn't itself be perfect

anyway.

You have to make your decisions now.

It's the same goes, not making a decision is itself a decision.

How do you think about the strength of that evidence, right?

Because there are a lot of things where we don't have super high quality evidence.

All we have is, okay, there's a correlation between eating this food and this health

outcome.

Do you think of that as, well, okay, I should only slightly update my probabilities

that this is healthy, or do you think that's actually quite a bit of evidence if someone

finds a pretty strong association?

Yeah, I don't know.

In terms of living my own life, I kind of try to follow recommendations that...

I think, you know, I'm not an expert, I mean, at all, on medicine.

So I would hope that the people who are look at these studies and try to assess the

reasonableness of the mechanism.

Now, I know that coffee thing came out, and it is controversial, and there are some

people who say that it's very plausible, and there are others who say it's not.

It's hard for me to judge.

Some of this...

It is, yeah, I mean, it's easier to focus on the really bad studies, because then it's

super clear.

Like, I just don't believe some of these mind-body healing claims, because I've looked

at the evidence what they seem to consider their strongest evidence, and I don't find

it very strong.

But the dietary things, I have a much less of a clear beat on it.

Here's this really hilarious chart.

It's called Everything Cures and Prevents Cancer.

Yeah, I've seen that, yeah.

And they'll show like coffee and chocolate and whatever, and they'll show all the studies,

and they'll be like those that prove a cure, cause cancer, those that prove the cure's

cancer.

Right, well, there are people who are supposedly reading these studies, and medicine is

such a big field that they have medical review journals.

Like, there are journals that all they come out every month, and all they do is publish

review articles on various subfields.

So, people are looking into it.

I don't know how that's being done.

I'm not the right person to ask this, like how people synthesize literature.

Well, one thing that comes up in the studies, right, like obviously the naive correlation,

just like correlate coffee with heart disease, there are problems with that, right?

So it could be that actually, you know, coffee reduces heart disease, that would be great.

It could be that maybe, you know, when people get, maybe people who have healthier hearts

for some unknown reason, they change their coffee behavior, right?

It could be there's some third factor that we don't know what it is, some unknown factor

that causes both, that affects both coffee drinking and heart disease, right?

So we can't distinguish those from the data, and you know, savvy researchers will, they

know this, right?

And they're going to try to address this.

They try to find comparable people, so they try to find similar, people are very similar

except one drank the coffee and one didn't look what happened.

And I have seen such some of these studies like aren't done well.

And so like it, again, that's part of the issue is it's, sometimes it's easy to see

a problem, but when you can't see a problem, it doesn't mean it's right either, it's, it's,

it's unpleasant to feel that way, but I don't know like what we can do about it, like try

our best.

Yeah, it seems like with a lot of these things embracing uncertainty is really our only

thing we can do.

Yeah, I mean, I, there are people who research the stuff and I probably trust their take

on it.

And if they disagree, I would, they have, they would offer more informed take.

Just as I'll, I'll be able to offer more informed take on political science, even though I'm

not always right about it, like I, I'm more likely to be able to kind of notice issues.

Well, I predict if you go dig into the health studies, you're going to be certainly disappointed,

but we're curious if you ever do that.

Let's talk about meta-analyses because sometimes people, so you know, what do you do in science,

because you, you have all these studies, they don't always agree with each other, right?

It puts you in a weird position, you want to trust the science, but well, the science

says one thing, it also says the opposite.

So sometimes people call randomized control trials, you know, they're like the golden

standard, the gold standard.

And sometimes people say, well, meta-analysis, the platinum standard, what you do is you

take a whole bunch of studies and you aggregate them together, because that's better than

any one study.

And then you draw your conclusions based on that.

I know you've done some interesting work looking at some meta-analyses and saying, hmm,

things are not always, what they seem.

Well, you know, garbage and garbage out, right?

So yeah, we looked at a meta-analysis of, it's like 200 different studies of nudging,

so-called nudging.

And yeah, but the individual studies themselves were biased in the sense that there is selection

on large effects.

So nudging is like trying to change behavior through small changes like in the environment,

right?

So.

The example that works is defaults.

So you change the default behavior on just about anything and most people will do the

defaults.

Right.

So to see your savings account, if it automatically gets invested in the SME 500, most people

will just leave it.

Whereas if it doesn't automatically get invested, those will just leave it.

But yeah.

And the other not just basically how don't work.

Like everything but defaults pretty much.

So, and the other kind of nudges like, oh, it's the cafeteria and you put the healthy

food near the checkout rather than the junk food and, you know, people still, like, they

find the junk food in good, whatever.

So there's a lot of literature on this.

Yeah.

It's kind of mixing, if the medical meta-analysis can be a little better because medical studies

are often done in a much more controlled way.

They have to report everything.

But sometimes they had a meta-analysis of some controversial COVID treatment.

This was a few years ago.

And when you looked at the individual study, it was like, oh, there's like four people

in some study and somewhere, like a doctor did it.

And what they weren't really controlled trials, they had different outcomes.

So it was kind of garbage in.

And so you, yeah, yeah.

But I think meta-analysis is wonderful.

In fact, I think you should be doing meta-analysis even if you only have one study.

Even if you have zero studies.

I have to explain that.

Zero studies.

Yeah.

Well, the point you're doing it with the other studies that could have, I'll say we have

a prior on what the other studies would be.

Here's the story.

I do a study.

You find a result.

Here's the estimate, the uncertainty.

You forget about P, whatever.

You publish, this is great.

You know so much.

Then you do another study.

And the result's different because it's on different people, right?

It's different.

They have different health conditions, whatever.

So you think of a different estimate and a different uncertainty in a different

places.

So now you have more uncertainty.

So that's weird.

That just kind of universally adding a second study gives you more uncertainty.

Well, that shouldn't be right.

I mean, it shouldn't be that more data gives you, and so that's wrong.

And so the answer is that even we only have one study, you should be doing the meta-analysis

and say, well, how much could this affect vary?

No, I don't know it.

So I have to put in a prior, I have to make a guess, but it shouldn't be zero.

So it's a very weird thing.

I think meta-analysis is a very good idea.

It's very important.

It's a very weird thing that if you think about it, that getting more data increases your

uncertainty.

But that's because when people have the only one study, they tend to sort of forget that

there's variation.

So it's weird, but that's how it is.

Right.

Because if you have just a one study and you're like, oh, well, this can increase the thing

20%.

You just kind of forget that there's a huge uncertainty on that 20%.

Well, not just that there's an uncertainty in that 20%, but there's variation.

That 20% is for these people in this situation, right?

I mean, a simple example would be you have an educational treatment, and they tend to

work on students in the middle of their range, right?

Because the best students don't need it.

The worst students can't make use of it.

So if you apply it to a certain population, you can have zero effects because they're

not in the middle.

So you can have the exact same treatment, but just put it in different kids and we'll get

different results.

So is the idea of the one study meta-analysis basically thinking about the fact, okay,

I did this study on this one population.

If I had brought in this or to the whole, let's say, the whole US population, that would

actually increase the uncertainty and I can actually think about what effects that would

have, even though I've never run that study.

Yeah, that's the idea.

You would create hypothetical other studies and make assumptions about them.

When meta-analysis are conducted, you know, they're combining many different studies

to try to get a better answer, you know, we talked about garbage in, garbage out.

How do you think about what should be included?

Like versus what studies you should just say, don't even put it in the meta-analysis.

I think you could, well, sometimes people say they'll include stuff, but they'll down-weight

it if the data aren't as good.

I don't believe in down-weighting things, rather what I will do is add in a bias term with

uncertainty.

So I'll say, oh, this study could be very biased.

And so instead, some studies are just unbiased for their local population, they're really

clean, but usually they're potential biases.

And so you have a sense of how large those biases could be.

So when we did our election forecasting model, we allowed for the possibility that polls

could be off systematically one direction or the other.

So that gave us more uncertainty about our forecast.

And we set our prior for that based on previous data, previous polling years.

Right.

Because usually when people combine studies, there's essentially an implicit assumption

that they're not systematically biased, right?

They're assuming, okay, yeah, this study's not as good, but that just means it introduces

more noise.

But if they're biased, then all bets are off, right?

Yeah.

And people know about this.

There's a whole field of biostatistics and they've been writing papers about this for decades.

So it's not like I came up with this idea or I'm coming in and cleaning up the field.

People know about it about modeling bias.

But the other thing is that meta-analysis tools, I don't know, maybe for the past 10 years

or so, they've been just much more accessible.

So you can throw data into a computer program and it spits out the meta-analysis.

So then you do get, of course, you get people doing a simple analysis that guess what

they want.

But to me, all these ideas fit together, but it's not like people aren't working on it.

People know about it.

It's not a big secret.

Nothing's a big secret.

That's the secret.

You know, another thing comes up when you're combining studies, which is, can you really

combine them?

Are they similar enough?

For example, let's say you're trying to answer a very reasonable question.

Does meditating reduce anxiety?

It sounds like a reasonable question.

But what do I mean by meditating?

Because you could have everything from, you know, I used an app on my phone five minutes

a day for a week to, I went on a three-month meditation retreat where I was with a guru

and didn't speak to another human.

And also, what do you mean by reducing anxiety?

There's probably, you know, at least it does in different ways.

You could measure that over different time periods using different scales and so on.

And then, yeah, then you want to say, okay, we'll have a bunch of studies on meditation

and anxiety.

Somehow I want to answer the question, does meditation help anxiety?

How do I even merge all this stuff together?

Well, you definitely start by not asking the question that way, because I would say in

what settings does it, and for which people?

So for me, for example, put me in a three-month yoga retreat where I can't talk and I would

make me very upset.

I would, that was, no, I can tell you that.

That would be horrible.

So the way you can do this, again, in meta-analysis, you can have individual level, yeah, ideally

you analyze the individual data and so you can say, well, there are certain people like

me who it's going to be counterproductive for, also you can have study-level predictors.

So in your meta-analysis, you have characteristics of the study and that should be part of the

analysis.

So you're part of the thing that you'd be attempting to learn is where is it work and

where it doesn't?

Now the trouble with that is it's very hard to estimate these variations.

So you'll just have a lot of uncertainty at the end.

But as long, again, as long as you're willing to accept that.

Let's talk about polling.

So in the first Trump election, a lot of people were shocked that Trump won.

A lot of polls said he had, you know, it seemed to say that he had almost no chance.

What actually happened there, like from a statistical point of view?

Well, you know, the polls did very well in 2015 and 2016.

So when Trump was running for the Republican nomination, lots of people thought he didn't

have a chance.

I didn't have a chance.

But I'll tell you, he was leading in the polls.

So the polls provided a lot of information that people didn't want to hear.

Now, in the general election, the polls were off by about two or three percentage points.

And I think that had to do with, we've looked into it, but I think this had to do with who

was voting that there were Trump attracted a bunch of people who were typical non-voters.

And non-voters typically don't respond to political polls either.

So I think the polls were missing a bunch of people who were going to vote for Trump.

Now being off by two or three percent is not so bad.

People are out there like, oh, response rates are so low.

It is.

Survey response rates are really low.

It would be crazy.

Don't respond to surveys.

It's silly.

Unless they pay you.

I mean, why should you give you them your time, right, so they can make money?

Back in the 1950s, response rates are high.

I'll tell you this.

If you were pulled by Gallup Poll back in the 1950s, I would advise you to answer the

poll.

Because you'd be one of 1,500 people, and the result would be you'd be having an effect

on the news headlines all over the country.

But you could be the pivotal responder who made a certain policy look more or less popular.

But nowadays, so many polls are happening.

So many polls are happening.

I once read a paper where they pulled people to ask them how often they have been pulled,

which is tricky because you get an overestimate because the people don't answer polls.

So response rates, it actually was logical, but over the decades, the pollsters have drained

the aquifer of public participation.

But the polling errors now are no higher than they were 60, 70 years ago, because it's

like we're running to keep up.

We do a lot more statistics now to analyze the data.

So it's just that before, well, let me tell you, so I'm going to tell you a bunch of

numbers, 1880 through years, 1884, 1888, 1960, 1976, 2000, 2004, maybe, 2016, 2016,

2020, 2024, the close elections.

There's a period presidential elections after wrote a paper about this.

There's a period after the Civil War where there are a bunch of very close elections.

And there weren't for a while.

Now what I'm saying is for most of American history, if you could predict the election

within 3 percentage points, that's pretty good.

So yeah, Ronald Reagan's going to win, got it, you know, like you're off by 3%.

But recently we've had a bunch of close elections.

So what I think is the problem is it just happens that in 2004 and 2008, the polls happen

to be very accurate.

Just, like it's by luck a little bit, because there's variation.

They happen to be, have very little bias.

And those are elections that were surprising.

2000 people thought Gora was going to win, he did win, but like not by very much, right?

So the polls said the election would be close and people weren't always losing it.

2004 was actually a little closer than people thought it would be.

Again, the polls were accurate.

2008, people thought Obama's not really going to win.

Who's really going to vote for this guy?

But actually the polls were accurate.

So after that, a bunch of people did poll aggregation.

And I think a lot of people, consumers of polls, thought that they were awesome.

2012, the polls were off by a little bit, but they predicted the winners.

So people didn't notice.

Then 2016 came along.

People had unrealistic expectations, so that's my story.

So basically, it's a lot of people in the, let's see.

Yeah, I mean, polls being off by two or three percentage points, like, is pretty good.

And you're not going to predict the winner in a very close election.

But nobody said you were the right to predict the winner in a very close election.

I mean, nobody owed you that.

Wouldn't it be better if we just didn't even think about it?

Like, instead of everyone tracking the polls, like, for the three weeks prior?

Oh, they do too many polls.

It's ridiculous.

I think it's good to have some polls.

I mean, there is, like, election security issues.

I mean, there are countries where you say, well, the election's much different than the

poll. That's a concern.

Also, most of polling is not about the election.

Most of polling, well, political horse race polls are lost leaders for pollsters, right?

Pollsters make their money for market research.

But then they throw some political questions on so they can get in the news.

And then they say, so and so, pollster.

But also, you know, polls about public opinion are valuable.

It's probably good to know that something like 60% of American support the death penalty.

Like, that's if something I haven't looked at the data recently, but it's something like that.

Obviously, there's been a lot of policy polling recently.

That's going to be, I think that can be very valuable.

Science is built on replication.

Our confidence that a particular hypothesis is true increases.

The more times we can conduct experiments and get results that are consistent with the

original research.

Unfortunately, psychology and other social science fields have been undergoing a replication

crisis for the past several years, meaning that researchers have tried but failed to replicate

experimental results from the past few decades.

And this is deeply troubling because it calls into question many of the things we thought

we knew about how humans work.

To help solve this replication crisis in psychology, the team at Clare Thinking has launched a project

called Transparent Replications that seeks to celebrate high quality research while also

shifting incentives toward more replicable, reliable methods.

They accomplished this by conducting rapid replications of recently published psychology

and human behavior studies in prominent academic journals, with the aims of celebrating

the use of open science best practices, improving reliability, and promoting clarity.

Once the Transparent Replications team has completed a replication, they make their results

freely available on their website for anyone to read.

To read those results and other essays by the team, visit replications.clearthinking.org.

So I've heard that only something like 10% of people answer their cell phone from random

numbers.

How do they deal with that?

That seems like devastatingly bad if you're trying to poll people?

A lot of pollsters use internet panels now, so what you do is you collect people, you

pay them to occasionally answer surveys.

You just have your standards that are people that are demographically diverse.

Yeah, you work really hard to do that, and then when you do your survey, you make sure

you get the right number of young people and all people and different ethnic groups and

so forth.

And that's not perfect.

So in 2024, there were polls that did this and still got the wrong answer, although there

were some that did very well.

So how to do it is tricky.

That's for sure.

How to reach the people is tricky.

How to get the list of people is tricky.

How to get people to respond, even after you've paid them is tricky.

And how to do the analysis is tricky, so I work, it's all industry.

I feel like I would be letting people down if I didn't ask you at least one question about

Bayesianism, since you probably don't know this, but he's maybe the most famous Bayesian

in the world.

So yeah, make your plug for, why do you use Bayesian methods when very few people do?

And what are these?

Just for the people who don't know about it, yeah.

Bayesian statistics is, how do I say this, that when you're, well, the simplest way of

saying it is that you are trying to learn something that you don't know and you have data

on this and you combine the data with your priors.

It's kind of your prior beliefs, so that was true, right?

I like to say prior information rather than prior belief, because actually, you're not

combining your priors with your data, you're complying your prior information with the

model connecting the data.

So you have to think about, there's something I care about and there's data and there's

a model connecting the data, like a model saying, this is a good measurement of that.

I did it and I did a randomized experiment, blah, blah, blah.

And then there's another model connecting the thing that you care about that you want to

learn to your previous knowledge, and that's your prior.

And that's a model also.

And why is it useful?

Because we have a lot of situations where we have a lot of prior knowledge.

We have a lot of previous elections, previous polls.

Even a medical study on an unknown topic, I do a new medical study, we have databases

of old medical studies and we have a sense of how large effects typically are.

And from that, we have a prior which can help us get more realistic estimates.

It's not magic.

But if I have an estimate, I say, this is statistically significant, and it looks like

it reduced the death rate by 4%, then I can do a Bayesian analysis and say, well, my best

estimate is not 4%, it's 2%.

And here's my estimate of whether it's negative or not, and that's a little better.

So I think we can do a little better, that's why I like to do that.

Sometimes you have very little data, and then it's very valuable.

So for example, if you want to do inference for subgroups, you just don't have a lot

of data in the subgroup, then your prior model is going to be more important.

Right.

So just to explain this a little bit, so normally in typical frequent statistics, you're

running a study, you're just going to analyze that data and think about that data you

just collected.

But with Bayesian methods, you can say, well, look, we have all these past studies on,

let's say, related topics, we can say, well, we know how often it occurs that we get

a fact that's 5%, 10%, 20%, and so on, and you can integrate that information to essentially

improve your estimate, is that right?

Yeah.

And people think it's cheating, because it's like, well, look, I want to prove something

work.

So I'll just throw in a prior saying it works.

So you can do that.

I think you just have to be transparent.

If someone publishes a paper and says, well, here's my data, and here's my prior.

My prior is, I'm pretty sure this is already going to work.

Then I can say, well, this isn't, you know, that's not my prior, like, you know, I don't

believe that.

So transparency is necessary, is a necessary part of that, and in a way that is different

than classical statistics, because a lot of classical statistics is set up as automatic.

You just push a button and run the analysis and the number comes out.

And here we really are saying, be very explicit about your assumptions.

I think that's super important.

Yeah.

I think on the other point you were making, you know, imagine you're trying to model what

every single state in the country believes about a thing, and some states you have lots

of data.

And some states you have a little data, right, normal statistics, you just conclude, well,

in the states we don't have a little data, you just have this really huge uncertainty.

But Bayesianism says, well, maybe you can use information from the other states to alert

to.

Well, let's be fair to those other people.

Yeah.

What they wouldn't do that, what they would do is say, well, I don't have enough data

in this state.

So I'm going to use, just this time, I'll use the old estimate from this state, or I'm

going to combine these.

So like, if you look at the CDC, when they have maps of disease, for rare diseases, they

do it by county, but in a state like Texas, which has hundreds of counties, they'll combine

small counties, right?

So people, what they do is they combine data to get a stable estimate.

But the trouble with that is then you lose your specificity.

It's a little like Heisenberg's uncertainty principle.

Like you can get a precise estimate, as long as you're willing to estimate something

that's spread out over the whole country, or you can get a very noisy estimate about something

local.

And if you want a more precise estimate about something local, I think you have to do more

modeling.

Before you wrap up, would you wish people knew about statistics?

Hmm, I don't know.

I don't know that people need to know about statistics.

I mean, I mean, there's other things maybe they should be, they should be, they should

be knowing about.

I guess, I guess they should be, they should be aware of uncertainty and variation.

The psychologist's first key economist wrote a paper in 1971 called Belief in the

Love Small Numbers, and it's a not, there isn't such love, small numbers, it's a belief,

and the belief is that the part is representative of the whole.

And so if I do a survey of 100 people, I should get roughly 52 women and 48 men.

But if I look, if I get a small sample or if I have noisy measurements, the part won't

necessarily look like the whole, people tend to think, to bet back to your thing, to connect

this.

People tend to think that all the evidence goes in the same direction.

And when they write a paper, it's like, look at all the evidence.

But in real life, the evidence doesn't go in the all the same direction, right?

Things happen, you will see things that go the wrong way because there's like things

you hadn't thought of.

And maybe when thinking about science or life, people should be aware of that.

That aware that the evidence doesn't go in the same direction, but also aware that it's

very natural not to realize that.

Before recording today, I learned a mind-blowing fact from you and it wasn't about statistics

and it's my final question for you.

Can you tell me that you don't have a cell phone and you never check your email before

4 p.m.

Is this the secret to your incredible productivity?

I think I was productive before cell phones were in common use and at that time, I was

also checking my email before 4.

So I don't think that would be the source of my productivity.

Andy, thank you so much for coming on.

Really appreciate it.

Thank you.

All right, let's just do a few questions from the audience and we'll be sure to repeat

them just so we can get them on the recording.

I read something over the weekend where all the top frontier models, LLMs were asked.

If you have a car wash as a hundred yards from your house, should you walk there or

drive there?

You need to go to the car wash.

I think all of them except one, I'm not going to name it because I can't remember which

one got it wrong.

This is just probably somewhat irrelevant way of asking, how do LLMs do at statistical

problems?

Are they an entry into the world of statistics that change things dramatically?

I'm sure in most areas in Denver, they express a very high degree of confidence in the

quality of their answers, but how is it useful, is it a useful tool in fareding out fraud

or mistakes?

What is your experience with LLMs?

How do you think LLMs are impacting that world?

Do we need to repeat that or no?

Okay.

Cool.

I've never used a chatbot myself, actually, so I can't, but I have some answers to this

because I've talked to people about it.

The answer is a chatbot can be super useful in a lot of ways.

First, it's like a version of Googling stuff, so it's a way of searching.

I use Google because I'm an old person, but this is like what people do, so you could

do what you do when you find something in Google, you have to click on the reference

and look.

Similarly, if you use the chatbot, you can look stuff up, and then you can try to find

it elsewhere and check it.

In that way, it's like having a really accessible textbook.

Even if the chatbot is doing nothing but rearranging words that are already on the Internet,

if you ask something about statistics, you get a lot of technical stuff that's probably

pretty good.

You're not asking about a conspiracy theory, right?

This is good as the quality of what goes into it, so I think when it's a technical subject

or how to repair your fridge or whatever, it's probably pretty good because there aren't

a lot of fake owner's manuals out there.

Beyond that, it can be used for coding, so I've not done this, but I've been told that

if you want to make a graph, if you're willing to work a little bit with the chatbot, you

can get it to make really nice graphs.

You give it some data.

If you just say make a graph, it might make a horrible graph, but you give the data, and

then you can say, I want to write a program in Python that does a grid of this graph and

this and so forth.

Apparently, that's very effective.

Can you throw the data in?

Can I just throw data in and say, please perform a statistical analysis or throw a paper

in and say, please find the flaws in this paper?

I guess it'll do something.

I doubt it would be very useful to me, but it's a little bit like people say, well, you

could use a chatbot to write a scientific paper.

Well, could I use it to write my scientific papers?

Well, maybe sometimes, like maybe sometimes I'm asked to write a review article about a subfield

and I could maybe have the chatbot to it and then clean it up, but most of my scientific

papers are research.

I don't know what I'm trying to do.

There's no way it would work for that, but that doesn't mean it's valueless.

We shouldn't judge a tool by what it can't do, right?

So when you're looking at clinical trials for drugs, are there any common smells that

you see when you look at the study and you're like, oh, this is clearly they're trying

to massage an outcome here?

I mean, ideally, you wouldn't have to worry about what they're trying to do.

So in the ideal world, all the data would be available, so you could reanalyze.

I think the hard thing is, most studies are just more complicated than you'd think.

It's very rare, even in a clinical trial, that you're just comparing two groups, like you

might have time series involved, multiple measurements, the analysis can be kind of complicated.

Sometimes, I guess sometimes you see problems, like sometimes they're not adjusting for

age or like, what if you adjust for age linearly?

Well, that might work in some settings, but it wouldn't work for something like COVID,

right?

Where there's a big nonlinear pattern.

I like to see people show the raw results and then show what happens after you do the

adjustment.

Yeah.

Yeah, that's always an interesting one, the report, the sort of a more complex analysis,

but then they don't report the simpler analysis.

That's something that I've noticed.

When we go replicate papers, that's like a point of suspicion, we're like, why didn't

they report that?

You'd expect them to report the simple thing.

So let's look deeper, and we'll redo the simple analysis and make sure it gets the same

answer.

Yeah, I want the explanation.

Yeah, this looks like this work, but actually there were more old people in the control group

or whatever it is, like you want to have some story, some under, not that the story is

always right, but at least that can guide your thinking.

I'm interested in your final point, which I found very compelling about that in the

real world we should expect the evidence to stack up in different directions and not

all go in the same direction.

Something that is true, in light of that, how would you change either academic process

in any field that you think suffers in this way or just the general discourse to just

take into account that complexity other than just saying more meta-analyses?

I have some ideas, so which are already being done in different places.

So one idea is sometimes you can publish the design of the paper, and then the data come

out and you're going to publish the data no matter what.

Like a registered report, I don't think it should be required, but it should be allowed.

Sometimes people say, well, I don't want to do a registered report because I don't want

to restrict what I'm doing, but the registered part is just a subset.

You could do other analyses, too.

Another thing is to divide things up.

If someone does a great study, they could publish just the data, so you shouldn't have to

publish.

If it's interesting, you should be able to just publish a paper, that's the data, and

then other people can analyze the data.

So yeah, I guess moving towards publishing everything.

Sometimes people create big data sets.

I think I remember one of a collection of genetic information from large populations, but

they didn't associate any studies with them.

So someone can draw through these big data sets.

How do people make sure how should someone approach such a giant data set to make sure

that they're not just finding the outlier significance because they didn't collect the

data themselves, but they do want to see if there's anything interesting in that data set.

Sometimes if it's a huge data set, you can break it up into training and test sets.

That's not going to work so well in economics and political science.

We kind of know the data they're out there.

We have some number of elections, and you can't always do that, but if you can, that would

be good.

Yeah, it's interesting because in machine learning, you see that a lot, but it feels like

psychology almost never do that, but it's a powerful technique, right?

You split some of the data out and say, this is the data I'm going to do.

Any analysis I want, I can go wild on it, but I'm going to put some in a vault, and

only at the very end double check that it still matches on the data I haven't looked

at yet.

Yeah, I mean, ideally that's what the scientific process is.

So I write a paper, I analyze the hell out of my data, and publish the results which

seem reasonable to me, then other people can follow up.

People don't always want to follow up, and my ideas, they have their own ideas, right?

I mean, it's like what you said, people should, I'd be thrilled if people would replicate

my studies, but people don't typically replicate their own studies because it's like, you

know, it's work, it's expensive.

I've heard an estimate that I think of one sub-fill that was less than 4% of papers

ever get replicated.

That might even be an overestimate, maybe way less than that.

Sometimes people replicate their studies and the replications fail, but they do an

analysis to make it look like it succeeded.

That's another problem.

I guess it's more of a comment and a reflection as a scientist doing not clinical trials,

but the discovery side of science, and having been a reviewer for academic journals as

well, and I see so much bad statistics being done, like they're not using the right type

of statistics, and I pick it apart, like N equals three, and they use parametric when,

you know, we have small sample sizes, so we do what we can, but it goes along the comments

that, and hiding their data in bar graphs, right?

So I like the comment where you said we shouldn't use statistics in science, and I think that

would help kind of solve that issue of trying to force out that p-value or finding the statistic

we should use to get that significantly value, because that's the only way we can publish

in these big journals, right?

So it would be awesome to figure out a way to push the idea of not needing statistics

on these smaller discovery studies, and then using them bigger in clinical trials.

I think one of the paradoxes of a lot of, like Madison's also policy analysis is that

big decisions get made based on small studies, and once the study is out, then things are

just done, so it's uncontrolled.

So in the ideal world, I think that our experiments would be more realistic, and I think the real

world would be more experimental, so ideally, there's no reason you couldn't keep randomly

assigning treatments even after they've been approved, because there's multiple treatments.

Now, it's a little weird, do I want my doctor randomly assigning a treatment?

Well, you could randomly assign encouragement, I mean, there are things that could be done.

I'm not saying it would be easy.

It was one trial where, when you went to the doctor, if there were two treatments, and

it was literally unknown, which was better, the doctor had the option to push a randomized

button, because nobody knows, and I think that's a pretty cool example of that, yeah.

I think just in general, if you think about, like, who, like, the idea okay with doctors,

nurses, teachers, police officers, various people who are implementing policies, and we,

as researchers, academics, or whatever we are, should respect those people.

So the idea that a medical treatment, I mean, sometimes it really is just a pill, but

typically it's a therapy of some sort, and the doctors and nurses who are implementing

it should be active, they should be part of it, I think, it's not like we figured out

what works and how you do it, and the same with educational interventions, interventions,

it's not like here's a kit, and the teacher pushes the button, the teacher has to be involved,

the police officers have to be involved and committed, and so forth.

So that's like the big picture, I think.

All right, last question.

To what extent should intuition inform, you know, your calculations in the sense of,

you have an intuition about some causal chain or some effect that just, you know, has to

cause be, you kind of see it, you feel it, and you know, as humans, we have so many bits

of information that we experience, and then when you do these tests, you know, you get,

you know, 20 bits, you test 20 people for like one binary thing, and whatever it's not

much information, like what is the interplay between the very small amount of data you get

from your samples versus your lived experience that informs, like, you know, everything

else?

Well, I mean, I guess that's kind of the point that we often have a lot of prior information.

I mean, if you think about like a drug trial, you say, well, they do all sorts of experiments

to figure out like what's a good dose, and then they do the big clinical trial and everything

is frozen, but it's kind of weird, the clinical trial has thousands of people, but they're

deciding the dose based on some small amount of information.

So that obviously they have prior knowledge.

The easy answer to your question is you should be able to use your intuitions or prior

knowledge to design your study.

So if I believe strongly that there's a certain pathway, then that would inform, that that

would suggest I should take measurements along that pathway, and that also might suggest

doing a control that shouldn't go through the pathway, it's a negative control.

So at the very least, you should be able to use that to design your study.

In the analysis, it's more, as we said, it's a little more controversial, but the design's

more important than the analysis anyway.

Handler, thanks so much for coming.

Thanks for joining us.

We always love to hear from our audience, so if you have any questions or comments, send

us an email at clearthinkingpodcastatgmail.com.

This episode was edited by tone support and transcribed by We Amplified.

We would really appreciate it if you could rate and review us wherever you're listening

to this, as it helps others find out about the show.

And did you know you can now watch videos of the podcast?

Just go to youtube.com slash at clearthinkingpodcast.

Stay tuned in by subscribing to our email newsletter, One Helpful Idea.

Each week we'll send you one valuable idea that you can read about in just 30 seconds,

along with that week's new podcast episode, an essay by Spencer, and announcements about

upcoming events.

To sign up for our newsletter, access transcripts, or learn more about our guests, visit podcast.clearythinking.org.

Should science stop worshiping statistical significance? (with Andrew Gelman)

About this Episode

Hosts & Guests

More from Clearer Thinking with Spencer Greenberg

The seductiveness of secular gurus (with Christopher Kavanagh)

What beats intuition when it comes to doing good? (with Marcus Davis)

Can averages explain a human life? (with Steven C. Hayes)

What happens when your co-workers are AIs? (with Evan Ratliff)