Loading...
Loading...

This episode of the stand-up is going to be extra special,
because Casey is going to do the intro.
Casey, what are we talking about today?
Hello, everyone, and welcome to the stand-up.
The number 45-6 best tech podcast on Spotify,
according to the most recent, uh, something.
Row.
Uh, anyway, sorry.
Today, on the stand-up, I wanted to cover something.
I'm going to talk about the AWS outage that happened in October.
But I'm doing so because I kind of wanted to talk about a bigger thing,
which is the idea of actually understanding something
versus saying you understand something.
So like, one of the things that happens a lot,
especially, I think, to people who are
earlier in their programming career, like, if you're a junior programmer or something
you're coming in, um, and I know this was certainly true of me,
is you, you want to seem like you know stuff, right?
Like, you don't want to seem like you don't understand what's going on.
So there's a lot of, like, external pressure,
whether it's really there or not, you feel like you should kind of say that you
understood something or, or pretend to understand something.
Even if it's like a little bit hazy or you didn't quite get it.
And even if it wasn't your fault, like, even if the thing wasn't explained properly
or didn't include, like, important information,
you're still incentivized to basically act like you knew what it was, right?
Because it just makes you seem smarter or something or at least doesn't make you seem junior, right?
And so one of the things that at least I've found as I got older and programmed,
had more programming experience and things like that is nowadays,
I, like, almost overask for things to be explained.
Like, I'll, like, I don't care about looking dumb at all.
I'm like, wait a minute, go back.
Like, I didn't understand that part.
Like, what do you mean by this or like, what's that term mean or whatever?
Uh, because now I just don't really care about that.
Like, I'm not as worried.
And I want to actually know because I've had so much experience programming where I thought I
knew something or I pretended I knew something and it came back to bite me.
I'm like, I want to actually know.
Like, I want to be sure that when I have an explanation of a bug or I think I know the reason
of a performance slowdown, I always in the back of my head, I'm like,
if I haven't really gotten to the bottom of this, it could be something else.
It could be it could be that the real thing is still hiding in there.
And I just don't know because I haven't really looked at it all the way.
I'm just I'm moving on because it's convenient or whatever.
And so, uh, the reason that I wanted to talk about the Dynamo DB outage is because
recently there's been kind of a string of high profile outages.
So there was like a big one that took down Google and it turned out it was a thing where like they
they didn't handle a field being empty, right?
So that they're programming the way they were programming.
They were like, okay, we have this thing, we load some JSON.
And if there's nothing in the JSON, it's just we like, we de-reffendell pointer or something,
right? It was like literally that, right?
And then there was one with CrowdStrike, where they were like they took down the entire world
with blue screens. And that was they gave a very good, it was like a really good explanation of it.
They were like, we had here's we do the certain array sizing thing.
And we had too many rules so it like overflowed the array, right?
And so these were like pretty good when they when they gave what they call RCA's or root cause
analysis, right? When they said like, here's why we went down. When I read them, I didn't feel like
there were a lot of unanswered questions in my mind. Like maybe I didn't know like literally the
line of code that because they maybe didn't publish literally the piece of code. But they gave me
enough that I was like, okay, I understand how someone wrote this code. And I understand the stupid
thing that that they did, right? That like, okay, don't do that thing. I understand. And I'm
totally like, okay, with the Dynamo DB one, because it came up on this podcast, right? We talked about
it when that dude at the guitar center, right? Was like I over and someone talking. Yes.
The pub, right? Yes, incredible. Here we see the elusive programmer, a simple creature that spends
most of its time working alone, often in darkness. Well, what's this? Someone being wrong in the
internet. Our coder springs into action, reaching top speeds of 120 words per minute before flesh.
A light moat website. The natural enemy of these code lovers stuns our friend. The chase is called
off. We'll have to get them next time. When not on their computers, they can spend hours drawing
crude symbols that's something they call whiteboards. Researches have discovered thousands of
darkness, often with more than a dozen views than a single office. However, no linguists is yet
deciphered what their purpose is. Vane creatures, their bodies have evolved over a millennia
to be able to sit in unusual postures while looking at themselves online. This will often last for
many hours, using the excuse they're waiting for code review, but pressed to why they're so inactive.
And finally, after a long day of accomplishing very little, a keyboard warriors ready for bed.
Quick read and it's lights out. Good night, little coder.
So how do I sleep so well at night? Well, I have sensory to help me crush those bugs, and I'm not
I'm not talking about like little teeny tiny South Dakota bugs that die in the winter. I'm talking
about big, mean jungle bugs. And I'm not scared of any of them, by the way, just
but I can squash those bugs with sear by sensory. So I was kind of a little more motivated about that
one to go like, okay, let me go see like what how much information they've posted. And I had read,
I had already kind of read afterward, they had a summary where they posted an RCA, and it was very
vague. Like the RCA just did not really explain very much. I then noticed that they posted a full
presentation like at reinvent in December, they or I guess I don't know if reinvent was in December,
but the video went up in December of the reinvent presentation where they covered this outage.
So I went and watched all of that. And after having read the entire RCA and watched the entire
presentation, I still was left going, I don't see an actual explanation of the bug here, right? Like
I'm trying to figure out what the actual bug was. And it just wasn't ever explained. And so what
I kind of wanted to do was just talk about that. Go through why I don't think they explained what
the bug was. And just use that as an example of like, I don't think people should just go, Oh, okay,
I get what the bug was, because people have like replied to me and gone, Oh, here's let me explain
to you what the bug was. And then they just explain the same thing to the person. I'm like, that's not
the bug, right? So everyone see is like incentivized to go like, I understand it because I
write, it's like, no, if you can't tell me what the actual bug was, then we're not done here,
right? Like we should have that fuller explanation. So does that all make reasonable sense? Like,
what I'm saying? Yeah, first off, I just want to say, I knew exactly what you were saying, Casey.
Like we're from the start, right, cheese? Like right away, you were like, okay, I know, I know
exactly what you're saying. No questions on my end. No blockers. Thanks everybody. I'm great.
I'll see you guys tomorrow. You know, no problem. I just want to say I really like listening to Casey
talk on the podcast when I listen on Spotify, but also just right now, like I could listen to you
talk for an hour. Great shout out to for the Spotify. I was just going to say like, especially when
you listen on Spotify, your choice quality is incredible. You also get the bonus extras, right? You
get all the banter before and after the actual. Yeah, we started posting longer, longer versions on
Spotify that are like more of the extra. Yeah, time less of on top is not on topic stuff, but a little
more. Yeah, because the live audience gets the yapening, they get to come in here, they get to hear
about trash and his Pokemon addiction, which you probably don't even know about because you weren't
listening to this on YouTube, right? You don't get to hear all the fun stuff. That's kind of a
hard sell for the first 10 minutes of a YouTube video. It's a very hard sell for a YouTube video.
I'm going to watch four guys talk about something I don't even understand and it's called Dynamo
DB. Since we're starting the podcast, maybe we should introduce Adam. That's a very good point.
We haven't done any suggestions at all. Tell us a little bit about why you're onto the podcast
because I am at Tejas House. I'm at number one. I'm not for this today.
Tejas requires all people who visit his house to be on the podcast. It's been awkward
at a couple of times. Who are you really other than an AWS hero? I'm not even that. I was an AWS hero.
All right. You can kick that out of the superhero group. How's that work? You don't get renewed.
I was a one-term hero. Is it like a paid up? Say you paid a beard hero? No. I didn't really care
about AWS anymore. Talk about it ever. Maybe he's not a hero anymore. That's now he's a villain.
Casey looks like he's part of like some Mr. Murder mystery. He's standing there.
Dude, we're about to get what is it? Nick Hilt? What's the person that does all the drawing
on the board and then shows up? Casey Moratory. That's what you know. Meuratory. Is it meuratory?
Or is it moratory? Oh my god. You're about to do visuals, aren't you? So yes, I know. This is
the best podcast. It's literally this is the best one to be a part of. It's pronounced
meuratory by my family. Like almost like there was a why there like meuratory. But that's correct.
It doesn't really make any sense because it's an Italian name and an Italian to be meuratory.
For meuratory. It's meuratory. It's maritor in Italian. It's meuratory.
Doesn't make it so why how it got meur I have no idea that was some Italian American like
immigrant thing that happened? I guess I don't know. So here's effectively what they said.
They have these things called and API endpoints, maybe they call them right and these are the
domain address like if you look up in DNS, it's the name that you're going to look for to
know who you're supposed to send like your dynamo DB requests to. And these things I guess
look like this. And Adam can probably confirm this because he is or was a hero.
They look like what? Oh, it's behind. Yeah, we're of these things behind because our video
disappeared on rivers. Yeah. Okay. Oh, there we go. So they look like dynamodb.us east
dash east dash one dot API dot a bit WS or something like this. And I guess it depends whether
you're using IPv6 or IPv4 like they have different names depending on things or whether you're
using like a specific like they talked about governments use like a different one or whatever. So
these names are like names that you effectively hard code I guess into your application where
you're like when I need to do something with dynamo DB, I'm going to like ask for this. Does
this make sense? Right. And does that sound right? Adam to like because I don't use AWS stuff.
Yeah. Yeah. That's all right. So, you know, you you asked for something like this and you're
going to send your information perfectly. I mean, I know what he's saying. Yeah. Yeah. So that
then is going to redirect you somewhere because obviously there isn't like one machine that's
going to handle all the dynamodb traffic in the entire universe. Even if you subdivide it by
region, which you can see here, you're kind of supposed to pick a region. I guess you don't you
don't send it to some main address. You send it to a regional address or maybe there is a main
address you can use that will figure it out. I don't know. But anyway, at some point you're talking
to this and this needs to point to effectively like a load balancing scheme. So this thing is
supposed to point to effectively what they called a DNS tree, although they never really explained
the tree nature of it at all. It sounded more just like like a weighted array, if you will,
where you just said, here's a bunch of machines and you're going to pick those machines based on
weights that we set so that we can load balance, right? So if a machine gets behind, maybe we set
its weight to lower. And if a machine seems kind of empty, we set its weight to higher. And so
they called it a tree. So I'm assuming it's a tree. They never explained what the tree part of it was.
But this name is supposed to point around for one quick second. By the way,
someone did get their L6 promotion based on that tree. So I do think next time you should find
out what that tree is because that meant a lot to somebody. Okay, there was a packet and engineers
happened. I do agree the tree is probably important. It's just not important for the bug. And
even this okay. That so that I will say there was no need for them to explain the trees. So I'm
okay that they skipped out on what the tree is doing. But I got a quick question as well. Yes.
Is it called a tree because it's a root cause analysis or no?
Yeah, I passed that.
I know we're jokes were too off topic. I'm sorry. I'm sorry. Okay. So anyway, this is supposed
to point to that. And that that sort of this load balancing scheme, basically of DNS entries.
And the way that they describe this in like their presentation is they would use a thing like,
I'll say plan 145 dot dynamo DB like ddb dot AWS, right? Now this is the root of that tree,
I guess, not root cause analysis, but like this tree, this would contain like this is the top
level record of a bunch of records that allow it to do its load balancing. And I assume route 53
kind of has this load balancing capability. I'm reading between the lines of the presentation. They
didn't say that outright, but I'm assuming route 53, which they're doing all this through, you know,
which is their own DNS thing is allows that load balancing to happen by you just set stuff up
in here that says how the load balancing should sort of be working right now. And then it will
pick the correct machine based on like some kind of randomization and the weights or whatever.
Now what they said was this name, which really does exist. And apparently there's a tree or
something like this. This name is one that they just kind of used for their presentation. They never
actually used a human readable name for this plan like 145 that I've written here or whatever.
It was really a hash of something. So it would really be like, you know, 0 a f e 1 2, you know,
9 a or something like that, right? Is actually what would be there. So if you went and looked,
you would not see a human readable name, or at least at that time, you wouldn't, I guess,
you wouldn't see like plan 145, you'd just see that. And so the idea was, okay, a user goes to use
it. They query this name route 53 will direct them like to here. And this thing is some kind of a
load balancing tree that route three can use that will allow you to get where you need to go, right?
That will give you an actual machine you can send traffic to eventually. Again, they did not
describe any of that. So I have no idea how any of that works. I've never touched or used route 53.
So I have no idea. But we'll just assume that that happens because it doesn't matter for this bug.
We do have an AWS hero. So if you do, if you are confused, you can always ask Adam and he may have
further insights. I mean, yeah, go for it. Well, Route 53 does have a lot of different ways you can
like split the traffic. So yes, wait. It is one of them. And that sounds like what they described.
So somehow they've set up these records with that. And they just didn't say how, but something,
something in a tree form. My guess is there's like a weight it like the tree has like weighted like
there's a couple of weights at the top that branch out to more weights or something like that because
that's easier for it to deal with because there's a lot of them or something who knows. Anyway,
I have no idea. Point being, this is what's supposed to be happening normally. Now the reason that
this is called plan 145 here, even though it actually would have been some hash code, but they refer
to it as like plan 145 is the load balancing as you might imagine has to be kind of continuous
because the dynamo DB machines are like doing stuff all the time. They're becoming more overloaded.
There's machines are going down or crashing or who knows what right could be happening being taken
offline. New capacity can be added. And so this stuff has to be updated constantly like all the time.
So this main API endpoint that you connect to, it constantly has to have that tree that it's
pointing to be adjusted. And so the way that they do that is they create another tree, the tree that
they're going to move to, right? They create like plan 146 or something. And they make the whole tree
here. And then when they're ready, like when this tree is done, they take this, you know, this
record here and instead of it pointing to that one, they point to this one, right? So they make the
new one and they move over to it by just changing that name. Now, for some reason, and this reason
is not really explained, the way that they've set up that process is they split it into two pieces.
There's something called a planner, which figures out what the new tree should look like basically.
So you can imagine there's some machine called a planner. And I don't know if it's an actual
machine or if it's just a process running on some machine that's running other things, who knows?
But there's something called a planner. And as far as I could tell, there's only one, meaning there's
just a planner that sits there and figures out what should the new plan look like that we're going
to switch to. And it's constantly doing this. So it generates plan 145, then it generates plan 146,
then it generates 147, 148, nine, you know, blah, blah, blah, blah, right? And it just keeps
putting out plans for all of eternity, because that's its job. Now, it never actually creates them
apparently. Its job is not to ever make them in Route 53. It's just to figure out what they
would be if someone were to put it into Route 53. Then they have three inactors.
These inactors get the plan from the planner and they put it into Route 53.
Does this make sense? Now, one planner, as far as I am to understand frustration, three inactors.
There was no explanation for why this would be the case. They said the reason there are three inactors
is because it's supposed to be fall tolerant, like if one of them goes down or something. But
they never explained why you wouldn't then need three planners because if the planner went down,
then the inactors have nothing to enact. So it didn't really make any sense. So there wasn't an
explanation in the thing about why this structure looks the way it does. It's not really that important
to the bug that it looks this way, although it kind of is, as we'll see later. So I was a little
weirded out by the fact that they didn't justify this, but that's fine. So hopefully that makes sense.
We have a planner. We have three inactors. The inactors are all trying to enact this plan, right?
Now, what happens here is that for, again, reasons that all the only thing they said in the
presentation was it makes it easier to reason about. This is what this is the only way. They said,
it makes it easier to reason about because it makes it easier to reason about these inactors use
serialization. So instead of them just trying to create records, and if the records are already there,
just not creating them or something, in other words, I have three people running. We all want to create,
you know, let's say this top level record, plan 146.DDB.AWS, right? We all are trying to do that.
One of us does it first. The next person tries to do it, and it's already there or something, right?
We're all trying to create the same record. So in theory, we could just have three people randomly
hammering on whatever part of the plan they're trying to hammer on, and in theory, it should kind
of all work, right? And I sort of got the sense, although you didn't come out and say it,
I started to got the sense from the presenter that he would agree with what I just said,
meaning that they could have just had them run arbitrarily, and it would or should be okay.
But he said they used serialization to make it easier to read reason about. What that means is
instead of these enactors just hammering on it like that, what they do instead is they attempt
to acquire a lock for whatever the endpoint is that they're trying to update. So in other words,
if this person is trying to update one of these things, and I got the sense that it was if you're
trying to update this one, but it could have been if you're trying to update this one, or it could
have been on both. They never really 100% said, if I remember correctly, exactly where the locking
was occurring. But the locking occurs by them going, okay, I'm going to create a lock that is a
DNS record. And by using the fact that Route 53 has the idea of an atomic, which is, you know,
I can do two things. And if they both wouldn't succeed, then it won't do either of them.
They basically made a locking system that locks via Route 53. So Route 53's DNS records are
actually the lock record, if that makes sense. Can I ask a quick question? Yes. You said it does
this through serialization. I don't quite understand what that means because I thought serialization is
just converting from one memory to a different memory representation of some, and I'm struggling
different serialization. So yes, that is serialization. In this case, in this case, we literally
temporal serialization. Meaning they wanted these enactors to have some kind of a way in which they
would organize their behavior into into an order rather than just being arbitrary and the way
that they did that was was locking. Okay. So what will happen is instead of this person just
doing whatever it is they're going to do, like, okay, I'm going to, like, I finished this,
I'm going to point this guy at plan 146 now. Instead of doing that, it attempts to acquire a lock
on like this, right? And if it doesn't get the lock, it won't make the change.
So only one of these enactors can be in the process of updating this at any given time.
Does that make sense? Now again, exactly what they were trying to do with that was never explained.
They just said makes it easier to reason about and left it there. So I don't know why they
thought this was an improvement. And amusingly, it's what ends up uncovering the bugs. So it wasn't
an improvement. If anything, it was probably bad. But so Casey, are you saying they don't have like,
they don't have a good reason for they're saying we're going to make the enactors run almost like
one at a time. Why do they have a, why do they have three enactors? I don't understand. Like,
why do they not just have one? They just don't say that. We don't know why. And they didn't quite
explain like, I didn't really hear an explanation for how you have three concurrent enactors.
You expect them to be able to go down, which is why you have three. Right. But they're taking a lock.
So what happens if this guy takes the lock and then goes down? Like, I didn't hear an explanation
for that either. So this was all very confusing to me. Like, I, I, I'm not complaining about it
as part of what we're talking about here because it's not important for the cause to me. But as a
presentation, I had so many questions. Like, I was like, I don't understand why you did any of this
to be completely honest, right? And maybe that's again, part of it could just be that I don't use
idiotic services. It might be that some of these things would be obvious if you are someone who
regularly uses Route 53 or something, you'd be like, Oh, it's because locks can be set to a time
out or I mean, I don't know, right? But anyway, so yet, so they're doing that. And what ends up
happening, I for, for this, but the thing that uncovers the bug is that what ends up happening
is these inactors, when they don't get the lock, they just do like a back off, right? They'll
basically just do like, Okay, let me wait and I'll try again. So in actor, this in actor tries to get
the lock, but somebody else already has the lock. So he just waits a little while. He tries to get
the lock again. That's what will happen, right? And what they said happened was they hit a pathological
case, quote unquote, where one of the inactors is, you know, has enacted some plan. And that plan,
let's say, was pretty old. I think they used one 10 was an example that they used. So it enacted
plan 110. And it wants to point, you know, it's like, I got to set the API to point to my 110
tries to get the lock to update dynamodb.us east dot one or whatever and fails because someone else
is enacting plan 111 or something like that, right? Or plan 109 could have been a previous plan.
So the other inactors are doing it. It can't do it. It backs off, right? I remember this
in actor here, we're on 110. It's trying. It's, it really wants to enact it. It tries again.
Someone else has the lock now. It tries again. Still locked. This person sitting on 110,
desperately trying to enact it can't do it. Apparently, this just happens so many times that the other
inactors and the planner is just turning out new plans this whole time, right? The other inactors,
they get up to like 145 or something and 146. They're enacting plans that are like way ahead
of 110, right? And this guy still stall because he just, unluckily, never gets the lock, right?
Finally, at some point, after like plan 145 has already been enacted and pointed to by some other
inactors and all that stuff, plan 110, disenactors, still trying to do it, finally gets the lock.
He's like, yeah. And so then he says, okay, we're pointing to 110 now. Yes, right?
So now it's on a super old, stale plan, but this really shouldn't be a problem, right? Because
eventually, the next time someone act or has something, it's going to be a much later plan. They'll
just enact plan, you know, 146 or seven or eight or whatever, and we'll re-point it back to this
and we're back to a fresh plan. So everyone will just have bad load balancing for like a few minutes,
but then it'll be fine, right? They did have bad load balancing for at least a few minutes, right?
Yes, true. Well, it's a lot worse than that. That's what was supposed to happen, right? Meaning,
that's how they would expect us to work, too. Okay, the problem is these, they also didn't want
Route 53 to become clogged with all of these records. Because if they just left them around,
eventually, after, you know, three months, you have like eight billion records that you
stuffed into Route 53 for every, you know, couple minutes you're putting in this big tree of
weights and stuff, they were like, okay, at some point we should just clean up these plans.
So in actors also look for plans that are older than a certain amount. And if they are older than
a certain amount, they'll delete them. So what happened was they pointed to plan 110. This
an actor finally gets to lock it points to 110. Another in actors like, oh, wow, 110, man, that is old.
We should get rid of that and delete it. So now the Dynamo DBUS East 1.apodabus is pointing at
a record that can't be resolved, right? It's just something it would actually, again, it wouldn't
look like plan 110. It would look like O-A-F-E-129-A some hash dot, right? D-W-D-D-B dot A-A-W-S.
But it's pointing at that name. And if you ask that name, you get nothing.
So what would happen at that point is everyone who was trying to get a end point to send stuff to
would get back a unresolvable name, basically, right? And I don't really know what happens in Route 53
when that occurs. But you would basically be getting back something that you either couldn't use
or just gobbledygook for an IP who knows. But whatever it was, if you attempted to actually use it,
you weren't going to get a response. Interesting. Is this because AWS doesn't use enough rust?
Because that's obviously a use after freebug. So good point. I think rust would have solved that,
right? If you rewrote Route 53 entirely in rust, obviously, all of these problems are not there.
No, to be specific, I do think in the presentation, they did say not about rust, but they did
say what would happen specifically, which is I think when you asked for this thing or either this
thing or this thing, I don't know which one they were referring to, because I can't quite remember,
you would just get back a thing that says no records found. So that's the end game of what would
happen, whether it was from asking for this or asking for that, I'm not sure, but to just get back no
records found. That's what you would have received when you were trying to call that API. So whatever
whatever library you were using to use DynamoDB, it would just be like, hey, no records found,
bro, sorry. So this, if you ask anyone on the internet, they're all like, yes, they explained
the bug. That's the bug. The bug is that there was this race condition, because everyone,
as soon as you say race condition, everyone's brain shuts off. They're like, oh, okay,
well, it was a race condition done, right? So they're like, it's a race condition. They explained
it. It's like, no, they didn't explain it, because if you think about what would happen here,
immediately after this, everyone's getting this, it's a new an actor. A new an actor will just
enact a new one, right? And so the bug, right, is why didn't that occur? That's the actual,
the actual RCA that I wanted to see is why didn't the next an actor come and fix it?
Can I, can I throw out something else? It wouldn't also be a bug like why write a record so old
that it should be deleted immediately? Well, it wasn't. It was because it was, this guy had written
it quite a long time ago. And it was, it, the way, well, I mean, if you're asking why didn't they
write an actor as with better code? Yeah, that's a pretty cool thing. Okay, because it's, okay,
because he's like, if you're updating to something that should be deleted immediately, isn't like,
that's like, that feels like the problem right there. You've done something wrong long before.
Yeah, even though it doesn't really fix the theoretical structure of this thing,
a simple check in this guy, when after he finished backing off on the lock, he should maybe check
to see whether he's about to set this to something that he would delete if he was running his
deletion code is probably a good safety measure. But yeah, so 100% agree with him. Okay, but
now an actor worked really, really hard to get that record. It's been waiting a long time.
And it's going to have its Pokemon cards. And he won't ever wait it. So just let him write the record.
Okay. So, so I want to hear about that. Unfortunately, if you look at the presentation and you look
at the RCA, it's nowhere to be found. The presentation at least has one 12 second little tiny chunk,
where it does say, where the bug roughly would be. And so let me explain what that is.
So what apparently occurs alongside this. So when, when you do DynamoDB US East 1,
a bit, when you point that at your plan, you also do another operation at the same time.
And that operation is to set rollback. I think it's ddb.rollback.aws. I don't remember exactly
what it is here. There is a rollback record. It sets that record to whatever the old plan was.
So if we were here pointing at 145 and we're now going to point at 110, right? This old
in actor is like, I'm moving to the 110. It attempts to set, it take whatever this name was,
right? Currently. And move that new, that name, which would have been planned 145,
move that so that the rollback address points at the old plan, right? And this is just for debugging.
Or, you know, it's basically just for operator ease, right? If they want to roll back to the
previous plan or something like that, or if you just want to know what the previous plan was,
you can see it here, right? That's part one of how they said about failure. I would want to point
out one thing here was this also didn't make any sense to me because I was like, okay,
you're telling me that these things update every minute or something. What good is it to have one
of those? Like, by the time you even logged in, it's been updated from the one that you wanted to
roll back to to some new thing that's actually the plan you don't want because everything went down,
right? Like, it's it, right? You don't want this. You just want these names in a list. So you can
be like, what was it at at 1230? Like that one, right? So this made no sense to me. I have literally
no idea why this would ever be good, right? It did not sound like it would do the thing you actually
want, which is to be able to mark a point in time and go, we need to go back to 1 p.m. because
everything went to crap after that, right? Anyway, so that didn't make it sense to me, but again,
not exactly so they're to the bug. So I didn't ask why. I'm just saying, okay, that's what thing
it had to do. And I'm going to let you roll back one version is what you're saying. Yeah, even though
the other trees do exist, so you easily could by just knowing what the name was. So all this is,
isn't is putting a human readable name on something you almost certainly don't care about, right?
But they don't really they can't really store that much stuff, Casey. I don't think they can really
put like, I don't know, Adam, like, there's don't have a lot of scale there, right? You like
the reviews? Oh, that's a lot of lines. If it were me, I would have just made this a timestamp,
if that's what you wanted, right? I would have said, when did the planner or when did this person
point to this thing? Like, when you got the lock, you changed this name to the timestamp and
update this in one atomic. So then you just know if I want to roll back to one PM, I just look for
like whichever had the timestamp just, you know, the earliest timestamp not after that time. And
that's what we were running at that time. That's what I would have done, right? But I don't know. So
I have no idea why they did this. They did what they did. I, you know, maybe might make perfect sense.
Again, I have no knowledge of their system. All these things may make perfect sense. So I'm not
really, I'm just saying I don't understand them. They might not be bad ideas, right? They might
be good ideas if you understood the rest of the system. So anyway, so what they say, and this is
all we get, is this operation, meaning setting the rollback to point to the old plan that was being,
you know, which in this case would have actually been newer in some cases, right? So it's not really
the, the previous Lee pointed to plan, which may be older, maybe newer, doing that activity,
if that plan no longer existed, meaning you like it had been deleted like this, then the enactor
stops permanently. So every time, like once you get into a state where DynamoDB at USC is that
one, right? So we do the whole sequence of steps that we said here, this plan gets deleted. So
now this is pointing at an invalid, like, unresolvable name. We cannot resale plan dash 110,
which is actually some hex code. But whatever that was, we can't resolve that anymore. Once that
state is true, then the next time an enactor comes and tries to make it point to a new plan,
whatever that new plan is, it cannot, like when it actually gets this far and tries to set the
rollback, that will crash it permanently. Therefore, all three inactors will now stop, because eventually
all three will try to enact a new plan. They will try to set the rollback first to point to whatever
the, uh, the old plan was find that there's no plan there. And that apparently is just a hard crash.
Now, I've got the three inactors was supposed to make it so that it had redundancy.
Now again, this is why I get grumpy with people online who are like replying. They're like,
it was a race. It wasn't a race condition. The race condition is not necessary for this. The race
condition is just why you ended up with this name being unresolvable. But if you didn't have
whatever code did this badly, it wouldn't have worked. You never would have known. You would have
had a momentary, like, you know, minute outage of DynamoDB or something, but I'm guessing there
are minute outages of DynamoDB from time to time, right? Like that's not global news. What's global
news is taking it down permanently, which is what happened here. And until an actual human goes and
figures us out, resets it gets these inactors going again, it's just gone, right? It's just out
permanently. So hours potentially, right? And it was long enough, I guess, in this case to then
have cascading failures. You would never have had that as there's a momentary out, like, if some
people momentarily got an unresolvable name or no records, right? Then they would just try again.
That's usually what, like, with DNS, like, that's like your phone, you went through a tunnel,
right? That's all that would have been. So I want to know what, what did the code look like here?
How did you write something that if this wasn't a valid name, which it wouldn't even be on
standup, meaning if you were starting this system and the operator hadn't pre-configured it,
it wouldn't be pointing to anything, right? That's the default case that you would think you'd start with.
So if you're going to do this, you would think you would just handle that case because the rollback
address could just not point to anything, right? Just take whatever this is if it's nothing,
set the rollback address to nothing, done, right? So there's something really weird about the way
they wrote this code. And that is what should have been in the RCA. That's what whole bug to me.
This is just set dressing for how we ended up having this, this thing point to nothing.
But the same bug would have occurred if someone had accidentally deleted this record. Like,
some operator was just like, oops, crap, I said it to nothing. This, the same bug would have happened
according to the presentation, right? So the root cause is not the race condition. The race condition
is on a side. Does that make sense? Quick question. So I am trying to, I'm legitimately thinking
through this. And so that means the thing that sets the rollback probably assumes some sort of
struct with a bunch of memory or something has been passed in, does some sort of, like, some sort of
access, it explodes, or do you think this is the same style of bug, which is the one line that
took down Cloudflare, which is they just assume it's there and unwrap it. It's in rust, it is memory-safe
rust, unwraps it, explodes it. I really don't know my, my guess, like in my head, I was like,
what is the thing that I see people do a lot of times where I'm always like, why would you ever do
this? But it's just because that's the way they learned to program. And I was thinking like,
if you were writing in one of these languages that likes to throw exceptions for error conditions,
this would be a great example of that. So if you had a thing where you were like, oh,
I went to go get the DNS record that this thing points to. And normally in a sane program
environment, no one is throwing an exception there. If they get back nothing, they just return
nothing, right? And then when the person goes to set ddb.robot.js, they just set it to nothing,
which is the correct behavior. Like, nothing flows literally the value nothing flows correctly
through this flow. So if you were writing it to be, since it is a core foundation service,
assuming you were trying to write something that was fault tolerant, you would never do something
like throwing exception. So in my brain, I'm thinking, I bet what happens in here is when you ask
for this record, they just use some library call or something that throws an exception when the record
doesn't exist. And it just threw an exception in the an actor's thumb. That's my guess, right? And
I could be very wrong about that because I'm just wild guess, right? But this is why I want to see
the RCA. What was it? It could be exactly the stuff that Trash was talking about. I mean,
it could be the stuff that Primes was talking about could be the stuff that I just said could
be anything. And I want to know because that's where the actual education would be here.
Avoiding this race condition is completely and important. This race condition could have lived
there. And while it was important, eventually to fix it to avoid those once a year weird outages
for five seconds or something, it is not actually the thing that we most want to learn. But we most
want to learn is don't write this thing. And we don't know what this thing even was. So how do we not
write it? This is why I think it was about RCA. Does that make sense? Yes. Yes.
All right. What is most of AWS written in Adam? It was Java when I was about to say someone from
the chat said Scala. They said they worked at AWS for seven years. And they said most of it's written
in Scala. Well, that's technically Java with extra steps. And that will anger all of them endlessly.
So, so that's really it for me. Like that, that this was a thing where I was like, I don't feel like
I saw the explanation. And I actually feel like it's important to hear because there was a bad
programming practice at the bottom of this summer. And I want to know what it was.
Especially because it helps people like me when I, you know, I don't really do a lot of
architecture education right now. But at some point, I probably would like to do some of that
because I think there's a lot of bad architecture out there. And so I kind of try to pay attention
to these things. Like, what are the kinds of architectural mistakes that people are making? And I bet
this was one of them, right? And so I'd like to know, I'd like to know.
Yeah. I think like what I would expect is like, at least like one simple, reproducible example of like
why I blew up like a whole like little coat snippet. So like, and this is something you brought
up earlier is like kind of like how we approach these type of things like with I'm like reviewing
someone's code. And I see something that looks weird. I will always do my best to make my own
little sandbox and like prove my theory out. And then like actually showing the code like this is
why this is probably wrong. You're like a small, simple, reproducible step. So I would expect
something like that. And that also helps me like truly understand. Because a lot of people,
like you said, they'll see something like that looks funny. But I don't know why it looks funny.
But I can't stop there. I got to like actually like build it out and then like understand. So
that's what I would expect. And you know, like like I said, the crowd strike and the Google outages,
I thought were better at like just telling you that they were like, look, it was an all pointer
D ref in here. Or it was an out of bounds array because we thought there's only going to be 20 and
we put 21 in the config file, right? And like, okay, I know exactly what kind of code that, you know,
is causing that kind of problem, right? And furthermore, furthermore to like an earlier comment,
literally as far as I know, everyone who programs an arrest only does it so that occasionally,
when they see something like this, they can say, well, if they'd had written an arrest, it wouldn't
have happened. They were not given enough information to even make that comment. They probably made it
anyway to be fair, but they were not given it. So you have to give one rule that should be followed
in RCA's is you have to give uh, rustations enough information to if they so chose correctly say
that it would have been prevented in rust. And this, we do not have that. We do not know whether
this would have been prevented in rust. We have no idea. It probably wouldn't have, but we don't know.
Well, Casey, we do have a pretty good chance because it's like probably would have never shipped. So
would have prevented it. True. We would have zero in actors because we would have no
set of actors. Yeah. Cloudflare does really good job at this as well. They like go in and show like
a lot of lines of code and say like, this is exactly what's going on. This is, you know, even
though the problem's up here, this is the line that exploded due to all these previous conditions.
That was me making fun of rust with the unwrap, which actually wasn't truly the problem. But you
know, it's just like all these things kind of happen. So they do a really good job. I'm surprised at
how poor of a job AWS has done for this one. Well, and the other thing too is it, it was one of
those things where it now, it makes me, so it makes me unnecessarily suspicious of you, right?
When I read this, I'm like, are you hiding something? Did you not really figure out what the bug was?
Like, you talked all about this race condition, but even from your own presentation, I can tell the
race condition really wasn't important. That was just, that was just what led to the record having
been set to nothing, but who cares, right? Like that's, that's like something that's nice to put
in the RCA as like an explanation of why this bug occurred now as opposed to some other time,
but it's not the bug. So it's weird to me like when I see an RCA that doesn't talk about the bug,
now I'm suspicious, right? And unnecessarily so, because if you actually did find it, then just tell
me and now I know you found it, right? So it's like, I think it also is a confidence boost
for the people who are looking from the outside who want to know can, they trust this kind of
deep thing. If it looks like you actually found the bug, I have a little more confidence in you.
If it looks like you have no idea what the bug was, or don't seem to understand what the bug was,
then I'm, that I'm more concerned. And so I think that's also another reason to do this in your
RCA. It, it provides confidence to your customers. Maybe that's where they fired Adam as an AWS hero,
too. Maybe it's all connected. They didn't want him exposing these dirty secrets. Yeah,
he was too much. He knew too much. Could you give a quick like three-minute summary of the
guitar shop, like what that, what that was revealing because I'm trying to remember what it was
because it involved like a single point of failure guy who was out here for this failure as well.
So I don't know how to reconcile the two things. And of course we have no idea, we have no idea
if either are telling us the truth now, right? Because this was such a bad RCA, I have no idea if
it's correct or not. But the password. Yes, the password was wishbone 12, I think. There you go.
I always try. That's my recollection anyway. So yeah, that story was that that there was that
there was a thing that was designed to copy configurations. And that thing had kind of gone rogue
and could not be stopped. Like it was just like it was just copying configurations totally
incorrectly. And it needed to be like fixed or repaired or something. And we don't have any more
information because it was an overheard conversation, right? And so does that comport with this?
Well, a little bit because those enactors do sound like the kind of thing that would be running a
configuration copy. But on the other hand, it's not really a configuration for machines. It like a
DNS entry is a DNS entry. It's not, it's not really a configuration. So I would say the two
stories don't line up that well. And so that's another reason why I was kind of hoping that this RCA
was a little bit more believable because I wanted to know for sure that the story was false. And I
still don't really know based on how bad this RCA. What if what if the tool that the guy wrote to
copy the configs is just literally the enactor? Like they just productionized it and he and like
they haven't changed it in seven years. That was kind of my. I don't know. Connecting the dots
there was he's like, guys, I wrote that as a way for me to test stuff in my local environment.
And you just decided to make three enactors and put them next to each other and brought. I don't
how did this happen? I do. I have alternative questions. Alternatively is it the rollback because that's
the one that did the copying of like, Hey, here's the previous one, right? And so I'm going to copy
the previous one. Then it gets like this null issue going on. And it just like the script never
encountered a knowledge goes rogue and starts writing over and over and over and over and over again
to where you can't. You can't do anything. I don't know. All I know is that like as far as I can tell
from their explanation, going only on what they were providing, I still just don't think the race
conditions even relevant because again, and at a literally an accidental update to the route 53
endpoint would have taken down all three enactors immediately. Because according to them,
all that's required to stop them is if the endpoint points at an unresolvable name,
that's all you need. And so if that's really true, literally an operator typo could have taken
all this down. No race condition necessary, right? And so again, the RCA just does not do a good
job convincing me that you've talked about what the real bug was because I can think of so many
ways that you could have triggered this exact same thing that don't involve this race condition,
that you spent the entire RCA telling me was the bug, but I don't think it is.
Well, we'd like to extend a formal invitation to Jeff Bezos. Do you want to come in here and explain
yourself? It's I believe it's Andy Jassy now. Yeah, Andy Jassy is the man you're looking for him.
Or wait, I'm going straight to the top. Jassy is a previous as well. He's not
able to listen anymore. He's head of Amazon. Yeah, that's what we want. I mean, we want to have
we want the president. Bezos was previous head. Now he's just chairman. So he's no longer a day-to-day,
you know, he's galomanting. So he wants one more bootcamp around. I want a real chairman analysis,
though. What's going to both? I want to, that's an art that he is. The real chairman analysis. What's
the real chairman? The armchair, the armchairman analysis. The armchairman. The armchairman.
I don't know what I'm looking for. The armchairman. Dude, why is an armchairman a phrase?
Like, I'm armchairman of the board. I'm pretty sure that's the guy that does the chips, right?
The armchairman. The armchairman. I don't get the joke. I'm sure it's. Armchair. Armchair.
Armchair machine. Yeah. Oh my god. Not potato chips. I don't trash this on the pod.
Oh my god. Dude, wait a minute. Wait a minute. Wait a minute. Wait a minute. Wait a minute.
This is our sponsor opportunity. Potato chip companies make a line of potato chips where the bags
are labeled by actual chip. So it's like this is a 9950X3D chip. Like the bag just has that on it.
And it's specifically potato chips for developers. I have been trying to convince
sun chips to sponsor us forever because think about it. Sun. Sun microchips.
Yes. That's what I've been saying. Oh, I was not. I was not going on that. It's a match.
What were you trying to say? It's like garden salsa. Yeah. It's a garden.
Programmer's garden salsa.
Yes, microchips. Sun microchips. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah.
That's the one that we need. If we can get a sun chip sponsorship, that's the perfect one.
It's a match made in heaven. Okay. It really is. Okay. But KZ have been tweeting at them for like five
years. They don't care. They don't care. They're just like, who is this guy? And why does he keep
tweeting at us? I would say if I could find the sun chips by all the sun chips to eat here.
This is this is May 28, 2001. Okay. We're really digging.
2020. Wow. Dude, you are really working hard about this chips.
Yeah. Well, I was writing a lot of Lua at the time. So I thought that it would go nicely
together. Lua, like moon, sun, right? It's like moon does the sun. We're echoing back to each
other here. So that was a classic hashtag computer chips. There he goes.
That's a new terminal line. They were not. Oh, they were. That was part of the meme.
They did not reply. They've never replied to me. They've never interacted with me.
So disappointing. I know. So where I'm still waiting, still waiting. So disappointing.
I can't. You like sun chips, though, right? A harvest cheddar, specifically.
Are you a harvest chip? Are you kidding me? Oh, my gosh. Hold on. Hold on. Hold on.
I am a fan. When we were at the tower, all those garden sauces,
threw them to the side and disgusting. Oh, wow.
It's everything. Nope. It's not. You can't be a part of this. I thought we had the same taste in
snacks. Well, I ate all the harvest cheddar's over that tower. That's actually great,
because then we make a good team. See? That's what I'm saying. You're like, oh, you know, yeah.
I actually enjoy French onion also. I don't think. What is there a French? I don't think I've
said green pack green packs French on blue packs. A ridge, which I'm not. It is original.
That is not that good. Adam, are sun chips vegan? I've never had a sun chip in my life,
I don't think. I don't know if they're vegan. Please, you know,
they may not have any. Oh, sorry. Oh, my goodness. Okay. Well, I think this is probably
actually a reasonable time to end at this point. Yeah. I can't believe Casey wrote backwards
that whole time like that. That's what I was going to ask. Yeah. It's running backwards.
Yeah, but you don't use to it. It's right hand. He writes forwards. He trained his left hand
to write backwards. That is not how you're not really. Yes. So it's like right hand. He did it.
There. He does it the right way. He did not write his right hand backwards, but you can watch.
He does his left hand. He does it backwards. Are you guys trolling? No. How do you just
know that? Casey's nodding. Yes. I said, no, I'm not against like you use the left hand. That's
how you do it. You just because if you if you aren't. So if you're ambidextrous, this would be hard.
But if you use to writing with your right hand, then when you train your left hand to write
backwards, that just seems normal, right? Because you don't have there's not another thing. Yeah.
But you are trolling. You don't actually write backwards. Yeah, we're trolling.
Yeah. Well, you think he would ask you, right? But actually I can kick with those
fingers. So many people asked me that. I was watching so closely your hand in it. I couldn't
figure out if it could. It was like, no, I think that's some right way. Like I'm trying to like
stand where you would be standing. I could not get that question out of that. Adam starts writing
on his fake on. Yeah. It's best for writing backwards for six months. Come out of my
I don't know how you do a Casey. So it just seems magical to me. It's just the mirrored camera,
right? Thank you, Trash. Yes. Okay. No, no, but how do I see through it?
What do you mean?
It works through the glass, right? Because if I'm writing on one side you see it back
for so if you just mirror it, does it actually come out correctly? I haven't I haven't done
the math to know. Right. This is a wrap. Wait, no wrong false. What did we say at the start
of this thing? If you're actually want to know, you ask questions. Okay. Don't just fricking
pretend and prime is doing the exact thing he should be doing, which is saying I'm not sure
does that actually work? Yeah, because my only question is how many times do you reverse it,
right? Because you're going to, if I draw like this, I draw like this. But then it goes through
if you're looking at it from the other perspective, for you, yours, it's going to be flipped, right?
There you flip this thing. It's going to look like it's going to, oh, that's vertical.
The glass was clearly between the camera and him. Yeah, the glass is clear. So that's backwards
looking through it. And so then you reflip it again. Correct. Work out. Okay. Maybe if you have
sure. I buy it. Yeah. If you were, like, you got the right now, you wrote the letter D,
right? Yeah. Of course, unfortunately, that could be letter B. So you probably want to write a letter
that's a different one. Give me a different one. There we go. So you write the letter Z.
You're writing it this way. You're looking at it. So think of yourself as standing and you're
looking at the board, you write the letter Z, right? Now, if you were to walk around behind your
screen and look back at it, you'd see Z flipped. Yeah. So all you have to do is just flip it one more
time and it's correct. Okay. Sure. You guys. One thing that gets people always really riled up.
Sure. Yeah. I draw my S's. Yeah.
Bottom up S. Are you kidding me right now? But then look how beautiful they are. I mean,
that's not a thing. No. It's like a serpent. You can tell when it's done that it's wrong.
Your Z. Do you actually write your Z like that too? I write my Z that way. Yeah. Sorry, Matt.
Sorry, Trash. We did math. So we cross our sevens. Yeah. Okay. I do, I do my, I do my sevens
like that. I don't do it. It's called Z by the way. But sure. Oh, dang. Oh, all right. Got it.
That's fine. Also, can I can I can I show you guys something? I have a very magical ability.
Are you ready for this one? Yes. I drew an ampersand first try without fumbling it.
It's considered pretty impossible by many people standards when they first draw an ampersand.
It's very difficult. Also, can you zoom in on the line? It looks pretty good from where I'm at.
Almost a perfect line right there too. That's computer enhanced. Now you can see that I moved it
right here and it moved it right here. You can see the two spots. So I do enhance your enhance.
That's why you got to go to computer enhanced. The way you write your ass that has to slow you down
so much. I just decide to write really nice instead. I don't care. I write. I don't care.
Why do the trotts fly on Trash? Can you explain that to me? Yeah. I don't know. It's just to me.
This is not the same one already. Hold your pencil like this. Can you do it?
No. How do you hold a pencil like this? Okay. Well, there's your problem. You hold it like a weirdo.
Dude, we were just talking about drum line. What do you hold it like this? Yeah. Like a proper, yeah.
Neither of those. No. That's not how I hold it either. What are pencils and pens for? There's too many fingers involved.
Right. Right on top. Boom. That's how you do it. Can you guys do this? Can you guys do this?
I can do that. I can do that. Okay.
Oh, we see this. Please make that be the opening. Can you guys do this?
And then it goes down. And then you cut to the start of the podcast.
Dreshing. That reminds me of I was at my I was at my friend's house. We were all sledding
together with us and our kids. His son goes down the hill and he goes and this is how
a 13 year old sleds. He jumps down the hill immediately gets the sled caught down
face person. The snow. It just bips into the sky. He's stuck right.
So now every time I see him at church, I'm like, and this is how a 31 year old runs.
That is so mean.
I think he's hilarious. He cracks up every time.
I'm going to do one more little trick for you. Okay. Are you ready?
All right. You got to hold a pencil like this. All right. And then you without taking your
hands off the pencil. Okay. We did this one. Yeah. You got to end like that.
Oh, I knew the answer at one point.
Okay. You just don't take your hands off of it. I can't do that. That's the one that. Okay. It's very simple. Okay.
So you put it right here. Right. And then you got to turn it. And then you just put your hands
the other way. It's like three pixels on my son.
Yeah. It just doesn't even look like it. It's a pencil. Whatever. Can you do this? Boom.
But you can't do it. Can you spell blood with your hand?
Oh, trash. Bro, you get to get this podcast. All right. Trash. All right. Trash. Can you
fortnight dance? Probably not. Yeah. Let's see a trash. No, please don't try to do this.
You don't want to do this. I don't want to do it. Did you just be in me? No, no, no, no, no, I'm good. I'm good.
If you hit an orange justice on this podcast, you will literally never stop orange justice
from here until the universe ends. No, we're good. End it.
All right. Thank you, everybody, for watching. I'm terribly sorry about this.
I'm 50 tech Bobcat. Yeah, I'm sure. We got out of this one. We didn't even have to show a
nipple. So that was a good episode. Can you believe I told him to go to Spotify for this?



