Loading...
Loading...

The ugliest and jankiest hacks we’ve put into prod, and a few of the worst things we’ve seen other people get away with.
Support us on patreon and get an ad-free RSS feed with early episodes sometimes

Subscribe to the RSS feed.
This late-night Linux family podcast is made possible by our patrons.
Go to latenightlinux.com slash support for details of how you can join them.
Support us on Patreon for access to ad-free episodes and early releases.
That's latenightlinux.com slash support.
I'll go first because I'm not sure if mine is pragmatism or jank, say you can all
be the judge.
No difference.
Sure.
I told myself at the point that I rolled it out to prod.
So mine was engine X tri-files and the AWS S3 CMD.
Oh my God.
I already know what this is.
All right.
Next person.
The time I installed an agent on a cluster when I was definitely not supposed to.
And Shane.
So mine is building a platform with the describe it would be an IAC injection hack.
Love it.
All right.
And tell us all.
So I had a moderately sized data set of images that I needed to migrate to S3.
When I say moderately sized, I'm talking tens of petabytes of images that I needed to
be to S3.
Okay.
That is quite big.
Yeah.
I mean, this may or may not have been every image of every product sold in any European
supermarket that was stored on an EMC.
I have many questions, but go on.
So we were moving this for various reasons off of the on-prem-san and into S3, but we
didn't have a good way of doing was performing that move with an ever-changing data set without
downtime.
So does that just to understand their images are being used alive and they are changing.
And you want to push that up while that's happening effectively.
You can't stop them to move through this migration.
Correct.
Yes.
So what do you do?
Like I said, we used tri-files.
And S3, CMD.
So effectively, we had a database which stored the path on disk where the image file was.
And I won't go into too many details.
But effectively, if a new version of that image existed, we then used Nginx tri-files to
see whether that was stored in S3 or not.
If it wasn't stored in S3, we then presented the user the copy from the on-prem-san, but
then used S3, CMD to perform the upload of that image to S3.
So what this meant was that as users were accessing the images in real time, we were triggering
an upload to S3.
And then pointing to the version that's in S3.
Well, we didn't have to, right, because tri-files would try S3 first.
So eventually, all of the images that were currently accessed would end up in S3.
And we just have a bunch of legacy images that weren't accessed, particularly often on
our on-prem-san.
And then we, at some point, do a cut over, upload new images directly to S3 and then just
sync the delta of the stuff that was the old stuff.
I think this is brilliant.
This is like, on-demand migration.
It's great.
That is exactly what it was, right?
So every time that an image was accessed, it was migrated to S3 as it was accessed.
You're right.
It literally was an on-demand S3 migration based on user activity.
Kind of reminds me of that.
It was a cloud flare who came up with the S3-competible thing that did that sort of live,
existing thing.
Yeah, I mean, it's basically that, but before that existed, right?
Because it basically meant that the most popular images were then migrated to S3 sooner.
So we had this weird load profile on the on-prem-san, where the reads dropped dramatically during
the first 48 hours, and then it's hailed off more and more from there.
As the more unpopular grocery products then started getting accessed.
Not only did you solve this in an elegant way with no downtime, but you also did like
the grocery company's job, where at the end of it, you could tell them these are your
most not-light products, drop them, or buy something else.
Well done.
Well, I didn't have the full site to tell our product manipulation, monetize that at
the time.
But you're right, we could have literally monetized our engine X access locks to a grocery
store.
That would have been amazing.
There is something wrong with me that the first thing I thought and assumed when you
said images was some sort of disk image, you mean actual like PNGs of like cabbages
and broccoli and stuff.
Okay.
He did say petabytes.
Like, I can understand why you'd say that.
You see?
He's totally logical.
I'm not insane.
Yeah, but sure.
Imagine that these images, they are really high res like 5,000 by 5,000 PNGs of every
side of every box of every product sold in most grocery stores.
That mounts up pretty quickly.
It's just like the petabytes of images you've got trained.
So this is why I said that I couldn't decide if this was pragmatic or stupid because it
was an entirely in house thing that we built and maintained.
And I don't know what we would have done in terms of error handling, but there's probably
a whole bunch of stuff we missed that we were just lucky on, but it got the job done.
And we managed to move all of the data without any downtime.
The line between genius and hack is so thin.
All right, Sean, your turn.
So if you've ever worked with a platform team, then you're given a platform and a set
of rules.
And you say you have to do this within these constraints.
And me, I like pushing boundaries.
I like running things I'm definitely not supposed to.
So I decided I wanted a database monitoring.
And the only way I could really get that was to set up an agent on a cluster.
The problem being very done that for like everybody.
They just didn't have like the one feature for database monitoring that I wanted.
So I had to figure out a way to install an agent without actually installing agent
like the whole cluster.
That sounds like an easier approach than just going and talking to the people who installed
the agent.
Look, I tried, they said they had a very busy roadmap and I said, I'll do it myself.
Well, it was not pretty.
What I ended up having to do was pull the image and run it with a bunch of configs to basically
turn off every single other feature except the one database monitoring thing that I wanted.
And then you have to give it a config file.
Basically, it says, here are all databases that I want you to monitor.
And here are how you connect to them.
And the problem I had was the secrets that had these credentials to talk to these databases
where in other namespaces that the surface accounts that would be running the pod did
not have access to.
Okay.
So is this a symptom of the fact that you're effectively running this way you shouldn't
be because the database team who was doing the real monitoring sits in a different namespace?
Yes.
And man, I did not have a good solution to this.
Do you guys have any solutions?
Anyways, you guys would have tackled this?
I mean, you're fighting such an uphill battle of the system around you.
I probably would have looked around and went, I don't think I'm supposed to be doing this.
Would be my first thing.
No, it's not a solution.
I don't think Sean.
It was really a one of that monitoring feature, whatever it was.
So your problem is that you're trying to get the secrets from one namespace out and
into another one so that you can use them in that other namespace.
Yeah.
But that wasn't possible.
And the thought never occurred to me, Shane, because I do what I want.
So what I ended up doing was basically copying all over all of those secrets into my namespace.
Don't do that.
And then I had to use some weird templating language they had for like environment variables
to kind of inject them into this config map that I also had generated.
And it had like a list of like 20 databases that was supposed to monitor.
It was ugly.
It was not easy to parse, but it worked.
Anyone might think that this wasn't what you were supposed to be doing.
Look, I needed database monitoring.
I needed it.
But did you need it that badly that you had to build some Jank solution that was probably
against your security policy to do say, I don't know, man, like it feels like a thing
where I would have probably tried to get better upstream support for this rather than
just janking it.
Yeah.
I got nothing.
Y'all are right.
That was bad.
All right, Shane.
Tell us your truths.
So yeah, I once built this platform and I gave database secrets to this maniac who decided
to copy them into his own face to get better.
To be fair, my actual story isn't too far off that.
You know the way platforms are supposed to abstract away the whole cloud environment and
have like a bespoke interface for your developer status.
That's kind of the point of it, right?
And not let me do what I want.
Yes.
Yeah, yeah, yeah.
Exactly.
If you want a new thing, Sean, then you ask for it and we'll prioritize it on our face.
It's very basic backlog and you'll get it done in like six months.
It'll be fine.
It'll be great.
Oh, my God, Shane.
You're not important, Sean.
Don't worry about it.
Trava.
So back in the early days of like AWS, I was building a platform around lambda.
People could request lambdas.
And what we were doing was taking any YAML file, templating that into some cloud formation
and making the requests.
We put in some guard rails and some bits and that was it.
Some of that was a bit more complicated than we were willing to put the effort into.
And so instead of like just doing the right thing, we just let people insert cloud
information when they made their request.
Okay.
So let me play that back to you.
So you're building your platform engineering product and the way people are supposed to interact
with that is with the YAML file.
But because you didn't want to build all the features that would be required to service
all of your users, you did the basic features and then accommodated the additional stuff
that people needed with dodgy cloud formation hacks.
Basically, yes, which meant you could just kind of, if you knew what you were doing and
start any arbitrary cloud formation, there was a bit of an impact to that, which was
that clever engineers realized they could just request whatever the hell they wanted.
So they were like within the cloud formation of this lambda, they've like injected.
I also want a couple of EC2 servers.
And I'm not going to use any of your guard rails to screw you.
And that much around this was varied.
Someone also requested a VPC on top of our existing VPC and broke all our networking.
So it was quite a painful hack that was quite painful to get rid of because people then
later on depended on that hack for production, which was a real joy to roll back.
Is this why you guys don't allow us to do things that you don't want us to?
Yeah.
Yeah.
Or I bet our platform engineers would not have been in there, but it was in there.
And so we had a nightmare.
It was a nightmare to roll back, but in a way, it was great product feedback.
What are the features that our platform doesn't have?
Oh, it's all this hacky crap that people have done and injected into our lambda IAM cloud
formation.
That's how I spawnish when we realize the error of our ways.
Yeah, this is all valuable feedback, just in the form of people hacking around things.
Exactly.
Exactly.
It's all learning.
Yeah, we've all seen remote code execution in the past, but like bugs that allow people
to remote to go into the structure.
I mean, I'm baffled that you put no guardrails in place around this, frankly, like you didn't
even think to limit the scope of the role that those lambdas were assuming to stop them
deploying services that you didn't want.
So it was literally spitting out cloud formation and then importing the cloud formation that
the engineer provided.
So it had to have as much permissions as cloud formation needed.
And maybe we kind of gave cloud formation to many permissions and probably should have
said, okay, let's just, let's just provide these services only with the developer facing
cloud formation.
Like, let's not spin up another VPC.
That's what I'm thinking, right?
You could have at least given it, deny on VPC.
Look, if we couldn't be bothered to like template this properly, we're probably also not
going to be doing proper security in the back there.
No, don't do that.
Leave me what I want.
What we were thinking about this though, I was just genuinely terrified about what AI is
going to do.
What about AI do with that knowledge?
And when it's trying to be very helpful servicing some developer's request, I've learned
I can just insert arbitrary IAC in here.
Let's just do whatever it wants.
It's going to be great.
So be afraid, platform engineers, prompt injections, a real thing.
Someone can think they're getting some sort of cool design review skill.
And then all of a sudden, boom, 80 West Bill, gone up 1,000%, but you have a lurch
for that, right?
Yeah, yeah, yeah.
Absolutely.
Yeah, totally.
The thing is, it wouldn't even be the developer, it would be like, oh, I've built this
thing.
Please deploy it and it just finds a hack and does its best.
There really isn't a great way of solving that problem though, is there?
I mean, platform engineering is almost by definition a mechanism to create standards
rather than giving that flexibility.
So is there a good way of balancing that kind of rigour with the flexibility that developers
are inevitably going to want?
That's pretend that the whole 10X productivity AI's giving developers is true, it's pretend
that.
What I mean is that platform engineers are a bigger bottleneck and coding, generating
terraform or whatever else won't be the constraint if AI's going to solve all this.
So I was going to be a weird sort of challenge to the static interfaces that platforms have
right now in that my AI is dumping out a load of slump.
I'm so productive, but the platform is slowing it down.
So I genuinely don't know, it'll depend on your organization and the risks, but in the
era of just a torrent of slop crop coming out your platform, I don't know, I am going
to be doing a talk at the next flat end day about this particular issue.
So I should probably have figured it out before I do that talk.
Is there a world in which you can conceptually say what you worry about?
I'm thinking of things like Azure Policy or some of those tools where you say, look,
I don't want anyone from this team provisioning the expensive stuff or I don't want these
people to be able to deploy anything except these particular images.
And then give the platform flexibility to the developers while still enforcing your
guardrails or is that a bit of a fiction?
The tricks going to be that the clouds are just so vast and so complicated, like codifying
a black list over a white list to give that flexibility is going to be a hard problem
to solve.
So I think it's almost like we need a new language around this, like figure out a better way
to like enable flexibility while also keeping those guardrails.
There is ways around it right like you can do this.
It's just like it's going to be a very, very, very long black list or it's going to be
a white list essentially.
So I don't know.
I don't see any tools good enough yet.
I think rather than trying to come up with the perfect set of rules because that will
always be a moving target, I think you really need to solve this at the platform level with
platform engineering.
What does that mean?
That's so vague.
I know.
That means if I, for example, my earlier issue, right, it's because it was a shared cluster
and it was a multi-tenant cluster, so multiple people were running in the different namespaces.
If I had been given or provisioned my own cluster where I could install my own CRDs or operator
agents, whatever, then I wouldn't have had to go around that.
However, I can totally understand that it's a very different problem and skillset to
manage a couple of tenets in different namespaces within a Kubernetes cluster, one, two, three,
four, small number versus having n number of Kubernetes clusters to manage on top of
that.
Of course, with all their custom code, all their custom operators and stuff, so it's a balance.
All right.
So to quickly divert attention, instead of the worst hank you've done, what is the worst
hank you've discovered?
I like this.
It takes the blame off of me.
I think, okay, I think I have some good ideas here.
How about a mission critical system that depended on very old, like, mainframe direct database
read and writes, and they had to reboot this computer every day so it wouldn't crash.
And so for a given amount of time, every day, the whole system for this whole company,
would stop between certain hours for the mainframe to restart.
That kind of reminds me, was it Windows 98 that had that maximum uptime or something and
people had to reboot machines for that, or if I just completely made that up?
No, that was running in a lot of early Boeing aircraft, and they had to time the reboots
to flight hours, I think, so that they could reboot whatever Windows system it was between
flights, so that it didn't crash during a flight, which is slightly terrifying.
All right Shane, what's the, what's the best or worst thing you've discovered?
God, I hate how bad my memory is, because like, back at my software engineering days,
the amount of like absolute trash I spotted, I think probably it's more a theme than
a specific issue, but like permissions and PHP developers.
Then in doubt, just give it a small 777 all day every day, don't worry about it.
That just means everything works, so that's definitely a theme.
There's, I swear, like at the back of my brain, there's like some really bad stuff.
Maybe I'll remember at the end, but at the moment, I cannot think of them, other than
saves, dark times.
I particularly like it when you're in a meeting and somebody is describing what they've
done, and it becomes slowly clear that they realize just how bad their situation is,
and they have to sort of slowly realize it in front of their boss, and maybe I'm just
a bit of a sadist.
I've got a good one.
I remembered one.
I was just saying that.
I don't know if this is a hack, more a way of working that was quite interesting, which
was I joined as a contractor, a team where they didn't really understand how get worked,
or version control.
So their version control was to just email each other, date and time, stamp zip files
of the code base every so often.
So that was one of my early jobs to teach them basic version controls.
That was good.
Well, you knew you just didn't spend your time like, particularly merging their zip files
together.
No, I'm not rebuilding your history from email attachments.
Oh, my God.
Jill touched this file of this date and Bill touched this file of this date.
And our day, there's a merge conflict, what do I do?
And presumably there's an email blame there where you just figure that one out and send
them an email.
I mean, the data is actually all there.
You could probably do this, but what no shape of the problem.
It's entirely possible to piece that together, but there's a reason the version control
systems exists.
Yeah, yeah.
We're not just still passing code around on floppy disks now.
And I was moving from using just a standard sort of file server with duplicity for my backups
to using ZFS snapshots.
I did spend a while trying to figure out if I could somehow recreate my duplicity snapshot
history by rolling through those on a ZFS snapshot and somehow recreating the snapshot history.
It was an interesting technical challenge, but in the end, I decided to just let those
things age out and move over.
Yeah.
I mean, there's some of these things that you just have to accept are gone.
When you pay three of these migrations, and that's probably one of them, right?
I tried to convince my wife that message history is also one of these.
Every time she gets a new phone, but never quite works.
And Gary, what's the worst thing you've discovered?
I mean, I've discovered so many hacky things over the years.
And I once interviewed at a place that proudly showed me their development cluster, which
was a bunch of shuttle XPCs in an IKEA CalX shelf in their office.
I've come across production data that developers didn't want to ask for storage to be configured
for that was then sitting on a USB hard drive on a dev server failover mechanisms that
were ADSL connections hung off of the back of firewalls that weren't documented that
then had VPNs into production cloud environments.
Oh my God.
So much of this stuff that I don't want to go into detail self, but yeah, I've come across
my fair share of junk over the years.
Well, Gary, I think I need to go back and listen to that list again and slow motion.
So to let me do that, I think this is a good place to wrap it up.
If you've got any questions or comments or your own dirty hacks you've done or discovered,
please send those in to showadhybridcloudshow.com.
We'll be back in two weeks, until then, I've been Aaron.
I've been Gary.
I've been Sean and I've been Shane.
See you later.
