technology

TNO059: Design for Operations: Getting Vendor Support in the Ops Ecosystem

The Everything Feed - All Packet Pushers Pods·Mar 27, 2026·49:55

About this Episode

Scott Robohn and networking expert Russ White dive into the concept of design for operations. That is, they look at how to connect the design of a protocol or solution to how people are actually going to use it. They examine how protocol designers often overlook the teams that must operate them, creating some “inoperable”... Read more »

Hosts & Guests

Packet Pushers

Host

Transcript

If you're responsible for keeping the network up,

you know the pain, slow polling, missing data,

and a lot of guessing during incidents.

StatSeeker changes that.

It auto discovers your entire network

and pulls every device every 60 seconds,

giving you full fidelity historical data when you need it most.

No sampling, no blind spots, just answers.

Start your free trial at statseeker.com slash net ops,

and now let's go to the show.

Welcome to Total Network Operations,

the podcast dedicated to hardworking network operators

like you who deliver all the packets.

Our mission is simple.

We want to bring out great ideas and net ops.

I'm your friendly neighborhood podcast host, Scott Robon.

And today, we have the great pleasure of having Russ White join us.

Russ, say hi.

Hi.

Russ, what do you do?

What exactly?

What is it that you do?

I say hi.

Okay.

Do you have any plans?

You don't have any plans of the background prompting you

and what to say and when to say it?

No, nothing like that.

I don't have Audrey.

Audrey is Tom's not me.

What do I do?

I work for Nokia.

And I don't know.

I work on hyperscale accounts and enterprise accounts

and enablement and trying to teach people data center fabrics

and stuff.

So I'm going to talk to you.

To teach people data center fabrics and stuff, just stuff.

Whatever's available.

That's the current hat.

But I'd love you to tease out.

We're going to talk about design for operations today.

We want to make sure we connect the front end of the design

life cycle to how people are actually going to use it.

You've got some interesting experience there over the years.

Why don't you just pull out a couple of those things

that you've done that show you've got a foot.

When you were in the game,

when it came to actually supporting the network?

Yeah.

So I think the biggest ones were,

well, of course, I was in Cisco Tech for years

and then global escalation, which was still,

I mean, I always say this,

it was still one of my favorite jobs.

Because no matter how bad the network's broken,

when you walk into it,

it's never going to get worse.

And never is a long time, my friend.

It's never going to get worse.

And you're always going to walk out looking like a hero.

It's either that or Cisco loses the account.

What are the other?

Right.

It's not that much middle ground there.

And then let's see, after all that,

of course, I worked at LinkedIn for a while.

I worked at Verisign for a while.

So I've actually operated and it occupied for a while.

So I've actually been on the operational side as well.

And very different environments.

And those three providers in particular,

like the criticality of LinkedIn probably

didn't have the real-time requirements that Akamai had.

And maybe even Verisign.

No, I would actually say,

so Verisign, the biggest problem we had was scale.

And of course, I was in Verisign last year.

So it was innovation and scale.

It wasn't as much straight-up operations.

But how do you build a DNS system that gives consistent answers

across thousands of servers globally

with all the connectivity that goes into that?

And how do you build big data centers that can support

other services than DNS?

Because Verisign does other things.

At LinkedIn, it was all about speed, speed to load page.

How long does it take to get the LinkedIn app to come up?

Sure.

That was the primary determinant and cost, right?

Like Microsoft bought LinkedIn.

One of my interesting experiences at LinkedIn

was they walk in the door and they're like,

how many people do you must have hundreds of people

running this network?

Now, we had less than 40,

including NREs, including SREs, design deployment.

It was like 40 something less than 40 people.

Running a network that spanned the entire world

and had multiple nine data centers or so

with hundreds of thousands of ports in each data center

supporting all sorts of weird workloads.

And so, like, it was a lot about cost cutting,

making sure that we were providing service

and reducing complexity.

And then, you know, I compromised a little bit different.

I was on the perlexic side.

So it was much more about always up,

although I could tell you stories about that too.

But anyway, yeah.

Having a service available versus the network element

being available, those are two different things, right?

That's why we ever done and see.

So, yeah.

So I've been in places where I had skin in the game

for doing it.

And then even to the point of doing FR routing open source

and having to think in development at Cisco

and, you know, involved with the engineering side

of Juniper and designing protocol stuff,

one thing that I find in that culture

on that side of the cold network, your culture,

is there's often very little thought about

what does this look like when you deploy it?

Yeah.

But what happens?

Great.

Okay.

So great.

I've written a protocol.

And as a very simple, stupid example

when I was at Cisco, we had this thing called the CLI team.

It wasn't the team.

It was like, whatever it was,

it was UI something mailing list.

And anytime you were a coder,

wanted to stick something in the UI,

the CLI, you had to get approval from this mailing list.

Okay.

And so I was on this mailing list

and I was one of the people who would make comments

and offer suggestions or whatever.

And I remember when Cisco Express 4 came out

and in fast switching,

it was always IP fast.

Like, to turn on fast switching,

it was IP fast.

And in Seth, it was Seth IP.

Right.

Which I was, was so maddening to me

from an operational perspective,

something as simple and stupid

as the order of the commands.

Right.

Because at two o'clock in the morning,

is the protocol first?

I don't know.

Yeah.

No, I'm going to argue it's not.

Not at 2 a.m.

Whether it's a regular maintenance window

or, you know, recovering from an outage,

attempting to recover from an outage.

And there are, I'm sure,

multiple places to put blame on things like this.

Yeah.

When you're accompanying the size of Cisco, right?

And you've had,

you have multiple trains of iOS

and probably different developers associated

to different things in those different trains.

It is easy to have drift

and like not have command line consistency.

I'm not trying to give anybody an excuse here,

but how would you attack that at a Cisco?

Like, how do you make sure,

let's make the UI consistent across the platforms

and across the trains?

Is that a pipe dream?

I mean, what do you think?

Yeah, I think it's a pipe dream,

but I think there is a way to do it,

which is to define a grammar.

Okay.

Not a dictionary.

And that's what people do,

is they define a dictionary first.

That's actually not the problem.

The problem is grammar.

Is it?

Is to, even if it's arbitrary.

Right.

A sign certain words as verbs

or classes of words as verbs.

Sure.

And certain classes of words as nouns.

Right.

So they're used to subjecture objects

and certain classes of words as adjectives

or adverbs.

Right.

As that sounds so stupid when you say it out loud,

but like, it would enforce a consistency.

Yeah.

And I think it's hard when you're trying to keep up with features.

Anyway, you know,

like, it takes time and effort to pause

and create, what is the grammar we're defining

across the line?

Yeah.

I think even when you have consistency,

it's a great idea, by the way.

You know, you should position yourself as an advisor

to the next startup that wants to have a good grammar

for their CLI.

Even when there is consistency,

there is stuff that's consistently hard.

And I'll throw a couple of things at you.

And I'll, you know,

this is not to, you know,

just grab about iOS, right?

There's some Junos things that have always bothered me.

You know, BGP configuration,

community strings, extended community strings.

And having used rejects to parse all that.

Like, on the one hand,

it's a great way to show your network wizardry skills, right?

And having mastered red,

that was a big deal.

Like people who moved to Junos, BGP,

it's like, I know how to use rejects

to parse these community strings.

While that might be technically interesting,

it does reduce the potential user community.

Oh, yes.

And you're forcing people to think exclusively numerically

when there are semantic content associated

with certain community strings, right?

Well, probably could have done a better job of that.

And again, also not just as Junos problem, right?

I think every, yeah.

Every router OS, yeah, yeah.

Yeah, users rejects.

And they all use it a little bit differently,

which is kind of what's maddening as well.

Sure.

And kind of like reinforcing the,

I have to be a CCIE,

I have to be a JNCIE to do this proficiently

on this platform or set of platforms.

The other example that really made my head hurt

was very early IP second fig in Junos on the ASPIX.

There's nothing logical about that for an end user.

It was definitely a perspective from a developer

who was working with ever constructs they had

that again was a brain twisting exercise,

not fun.

So.

Yeah.

And I think in part of it is that as a coder,

the way you built your code comes out

in the user interface intentionally or unintentionally.

Right.

And it's easier to write the user interface

to match what you've coded.

Sure.

And it is to think it's two o'clock in the morning,

how is this operator going to deal with this?

Yeah.

That's a real, real issue.

I have always wondered about,

you know, what can vendors do better

to bring the end user operator perspective

all the way into, you know,

now that's the UI and grammar,

but even before, you know,

that from a protocol architecture

or packet processing architecture perspective,

you know, do I have the end user in mind?

Can you shed any hope on, like,

if you've seen good things happen,

either in your operator gigs or your Cisco gig,

where you've seen people try to bridge that gap?

Unfortunately, I don't know how to do it.

Okay.

I wish I did.

I don't know how to do it either.

Like, what do you do?

Do you?

I don't know.

I actually don't know the answer to this

because I don't know how you,

some of its arbitrary, right?

Again, what's a verb and what's a noun?

Sure.

Totally arbitrary.

You just made a decision.

Right.

Who knows if that's the right decision or not?

But it's a decision that you made, right?

Sure.

So I don't know how you get around that to some degree.

But on the other side,

there's a level of complexity that you have to think about

where you're like,

why did you do it that way?

You talking about it with the rejects, right?

Right.

Okay.

I know why that happened.

Sure.

I understand why it happened

because we needed some way to deal with these community strings

are all numerical.

And somebody looked around the open source repositories

and said, oh, look, there's a reject engine that I can use.

I don't have to write a new thing.

Sure.

Just reuse it.

Yeah.

Just reuse it.

So yeah.

But does that mean in the long run that it's better?

Not particularly.

Sobs a near term issue.

Yes.

Without thinking about the long term impacts of that.

Right.

I will tip the hat to what I saw

with the Distinguished Engineer program at Juniper.

And for other listeners,

they may have just heard an episode I did with Creed Compella on this.

So I apologize for any overlap here.

But we did have efforts that started and slowed down

and started and slowed down where developers,

usually DEs,

but less experienced developers,

really wanting to talk to end customers and understand,

how is this going to be used?

We've had separate conversation on what we like

and don't like about BGP editors' families.

How is the end user actually going to use it,

which often has nothing to do with what the developer has in mind?

Yeah.

Talk me through what deployment looks like.

If I gave you this feature,

explain to me internally what your process looks like

to deploy that feature for a customer.

Right.

And if you get five different operators

and you do the same thing,

you ask them the same question.

They're going to come up with five different answers.

Correct.

But you can like mesh together

a rational way to make it work.

And I think that's crucially important.

But now UI is really only one aspect of this design

for operations, right?

Yep.

I mean, UI is a big deal.

But I think about other things like,

okay, when I talk to people about aggregation and summarization

and I think about breaking a network up into failure domains,

most people think of failure domains

breaking these things up as a matter of performance,

making the network converge.

They don't think about it in terms of troubleshooting

or in terms of just making your configuration logically consistent,

from module to module.

That's more important than your performance.

Trust me.

Well, I would even,

I'd allow for the why not both answer, right?

Yeah, of course.

Of course.

You know, I want it all,

and I want it now.

Yes.

But like in financial services, for example,

or trading, you know, stock trading environments, right?

Both are super important, right?

I need low latency and performance.

I also need to make it really hard to screw this up.

So right.

So at two o'clock in the morning,

we just have a thing that's just go tack.

We have the two o'clock in the morning rule of thumb.

If you can't explain it at two o'clock in the morning,

does somebody whose main language,

their primary language is different than yours?

Right.

You both speak the same language fine.

But one of you primarily speaks Chinese,

and the other of you primarily speaks British English, sure.

Sure.

I'll show you whatever it is.

If you can't break that communication divided

to a clock in the morning when the network is down,

really think about the way you're building your network.

Yeah.

Like they really think about it,

because that's a problem.

So turn that into a practical example,

if you can.

What's a bad way that you saw made better

to address the issue you just raised?

Well, I think classical ones are things around,

mostly around access list and filtering a various kind.

Sure.

Right?

Easiest way to break the network.

That's right.

Easiest way to break the network.

And you know, you see an access list with a thousand lines.

And you're like, well, could you have done that in five lines?

Who?

Like, probably you could.

Honestly, you'd probably could have.

Or you get into configuration situations,

particularly with route maps and stuff,

where they have these very complex,

if thin clauses that are set up.

Yeah.

And you're like, well, why?

Who understands how that works?

Right.

And then there's all sorts of magic numbers people stuff

in their configurations.

Oh, if it's community string one, two, three, four,

that means this.

If it's one through three, five, that means this.

Right.

No, no, stop that.

Like, just stop.

Well, that's a case of.

And that's kind of what I was hinting at earlier, right?

You know, if you have, if all you have is a hammer,

everything looks like a nail, right?

Yeah.

So you're going to use numeric values

to carry other semantic content.

Yeah.

A great example, by the way, of this type of thing,

of thinking, is if you're redistributing,

we don't redistribute protocols a lot anymore.

We just don't.

Right.

But used to you would redistribute OSPF

into EIDGRP or EIDGRP into OSPF or whatever.

And people would build their redistribution points

and they would build these massive access list

of all the things that didn't, like,

this originated in EIDGRP,

don't redistribute it back into EIDGRP.

Right.

Okay.

Now you have this massive access list.

And two o'clock in the morning,

you're trying to figure out,

is that access list correct?

Well, why not just?

Where's time?

Where's time to evaluate this?

Yes, right.

Exactly.

Why not just tag the round?

Yeah.

Right?

Much much simpler.

Yep.

Much much simpler solution to the same problem.

ASSETs coming off of,

coming off of IRRs.

We just had this long discussion about ASSETs

and various people have been really crazy about this.

Yeah.

Same sort of a thing.

It's simple to, like,

create a one-line thing that says,

implement this ASSET from this IRR.

But you're hiding a lot of crap in there.

Right.

There's a lot of complexity in there that's like,

I don't know if that was a good idea.

For sure.

I think there's other artifacts to,

well, before I hit that,

you know, the ACL growth problem,

and I have to say firewall filter

to include my Junos friends instead of ACLs.

And by loved,

loved the named stands of capability

in Junos firewall filters to move things around.

And I know,

I know Cisco eventually got robust

reordering capability in ACLs.

But, you know, they just grow, right?

And you can't,

you have months or years of add an entry to the ACL.

Add an entry to the ACL.

With little to no thought of REAL,

which is, I think,

I think how you get the 1000 line ACL versus,

let's refactor this,

just like we refactor code, right?

Yeah.

To say,

five lines who do this much more succinctly,

and maybe even shut down,

you know, reduce my TCAM usage, right?

Yeah.

Well, and by the way,

this is where tech debt really is, right?

Yep.

A lot of people think tech debt means a legacy.

And that's really not to me

what I think of when I think of tech debt.

I think it's one example.

That's one case of tech debt, yeah.

Yeah.

So to me,

if I want to express legacy,

I'm just going to say it's legacy.

Like,

I don't need any word for that.

What tech debt actually means is,

is the differential between the way I think the network works

and the way it actually works.

And if I have this 5,000 line ACL,

right?

What do you understand that?

Yeah.

And I know it's scary to pull it apart.

Yeah.

I got that.

But you know what?

That's tech debt.

Nobody's going to understand that at two o'clock in the morning.

Yeah.

Just, you know,

take the time,

tear it apart,

figure out what it's doing,

and fix it.

Because that,

that's going to kill you.

That's going to absolutely kill you.

Take it offline and reoptimize it.

Yeah.

Right.

Yeah, of course.

And so many ways to do that today too.

I'm impacting production.

I mean, container lab.

Yeah.

Right.

I think it's probably one of the, you know,

it's getting increasing adoption for stuff just like this.

So I can have a test environment and a dev environment

before I push it into production.

Yeah.

Exactly.

And yeah.

Think about it when you're building this stuff.

If I'm adding the 10th access list entry this week,

maybe I just need to redesign things.

Yeah.

Maybe this just is not the right design.

Yeah.

I mean, I'm now using the right tool.

Right tool for what I'm trying to accomplish.

That's right.

And like for your,

for your, you know,

route taking example versus another ACL entry.

Yeah.

So we're going back to design and thinking about like modularity

and being able to make the design repeatable.

That's a huge deal in designing things for simplicity

and thinking about the operational ability of the network.

Not just when the ability to operate the network

in the sense of me understanding it,

but also automation.

Sure.

How can I automate it?

If I've built everything as a snowflake,

now I can't automate it.

And this actually drives into intergenerational stuff.

How many generations of each module type do I have in my network?

Like every one of those generations needs to have automation built for it.

Like that.

And that's, that's horrible.

Right.

Yeah.

I do custom automation for, for, for different skews.

Yeah.

And again, the bigger companies have the more problems with that.

Right.

The more skews I have, the higher risk there is of stuff like that happening.

Yeah.

But yeah, you don't have to convince me on, you know, simplicity,

you know, pull out Occam's razor and scrape away, hack away at,

you know, all the variants that you think you need but you really don't.

And you get to engineering all of you by the simplest design that meets all your requirements.

So.

Right.

Yeah.

Remember the old saying about protocols, right?

Protocols not done until you've removed everything that doesn't need to be there.

That's right.

Same principle.

Same for your network design.

Yeah.

Pull it out until it's.

So when you think about where you divide networks up aggregation summarization,

by vertical separation or vertical modularization, via overlays, all of that stuff,

you should be thinking about not just where do I break it for performance,

but also where do I break it to make it repeatable as repeatable as possible.

And then where do I break it to make it where I can troubleshoot it?

Right.

That's like we often miss the troubleshooting bit.

And where do I break it to define the security breach the fastest?

Right.

Questions we don't ask.

Let's talk about one of the hardest parts of network operations, proving what actually happened.

During an incident, you're under pressure.

But most monitoring tools sample data or pull too slowly.

So when you get back to investigate, the detail just isn't there.

That's where stat secret stands out.

It automatically discovers your environment and pulls every device every 60 seconds across your entire network.

Not just critical interfaces, everything.

And it stores that data long-term, so whether you're troubleshooting something from five minutes ago or five months ago,

you've got full fidelity history to work with.

Engineers use it to detect issues faster, prove root cause, and prevent repeat incidents.

If you want to go and see how it works in your own environment, you can run it for yourself with full access for free.

No sales calls, no credit cards, just you and your network.

Go to statseeker.com, slash net ops, and start a free trial.

Thank you, statseeker.

And let's get back to total network operations.

It takes a long time in your career to get beyond the, I just want to get this to work.

So they'll get off my back and I can move on to the next thing.

And maybe slow back, breathe a little bit and say, okay, what's the better, longer term, you know, way to implement XOR Y or Z as it were.

So yeah, exactly.

Let me poke it.

Other artifact type things and the best example I can think of.

And again, this will harken back to a recent ISIS versus OSPF discussion that you may have had.

And I'm an ISIS fan, right?

I think dealing with server service providers, I don't know if this is a true blanket statement.

I'm just going to make this assertion.

It seems like the approach to TLVs in ISIS and the implementation, the scale you needed for service providers.

Always seem to show up in ISIS first as an IGP and then OSPF later.

That's kind of my experience.

My experience as well.

Okay.

So not just a Junos thing, but I still have this stupid and sat bad dress that I need to configure.

There has nothing to do with anything else in the network.

Well, how come we've never gotten rid of that, Russ?

Oh, my.

I think it's, I don't know.

It's like, why do we, why do we still use a 30-bit address for OSPF router ID, even in OSPF V3?

Sure.

Like, it's V6.

Like, why isn't on V6?

It's V6 capable.

But why didn't they just use a V6 address for the router ID that's never made sense to me?

I think partially it's just habit.

And honestly, I like it in a sense that one of the things I struggle with, or those PF on our troubleshooting OSPF,

is dividing out router IDs from reachable destinations.

Oh, sure.

So you automatically have a non-IP router ID.

Yeah.

All right.

10 points to the house of white.

There you go.

I do a show database.

And I automatically know that's a reachable destination.

And that's a router ID.

In OSPF V4, or V2, or it's OSPF V2, I look at it and I'm like, ah, yeah.

What's what?

10.1.1.1.

What does that mean?

Right.

Yeah, that's fair.

And maybe that's unintentional or a serendipitous.

Oh, it's serendipitous.

It's not intentional at all.

Yeah, it's not intentional.

It's just the fact that when they started adding protocols to IS to IS.

I mean, we think of IS to IS is the CLNS stack.

Right.

But you know, there's ISO I grew up.

Or there's there was ISO versions of IS to IS.

They were like all these other things that you get into.

And you're like, oh, yeah, it was designed to be multi-protocol routing protocol.

And you just don't want to mess around with the ins app.

I mean, the ins app is just the ins app.

Sure.

And again, defaults to, you know, hey, that's my router ID.

And that's the only time I ever need to care about that.

Yeah.

But I do think the bias or lack of bias in IS.

Yeah, it wasn't IPv4 biased.

Right.

Yeah.

And I can just define a TLV.

And then the approach to, if I understand this type, the T and TLV, I can process it.

And if it's a type I don't understand, I can just pass it on.

Right.

Transparently.

Whereas OSPF kind of made the opposite design decision.

That's like, if this is an LSA type, I don't, I don't understand.

I'm stopping it.

I'm dropping it right here.

So.

And actually, I find OSPF harder to troubleshoot because of that.

Yeah.

In many senses, right?

Yep.

And so this is an example of designing a protocol for operational.

Right.

Making it easier to operate the network.

Like, okay, go back to BGP.

I find BGP very hard to troubleshoot nowadays.

Just because there's so much stuff.

Sure.

It's all carried the same way.

It's all carried.

And it's like, okay, I have to do a show.

BGP.

Verf.

9846.

Blue or whatever it is.

And you're like, right.

Why would I ever remember this?

Like, why should I remember this?

Like, there's so much going on here that it actually makes a difference.

And that it actually makes it hard for me to do even peering relationships nowadays.

Oh, you negotiated 15 things.

Oh, well, which one of those were required to be matching in which of those aren't.

Like, nobody knows that list.

Not nobody.

I know.

I didn't even know that you could have like, so this is, this is a revelation to me.

Right.

I haven't established peering relationship, but only certain adges family not be working.

Is that what you're telling me or capability not working?

Yeah.

Some capabilities have to match and some don't necessarily have to match.

Sometimes you dump the session.

Okay.

Yeah.

Not a good random random decision.

No.

Hard to know, which is which, right?

I mean, and even between protocols, like, okay.

BGP is quite intelligent around the keep alive.

Right.

If you use the other person's keep, right?

OSPF, if the keep doesn't match, what do you do?

Right.

Oh, sorry.

I can't form a friendship with you.

But we're not a Jason.

Why?

Right.

Well, if I remember, like one of the big early Junos issues was that.

I think it was 90 seconds versus 30 seconds.

And I think Cisco had the longer time for keep alive and Junos had the shorter.

And if you, if you went with the shorter one, the relationship is going to time out

because the longer one, you know, isn't as responsive as you want it to be.

So that was always one of the first things we had to make sure we can figure to match.

You know, if it was Cisco, Juno, prepare stuff.

But they should just peer anyway.

They should just form an adjacency.

And you should just say, oh, that neighbor wants me to send a hello every 30 seconds.

That one over there wants me to send in a hello every 120 seconds.

And I want you to send me one every 200 seconds.

Like, I shouldn't care that they match.

Or they don't.

Or, you know, who needs to tune that?

Like, if we could just agree on 30 seconds, keep alive.

Right?

Well, okay.

So the problem there is different link types.

Sure.

And so, you know, if I'm running a T1, I don't want a 30 second.

Hello, I want 120 seconds.

Sure.

If I'm running 10 gig, I might want five seconds.

Now, could you say, if the bandwidth is X, set it to Y?

Yeah, you could.

But yeah, I don't know.

Throwin' legacy at me.

All those OSP efficiencies being formed over T1s today, Russ.

I don't know.

Maybe not in the West.

Not in the West.

That's right.

Right, right.

Token bus OSP efficiencies.

Let me, let's twist this a little different direction.

And if we believe that automation is good and should be driven,

then we're going to increasingly interact with the infrastructure programmatically.

Is this an issue?

Scott, you should just drop it or programming all this stuff on APIs now anyway.

What are you saying?

No, no, I don't think so.

I think it's still important.

Because somebody still has to be able to look at it and know.

Again, 2 a.m.

Yeah, 2 a.m.

Or even when you're sitting there telling your AI system,

configure this for me.

Right.

Did it do it correctly?

I don't know.

Is my single source of truth correct?

Even in a full automation system?

I have a single source of truth.

Great.

Is it right?

You know, my source of truth is it's the network.

Yeah.

Apologies to all my source of truth.

Friends out there.

But right.

I mean, how do you know it's right?

Yeah.

And if I have to all the time look at all these timers every time I look at a network,

think about how much time I'm spending troubleshooting just checking things.

Right.

It really, you know, I should be troubleshooting the real problem.

Sure.

And I'm with you like it's going to be really hard to pry away the command line

for those critical outage situations.

Yeah.

Right.

I think that's going to be with us for a very long time.

I do think programmatic access and interaction for rollout moves as changes.

I think that's good.

That's a right direction.

I think you can still screw things up in your Python, right?

Yeah.

We can still cause problems.

But it's funny.

And this might be a semi controversial statement.

We talk about getting away from the CLI for net ops.

If I'm programming stuff, haven't I just moved?

What CLI I'm interacting with?

Yeah.

You know, I'm writing Python instead of, you know, a show system tech support

or, you know, router BGP.

Even systems with advanced GUIs and advanced user interfaces,

there's always still a CLI mode you can drop to for faster troubleshooting

and to get the detail that's hard to get otherwise.

I mean, even a car like, oh, I plug in my thing and I have it on my phone.

Oh, look, it tells me that the right headlight is out.

Okay.

Right.

There is a CLI interface to that to that car that if you're an advanced user

and you're troubleshooting something that goes beyond the right headlight,

there's a way to get in there and mess with it.

And so all of these systems still have all those CLIs.

I don't know the whatever actually get rid of them until we get to the point like Star Trek.

Oh, computer.

Computer.

But, you know, and I think if you want to do the elder chief engineer Montgomery Scott,

you know, computer.

Yeah.

And whichever Star Trek movie that was, right?

Yeah.

So I am bullish on natural language interfaces.

And I do think this is something that LOMs are actually really good at.

And if I talk about intent based networking, Russ.

Yeah, it's an LLM, isn't it?

I can express my intent.

I'm expressing my intent to you right now, right?

Mm-hmm.

I think it's a good thing.

There's certainly stuff that needs to be trueed up and they'll get better over time.

But I think it's a step in the right direction.

Yeah.

And when I look at products that let me talk or type in English,

but also show me the specific commands that are being generated and say,

is that right?

Yeah, I think that's a great intermediary step.

Right.

And where I can, I can QA, you know, is my intent being understood correctly?

So.

Yeah.

Oh, yeah, definitely without a doubt.

Yeah.

So back to the beginning, design for operations.

Please design for operations.

Think about every design decision you make.

Do I use a single overlay and underlay protocol in my data center fabric?

Well, it's not that it's right or wrong.

It's a matter of, does my knock-no to protocols?

And then, is it easier for me to troubleshoot a network

that's two protocols instead of one?

And for some people, it's easier to troubleshoot a network that has one protocol.

For other people, it's easier to troubleshoot one.

As certain scales, it's different answers, different scales, different answers,

with different modularization systems.

So you need to factor all that in.

You don't need to just say categorically, no, we're only ever using BGP on our data center fabrics.

We'll never use ISIS or OSPF is under underlay protocol.

Well, okay, but you're limiting yourself.

Yep.

And that's not good.

And when you think about aggregation, where do I aggregate?

Where do I summarize?

Well, not only what's going to make it the network scale,

what's going to make the network where you can troubleshoot it?

Sure.

What's going to make the network where you've built modules that are repeatable?

Like, ask those questions, right?

I know that my friend, Chris Romeo, hates the idea.

He hates the, the terminology shift left.

But that's actually what you're doing, right?

You're shifting left.

Why don't you explain that for listeners who may not have been exposed to that before?

Okay.

So shift left just means in the design cycle.

But if I start out with requirements gathering and intent that I go to design,

then I go to deployment that I go to operations.

I'm actually at a point where if I leave troubleshooting until I get to operations

or to limited until I get to operations,

you've really designed the network poorly for doing those things.

Sure.

Yeah, shift left all the way up to the requirements phase.

That's exactly right.

Okay.

As hard as you can.

That's what that terminology means.

Now in an agile environment, whatever agile means to you,

that might be more problematic to say shift left, right?

Because it's supposed to be iterative.

Sure.

But nonetheless, taking it, I mean, think about it, you know, include operations.

Some, not your senior operations person on your design team.

Sure.

Exactly.

I know.

I hear where you're going.

Yep.

Yeah.

The person who's really going to be troubleshooting it,

let them come in and hang out with you when you're designing stuff

and let them ask questions.

Let them ask questions.

I encourage them to ask questions.

That's right.

And even, even task them with, if you don't understand this, question it.

That's right.

Because that's going to impact the 2am events.

That's right.

That's exactly right.

That's at the customer implementation side.

If I were to say, you know, a similar shift left motion within the vendor,

like I think it comes down to communication all the way from, you know, architecture, design,

to what that end customer, how the end customer is going to do, or those five different end customers,

five different ways, that sounds like a hard problem.

How do I talk to every customer that might want to use this?

You don't have to talk to every customer.

You've got great in-house resources, and we've been them.

You know, talk to your tech engineers.

Yep.

Talk to your pro services, people.

Talk to your resident engineers.

Because they're going to see exactly, they're a great proxy for an aggregation of stuff

from multiple end customers.

Right.

And hash it out.

Like even, you know, let people play with it internally before you call that the next release.

And this has always been one of the problems of the ITF and standards bodies is that they are dominated by vendors,

and researchers, and not operators.

And so just unfortunately, sometimes, I wouldn't even, I shouldn't try to quantify how often this happens.

Protocol designs come out that are really kind of not operable.

Like maybe that was not the best design decision.

IPv6.

There's been things that have come out in OSPF and BGPE and radius and other places where you're like looking at it going.

And seriously, did anybody ever think about how that's going to be deployed?

Really?

Yeah.

I know that, you know, Mahesh, Jeff Huntani, of Arcus, and I forget the name of the group he's responsible for in ITF now.

But he's actively going after operator input.

Okay.

He seems like he's trying to get his, he understands the problem set.

That's not an issue at all.

Just how do you incentivize the operators to provide relevant input, provide ways to provide it easily,

and close that feedback loop for, you know, that open-ended code it and ship it as it were.

Oh, yeah.

Very, very important.

And it is very hard to do for an ITF perspective is to get people to do those kinds of things.

Yeah.

Before we wrap up, I do want to circle back to if we talk about tech roles and being part of this process to help help really drive design with implementation in mind.

You made the comment that being a tech engineer was one of your favorite roles ever.

I would love to hear more about that.

Global escalation.

Okay.

More so than tech even.

Yeah.

But go ahead.

Why, why did you love that so much?

What, what appealed to you about it?

Always a different problem, always a hard problem.

Always had the resources necessary to get the job done.

Regardless, always had good backup access to the source code, whatever was needed to get the job done.

Right.

Because the customer was on the, on the verge of leaving the company as a customer almost in all cases.

And also, you just see how things break.

You really learn very quickly how things break.

And in fact, like the first book I wrote for Cisco Press was advanced up in network design.

And basically they sent her an email and they said,

a lobby's book has been published.

We need to follow on.

We don't have any authors offering to write a second book.

Both of them are back, are back there over this shoulder.

Behind the virtual, the virtual partition.

You can see the edge up.

This is go press side of my library right there.

So, yeah.

Don and Alvaro and I looked at each other and said,

well, we really don't think we know how to design networks very well,

but we do know how they break.

Sure.

So let's write a book that just says the opposite of everything we see break.

Okay.

So I point out that there's more than one way to be wrong here.

Is that relevant?

Well, I mean, but on the global escalation,

do you see all of them?

Yeah.

What was the, what was the progression?

Like how long were you not, how much time did you have to do to actually get into global escalation?

Oh, I think I was in tact for two or three years before I went into global escalation.

And I was only in global escalation for a couple of years before I went into DNA,

deployment architecture, which was like half,

I was one foot as a coder and one foot as a kind of an escalation slash

super sales kind of like.

Sure.

Sales came to back into our team to cover particular things, basically.

Well, I would, I'm not sure I want to call being a tech engineer,

one of my favorite roles.

And I was never an escalation guy.

But I intentionally sought attack role.

I was a trainer.

I had been a Cisco trainer as a partner.

I just be computer consultants,

which are a slattery and that whole team went to juniper as a trainer,

which I loved, but I hated the travel.

I'm like, children were young at the time.

So I traded the travel for two AM sessions on a regular basis.

But it was an education for me.

You know, being a trainer, you know, how things are supposed to work.

And you make sure the labs work.

And the latest upgrade didn't break the labs.

But in tact, you see how things really work.

And in fact, I don't know how it is now, because I haven't been attacking in 10 years,

15 years, whatever it is.

But people always ask me like, how do I get in the network engineering?

How do I actually get to that point where I'm really start and tack,

find a tack job, whether it's a Rista or juniper or Nokia or Cisco or Huawei.

I don't really care.

Find a tack job.

And you will learn so fast because you have to,

because everything breaks all the time.

Yep.

And what a stepping point to go into an escalation role,

right, or move into pro services or resident engineering,

or move into SE roles, like some of the best SE hires I ever made

came straight out of the tack organization.

Yeah.

And I always felt like that kept me grounded.

Right.

You know, five years of pain with some large international customers.

It was an incredible learning experience for me.

And I'll say the escalation team,

like they were observed as, you know, minor deities, right?

They got there for a reason, right?

And have seen almost all of it before.

And, you know, I call out names, but I don't admit people.

So I won't do that.

But like some of the, some of those escalation engineers taught me

some of the best lessons of how things actually are implemented and work.

Yeah.

Totally good use of my time.

So.

Oh, yeah, definitely.

I have, I have a two AM story that I want to share.

And I want to hear one of your best two AM stories.

I was in the middle of an outage with a large internet carrier

that you would know the name of.

And I'm pretty sure I took BGP down.

I think it was my fault.

And I had gotten one of our developers,

one of our BGP developers on the, on the bridge at, you know,

was probably two or three Eastern, you know, 11 or 12 Pacific time.

That's where he was.

And I was flailing and I was just trying different stuff.

And this guy, maybe a little snarkly said, you know, Scott,

your lack of sleep is impeding your performance.

You should probably just get off the bridge and go to bed.

And I did not appreciate it at the time, but it was probably the right call.

So, you know, you live and you learn.

Yeah.

What about you?

What's your favorite?

I have many, many tax stories.

You could say that we can, we can hit mothers more,

but give me one good one.

Yeah.

So, I remember once that I was on with a large insurance company

and they were having EGRP problems.

And so, of course, I got called because I was like, you know,

don't slice it.

I were the EGRP people at Cisco.

Those two words always go together, by the way, EGRP and problem.

Yeah.

Well, okay.

I don't know.

It was pretty good towards the end before people started really,

whatever.

And so, they were having tons and tons of SIAs.

That turned out to be a memory fragmentation problem in some 7,200s.

Is this terrible that I remember this?

And 7,500s in the network core.

Anyway.

Oh, good.

Yep.

He was this particular vice president kept breaking into the call

like every 15 minutes and screaming and cussing at me in the

attack engineer and his own engineers.

And he was going on and on.

If you can't fix this problem, I'm going to rip it off.

Blah, blah, blah.

I just kept going on and on and on.

And finally, it was probably two o'clock in the morning.

I don't remember.

I finally said, if you can't leave us alone to get this problem fixed,

we're never going to fix it.

And he just kept ran the inventory.

And finally, I said, I said, your network sucks.

And I'm switching insurance companies and I hung up.

Bold move, my friend.

Man, did your call be?

And said, I have been waiting for you to do that for about 12 hours.

And you know what?

We got back on the call about 30 minutes later.

And that vice president never broke into the call again.

And we got the problem fixed.

Got the got the message.

Yeah.

Um, how calculated was that our award just off the cuff?

Did you hit your limit?

The cuff.

I was at my limit.

I was done with it.

You know, I've done more calculated things like, and we had a small

sort of provider someplace where there was like 275 hundreds.

So tiny, tiny mom and pop shop.

Sure.

Ran, you know, ran out of some little convenience store or whatever it was.

And so there was one network engineer.

And he was part time, I believe.

And then there was like the owner of the place.

And the owner kept breaking into the call and saying, I really need this fix.

My customers are calling me.

They're complaining and blah blah.

We kept going and going and going.

And finally, now this is how old this problem was.

I said, do you have an AOL account?

American online.

And he said, no, but I think I saw a disc laying around on the floor here.

Some place.

Okay.

And I was like, can you sign up for AOL?

And I'm going to give you, I'm going to give you directions about how to download an

image that we need for this router.

And that would be really helpful if you would do that for us.

And so he went off, found the AOL this signed up for AOL over a dial-up modem.

And I went on Cisco online and found the largest Cisco 75 or 100 image I can find.

You're evil, man.

Please download this.

And you know what, it was peaceful for about an hour.

And his network engineer and I fixed the problem while he was downloading that image for AOL via AOL.

You realize this is a podcast and this will be distributed to many listeners, right?

You do understand we're not just us, right?

That was like, you know, and that was not at the end of my limit.

That was just like calculated.

I was like, you have customers that are calling you complaining.

We need to fix this problem.

We're never going to fix this problem if you break into this call every five minutes

and tell me about the customers.

Like that's just, it's just not going to happen.

So we just need to like find some way to divert your attention for about an hour.

Squirrel.

Yeah.

Well, this ends another episode of Russ White's Tics and Trips.

The Trips and Tics.

Tics and Tics.

And how to keep random sources of noise occupied so the real work can get done.

You have probably not gotten any real work done while listening to this episode.

I don't apologize for that.

We really appreciate you joining Russ.

I just want to say it's a pleasure to have you on.

Any time you want me on, let's say I have a time to do it.

Well, you know, just to tug at the hard strings here.

I love what you do and through other media for years.

I always learn some stuff from what you talk about.

I want you to know it's really appreciated.

Well, thank you.

What, uh, where do you want people to reach you?

What should they, what should they look for?

What's the best way to see what Russ is up to?

Uh, LinkedIn rule 11.tech.

Got it.

You've got that down almost as good as Tom.

Almost as good as Tom.

We appreciate you tuning in to total network operations.

Love your feedback.

Um, if you don't want Russ on this, uh, pod every other episode,

let me know other ideas of people you'd like to talk to.

I, I can do, uh, I can do a very long Russ series.

Thanks again.

And, uh, seven minutes feedback on LinkedIn,

or you can go to packupusures.net slash follow up.

Um, send it whichever way you prefer.

We will see you next time on total network operations.

TNO059: Design for Operations: Getting Vendor Support in the Ops Ecosystem

About this Episode

Hosts & Guests

More from The Everything Feed - All Packet Pushers Pods

HN821: Boring Network Design Is Good

IPB197: SLAAC and the End of DHCP?

N4N052: Multicast Part 2

D2DO299: The State of Platform Engineering and DevEx