Kube Cuddle | Transcript: Paige Cruz

Paige Cruz

April 16, 2023 / 58:02/E20

Rich: Welcome to Kube Cuddle, a podcast about Kubernetes and the people who build and use it. I'm your host, Rich Burroughs. Today I'm speaking with Paige Cruz. Paige is a Senior Developer Advocate at Chronosphere. Welcome.

Paige: Thanks for having me on.

Rich: I'm, uh, excited to have you. Uh, here. We have known each other for a while now. We met, uh, several years. Uh, through DevSpace Portland, I was one of the organizers, and I think you spoke at least a couple of times at the conference.

Paige: Yeah, I was very ambitious in the, my early days of being an engineer. In fact, the first conference talk I gave was within, I wanna say four months of me learning

Rich: Oh,

Paige: how to do development in a corporate setting. And, um, there's no better place than DevOpsDays Portland. I have to give our, our community regional conference a plug.

Rich: Yeah, I actually stepped down as one of the organizers because I just, uh, had too much else going on and they didn't do one this last year. So I'm not sure where things are at with the conference, but I, I hope it shows up again. It's a great event and, um, uh, really awesome team to work with. I, I enjoy doing that quite a bit.

Um, so you, uh, previously were a site reliability engineer, and now you have moved into the dark side of DevRel.

Paige: Yes, I am really enjoying things over in marketing land. Um, and, and I think, especially when we're serving a developer audience, um, I, I think a lot of developers are allergic to marketing. Um, or they, there's a bit of skepticism that is, is rightful to have, um, you know, not every vendor is , you know, selling the best product or using the most above board tactics.

And so I really see my role as combating that. Um, is being the friendly face, being, um, the unobtrusive, I'm here to share information and I'm not trying to bring you, you know, I don't want there to be any mystery about if I'm trying to sell you something or share information. And so coming on this podcast, hey, this is just coming from my own experience as an SRE contending with Kubernetes, having to teach developers.

Um, honestly, nothing to do with my, my company here other than they're paying for my time. So, um, that, that has been interesting. I'm sure you've had to sort through those feelings of being on the dark side as you say.

Rich: yeah. I mean, I think that, um, I think that especially when, um, you know, like I'm working at a place now where I talk mainly about, open source products, but that's not always been the case. Right. And I think that, especially when you're at a place where all they have is a commercial product, it gets a little, even a little harder to navigate that stuff.

But, um, but I think that a lot of people, um, are receptive if, they feel that you're coming from a good place like that.

Paige: Yeah, exactly. So that's what I'm here, I'm here to, if you don't know me, um, I'm here to, to bring you the good information, um, that I have hard won through my years of experience and, um, I'll let you know when it's an ad.

Rich: Yeah, absolutely. Um, so, uh, wanted to get into your background a little bit. Um, how is it that you got started off in computing?

Paige: Oh my gosh. Okay. So, uh, I'll say I'm a millennial, like full stop. So technology was introduced to my life, um, not in like a corporate setting or not for spreadsheets and efficiency, but I was introduced to like, AOL Instant Messenger and MySpace and I was playing PC games. And so computers to me have always been sort of a portal into creativity and kind of extending the physical world around me.

Um, it's a place for connection. Um, I can, the earliest form of coding, I think that entered my life was an uncle got me a HTML website book for kids and I was so proud. I made my little table and my headings and I was like, Mom, go look at my website. It's so sick. And I'm like, file colon slash slash.

And you know, obviously it was not on the web. Um, and, and then I kind of put, I put it down, you know, had someone in my life maybe been involved in computing, I, who knows, I could have been coding since I was seven, but, um, I kind of filed that knowledge away and ended up taking, uh, I first enrolled in mechanical engineering in college, and the mechanical section actually included a Python class.

Um, they wanted us to have a really broad foundation and so I took my CS, you know, entry level compute, you know, CS course, and was like, okay, I know Python. I, I'm comfortable on the terminal. I can ls. Okay. So, so in my mind there's HTML webpage stuff floating around and now this kind of concept of the computer, the terminal, the, the backend of the machine, the backend of the GUI.

Um, but I didn't really know how to combine those things. And luckily unluckily, um, I got halfway through my major in mechanical engineering and kind of hit a roadblock and ended up pivoting to, if you can believe this, a major called Engineering Management. So I literally have a degree in Engineering Management.

Yeah. Um, which is funny cuz I don't see that on the job applications. That's not a degree they're asking for

Rich: I honestly, I didn't know it was a degree that existed. This is literally, I think, the first time I've ever heard of that.

Paige: It's a little, I will say it's a little silly as an undergrad degree. I don't know that I trust anyone who, who maybe went straight from high school to college, um, to manage a team of engineers. But the purpose of that program was to take the technical minds within engineering and expose them to what it fully takes to run and operate and scale a business, which is marketing, it's sales operation, like literal supply chain, chains, not digital ones.

And so I walked away saying, oh my God, thank God we have people in finance who understand accounting and term sheets and thank God we have folks in marketing who are out there spreading the word and in the field. And, um, I, it stripped away a lot of the engineering elitism that I think, um, kind of persists in our society today.

I, I'm like, heck no. I am not an accountant. Please don't send me like term sheets or anything. Um, I, I just wanna look at, at the code in the computer. So I'd say that is probably what primed me, not only for SRE, to take an interdisciplinary approach to, to not just looking at your technical systems, but to say, how is it serving the business?

What is the business trying to do? How can I help marketers and sales and customer support? You know, looking at it all as one hashtag oneteam. Um

Rich: That's super interesting because there definitely is a lot of elitism in the industry. I feel like it's maybe getting better, you know, but maybe that's also because I'm in a bubble of people who aren't that way. But, um, but it's, uh, yeah, I mean, some of this stuff for me, like I, I was definitely that way, especially back in the, like the 90s, the, you know, the 2000s.

I thought we were the cool kids, you know, the ones who were like doing the Unix stuff and that, like, we were smarter than everybody. And, and, um, I, as time went on I started working at startups and, um, I worked at one startup a few years ago where I was laid off and I knew that it was directly as a result of us not making enough sales. And so, so suddenly, like some of these things start to click in your brain and it's like, okay, I really, I really need these salespeople, right? I really depend on them to even have a job.

Paige: Yeah, they are the early, I would say they are the, uh, early indicators of business health. Um, if you do not have a friend in Sales, uh, now's your chance to go make one. Um, highly recommend. Um, yeah, so, so I, I graduate with this degree, right? You know, nobody is trying to, I did not have people knocking at my door asking me to manage engineers.

Um, and I was in a little bit of a career crossroads. Where do I go to start, to start my working life. And I was very lucky through Startup Weekend, which is essentially like a hackathon over a weekend with business and software people. You kind of pitch a startup and then by the end of the weekend you pitch, um, you pitch your findings and what you've built and, um, people vote. It's really fun. But through there I met somebody who worked at New Relic and he was a product manager. And he was, he was friends with our HR person and he said, you know, I don't know if you're interested at all, but we are looking for a people ops person in the Portland office.

It's all engineers. You know, you should think about that. And I was like, oh my God, I just spent four years with engineers. I know how persnickety they can be pedantic, you know? I know. Sometimes being technically correct is, is the most important thing. Um, I get these, these folks, these are my folks. Um, all I, I would totally be a good fit. And that really started my tech career.

Even though I wasn't writing software, I think if you are working at a tech company, you've got a higher than average sense when it comes to the tech world and using applications and just, you know, you understand the massive role that technology plays in the business world, in the nonprofit world, all around us.

Um, we have those insights. So I'm there. I'm, I'm doing people ops. I'm loving it. I'm running our intern program and I happen to one day find out that the software engineering interns, their hourly rate was a lot higher than mine, and I'm sitting like, whoa. I know Python, I know HTML, right? No, I'm not an expert, but like the concept of a for loop, I'm aware of these things.

I know how to hack a Google sheet. So I'm like, what am I doing wrong here? And I looked around and all around me were, um, women who are software engineers, who are happy and fulfilled and, you know, reaching new heights. And I just, I saw, an environment that I, I felt I could grow in. So I took the plunge of faith, ended up going to a bootcamp, and then thank God on the other half of that, um, at the end of that New Relic hired me back onto an internal tools team, where I got my, really, my start with this whole DevOps, ARE, infrastructure, you know, Terraform, um, I kind of got, when you're demoing, when your product is a monitoring product and you need to demo it, you've gotta build actually crappy apps.

You've gotta build apps that are flawed.

Rich: Yeah, yeah. yeah. I literally just did a webinar the other day with somebody who has a, an observability related product, and they had broken apps that could deploy so that we could troubleshoot them.

Paige: Oh my gosh. Yeah. And, and really it, it's so. I credit so much of my growth to that team and the mentors that I had, I asked a very patient man named Gabe, what is a server? What is a server? I asked him that probably 10 times in a row cuz I'm like, I don't get it. It's a computer. It's a laptop. It's a laptop that's in a different room.

Why, you know, why can't I run it locally? And had I had someone who was, uh, I don't know, a little grumpy, a little crunchy, um, of an engineer, uh, who would've brushed me off and not taken that curiosity. Um, I probably wouldn't be sitting here today. So, uh, I hope people realize, like even within a short six years that I've been in this field, I have grown from what is a server to, oh, I've got the keys to all the clusters across the regions, and I know how they work on some level.

Um, so it's possible and we, we've really gotta open the door for the next generation of SREs and operators and, yes, this world is complicated. I was bo, I was a dev, born in a Docker container, shipped to ec2, that was managed by infrastructure as code. You know, like I was kind of thrown into this complexity and really it's the power of events and observability, which we're gonna get into that helped me as a newbie, grapple and understand the multiple environments at the multiple, um, companies I've worked at. Without that visibility and my strong foundation in those concepts, oh my gosh, uh, you don't wanna see me grepping through logs.

My grip skills are not great, 'cause that's, I don't find myself, you know, at the terminal to debug. I'm mostly using the tools. The tools that I've grown up with.

Rich: Yeah. Um, for folks who are outside of Portland, um, you wouldn't necessarily know the, the story, uh, in terms of New Relic and like their presence in the community here, but they're like one of the big tech employers here, one of the big startups. Um, I guess they're no longer a startup they exited a while ago, but, um, but I still think of them that way.

Right, because I know, I remember. Uh, when they, uh, landed in, they were actually in the same building that I worked in for a while. And you would see all these folks, suddenly there were all these people around with New Relic hoodies on, and it was like, what is that? Um, and uh, one of the things that I was always, uh, always kind of jealous of, is the fact that the people that I knew who worked there, um, they gotta use New Relic for free. Um, which was kind of cool.

Paige: Oh my gosh, yeah.

Rich: The places that I worked couldn't afford it. Right. And it was like

Paige: Oh my God. Yeah. That is, that is the one thing when I talk to my ex-Relics, I'm like, okay, what was it like at your first company after New Relic when you didn't have this, you know, powerful , like we, you know, the, the retention and stuff. We knew the value of what we were providing and we definitely took advantage of it.

So, um, it's been really interesting to ask that question. For, even for folks, um, I also worked at a tracing company called Lightstep. Um, even asking folks what was it like, what was tracing like at the company that after Lightstep, and that's what I think these, these observability platforms do need to keep in mind is you can have a little bit of a blind spot because you, your company, will pay for all of the storage. Your company will pay to use all the features they know, you know, you can turn to a teammate and say, I don't understand what this means. And we tend to attract engineers that have a higher level of knowledge in general, just about monitoring, telemetry, operating these systems, time series metrics.

Um, and that is not always the case. You know, folks are not learning instrumentation and monitoring and thinking about the whole software lifecycle when they come out of code school or CS programs. Um, I know there's a few good ones these days, but like on the whole, I, I don't see it as a part of the curriculum.

Rich: Yeah. There's always been I think, a gap between, um, what people learn in school and what they really need, you know, to know to, to do this stuff for a living. Um, but that's, that's super cool. Um, I think that, um, I'm not sure if I might have met you before you were at New Relic, but I definitely remember you being there and then moving to Lightstep.

Um, we do want to talk about observability, that's, uh, kind of your area of specialty. It's something you're excited about, and, um, we, uh, are gonna talk about that specifically in terms of Kubernetes stuff. Um, so you had mentioned a couple things that you wanted to talk about, um, events and tracing, and I think those are both really good things to tackle.

So, um, why don't we, uh, maybe get started with Kubernetes events.

Paige: Yeah. Yeah. And I'll, I'll rewind a little bit to like, when did Kubernetes come into my life? When did I, when did I, when did, when did we cross paths, Kubernetes and I. And that was at my first, um, company after New Relic, I went as an SRE over to InVision who were really early adopters of Kubernetes such that, um, they were on the Kubernetes train before, um, Amazon had released EKS.

That was one of the first questions I asked. I'm like, okay, are we hosted Kubernetes? Are we doing it on our own? You know, that alone is a huge difference in observability and what you need to pay attention to. Um, and so I find out, oh, nope, we were early adopters. We're on, we're all on ec2 instances, we're provisioning clusters ourselves.

And, um, someone's like, go read Kubernetes Up and Running, and oh, here's Kubernetes the Hard Way. I'm like, okay, okay. I'm coming from my internal tools team, or, or technically the last team I was on was a product team, but we had a platform that kind of abstracted away the idea of like a, node. I didn't really, as an engineer know what region my stuff was running and the nodes, like I didn't think about my infrastructure. I thought about my features. And so now I'm be, I'm, I'm being given this thing called Kubernetes that has so many different components, and now my idea of an application is not just kind of one config file and my code and you know, compiled in a binary or shipped in a container, but now I'm like, I have a config map, I have a service. I now need to think about this ingress thing. I have a pod. Okay, how do I configure all of this stuff? And, and all of a sudden what maybe used to be a Docker Compose file or you know, kind of one big file of config options has turned into these separate objects that I have to think about that all have their own metrics.

And I think that is a real challenge for developers to pick up. Not that they're not capable, but we've really increased the complexity of the environment they're deploying into without giving them the training. Um, I don't know that it's okay to just expect, um, an average engineer to understand pod health metrics.

Why, why should they have to know the intricacies of that? Um, Yes, all abstractions are leaky, but I think the complexity of Kubernetes is a lot to ask for developers to take in, in addition to all of their responsibilities. Um, I think it was Joe, your, Joe, your last guest that, um, has a blog post of like, Kubernetes is a platform platform, right? Like kind of that idea. We should not be exposing the guts of Kubernetes to our developers. And so, that, I guess is what I wanna open with, is I empathize very much, um, with the software dev side of, of this complexity because I had a real hard shift in that first job to get up to speed, not only on how applications are configured, deployed, and how to monitor their health, but also now, I've gotta manage the control plane, I've gotta manage my nodes.

I, because we didn't have the beautiful hosted Kubernetes, um, we were responsible for a heck of a lot. And the education is really the gap that I continue to see. We're now several years into Kubernetes being an enterprise standard, and I still talk to developers, um, all the time at different companies that want more information and they want to do the right things and they want to help, but they open up a dashboard with all of the different Kubernetes metrics and, and they have to learn this whole new language to just understand the code that they've already written and if it's okay or not.

So that, I guess I'll say, yeah, that is, I prize Kubernetes for the documentation. I prize it for, now there are a lot of approachable explanations from like a comic to a zine to, um, you know, interactive sandbox scenarios. I think now is a great time to want to learn Kubernetes and upskill yourself. but it, but it is a challenge. So.

Rich: I totally get it. And I mean, I think I probably even said this in the last episode, but you know, my general line about these things is that um, I, I also have a ton of empathy for the developers, right. Because, you know, they're not hired to be Kubernetes experts, you know? Right. Like, that's not their job, and that's not what, when their um, quarterly or yearly review, whatever it is, comes up and they're maybe trying to get a promotion or a pay raise or something like that, you know, um, how much of a Kubernetes expert they are is, is not what they're gonna get measured on.

They're gonna get measured on the code that they wrote and the features they developed and the impact of those things.

Paige: Mm-hmm. I know. And we don't, I would love to see at least a Kubernetes or a production operations question thrown into software interviews. I do, I do think we, there should be some expectation you're responsible for what you write in production and we should get a sense of, of if you know how to operate things or what your level of experience is, operating things.

But, um, yeah, it's, it's complex. There's no way around it. Walking into a tech company today, whether you're a software engineer, a product manager, an SRE, a DevOps engineer, you are walking into a massive system that you've got to somehow cobble together in your head a mental model that is complete enough for you to make decisions and hope that your company's invested in not only monitoring iteratively, checking on your alerts, tuning them, not setting and forgetting them, but also evolving into observability, which is where, where I really see events and traces, um, coming into the picture. Um, do you, I'm curious, do you remember your first big incident and what, what data was there available to you?

Rich: Oh my gosh, Paige, you, you understand that I'm very old, don't you? Like my, my first big incident was probably in like 1996 or something, so I don't, I don't, I don't remember, but I do, um, I do remember that, I mean, I wouldn't say it's remembering as much, but I have this experience nowadays that's very funny, where I'll interact with tools and my thought will be like, oh, I wish I had this 10 years ago or 20 years ago, or whatever. You know, there's, there's so many great tools now and there's, um, a lot more information you can get. One of the big, uh, the big examples of this for me is eBPF, right? And like the, the amount of stuff that you can, like, learn about a system and, and, you know, I spent time troubleshooting incidents where, you know, we literally, there was not a way to get to the, the information that would've helped us solve the problem easier, you know, in the way that you can now, you know, where you can literally see like what syscalls are being made or you know, what's happening on the network. All these different things.

Yeah.

Paige: Wild. Okay. Because 'cause I have a pet theory that whatever tooling was there for you and whatever data types were there for you in your early moments of being confused and, and learning how to operate and manage incidents, that's what your right hand is. That's what you go to. And I, I don't know if it's generational so much as really what was there for you in your personal time of need.

And for me it was luckily these really fabulous traces or even within a single service, um, just sort of, I, I kind of think of APM or application performance monitoring, of tracing just inside the bounds of one service. And that is extremely helpful, um, to look at those stack traces and stuff. So I kind of grew up with thinking of enterprise monitoring vendor tooling as this treasure chest of data that I could play with that was helpful, that could tell me things about infrastructure and my applications. Then when we get into Kubernetes, there's a whole lot of things that can go wrong. We've tinted up the complexity involved in not only maintaining a cluster, but um, let alone configuring your application.

Say everything goes right. You've got every, you know, your pipelines are green for CI/CD, you've got your monitoring in place. Dashboards look good and all of a sudden you see that your deployment's failed. That is probably, that is where I like to start the conversation versus the heat of the moment, reactive, incident debugging, 'cause they're two totally different modes of investigation.

Um, and so for me, I'm always like, "describe pod." You know, "get events." Tell me what's going on. I wanna know as quick as possible, what and where. Why? I would love to know why. And, and a lot of the people that sell you why in a box, in sort of the AI Ops, I'm like, I don't know if we're there yet. I still think the human in the loop brings a lot to the table.

Um, it is our years of experience and our knowledge of the system and the people making changes to the system that lead to quick, um, troubleshooting or, or more efficient investigations.

Rich: No, I, I agree with that a hundred percent. And, you know, having spent time around some of the folks in the Resilience Engineering space, like some of my friends from Netflix and some other, other places, they talk a lot about, um, the role of the humans in these sociotechnical systems and the fact that, um, it's the people a lot of times that maybe even prevent the incident from happening in the first place, right?

Um, yeah.

Paige: Yeah. Yeah. So if we think about that case of a deploy gone bad, for some reason, the deploy failed. If you're working in a shop that has, uh, an internal tools team, or if we call it platform engineering, sometimes it's called DevOps, um, unless there's a lot of good developer training and enablement, um, often times if a developer doesn't know where or how to find that information or how to interpret, node not ready, image pull, you know, back off like

We can think, yeah, it says it right in the title, image pull error back off.

Um, but does that person, you know, does that developer know about what do they know about Docker? Do they know about registries? There's a whole lot hiding behind image pull error, um, that we, we can quickly be like, oh my God, let me go, let me go check. I know where to go and I'm using my system knowledge to figure that out.

And so, I don't, so that gets at that human in the loop problem. So you could have the available data, you could have the event that tells you the reason that that event fired. Um, but if your observer, your person on the other end looking at that data doesn't understand how to interpret it, it's not, it's, I don't know, it's not as helpful.

Is your system not observable. You've gotta consider the humans in the loop. And what I found, oh, go.

Rich: Oh, no, uh, I mean, I was just gonna say that, um, I feel like, um. When I see people troubleshooting or talking about troubleshooting, I don't know that I see people doing "get events" as much. You know, like I don't know if people

Paige: Yeah.

Rich: are thinking about events or using them as, as much as they maybe could be.

Paige: Yeah. And if they do, they may not explicitly know that they're looking at events. When they look at that reason field, you know, uh, failed attached volume, node not ready, rebooted. Um, that is, that to me is the, the, like power, the power debugging, um, versus I could, I guess I'll step back and say when I'm talking about observability, I think about the data types that are available in a modern system.

We've got our metrics, we've got our logs, we've got what events, which are a thing happened at this time. Um, that's about as generic as I could get. Um, and then that concept of a trace where you're taking a span, which you could think of as a structured event with some specific fields thrown on there to stitch it together into a trace.

Um, that's a whole lot of information to take in. So often times I've, I've been on this central SRE or DevOps team that's in charge of the journey from PR to production. Your pipelines, and then every environment in between their, you know, staging and prod.

Um, and, and I get a lot of developers saying, Hey, what, my deployment failed. I'm like, okay, why? Okay, give me the reason and, and I'll go look at the event. And then what was really the most beneficial part is when I'd stop and say, Hey, what do you already know about Kubernetes? Or have you, have you looked at, if I'm on this podcast, I'll call it kube cuddle.

Have, you know, what are your kube cuddle skills? Um, can I pair with you? And then you start to see, oh my God. Um, I'm talking at this level of abstraction layers and, and the problem is down here. And now I need to get your understanding. Um, I need to like to give you a crash course in the last 10 years of computing innovations, and explain why our company or organization has tried, why we adopted Kubernetes, which I probably wasn't even here for.

Um, so I'm gonna give you some, some of the canned answers. So like, there's, I, I don't know, I'd be curious what your take on, on the wall between developers and operators, whether we've really crumbled that. Um, is it a chain link fence now? Is it that the Ops field is bigger cuz we have to care about Kubernetes and, and the developers are hidden from that? Where do you think we are?

Rich: Uh, well, it's interesting. Um, I think that DevOps started so long ago. I wanna say it's been 11 years now or something like that. And, and, um, I was working in operations at that time, very closely with software engineers. I was basically like an SRE before, you know, that term existed outside Google anyway.

And, um, and I feel like, I feel like we still have a lot of the same exact problems and I feel like, you know, I hear a lot of conversations where it's like, you know, platform engineers talking shit about developers, or vice versa. And, it seems like the technology has changed, and I think that in certain circles things are maybe better.

You know, I feel like there are people who have learned, and are building things in better ways. But, but definitely not everybody. And, um,

Paige: Yeah. We

Rich: know, I'm thinking, yeah, I mean, I'm thinking of the DORA report, you know, if you're familiar with that stuff. Um, it's been a while, um, since I've kept up with it.

Um, , you know, there, there was this kind of categorization of like, these are like the really high performing teams in what they do, you know? And there were these kind of different levels, and I remember one year it came out, this was probably like even four or five years ago, but there was this really kind of brutal note in there about the fact that like, the teams that were doing the worst were actually falling farther behind, you know? And, and I think there's, there's an aspect of that, and, um, one of the things that, um, I think when people talk about platform teams, a lot of times we talk about building a platform that's easy for the developers to use, you know, and, and all of that, and abstracts things away and, and that's all great, you know, but, um, but it's interesting because, even if things are pretty easy for them to use, they're probably still gonna run into things that they don't understand if they're, if they're not a Kubernetes expert, and I feel like, that it's almost sort of a consultant kind of role, you know, in a way. That, that those people who are in those kind of central DevOps or platform teams, whatever you call it, you know, that, that it's exactly what you were talking about, about, you know, pairing with a, with an engineer and, and trying to help them learn what they need to, or working on documentation or, you know, refining the tools so they're, they're easier to use, whatever it is, you know.

But that, um,

Paige: Yeah, I've, I've, I've had the most luck when, and honestly it's not scalable, right? To do it one by one, engineer by engineer. But what, what I've had the most luck with is pairing with someone on their first config PR. And, and when I did that, I realized, you know, someone wanted to add, adjust their horizontal pod autoscaling.

I'm like, great, great. Uh, let's, so I'm like, okay, we're gonna open up your Helm chart. At the company I was at, we used Helm charts to generate the Kubernetes manifest in YAML and shoved that inside an Argo CD application. That itself was a part of an Argo CD app of apps that got deployed. And I'm like, oh.

And then we run into a Go templating error. Right? Classic. And I'm like, no wonder this is confusing. You wanted to change one line, and now you are looking at a Golang error when you're a Node developer and wondering why there's 20 different steps to even get the manifest that you're gonna be applying.

Oh my God. And, and it just clicked for me. I'm like, oh, I've normalized a lot of this complexity and like, whoa, whoa. The best thing we can do for developers is at least give them tracing. At least give them the story of this is what happens when Kubernetes is trying to roll out a new deployment. Here's what happens when your Argo app is refreshed.

Um, the trace to me tells you the story and whether or not you intimately understand all of the details and what the kubelet is doing and, and whatever, is, is almost orthogonal to, to understanding these are all the things that happen when X event is triggered or, or, we need to update a config map and then, you know, restart your pod or whatever. To get the sequence of events in order and to be able to follow that is so powerful because then you get devs and operators looking at the same data. You know, I, I would so much rather have a dev say, Hey, my deployment failed. I took a look at this trace and I don't know what this thing is, but this is what took the longest. And boom, you know, I've got a great starting point.

And I'm like, thank you. Um, that's an ideal troubleshooting workflow.

Rich: Yeah. I mean, I think that, yeah, tracing has obviously become a lot more important over the years as we've, you know, moved more to distributed systems. Um, uh, I was looking, there's actually some interesting stuff they've been adding, like tracing to Kubernetes itself so that you can like see what's

Paige: It's in

Rich: Yeah.

Paige: 1.22. Everyone needs to get on it. Um, API server tracing is an alpha, um, powered by my favorite open source project after Kubernetes. Don't get jealous, Kubernetes. Um, OpenTelemetry.

And so, yeah. Share. Do you mind sharing with me your, when did tracing come into your consciousness and how has your opinion of it evolved?

As you know, it's been around for a few years.

Rich: Yeah, I mean, I, I probably heard about tracing, um, at first, or really started thinking about it more, um, in my last SRE role, which would've been like 2015 to 2017, um, in that period of time. Um, and, um,

Paige: So it was very nascent back then. It was, that was like you were writing the tracing tools. If you were tracing.

Rich: Yeah. Yeah. And I don't think that I was doing a lot of it myself, you know, but, um, but one of my, uh, one of my things I used to do a lot, um, is I was a very regular, um, attendee of the Monitorama Conference here in Portland. I haven't been in a few years. I haven't been since Covid. But, um, but that was a big thing.

And so, you know, every year I'd be sitting around with all of these monitoring nerds, observability nerds, whatever you wanna call them. And, and I learned a lot about tracing there, you know, people at Twitter and other places who were doing super cool stuff. Um, and, um, and yeah, it makes a ton of sense, right?

The, the idea that, you know, you've got data that's flowing through a system. It's getting handed off between services, you know, through API calls or, or whatever. And, and, um, having some view of what that looks like, a holistic view is, is super important.

Paige: Yeah. The, the analogy of a distributed trace is a stack trace across your distributed system. I'm like, I heard about it. I was on the, uh, one of the early teams at New Relic that, uh, launched, at least I forget which part of it we were responsible for, but that's when I learned about tracing.

It was that was in the era of everybody's breaking up the monolith. Microservices are the new hotness. We don't know what we need to do. And tracing really emerged as like, Hey, what used to work for, uh, you know, a three-tier web app, it does not work for this constellation of microservices, that are all ping ponging.

They're controlled by different teams, maybe in different time zones. Oh my gosh. Um, now we've got so many languages and we've told, if you build it, you run it. So have fun with whatever language you want. We didn't ask the operators. We didn't ask the operators how many languages they were willing to support, um,

So, so I, it was easy for me to open my mind up to the possibilities that tracing could bring. In the years since I, I talked, I talk about tracing a lot and I talk about OpenTelemetry and people bring up a few common things. Like why is it that every company has like two zealots about tracing? It's like two people who are like, I could not live without it.

And then the whole rest of the org, no one's logging in. Like that, for sure. Okay. And I'm like, okay. It's cuz those people like were part of the early wave, they understood the value. They somehow pushed through your company's inertia to get some sort of instrumentation that was enough to help them. And they're not, they're probably not gonna quit. Like, oh my God, an end-to-end, fully end-to-end traced, um, infrastructure and app tier.

I would, I would die. If I was interested in engineering that, send that in my recruiting email . Um, but I've left those days behind, so please do not send that email. Um,

Rich: I have, I, I do have people actually hit me up once in a while for like SRE or DevOps rules on LinkedIn, and I'm like, you do not want me to sit in front of the keyboard and trying to do this stuff.

Paige: Yeah. And then the other thing people say is, okay, well, tracing's been around for so long. OTel, yeah. It's, it's, you know, V1 now, so why isn't anyone using it? And I'm like hmm. There are people using it. There are, with any emergent technology or new kind of concepts or strategies or ways of doing something, you've got that peak of early adopters and that's, we're, we've passed that point with tracing.

And now we've gotten to the point where like, I know Shopify's running OTel, like bleeding edge version. They essentially help us test, um, OTel Ruby releases cause of, um, how up to date they keep their systems. And so, so part of me says, you know, sometimes tracing takes a while to take hold in a company, you have to be, you have to be conversant in it.

You have to maybe have an opinion about what tool stack you wanna use. You need to have a vision for the future. In the long term it's not just as quick as dropping it into one service and organically letting it take hold in the org. No, it is an effort. You've gotta invest. Um, but I, I think we haven't yet seen the year of tracing and maybe 2023 is the year for tracing.

Now that we're up in Kubernetes, now that OTel is sprinkled on the API server, everyone's gonna have access to it. Whether or not you run a tracing backend or you're paying for tooling, any one of your devs can spin up a little Jeager backend and start playing with traces and start to explore it with data they care about.

Um, so I think, I think the future is really bright for OTel now that we're in Kubernetes, that the, now the party's just begun

Rich: No, I, I think that's huge. Um, and I think that, I've been, I've been pretty optimistic about OpenTelemetry from the beginning, you know, because

Paige: Mm-hmm.

Rich: There were the two competing projects and it was sort of a mess. And I thought it was so great that people came together, you know, and settled on one standard and it really seemed like all the vendors were supporting it, you know, it really seemed like people came on board and were really, you know, honestly committing to, to, um, to using it, you know, which is obviously the important part, right? Like you could decide on a standard, but like if, if you know the vendors aren't gonna implement it, then, then, um, that's not gonna go anywhere. So, yeah, no, I, I've, um, I've been excited about it. There was actually a thing that came up at KubeCon that was really interesting. I saw a, a talk from, um, Alex Jones, where he talked about, there's a new thing called OpenFeature that is sort of like OpenTelemetry for feature flagging.

Paige: Huh. Okay.

Rich: And it's the same kind of deal, you know, they, they settled on a standard and they've got a bunch of vendors on board and, um, yeah. And, and I'm excited to see that that sort of, you know, pattern like repeating itself.

Paige: I think, I really think that open source is, we are just beginning to reap the benefits of like, everybody coming to the table in these standards. I, I've, I've worked for vendors and I watched, I watched OpenTracing and OpenCensus and I'm like, oh no, which is gonna win. And I was at that Monitorama when Jono was talking about, here's the history and look, we've merged and now we're OpenTelemetry.

And I was like, you're kidding me. I, unless you've been deep in the monitoring world and keep track of all the monitoring and observability vendors, you don't know what, what a like tectonic shift that is to say, no more proprietary agents. If you want to be a part of this new world of observability, you need to come to the table and we need to figure this stuff out because I want the world where you instrument once and you're done.

I mean, you need to keep going back and adding business attributes, but like, I don't wanna keep plumbing through the same stuff day after day. Um, so, and, and that is, the thing about open source that I've realized, this is my first year that I've had the time and space to start coming to SIG meetings for Kubernetes and OTel

Rich: Oh, nice.

Paige: is, oh my gosh.

Open source is powered by people. It is the people. It's open source. We're doing it for free. Like yes, we work at companies, but very few companies have pure, a hundred percent, please work on this open source project. I'm like, oh my God, we did this all. Like y'all did this all. Wow. And, and when I hear the complaints or like, oh, xyz feature is not there, I'm like, this is open source. It, it's incumbent upon us as the community for you to take, to be involved and to, it is a give and take.

Um, and so yeah, I'm, I'm like excited. I'm excited to bring more people into this world. I've, um, one last Kubernetes, uh, event thing to talk about if I know we're at time. Okay. I, I would love if anybody has stumbled across this project called kspan, turning Kubernetes events into spans.

This in. Yeah. Let me drop a link. In lieu of O you know, maybe your company's not running. 1.22 Alpha Kubernetes tracing turned on. Maybe you're a developer, doesn't have access to the cluster config. Um, I really saw this exciting project where we're gonna take these events from Kubernetes that we've talked about, you know, node not ready, pods scheduled, xyz, um, and stitch them together into a story of here's what happened with when your deployment rolled out with native Kubernetes events already.

What, why this project has not had updates in two years. Somebody, one year, somebody, please tell me because I, I think this is awesome. Um, and I would like to see more of this. Um,

Rich: I will, I'll definitely link to that in the show notes. And,

Paige: Please. Events are, events can be spans, spans can be events. You know, let's, let's think about the nuances of our telemetry. Um, it's not the cut and dry three pillars that some people talk about.

Rich: Yeah, for sure. Um,

I wondered if, if we could chat just for a couple minutes about, um, SRE and maybe why you got out of it, because I, I feel like we maybe had a somewhat similar experience. Um, I've read some of the things that you've written about, you know, your burnout and your radical sabbatical, where you, you took some time off and, and kind of recharged. I wondered if maybe you could tell us that story a little bit.

Paige: Absolutely. Um, I had been working at multiple startups back to back, um, when I was at New Relic, it was just after they IPOd. So we were in that teenage adolescent, um, startup phase or we were growing out of it. And then I went to two companies back to back where SRE was really responsible for infrastructure, security, CI/CD release, um, in addition to on-call, incident response processes, um, keeping the lights on, uh, capacity planning, you know, kind of all of the things that it takes, you know, all of the things that it takes to keep, um, a business humming along. And during those, during those last few years, it's been the pandemic. Um, I, I radically had to look at where I was pouring my time and energy and also think about what was sustainable for me.

Um, I'm married to someone who's on call and on call for GitHub. So like front page news, front page, Hacker News kind of incidents. Um, and it's a lot. The human sacrifice that operators today make, even in systems that have high observability. If there's not a big enough group of people, of specialists that also understand Kubernetes, also understand the cloud.

Um, it is a big toll. It is a big, um, it's a big ask that we have of our sREs. Um, sometimes too, sometimes it's not realistic. And if I were someone who didn't, I'm a worrier, I'm a natural worrier. And, and I thought that was the best trait for an SRE. I'm always looking for issues in prod. I'm, you know, I know my tools.

I know I'm gonna check on my containers, xyz, um, but in fact that anxiety consumes you. Um, like you, you do, there, there are always things going wrong in a distributed system. There's always failures to find, and I reached a point where I was burned out in a way that I was, I wasn't even a good teammate. Um, that's one of the things I really prize myself on, um, is my ability to help others on my team always wanna be the bridge to understanding.

And I was so consumed with on-call burden and just the fatigue of the pandemic and working at a startup pace is, oh my gosh, I don't know how many years I could work at a startup pace. Um, I, I ...Props to everyone who's been in the game for a while. Um, and so I took a step back and said, what am I good at?

What do I like to do? What can I do? And like I don't yet own a home. So part of me in my mind is making this practical financial decision. How much of my technical knowledge, um, can I monetize before God forbid the next Kubernetes comes out? Um, I'm not learning it. I'm not, I'm sticking with Kubernetes. I will be the like COBOL cowboys for Kubernetes in 20 years.

Um, I'm the buck stops here

Rich: He.

Paige: Yeah. So, so for me, and maybe had I had a decade of experience working with the nuts and bolts of, of operating systems and if I, gosh, if I had a sysadmin job where I really did have to care and configure servers and care about their security versus, oh, we could just restart. Oh, kill that node pool. Uh, it doesn't matter. I would probably feel a lot more confident in the knowledge that I have today. But I don't know if, if the big one hits and, and we've gotta get on HAM radio to talk, I'm not gonna know how to set up a new internet. And I didn't feel good about that. I'm more of a concrete person. It's wild that I ended that, I started with mechanical engineering. I need to see and feel and touch things. I need to have it in my space. So the fact that I talk about containers and clouds being orchestrated by other computers that I configure from a computer here, it is hilarious that this is where I ended up.

And so DevRel was just the natural next step for staying within tech and really I love my job. I get to bring people's aha moments with observability. It's not the scary thing, it's data that you do already look at if you are looking at Kubernetes events, and if you're not, check them out. Um, please. Uh, it's a kube cuddle away. Um, all of this rich, rich data. So yeah, to the SREs that are still on call, my thoughts are always with you. Always all of the hug ups, um, because I'm out of the game.

Rich: Yeah, I mean, I reached a point where it was a combination of having been on call for many years, and I also found myself getting a lot of imposter syndrome, and I think that it was because this cloud native space is exploding so fast all the time. You know, I was always like seeing people talking about some cool new tool that I felt like I should know about, but I didn't have time to play with it, you know?

Certainly not in my work hours. Yeah, yeah.

Paige: Service mesh. I'm like, what? Why? Why do we need this? I had, I had a debate with an engineer and I'm like, I don't think we need this. That's a really big project. Like here are things we could do with the system we already have. And yeah, that CNC CNCF landscape, I show that to when I go give talks to college students.

I'm like, look at this

Rich: Yeah.

Paige: look at this.

Rich: Yeah. Well, I'm, I'm glad that you landed somewhere that is working out well for you, and I'm, I'm glad that you're DevRel. It was super great to see you again. It's been so long since we've had a chance to chat and, um, it was great to catch up. Um, I will, uh, throw some links in the show notes to, um, your Twitter and you're on Mastodon now.

I see. Um, uh, because gosh knows what's gonna happen with Twitter, but.

Paige: It's so sad. I feel so, I've never worked there, but I know what it's like to watch a system you really loved poured your time and attention and goodwill into, and to watch it crumble is just, oh, it's heartbreaking,

Rich: It's brutal. I mean, one of my friends who worked there, I, I mentioned earlier, you know, seeing people talk about, the tracing and stuff there. It was one of my friends was on that team who built that stuff

Paige: Mm.

Rich: and, it's like, yeah, just, just brutal stuff. But, um, you know, hopefully.

Paige: Mastadon. Find us there.

Rich: Yeah.

Yeah, yeah. We're, we're both there. Um, do you have anything else you wanna mention before we get going?

Paige: Yeah, I, I think if you, are inspired at all to dig more into this world or to discover for yourself, um, how to observe Kubernetes clusters, the SIG Instrumentation, and also OpenTelemetry's various language specific SIGs, as well as the collector, are always open. They're full of friendly people who would love to have you join us.

Um, whether or not you just. learn about what, what the common, you know, what our top concerns are, what we're working towards, or if you wanna become a contributor, um, I think it's something you should all consider. If you're a user of Kubernetes, um, and now a user of OTel, whether or not you turn on tracing in 1.22, um, think about giving back

Rich: I will, um, make sure to link to the docs on the, the tracing stuff too. Um, I think that's, it's super cool. I was really happy to hear about it and I think that like, people being able to see traces even in terms of what the cubit's doing and stuff,

it's really, really fantastic.

Paige: Yeah. And I would say this is where I'm, a lot of times I romanticize the past. I'm like, ugh, I wish I was a sysadmin at a data center, plugging in things and looking at lights. Um, but. I have heard the open source experience has not always been as pleasant, um, as it is to come to Kubernetes or OTel.

And so I really wanna give props to all of the folks that have worked hard over decades to make open source welcoming, inviting, and a place where I feel comfortable showing up to a meeting with strangers that I've never met. Um, I, it's getting better. I like it. We've got progress.

Rich: fantastic. Yeah, I mean, we were talking about the old days and, and it's like, I remember the world before tracing, and I would not want to go back. Like grepping through logs of different services on different servers and hoping that, like, are the times like, are they, are they synced up? You know, likenot fun.

Well, thank you so much for coming on, Paige. It was really great to chat with you and, um, I will, uh, again link to, um, those things that we talked about in the show notes.

Kube Cuddle is created and hosted by me, Rich Burroughs. If you enjoyed the podcast, please consider telling a friend. It helps a lot. Big thanks to Emily Griffin who designed the logo. You can find her at daybrighten.com. And thanks to Monplaisir for our music. You can find more of his work at loyaltyfreakmusic.com. Thanks a lot for listening.

Broadcast by

headphones Listen Anywhere

Listen Anywhere