Maarten is in conversation with Ramón Medrano, Senior Staff Site Reliability Engineer at Google.
In this conversation Maarten and Ramón discuss how the principles and practices of Site Reliability Engineering (SRE) can be applied to the practices of Data Reliability Engineering and data quality management. They deep-dive into four topics - SLOs, lineage, debuggability, and how to operate as a team - from the book Site Reliability Engineering: How Google Runs Production Systems, co-authored by Ramón’s manager, Jennifer Petoff.
As the book explains how Google’s SRE team builds, deploys, monitors, and maintains some of the largest software systems in the world, Maarten and Ramón’s conversation explores how data practitioners can apply some of the best practices, processes, and thinking, when it comes to data and systems.
Welcome to the Soda Podcast and welcome to season one of the series In Conversation With. Just like good data helps the world go around, so do good conversations. Your host is Maarten Masschelein, CEO and founder of Soda Data. In this series, Maarten will be in conversation with practitioners, technologists, and change makers who all share a passion in making meaningful connections and rethinking traditional practices. They'll be talking about data, what makes their world go around, and sharing their thoughts, perspective, and ideas that we think will inspire you to be a part of the conversation and be a part of the change. Without further ado, here's your host Maarten Masschelein.
Hi, everyone. I'm Maarten. I'm the CEO of Soda Data and this podcast series is called In Conversation With, and today I'm in conversation with Ramón Medrano. Ramon is a site reliability engineer at Google, and he started there back in 2011 as an intern, and then moved over into a technical lead role, later an engineering manager, and now he holds responsibility for privacy, safety, and security teams. Ramon, welcome. Let's get the conversation started. Could you introduce yourself a little bit, but also your background and what you do on a day to day at Google?
Thank you for the introduction, Maarten. So I'm originally from Spain, as my name and my accent suggest, so prior to Google, I was working at CERN, the physics experiments at this in Geneva and France, and I was doing data management for the accelerator at the ATLAS experiment. So we were doing the acquisition of the data from the detectors and management, like aggregation and analysis. So we did all this exposure discovery at the time, 10 years ago. And then I moved to Google where I'm in the part of the identity team, so we manage all the data for the authentication of the user, so anything that has to do with the user account, authentication of like your password stuff, second factors, some authorization we do, right, and we manage all the accounts from consumers to enterprise some part of the cloud authentication as well.
So it's pretty data intensive, we could say. Not as all the products, but the thing with authentication is like, if we are down the whole company. or most of it, is as well down for all the customers, right? So it is some sort of a good challenge. Then how a day to day looks in my role as an SRE for this team is basically, we need to make sure that the services are up and running. But then there is as well, an aspect of the security and trust from the customers, because we need to prevent, for example, that user accounts are hijacked by different actors, that they always going to hijacking your Google account for sending spam or doing whatever. And as well, we need to make sure that the services are fast and secure for the rest of the company, because the authentication is whatever happens at the beginning. So if that goes wrong or slow, then the rest of the flow is just doomed.
Awesome. Well, it's great to have you here. So thanks for being here and you're dialing in from Zurich, if I'm not mistaken. And I also always love to ask my guests, tell us something about the city Zurich, something you'd love to do or something that attracts you to the city.
So it's a pretty multinational city for being small. So we have 400,000 people around us and for conglomeration, right? So it's not a huge city. The nice thing is like a tech hub for the center of Europe, but it's not only Google. There are many other companies. So finding people with the same interest and so on, I think that's nice. And then for my daughters, for example, something that they like very much is going to the mountains and we have that like an hour by car from the city. And that's pretty cool because in summer you can go with the bike. I do like descending. So I go with the bike just a bit off the hill and then go down. And then in winter you can just ski, which is, I think, fantastic. And I come from Spain where we don't have that much, at least in my place, that is the north. So we have the sea, but not like a lot of these classes of activities. So we enjoy that very, very much.
Awesome. And today's conversation is all about site reliability engineering. So let me abbreviate that for the rest of the conversation to SRE, I think SRE is being originated or created at Google. And we definitely today want to learn a bit more about that, but more importantly, we also want to learn how SRE kind of applies to data. There is a lot of influence from software engineering coming into the data and data management space today and principles and practices like SRE is definitely one of those. So you are part of Jennifer Petoff's team. She put you forward to be on this podcast today, as she wrote a book, Site Reliability Engineering: How Google Runs Production Systems. So this book explains how Google builds, monitors, deploys and maintains some of the largest distributed systems in the world. Could you give our audience maybe, to get started, a beginner's introduction to SRE?
So SRE, I think the name it comes from, so a few years ago in 2006, this VP that we have for production, that is Benjamin Treynor, said that we need to have software engineers running production. So what a software engineer would do when you run production, instead of doing operations by hand, you would just write software for doing it for yours, so that's the origin. Site comes because Google had a single site at the time. So we were the engineers that did reliability for our site. So I think now the acronym has lost a bit more of the meaning, because we have many products, there's tons of services and so on, but it stands.
I think if I say DevOps, people will be more familiar with that, with the role. So DevOps, you see, is a set of principles of whenever we have, in the past, developers, there are operators for services, so we want to bridge that gap, right? and have automation and infrastructure as code and all that stuff. So that philosophy is something that is the origin of SRE at the end of the day. And it's a philosophy that we use. So I always want to say that DevOps is like the class that we have and SRE has it from that. So it's an implementation of that interface if you want to.
So we bring all the DevOps principles into the table, but one thing that we use a lot is SLOs. So they are service level objectives. So it is something that we define for service is like, how is the service level? So we describe the service performance and then we set an objective for that. And then all the automation, prioritization of projects and on goes along those lines. And then we are the, let me think how to say, the owners of productions in some way, so we can contribute whenever we see fit for the reliability of the product, be it automation. That is something that is very known by people, but we can do features for a service, right? if that makes the service or the product perform better or be more reliable, is something that is in your hands all the time.
Awesome. That makes total sense. What would be kind of your favorite real life story use case that kind of really explains kind of why you need SRE? Do you have one in mind?
So, well, I have one outage that we had a few years ago. This is kind of old, it's like from 2015, I think. But it, I think, reflects, well, two things: why we need SRE, or what SRE does in practice and why data is quite a challenge for, even for SRE teams like Google that we are like 3,000-4,000 people, right? And it's in the certification database. We have a database that has primary data. So it's like a table with the username that you have some identifiers, the salted hash of your passwords, some small metadata, and that we use that for authenticating you. So whenever you go to accounts google.com or whatever, and you need a cookie, for example, to access email or whatever, we will just check that database, right? And that database has a schema and there was a change in the schema at some point.
So there was a bad job there. It was changing that schema to add some, I think it was modifying or adding a field. And anyways, is okay, but there's a bad job that will run and we set up the new field and we'll set the default value for the account don't set it. And at some point the default value was incompatible with some particular kind of use case that we have around and that didn't materialize until very much like three, four months after the fact, after the modification of the database, right? So SRE could do, for example, the validation of these could do for example the progressive rollout of the change. So for example, would like testing that the new schema will work with the binaries that are running on top of it, testing that the sufficient capacity in the database, for example, to hold all the new data, they want to have, testing that the queries, they are optimal. So it is not going to take down the database because you want to join two tables in some particular way that is very expensive. Right?
And then, on the other hand, data is hard, right? Because, this particular outage, it was not affecting the availability of the service until much later down the road, right? There are many more other corner cases and stories about how this can work. So validating or approving these changes in the sense of this is safe for production was something that was not trivial and that we didn't see at the time actually, and then we got a corresponding outage in exchange.
Yes, I cannot even phantom or imagine what the downstream impact of all of that was, but it probably got a bit messy and required quite some manual efforts to fix and go back to regular, normal states.
So I remember that I was paged in the morning at 7:00 AM. I was at home, it's like whenever you are operating a service for some time when you get the page, you know that well, this is okay. And some other pages, when you see them, it's like, oh crap. This was one of those. Because as you see, I don't know how many million people were affected or whatever it was. And then the problem is the triaging was not clear how to do it. Because when you are speaking about an interactive service, like an RPC server, it's something that is synchronous in the sense where you can say, well, this service is calling this front end. Then there is a series of backends. Here is where the failure happens. And then you can see, for example, in stack trace or whatever it is. Whenever you have something like that, a field that has a default value that is not well processed, you have two problems.
That is usually in the stack trace what you are going to see, for example, or in your distributed tracing if you call the database and the database returns the data. And it's like, okay, it's not the data because it's returning the data that you want. The problem is that the data you are storing there is not what the application is handling. And then another thing, and is what was happening in that case, not all the accounts were affected only a small subset. So you need to find a needle in a haystack in the sense of, okay, out of these billions of accounts, there are only a bunch of them affected. And that at the time was manual for us to find, and then we need to have all sorts of people related to the system to help us understand what were changes that they were in flight, latest schema modifications and so on. But this was three, four modems. Also we are reviewing in reverse timeline. So yeah, that was hard.
It's a very interesting example. So I would love to take it to the role of an SRE, site reliability engineer and what is their background, how technical, for example, are they. But also where do they organize or report into, how is that structured, maybe at Google, but if you have a perspective of how that is outside of Google, that will be fantastic too. So that would be helpful to dive into. Maybe even how they are evaluated, SREs, are there certain metrics or systems to evaluate how good a performer in SRE is? Those would all be areas I think would be very, very interesting to dive into.
So SRE, we are software engineers at the end of the day. At Google we have two variants of it. So we have software engineers for SRE and systems engineers for SRE. I'm a system engineer. The difference is what we are looking for is people that have hands-on experience in the intersection of writing software and running systems. So it's not a perfect balance that you need to find. There are people that are more skewed to our systems or the people that are more skewed towards software, but the things like you need to convert our writing software and understand how a network stack for example works or how processes of file systems work in a system. Then how all of these are organized at Google, I think is the only big company I have seen that is organizing the sense SRE is its own organization.
So we have an org that is SRE. So we have a VP for SRE and so on and so forth. And then there is another hierarchy for the developers. We find this working well in the sense that our commitments for the system are for the system itself. In the sense of when we take over, we onboard in our team whatever product, whatever service, our role is to say, well, we are going to bring this service to the reliability that the product needs for the business definition. If it's, I don't know, whatever it is, a chat app or if this is a new thing that we have launched for video, whatever it is, the product managers will say how much reliability they need. Obviously they're going to need a lot right at the beginning. But then there is a conversation between developers that they want to ship very fast SRE, which the incentive we have is not to change production at the end of the day for stability.
Then you have the product managers that they understand costs for example, or they understand what the customers want. And you can say, "Okay, if you want 15 months, you are going to need to hire 150 SRE for this." And then they're going to say, "Okay, perhaps no." And then there's a balance of finding a trade off what you say, okay. I don't know, four and a half months for this service is going to be fine. We are going to need to staff this team this way and so on. And then we start doing projects in partnership with the developers. And there is a point in time that the service is stable enough, meeting the SLOs that we have set for it, for the customers. And at that point, the SRE in principle is done. We could just say, "Look, we are affording the service. You just take care of it. You can be the devs." The engineers will be on call for it. So SRE will just go and move to something else. That's in theory, there are many services that are critical enough that will in practice always have SRE support.
Other companies that do these differently. I know that for example, in Meta, they have the concept of production engineer, which is the same as an SRE right, but they are part of the developer team. So all developer teams will have X software engineers and Y production engineers. And they will balance this depending on the needs they have over time. I've seen this model as well. Both works, depending how you have your company organized. And then there is the other model that I saw, particularly at CERN, for example, we had DevOps engineers that could be doing some general DevOps stuff for the whole department. So it could be an engagement for the whole system, not for parts of it like we do at Google.
And then for the evaluation, I think that goes, or is a product of how you organize your job. For example, at Google, we have a completely separate job description and job ladders. So you have the role of a systems engineer SRE and you have all the levels, requirements. And there is all this process that the company does twice a year for saying, "Well, this is your expectations. This is what you did. This is how it matched. This is all the scoring and all that stuff that goes." But at the end of the day, it goes into the projects that you're able to do, the change you're able to inflict in the system, how effective you are doing on call and operation duties and the scope of your work.
That makes a lot of sense. Kind of the question that pops up in my mind is also how is Google different from other companies in certain ways? And when thinking about it, I think Google is very much known for a lot of different products and services, of course, but one of them that I was thinking about is all things related to document management and document storage, which is I call it unstructured data. Whereas I think other companies might have a lot less unstructured data and more structured data. Would that be, I guess Google has all of it, but for some other companies might have a bit more or less of one or the other. Would that be a valid statement? And would there be other differences between Google and typical other companies or organizations?
I think it depends. I think it's a pendulum that shrinks depending on where you are. You find yourself a few years ago, for example, when I joined the company, everything was no SQL kind of on structured data. So because it's very easy for this concept of scalability that is you'll have one global system that will scale ad infinitum and will be very reliable. And that was possible when we had, for example, use cases that they were highly dominated by consumer workloads. So when you have consumer workloads, for example, they have availability that you have, and this is not to diminish anything, but the consumers do not expect to have four lines of their own account forever, permanently. So for example, sleep, and they don't use their phones and all that stuff. And then you can afford to build these kinds of systems and having unstructured data with no strong schemas and so on, helps you scale these systems up.
Then there is this rise of cloud computing and companies are moving their workloads there and so on and so forth. So you are starting to see more and more workloads that they're dominated by, let's say, robots. So you have built on machines that they are running workloads that you don't know, and they're calling APIs. You will have things like containers, products built on top of your products that have their own behavior. And then you have all the compounding effect that smooths out all these availability requirements. So you start to have systems that they're available all the time, have a lot of high performance and so on. So building these kinds of global systems that have singular instances, it starts being a challenge for reliability. And that's where the pendulum, for example, swings back to the other side, which is like, well, perhaps we can just not have, for example, totally global systems.
We may have to have different instances per I don't know, per cloud region or per continent or whatever it is. So they are differentiated. And then having a better notion of schemas, for example, helps you offer better services for their customers, because you can have APIs that they're richer, that they can understand better and they're easier to use. Because if you offer a look, here's an HTTP endpoint and you will have some JSON at the end of the day, that's not a great experience I would check. So it depends what you're doing. I think now we are seeing a mixture of all these classes of databases.
Makes sense. As I was preparing for this podcast, and as I also mentioned, I think in the beginning of it, that there's a lot of these principles and practices becoming more commonplace in the data management space. So a lot of concepts or best practices in software engineering. And I think it goes far. It starts kind of from everything as code to version controlling and data. And now it's also extending into how do we operate teams being on call, taking responsibility for the reliability of systems, for the quality. So there's a lot of change and kind of influx of new ideas happening. And I was reading through the book and as we prepared for this conversation, there are four topics, really, that we wanted to dive a bit deeper into and also see if there will be differences between such a kind of topic in the infrastructure reliability versus data reliability space.
And if I recall correctly, the four key topics that we want to dive into were on the one hand. So the SLOs, service level objectives, which we've already talked about. The concept of lineage and traceability, I think in the context of debugging, but also debuggability as a topic. What are some of the tools, things we need to properly do that? And then how to operate as a team. So I propose, we just take one by one, we dive into the concepts, have a discussion around them. What are some, for example, best practices? What are some things Google is doing? And maybe what are some learnings or things that could be applied in maybe other industries or maybe also less data intensive companies.
So let's maybe start off with service level objectives. Would you mind giving the audience a bit of a primer you mentioned earlier? So it consists of indicators to measure, I guess, the reliability and the health of a system, but could you give the audience a bit of a primer on the concept of SLOs?
So yes, a SLO is something that we SREs, since we are doing that all day, we have this bias of everyone knowing it, that I think it's a good thing to do. So the first thing that you need to do to define an SLO is define an SLI, that is service level indicator, which is a metric that describes your system in some particular way that you are interested in. The classical thing to do with services that are transactional, for example, like a, I don't know, an RPC server or a database, is saying all the requests that I send, how many of them they are successful, so the proportion of requests that they are successful. You can make something as simple as that or make it more complex. For example, the proportion of requests that I sent to the database that are responded with a 200 in less than 300 milliseconds or more properties.
But at the end of the day, it's a number that tells you where your system is at. Then you have the SLO, the SLO is a service level objective, which is an objective for the SLI that we just defined. So you want to have the system in such a way that you say my system is available enough if the SLI is always over, I don't know, 90%. So I want to have all the requests that are answered by their system in less than 300 milliseconds over 90% of the time being successful.
And they have an SLA that I think is something that people are more familiar with. The SLA is a service level agreement. The service level agreement is at the end of the day, it's a contract. It's a contract with your customer that you say, "Look, we have this contract that says this SLI is going to be over this threshold all the time or 99% of the time. And if this is not happening for some reason, and I will teach you whatever it is, we will give you an exchange, whatever." Some cloud providers give cloud credits, so you can have a discount, right? Or this will give you back money, whatever it is, depending on the service you are providing, it might look differently.
I was saying that this is the classic example for transactional services, like an API endpoint, right? That's very typical.
Very good. So you could apply this concept quite easily to a data flow. Let's say we have a company that there's an analyst within the finance team, and they want or need data by a certain point in time in the morning. And they needed to have it completely available, especially around a certain set of columns, because they may be important for regulatory reporting, who knows? So you could apply this concept also to this flow saying, "Well, in order for us to achieve it, the data needs to arrive at least an hour earlier in our analytical platform." And then we have about half an hour, 45 minutes to do data transformation and preparation. So that's ready. So it's nicely on time for our end consumer.
And the contract agreement would really be with the end consumer, with that analyst within the organization saying, "Hey, we're going to make sure that data's there. We agree or commit to it." Because it's internal, maybe it's a bit hard to work with credits, but that could be maybe an internal example of setting up SLOs and SLIs. Would that be a good one or am I looking at it in the wrong way?
I think that's right. For example, in your example, the SLI could be the number of days that the data got to the data analyst or whomever needed on time for them to use it. And then your SLO would be, we want these to be, I don't know, 29 days out of 30 every month, because a hundred percent is always... Unless there is some regulatory kind of thing I'd recommend not to aim for a hundred percent because it's just super expensive. So unless you have to do it for some particular reason, let's just say 29, out 30. And then for internal staff, it's right that you need an SLA, because you don't need to tell the analyst I will give you two times the data of tomorrow, whatever. It's just an SLO, which is an indicator of what is the target of quality for the system.
The thing with data is the SLIs, they can be much more complex than with traditional transactional systems. If you think about data, there are all sorts of other dimensions because you have, for example... Let me use an example that I have at work, that is the hijacking. So at the beginning of the podcast, we were speaking that our systems need to protect users from data accounts who get hijacked. That's happening all the time. There's some spammers and people that want to steal your account. And we have systems that will detect the risk of someone trying to log in into an account. We have signals that tell us, for example, you are an American citizen logging in from your own town, from a device that has an IP that we have already seen, that has very low risk. Now the same logging in an empty browser, in a device coming from an IP from, I don't know, from South Africa, that's a completely different story.
So this allows us, for example, to ask you for a second factor or do some, whatever it is. The thing is that to determine these risks, there is a lot of data that we need to process to have the data that will inform how risky the risk level of login is. And there you are going to have, for example, the freshness of the data. That's a service level indicator for the data, like when was this data last generated, right? Because if it's very old, perhaps it's irrelevant. So you might have an SLI saying the data that we are using for these queries needs to be fresher than 24 hours, and can't be older than that. Do you have data, other things that are there, for example, the retention, not only for regulatory reasons, but yourself in your company might just not want to have data from customers that are older than 18 months, as an example, for logs or whatever it is because you don't want to retain that. So you might have to have an SLO, that is the data will never be stored more than X.
And you have the right data. So we have a database in here that is just a dump of a bad shop that is processing data sources from many, many, many places. If you have a data warehouse or a data lake, there's a very common ETL kind of processing, you might want to have indicators of the data you are consuming is correct enough or have the level of quality that is going to be good enough. And then you are starting to unpack it because it's just recursive. You might not want to generate data with data that is not good enough. And then if there is a cease of pipeline, it just goes bananas. It's just crazy.
And did you at Google, is there some tooling available for very easily setting up, for example, SLOs or SLIs on data specifically?
No. No. So perhaps we need to look into it more. But my team, at least we are looking into data with the traditional tooling and we are doing, for example, batch jobs that are processing data. They will export metrics of this data they got. This is the success ratio and so on and so forth. So we can say, well, extract from there the SLIs. Once you have the SLIs, though, then that's a hard part because then if you have an SLI, then you are going to have a number and your SLO, if it's over it, then you're good. The problems are arriving at two SLI.
Could you elaborate on that a little bit?
So the SLI at the end of the day is a summary of all the properties you want your system, be it a data or an RPC server or whatever it is or both, to fulfill. So if you have an SLA, that is freshness, that's easy, you can say export the metric of the last time stamp for the data modification. And you will see how it goes up until it's generated again. And it's like this kind of, sawtooth kind of thing. And then your SLO is going to be a line on top. If it crosses you are not meeting it.
If it's something like data quality, for example, you need to define that. What is data quality for your thing? For example, in the hijacking database, you might want to have an SLI that tells you the data quality as the number of signals that you can collect for signing in. So if you are a user signing in, and then you can elaborate or calculate the risk with IP addresses with all the metadata and so on, you might want to say this high quality data because I can collect data from many sources so it's rich. If there is a sign in from a new account or whatever, and you know nothing, your data quality is not prime. So perhaps you really don't know if the login is risky or not, because the signals that you are getting are not good enough.
You have a lot less data to make a decision ultimately, in that scenario. And that's indeed part of quality, as well as the availability of all of the information that you need to make an accurate estimation or an accurate outcome.
For data quality, one example that I think is easier to understand, easy to grasp is the search engine. So imagine Google search. When you search for a topic, you are going to see a lot of features in this search engine result. You're going to see the results, there are some ads, there are some boxes on the right that tell you more. Like if you look for a movie, it's going to show you the, I don't know, the leading role actors, the producers, all that stuff. So when you're conducting that search, there may be some systems that are unavailable or that they don't have data. You might want to present the result to the user and some of the things may be missing. So the SLI for that search, for example, the data quality will not be a hundred percent, would be 60 or 70. If you keep collecting that all over time, you're going to see how your system is providing data to the customers, affecting quality.
That makes a lot of sense. Very cool. Can we move into lineage as a topic, lineage and tracing problems? Maybe I'll give a quick as a starting point, a quick summary of how I personally have, what my experiences with lineages are and how I see it and we can start from there. So of how I understand and see data lineage, it is ultimately about how datasets that people use or data, how they're connected to the broader ecosystem, whether that's the jobs that produced that data, for example, that is one. Or whether it's about the quality controls that are set on your datasets, or what is the DAG of information could be one view. How long your queries take, for example. Performance kind of related view could be another form of data lineage, but also maybe seeing who owns what data and organization and drilling down into, for example, how many published datasets an organization has?
So lineage is quite a broad topic, but it's very often used in the context of root cost analysis, especially by technical users who, for example, are responsible for the data transformation pipelines. So I see it as one of the tools, but it also can be tool for broader understanding of really what's going on and about slicing and dicing the information at hand to understand where did the problem actually originated and what is the downstream impact? Would that be, is that similar to how you see data lineage and this notion of traceability? Or do you have a different perspective on it?
No, I think it's all of the above. If you think about traceability, I think people that are listening to us and run some set of microservices and so on, they will be familiar with whenever a microservices fails, the whole mesh of things that they're collaborating with, they have problems. And then you just trace and then you see who is calling what and so on. There are things like open telemetry and all that stuff that help you. With data it's harder because it's not synchronous in the sense you can just take a snapshot of the system doing your RPC at that moment, because data is generated at different moments in time and it evolves. So a data store is not moving in lockstep to all the records that you have. You modify some, you add some, you delete some, some are static, so it's a more living kind of thing.
So for the traceability, there is for example, something very easy to understand this is when you have derived data. So you have a data store and you have some batch jobs, so they are processing it, and dump it like a summary in some place. And then there comes another one that takes these stamps, does another processing, puts them into some data store and then you query that, or then you generate other secondary data. Doing traceability of that is very hard because think if the first batch job has a small back, or there is some data missing or whatever, and the dump that it generates is incomplete, then there comes the second batch job reads that and it's just not processing some data that should be in there, but you don't know that it's there. And then there is a series of things happening. And at the end of the day, you have a compounding effect that the end product may have might be nonsensical.
And then is when you see the outage in your sign-in product or whatever it is, what you are doing. And then you see, but this service is okay, it's querying the database finally. Why the data is garbled. And then it's who generated that. And then you start unpacking that and three hours into the outage you say, but why? And it's just a small batch job processing something that some down the road. So I think it's very nice to have a kind of data lineage that you can say, this data comes from this series of steps that happens to generate it. So you could at least perhaps there can be a corporate analysis that tells you this one was not great, but at least you could see what happens to bring your data to usage. And then this, all the governance stuff that you mentioned, that I think, as well has been interesting because then if you have a big company and so on, all these processes will be owned by different teams.
And we know that we end up shaping our org chart at other day for the Conway law and the premise that crossing these boundaries at some point can be hard because then you're going to start having APIs, that they are designed or developed with different styles and so on and so forth, so can be hard. We were mentioning it before the retention. I think that's a classic sample for governance. So if you delete some records that you have in your original data set, because you don't want to keep them more than 18 months, or 16 or whatever it is, then all the derived data needs to catch up within the timeframe.
Right? So that lineage becomes an important tool, I guess, on the one hand for finding the first point in time where the problem occurs, but possibly also for the downstream dependencies. In this case where you say, indeed, it's further propagated, so it's further derived into new data sets. So we need to make sure it's part of that data retention that we deleted everywhere, including all downstream processes that generated derived data sets.
Are there certain tools or methods that you're currently using at Google for this concept of lineage and traceability? I guess there's definitely some of that within the infrastructure monitoring and observability space, probably also for application monitoring and tracing about pages and their low times, et cetera, and then maybe for data. So how data is evolving, how the DAGs look like, what are the data sets they produce, maybe what are time SLOs on when they should arrive? Is there some tooling across all of the spectrum available as well?
So we have, just to be honest, we need to differentiate for RPCs. And so I think we are far more advanced than with data. So you can trace an RPC mostly everywhere within the company. So you have things like if people are familiar with that, for example, there is a paper for that they can use the open source version is OpenTelemetry and there are some implementations. But basically you can see a graph of where your traffic is flowing and so on. For data, we are using a similar approach that is annotating data and so on, but this is harder because the problem I think is inherently diffident because of the data you will have it addressed. So it's not like an RPC is something that is ephemeral at the end of the day. And I think propagating labels and on is just putting it there, can get expensive if you have many of them, but it's something doable.
With data harder because how are you going to annotate all these stuff? Are you going to annotate per row, for example.. Because each row, they are different. They might be created at different times. They maybe have different importance, even. Are you going to be creating by data store? So I think there's a trade off there of cost and resolution that you are going to be having. And then the annotation that you are doing their data too, which may seem like an obvious thing to do, but where do you put them? Are you going to have a secondary system for all the annotations for data? Are you going to annotate in the same system? So if such a system corrupts, why is it not going to corrupt the annotation, so it is not that trivial.
So the thing that we are doing, at least in my team, in other parts of the company, because Google is huge, I don't know, I'm sure someone is more advanced than me, or than my team. One thing that we are doing is, working on, as SLO maturity processes, which is like, look, let's start with the principle approach of SRE, that is, let's define it as an SRE indicator for this system. In this case, this is data. And there have been some properties that we are interested in that are able to tell us that the system is healthy.
So there are the classic things of if we run a spanner database availability of the spanner database as a service, responding to RPCs. But as well, there has to be a slice that tells us about the data quality. One place where we started was from the classic integrity approach. That is correct. Let's assume that the database is an atomic thing. The first thing that we need to do is not to lose it. So let's just do the backups, restores, be able to bring up a new database with the backup and make sure of its health.
The second step would be like, okay, this data we need to see, for example, what is the resolution of restores that we can do? Can we do snapshots? Can we do point in time? Having more elaborate kind of things, so at least this is reactive, right? Because if we know the data is corrupt, at least we can go back to the place where it was not right. And then what we are now, I think some people are working on it more, starting to look into that, managing the schema in real time. Which is like, oh, so you want to write some code for this binary, or you want to write an experiment, or you want to alter the schema. How we can validate that change, for example, is going to be correct. And how can we see that the service level indicators for the system as is, and after the change is going to be correct. So there is a lot of data testing, which seems weird. That is what's happening is let's test the new schema with the old data, see how it goes, or the new binary.
I think there's indeed, there's an entire world of data testing that I think is still quite nice and new. So there's of course testing, has been for many, many years, even decades probably now, since Test-Driven Development, it kind of was a point of no return. I guess everyone is now building tests as they're building software. It's part of the software engineer's workflow. And it's a good practice that everyone knows. However, in data, that is not the case. We see that as there's companies, of course, and teams that have started to do some testing, but I do think there's still a lot of work to be done. For example, having test suites that allow you to very easily do an integration type test, where you have new analytics codes, so new ways of transforming your data, that you're publishing or pushing to production. How can you make sure that indeed doesn't impact the quality or reliability of your SLIs at the end of the day. Do you feel the same? Is there still a lot of work to be done there?
Yep. 100%. Yes. And it's kind of intuitive because we see in all these newspapers and so on saying that data is the new oil and all that stuff. However, if you look into how systems are working and so on, I think we are more advanced into all the testing, all the infrastructure like testing and so on in the processing, than the data itself as a concept. And I think of data as something that can rest and you have that and you keep it and then you can use it and modify, and is at the same time that is a big asset for companies, there are certain companies that they leave out of the data that they have. What they are doing for their customers is they're transforming it. They're mining it. Whatever they are doing as their business, without that data, they would be out of the business. So it's a huge risk for companies not to be really structured and really rigorous with what they're doing with their own data.
Makes sense. I think from testing, we could probably go into debuggability quite easily. I think tests definitely help create observability, help create test results, give information about what's going on in real time. How do you see the topic of debuggability within the SRE framework? And are there any specific highlights you have around it that you wanted to share with the audience?
So you just mentioned observability. I think that's the key word, because the observability at the end of the day is a property of the system that allows you to see what is happening without having to know how the system works. Because I think that's a key, if you already know how the system works, whenever there is an outcome that is not what you expected you know where to go and look. Which RPC services, microservices, and so on, I think that's the case.
With data is not because with data, you might know an abstraction of the data that is the schema, but you don't know the data that you have at that moment. And I think the availability for data that's the bridge that we have to cross is giving me observability of what the actual data is. And if you have a database with accounts, for example, we have billions of them. I'm not going to scan through them or run SQL queries for finding outliers. So we need the system that is telling us, look, I can cross. That would be fantastic saying this service that you have is failing RPCs for this subset of accounts or accounts that they have and fill X, Y, and Z, this value. That's fantastic. If I would have that right in that outage, that we were speaking before, I could have solved that in five minutes instead of a few hours. So that's observability we need to have.
That makes all the sense. And your beginning also mentioned the example where ultimately there was a subset of the data that had the default values which are not accepted by a downstream service, or creates a corner case scenario. And I think, to go back to debuggability, as well as that. Actually at that point in time, the person who's investigating the incident ideally should have kind of a pre-bagged view that says, "Hey, we've looked at all the areas where we have this new value and we compare it to all other records. And these are the segments of data in which we find that you have most failures or defects." So to help already guide the user, or guide the SRE in this case, to better understand that will be the potential root cause of this.
Yes, it's the root cause. So I think this is all reactive, but if you apply exactly the same technology or exactly the same approach, you can actually make it proactive. If you look at whoever was changing the schema say, "Look, this is my comment that is going to go into the repo." It's like, okay, fantastic. We are going to do these integration tests. So we're going to bring a database for the accounts database or whatever is what we are doing. And it's like, we fill it with the data we want that can be a copy of production data, if you can afford it. Sometimes you cannot because it's just production data or financial data or derived data or whatever, but it can tell you, look your change under these conditions, if you have enough sufficient data it's going to fail. And then you can see like, oh, let me change that. And then you don't have the outage.
That makes sense. It's always better to prevent. I guess the reason why we're not always there is simply because we don't necessarily have these kind of checks and balances in place today. So we have an increased reliance on just leveraging observability to get early warning signs of things that are incorrect. But ideally, I guess you get to a scenario in which, as you introduce changes, you have a lot more controls around it so that you prevent a lot of problems from happening anyway.
Yeah, the problem is that it's expensive. If you think about, say I am going to turn up a new database on the fly for this integration test and on, there are third time problems that you are only going to get if you get a database of the same size of, or properties, than production. That can be very expensive because there is this thing that if you don't test in production, you're testing a lie. I agree with that. However, you can't always have it. And then if you're going to manage with smaller databases and so on, you need to assume how far you can get with that. There are some cases that you are going to be limited perhaps by scale, if you are going to, for example, find a performance regression, you are only going to find it in production. This is why moving into production you need to do rollouts in a slower way. Instead of changing the schema in one go, it's like changing for 1%, 10% and so on.
Gradual roll outs of your user base.
That makes sense. You're actually testing it as part of that workflow, testing it on a small amount of people in the end, to limit the blast radius, I guess.
Yeah. It's a very common practice for features, for example. So you would roll out your feature code and it is just not used. And then there is a flag that you can push to your service. They're like, look for this fraction of traffic to start calling it and so on. I think with data, we could do it. We have to add support into the databases for it to be easier to use. Because right now, if you want to do that and you use a pre-made software, like you say, for example, I use MySQL or whatever it is, which is very common and I recommend, but if you want to manage that, you really have to do weird things with schema sometimes and they are very time consuming. So that would be nice to have in databases saying, "Look, this is a schema that I want to roll out. Just do it for me." I don't know, a system service, some stuff that will test that into your test database and then moving into production will be doing it slowly.
That will only be super helpful. If we move to the topic of working as a team, so we've talked about it a little bit earlier in which the SRE is either part of the engineering team of a new product or service that you're building and launching. They can also be more of a central competency center, I guess, who is involved in the same workflow, but then more hands over information training, but not necessarily full time, part of that team. Would that be kind of an accurate summary of our discussion thus far. And do you have any other kind of pointers around what it means for an SRE really to work as a team?
So as a team, we share the service sets, the principles, the job ladder and so on. But one thing that we share that other teams across the company, or in other places, don't is the on call rotation. So you are going to be supporting the services that they are in production, but production changes all the time. So whenever I get on call, I'm going to get another instance of production than the last time I was. So that's something that binds people together well because then you have to hand off how the changes are going, what is the outages that happen and so on. So that's nice. Otherwise, if you remove that, we are an engineering team, like a software engineering team that our features are reliability, right? So it's a bit unfair to say that the features that we do, no one sees them because it's like, if you see our work, then basically we screwed it up.
But then the process is basically the same for developing software, whatever it is. So we will just say, "Hey, we need to do this change." And let's put a design doc together to see how we can do it. What is the impact? Discussing the rollouts, and then this code needs to be written, put into production.
It changes a bit. There is at some point in time, you are going to have a system that is getting some changes because there are some releases going on and so on. So the system gets some pre-statuses in production, and then you get a bit more outages because it's changing faster than it should. And then you just move towards like, okay, we are a bit reactive now solving these outages, doing the postmortem and so on. And then once you finish those projects, your system is more stable, so you can move back to do other proactive kind of work. So is generally within some limits. So you will never have more than 40- 50% of your time in the worst case scenario or on call related and post mortems and operations kind of work. So it's always bare minimum, half of your time is going to be for proactive projects. Right?
One of the interesting things I read in the book as well was around the concept of toil and the manual intervention and that kind of as a metric indicates how much we're spending firefighting versus building tooling to become proactive. So there's this balance. And if you're firefighting too much, you're in a situation where you actually should really focus on not doing any new things, improve the reliability of the system as it is today, until there's, again, the amount of toil or manual intervention reduces up to an acceptable threshold and only then you can go back into making new things.
That's correct. Yes. So that's what I mean with on-call operations. So it's toil that is... I mean our abstraction is we are going to automate ourselves of the job because there's not going to be toil. Okay, that's like a moonshot. In practice this is not happening. So the system, as I said, there are system so that they are products that they become mature, very maintain and smooth. And there is a point in time they say there's not even that much toil. Its recidivism is almost nothing. There are features they are going in, but not that many. At that point we detach from that system. There are others that you live in this cycle that we're discussing. So there is some more toil because whatever it is, and then you can get, if you don't control this is a problem because you can get a system that is burning teams out because there is so much toil that you can only do the toil.
It's like a abusive cycle is there's so much toil that you don't have the time to do some project to prevent that toil to happen. So in that point in time, because you need to just prioritize the root cause of those toils, whatever it is and try to fix it. So perhaps you need more staffing to try to do it. Perhaps you need to say, "Look, there are some things in the service that we are not going to be fixing right now. We are going to lower the SLO while we fix the toil or whatever it is." So there are some decisions that you need to take with your product and your developer teams to make sure that the service goes back into a healthy state toil wise.
And then there is the other failure mode that is, we remove every single piece of toil in this system, but we have it. And it's like, why? You're just overachieving? It's like having an SLO five nines for a system that would only need three. The impact of that is you are not having enough risk for the system. You're not having enough velocity. So you are costing more than you need. So toil needs to be within some guard rail. So you are always going to have some, you don't need to have more than X. So the managers of the teams need to be measuring these all the time and having metrics to see how the health of the team and the system is going.
Yeah, that's clear. Is there some target that you set on that? How much time should, for example, an SRE be focused on building tools and systems to automate versus being in the weeds firefighting?
Depends. So what we measure is the toil should never be more than 40-50% of the time of the team. If it goes up it means that things are unhealthy. So that's my threshold for why we need to stop here and actually focus on doing some projects to remove the toil is 50%, so it's not even a hundred. And then it depends on the system. So there are systems that are less important than others. And I think that's very fair to say. There are some systems that they're very important. They are core to the company and they're very high availability targets. Those are the ones that are going to take more of your attention. So if there is automation to do, you're going to do it for them. If there is going to be, I don't know, data lineage projects to do, you are going to likely do them for them because those are the consumers of that data and so on and so forth.
And there are others that are less interesting. So you might have to tolerate a bit more toil in these systems and say, "Look, this operation will do it by hand for this class of services. We don't do enough of them because they release once a month or whatever it is. So it's okay. We just find it easier or cheaper to do it by hand than to automate, for this class of systems." So I think it's a balance that you need to find. It's difficult to say when you need to know your system and you need to work with your product. You need to understand what is your role in the business at the end of the day, which is a very managerial thing to say, but it's the only way to make this decision. Whether you like managers or not, that's what it is.
So if we conclude, I think we've gone from Google's first site and then the practice, but then Google of course became much, much, much bigger, much more complicated with a lot of distributed systems and dependencies and different teams and organization, so that evolution. And I'm wondering, as we went through, we often drew the analogy between how it is in the infrastructure application world and what the differences are in the data, especially for structured data. So would you see an opportunity or a need for more specific practices? For example, we could name them data reliability engineering as an overarching term for the advancement of, or bringing in of these practices into the data space. Would you see an opportunity for that, or do you think that SRE is already properly equipped enough itself to address most of those?
So I think both, and it's not because I don't want to answer the question. I think that I see an opportunity, especially more likely for the tooling that we need or the technological implementation for that. If you look at the principles of the evolves in SRE, I think those are very well suited to data management. If we look at you need to have SLIs and SLOs, you need to have automation, you need to have infrastructure as code or this class of stuff, what we are discussing about the toil and so on. Those principles, I think we can apply them very well to data management. Is this the tools that we have not only internally are cool, but open-source stuff, like companies that are offering products and services, and so on, there is an opportunity for adapting all that to data. We just discussed about the defining SLIs, right? And we saw how different they are, but the concept behind that is exactly the same.
Yep. That makes sense.
Now to round off our conversation. Can you tell us a bit more about what it is like working at Google? I know it's part of the fan collective. They're of course famous when it comes to technology worldwide. So what advice would you have for anyone wanting to work at Google?
That's a hard question. Well, I have only worked at Google for all the lists, so this advice only works for Google. The advice, so I don't know, I must admit that I was very lucky for all the process for me getting into Google and so on, but because of many, many reasons. But what I could tell that I have seen very often is that I have seen students in universities. I have seen people working in other companies, well, Google is just too big for me, and it's not. Would you need to have yes, there are interviews.
There are all these blog posts and Tweets and the conversations of how interviews go and how that works and so on. What I got advice is if you are working in technology, you like it, and so on, you will have experience. You will know stuff, build on that. It's like you are a systems engineer. Fantastic. Be a good one. And then when you apply to Google, the interviews, they are going to be working with you on, yes, they want to evaluate your fitness for the company, but they are going to do the stuff that you know. So you don't need to know every single thing that happens in the world. Choose something that is interesting for you, networking engineering, machine learning, whatever it is, data management that we were speaking, and be good at that. Love it and learn and so on. And then I think it's going to be easy and there is no gate entry point or you need to come from some place or whatever everyone is welcome at Google. So it is not too big for you. You can apply whenever you want.
So no need to spend dozens of hours on LeetCode necessary is what I'm hearing.
No, that's useful for a certain class of interviews, but it's more of a marathon than spending a few weeks doing some work. You obviously can do it to have more freshness for the interview, whatever. But what is interesting is your take of an analysis of problems or stuff like that, rather than all the standard library of Python for when you get there.
Makes sense and how do you see your role or career evolving?
It's hard. So we say at the beginning that I have played many roles, so I have been an intern an ICE. I have managed teams. I am a TL again. I think what has been apparent is that I have been working in systems and that the scope of these systems has been growing, not in the terms of people that are working on language, yes, because they are more complex and so on. But in terms of the importance of these services for the product. If you look at the authentication or even if you look at cloud computing these days, not even GCP, you have all the cloud providers, they're like societal kind of systems. So there are more and more people depending on them, like hospitals, banks, stuff like that.
So I think the role of reliability is going to only be more important because if we take down, I don't know, GCP or AWS or whatever, how many businesses, how many government agencies, whatever, they cannot work. So when you think about that, it's like, oh, so this is not Twitter people speaking about Gmail down anymore. This is actually crap, so it's important. So I think the role of reliability is going to be moving more towards proactive stuff, like risk management kind of thing, data management, like applying more systematic and rigorous kinds of processes and methods. So we can actually understand the systems better.
No, that makes total sense. There's definitely a GRC component kind of concern around or driver, I guess, is a better word around all of the things that we're doing in reliability engineering. So it makes sense that would fall under that broader scope or umbrella.
Ramon, thank you. I really enjoyed the conversation today. I think key takeaways or learnings from my end are definitely around what exactly is that role of the site reliability engineer? How does that look like within an organization, within Google specifically? What do they do on a daily basis? How do they work with everyone? But also that SRE is evolving to also include more scope around data because data is a unique problem. It's a unique asset or component that requires some new tools and thinking to better manage, to make it more debuggable to better become proactive to handle incidents in that space. So thank you so much for taking the time today. Thank you. And thanks, of course, to all of our listeners for joining in today, just as well.
Thank you for the time. I think the conversation has been really, really good. Thinking about this out loud, actually helps putting ideas together that usually you don't do by yourself. So I think it has been a nice time discussing all these topics.
That was a great conversation. When our peers share, it's an opportunity to listen where others have tried, where others have succeeded, we can learn. Oh, for the power of trusted relationships and this data community. Join the journey and get connected. Follow Soda to be the first to know about new conversations as soon as they drop. We'll meet you back here soon at the Soda Podcast.