Data Dream Team
S1 Ep12
S1 Ep11

The Engineering Spark for Data with Holden Karau, Author and Open Source Engineer at Netflix

About this Episode

Holden Karau is best known for her work on Apache Spark™, her advocacy for open source software, and her creation and maintenance of a variety of related projects, including spark-testing-base. Get to know Holden before we look at the data dream team from the perspective of an individual contributor.

Holden is a true champion for open source and becoming a better data engineer. Feel the power of her spark as you hear Holden’s perspective as an individual contributor in a data team, and enjoy as she shares her hobbies, practical advice and approaches to working in data.


Episode Transcript


Welcome to the Soda Podcast. We're talking about the Data Dream Team with Jesse Anderson. There's a new approach needed to align how the organization, the team, the people are structured and organized around data. New roles, shifted accountability, breaking silos, and forging new channels of collaboration. The lineup of guests is fantastic. We're excited for everyone to listen, learn, and like. Without further ado, here's your host, Jesse Anderson.


My guest today is Holden Karau. I've known Holden for quite some time, she's awesome. This is going to be an awesome podcast. With that said, there may be some salty language in this. It's going to be bleeped out. Because Holden's very emotional and very thoughtful about her use of open source and what's happening in the community, which will be awesome, because it's kind of a different sort of view into data teams. So, Holden, would you mind introducing yourself a little bit more?


No, for sure. So, I'm Holden. I've worked on data for far too long, at least seven years, maybe more than that. Currently, I'm at Netflix, which is a super awkward place to be as a trans person right now. So, all glory to the current employer who can do no wrong for as long as they remain current employer. And I haven't swore yet so that's a solid start.

Let's see here. Yeah, I've worked on Spark for several years. I've written a few books. I like writing books. It's sort of one of my hobbies. If you are ever looking for a hobby that pays really poorly but still does technically pay, I recommend writing. It's a lot of fun. I live in San Francisco with my wife and my dog, Professor Timbit. And I have my stuffed animal, Boo, and many other stuffed animals who are at home, but Boo is the one who travels with me.


Yeah. So, I've met Boo and I've met Timbit. People who aren't Canadian won't know what Timbit is.


Oh, yeah. So, timbits are small doughnut holes in Canada at the franchise, Tim Hortons.


So, Holden has a super interesting number of books that she's written, as she was just mentioning. So, Fast Data Processing with Spark, Learning Spark, High Performance Spark. I'm starting to see a pattern there about Spark. And then, the latest book is Kubeflow for Machine Learning. This gives Holden, I think, one of the more interesting, I guess, purviews and approaches to how she's dealt with the open source community. Could you elaborate more about what you think about this? Why are you the person to help management understand data?


I think I'm a really good person to help management understand data tools. So definitely, I have a lot of experience with Spark, as those are the Spark books. Kubeflow is also sort of a really interesting project, although one that, I think, unfortunately, didn't really live up to any of our hopes and dreams for a variety of reasons that are sort of more related to the implementation details of how open source works out at large companies. That's an adventure.

So, I've worked on a number of projects besides just spark. And Kubeflow is neat because it ties together a bunch of different open source data and machine learning projects into sort of one umbrella where you can use them together. And it makes it easier for data scientists and data engineers to sort of build pipelines. And I think working with Kubeflow and certainly writing a book about Kubeflow was really informative for getting an understanding of how the different pieces interact.

And how certainly, I would say, like in a lot of life, right, the challenges are when things meet, right? Like at the margins where the pieces sort of try and come together but there ends up being a bit of friction. Just like say cogs in a machine. The difficult part isn't necessarily the cog, it's where it needs another cog. That's where the hard parts are, right?

TensorFlow is a great tool. But the challenging part is often getting the data into TensorFlow or getting the useful results out of TensorFlow. And so I think that perspective is really interesting. And it's something which, if I just did Spark, I think, would probably not be as much in my purview.

Another part is while I've written these books, I've also done developer advocacy as well as open source development. And developer advocacy is just a fancy word for taking someone else's money to fly around the world. And so, that's really fun. And if you're kind of burned out being a programmer, I really recommend trying developer advocacy out. It's a great way to waste a few $1,000 of someone else's money.

And it's also really lovely because you get to meet a whole bunch of different people. And I think it's a really great thing for tooling engineers to do, because you get to meet all of your users in person. And you also get to meet a bunch of people who tried to use your product, and hate it and aren't using it. And they'll tell you in person, very directly in a way that they will not even on the internet, right? Like on the internet, they'll cuss you out for making bad software.

But in person, like when they see you face to face will realize that you're a person. And they'll take the time to actually clarify like, "Hey, this is the part where I felt really let down. This is what the impact is on me. This is how I couldn't do my job."

And so, I think, having the ability to have those conversations is something that I'm really thankful for both Google and IBM, just giving me a corporate credit card to fly around the world. I was really chill. I like that.

Burned out of it. It was tiring. But it's cool. It's really good. And so, to tie it all back together, I think, the part why people should listen to me is I both work on the tools. I've spent a lot of time talking with and working with the people who try and use these data tools. And I also have spent a lot of time trying to integrate the different data tools, in my opinion, the hard part of a lot of this data stuff, right? Like the individual pieces are pretty okay. It's trying to get everything to work together that sucks.

And so, I think, this perspective is really good. And I think, also, I have more of perhaps individual contributor perspective than the other people who maybe chatted with my brief stint in management did not go well. I forgot who my team was and I tried to have a team meeting and I only invited half the team, IC for life.


IC is an individual contributor for those not in the know.


Oh yeah. Sorry. I've spent so much time in the Bay Area. Yeah, no, that was awkward. The part of the team that I invited showed up and was like, "Hey, where's the rest of the team?" And I was like, "Oh, yes, there are more of you. Could you go get them? Please?"


I was doing a MapReduce task and I was distributing you out. So, doing half now and half later. That's how ...


Yeah. Yeah. I mean, honestly, after that meeting, I went back to my boss and I was like, "Hey, this isn't going to work. You need to find a real manager. I did not remember who was on my team, let alone what they were doing. This isn't going to fly."

Thankfully, very understanding, he was like, "Yeah, it's okay. You go back to being an IC."


Being an IC, an individual contributor, and how there's been this rise of engineers, more specifically, data engineers. You and I have seen this rise. You and I have evangelized this rise. So, what insights can you give your peers who are either trying to transition from software engineering or those trying to just start off in data engineering?


I love that question. It's really awesome. And there's a few different things that I want to touch on here. The first one is actually just going to be something that's going to help me and the community and less of like helping the individual. But then, I'll definitely get some advice for the individual.

But this request is for all of the people who are new to being data engineers, regardless of what your background is. And I know that this is going to be super scary. Call us out when we're doing creative things that don't make sense, right? I believe there are so many different colloquial expressions for this. But essentially, we do a lot of really stupid. And it looks normal to us, because we've done it for so long, right, Jesse?

There are so many things that we do where it's just like, "Oh, yeah, I mean, this is kind of weird." You do it for like three years, and you're just like, "Oh, it's normal. Of course, everyone would know the distribution of the keys of their data. Like, who wouldn't?" And then, it takes someone coming from a SQL analytics background to be like, "Why the hell do I have to know the distribution of the keys in this table? That's stupid, right?"

And then it's just like I want those people to come and I want them to ask these questions, because that's how we're going to get better, right? Because otherwise, we're just going to keep making slightly better versions of things which, it's cool. But I think the newcomers are the people who have the chance to really cause us to reevaluate some of our core assumptions. And I'm really excited that you're going to join us. And I'm really excited for you to ask us really tough questions.

And I know it's going to be really scary and hard. So you don't have to ask all of the tough questions. But please ask a few. Okay. So, regardless of whether or not you're actually going to do that, that's fine. There's a whole bunch of different things that I think can really help you get up to speed. There are some things that I like to keep in mind when I'm picking sort of what I'm going to work with.

The first one is this concept of community over code. And this is something that I'm stealing from the Apache Software Foundation, who probably stole it from someone else. But essentially, tools are important, software is important. But at the end of the day, if there's an amazing tool but everyone that works on it is. You know what? I'm okay having 10% less efficiency to work with people who are nice, because at the end of the day, the people who are nice are going to do better than the people who are. And they'll catch up eventually, right?

If you work with the nice people, you will probably advance much faster, right? I think most do much better when we're working around people who are encouraging us to be our best than we're dealing with people who are telling us that we're not good enough, right. And certainly, there are times when I have not followed this. And I regret those decisions in my life. But I would encourage you to, in addition to looking at the performance of the tools, look at the community that's built around them and pick a community that you feel is welcoming to you.

Now, I love open source, right? And I think open source is amazing. And part of this is because I think a really great way to get better at something is to be involved in the making of the thing that is the level below what you're doing. So, if you are a data engineer, right, and you want to become a better data engineer, or you're switching from software engineering to data engineering, and you don't really understand what's going on in data engineering.

I think a great thing that you can do is code reviews and open source to the tools that you're going to be using. And I think this is really great. And I think that you bring a lot of value to those code reviews. First by asking the questions or sort of like, "Hey, why are we doing this?" But then also just like, honestly, that data engineering world, we're kind of lazy when it comes to tests and we really shouldn't be. That's beaten us so many times.

And I think software engineers are more likely to call us out on some of those times when we're lazy. So please come and call us out in those code reviews. But also, just like in doing that, you're going to meet the people who are building the tools that you're using. You're going to understand how they're built. You're going to see how the different pieces interact. And of course, you can do this by writing code in the tools that you're using as well.

But I think code reviews give you the opportunity to see so much more of it than you would see if you were just picking up small starter tasks, right? Because starter tasks tend to be really narrow and very, very focused but code reviews can give you this broad view really quickly.

Another thing that I think is important to keep in mind, there's a whole bunch of tools. And just because someone is using some tools doesn't mean that they are the right tools for you. I think it's really important to take the time to evaluate what it is that you are building or what it is that you want to build. And what scale you're operating at and what scale you reasonably expect to be operating at. And pick something that matches that scale, right?

I remember meeting with people who had data that could fit on an old style floppy disk that wanted to use Spark. I'm like, "No." If you can load it in Excel, you probably don't need a distributed processing engine. You can run linear regression on your phone for that data set. And so, I think that's something that's important to think about. Of course, I'm not saying if what you want to do is you want to learn Spark and you want your employer to pay for you to learn Spark, hell yeah, party on. That's cool.

I think there's this really great saying from many, many years ago about the importance of keeping the fun in software. And I think that's true. Sometimes we're not going to be the most optimal people because we want to learn something and do something new. And I'm not saying you have to stick to the tools you know. It's totally cool to pick a project because you just want to learn it, but do it with that intention, right?

Be intentional about what you're doing if you pick something suboptimal just for the sake of learning. Be honest with yourself, you don't have to be honest with your boss, I don't care. I'm not in management. Party on.


So, thinking about those teams, and you kind of alluded to making sure that they've chosen the right things for the right reasons, how would you recommend teams be organized to get the most value from their data?


So, I love that. I think it's really important to have the right team organization to get the most value from data. And I think that the times when I've seen us really be able to deliver great insights all the way through to actually productionizing them, is the time when we have integrated teams. And this is because the data scientists need really great exploratory tools for data. But they also have to be tools, which are able to be productionized in the long run, right?

And they have to be structured in such a way that data engineers are going to be able to understand what the hell the data scientists were doing. It's really easy for someone to go off and do a whole bunch of really cool things on their laptop and just not have enough information for the data engineers to be able to productionize it. And it's also really easy when the data engineers are off in their own little separate world for the data to be too difficult to get access to for the exploratory tools to just not be there.

And I think having data engineers and data scientists work together is really key to delivering successful data products. I think, sometimes it certainly makes sense to have platform teams at a certain scale, right? I think normally that scale is at the point where you would have been considering an external vendor. I often end up working on platform teams. But I think it's important to keep the platform teams connected to their users. And I think there's a few different ways to do this.

Either the platform teams need to use their own platform or the platform teams need to be in a place where they're supporting the users, either through rotations or on call. One of the things I really think that would be great, that I haven't seen people do nearly as much as I would like, is at Google, you'd go through and you'd be a site reliability engineer for a while, if you were coming from software engineering.

And I think that one of the things I would really like to see more data platform teams do is a similar concept, where people from the platform team take rotations on to the data product teams and work there for a few months, right? Because on call answering tickets, it's also really easy to get into an us versus them mentality with stuff like that, right? But if you've occasionally worked with these people for a month, say every year, you're not going to think that they're stupid, right? You're going to be like, "Oh, yeah, no, that was super painful when I was trying to do this, too. I get you."

And on the flip side, they'll understand a lot more where you're coming from having worked alongside you and you'll be able to keep that trust. I think this is really important, especially as we shift to even more and more remote work, which I love. I love it. But I think it's important to build that trust and keep those teams as closely coupled as possible once you get to the scale where you do have to have separate platform teams.


I love the experience you have with large companies. That's unique. What sort of advice would you have for new data team managers?


Awesome. So first of all, I want to say congratulations, if you're a new manager, that's awesome. I'm really stoked that you're here. If you're a manager coming to the data space, that's cool. I love experienced managers. This is really great. So first off, for the people who are maybe coming from an engineering or data science background, and are now managers, what I want to tell you is what got you here won't get you there.

And I want you to take a break and I want you to focus on being an excellent people manager. I'm super stoked that you've got these technical chops. That's awesome. But that does not make you a good manager. In fact, it's going to be really easy to fall back on those technical chops and say forget half of your team exists. You probably won't be nearly as bad as I was.

But it's really easy to sort of fall back on the things that we know that we're good at. I think if you want to be a successful manager, there are great management books. There's really good courses out there and you should take the time and focus on being a really great people manager, building the empathy with your users and your team, building the empathy between yourself and your team like understanding their pain, understanding what it is that gets them motivated.

For people who are experienced managers and are coming to the database, I'm super stoked that you're here. I always love having managers with experience. Data teams are a little weirder. If you're coming from managing, maybe a non-technical team, that could be a little bit of an adjustment. I want to tell another story, and it's about a company with a no pants Friday. I'm not saying you need to implement no pants Friday, that was frankly a little weird. But essentially, your engineers are probably going to be a little weird.

And I think it helps to think especially of data engineers and data scientists as more creative type people than you might be used to thinking of your engineers, right? Like giving them that freedom to be creative and understanding that you'll probably get some really interesting things done that you might not get exactly what it was you set out to get done. But at the end of the day, right, if you're delivering value to the business, and that's more than your spending by a substantial enough amount, that's probably good. And it's okay that you maybe didn't hit your top three goals, it's the things that you hit that knocked it out of the park.

Another important thing that I've seen happen with data teams and this tends to happen more with platform teams, is take a look around, take a look at your on-call load, make sure that you aren't just fighting fires. If what you're doing all the time is just keeping the lights on, yeah, it's important, of course. But probably, it's time to take a step back and ask yourself how you got to this position and how you can get out of it so you can really innovate, right?

Because at the end of the day, if all we're doing is keeping things up, honestly, that's a job that we can outsource. Another one is to be suspicious of any benchmark including those that I make and those from your own team. It's not that people are making benchmarks or trying to lie to you, it's just that benchmarks give you a very narrow worldview, by nature. And so, you might get really excited about a project based on a benchmark and you're going to find out, it doesn't work out.

I think incremental projects, when you're for performance improvements are often a much better way than huge investments. Another one is, if your team members aren't in the right place, help them find the right place. One of the best managers I ever had was only for a few months, because while I was working for him, I was working with this other partner team. And I was so much more interested in the work that they were doing. I really love supporting the work that they were doing.

And he noticed this and he was like, "Hey, you seem really interested in working with this other team. Do you want to go and join that other team? Is that what you're interested in?" I'm like, "Oh, yeah, that is actually what I would love to do. But I totally get it. Don't worry, I'm here. I can work here for a while and then I'll try and transfer." And he was just like, "No, it's cool. I know the other manager. Let's all talk about coming up with a transition plan, if that's the work that you really want to do." And I moved to that new team. And that was a really, really great experience.

The business did better overall. And I think both of those teams did better. Because when you have that, you have someone who has the background of your team on another team and they will be able to bring that understanding of what it is you've been doing to the people who are working with you. And the alternative, of course, is like, frankly, if we look around in the job market these days, if someone's kind of bored, they're probably not going to stay in your company. They're just going to go somewhere else, right?

Recruiting is really, really expensive. And that other manager is going to owe you a favor if you give them one of your team members. The TLDR is I'm super stoked you're here, data is important. But I really want you to be a good manager first, focus on that.


Awesome. So you have a ton of experience and evangelism. So, you've talked to a lot of people. You've taught them and talked about skills. So how do we need to up skill and train our individuals? What do you think we can do and to do it so that it's best for their growth and personal development?


So, I'm a little biased here. And I can certainly give you a perspective that's going to come more from how I want to grow and how I've grown historically. But I think it's really important to take the time and talk with people about how they want to grow. And I think we've actually seen this a lot in education, right, with the things called learning styles. Everyone learns differently.

So, I love learning by reading and writing books. That is not for everyone. But if that is for you, like hell yeah, I'm super stoked you're here. The more great technical documentation we have, the better the world is going to be. That's certainly not for everyone. Most people don't wake up with a passion to write documentation when you ask them.

I think something that helps most people grow is mentorship, both in being mentors and being mentored. I would not be where I am today, if not for the mentors that I've had in my career. I think that these mentors can take the form of technical mentors. It's a lot easier to learn technical skills from videos and books and just like trying things iteratively. But it's really hard to learn the softer, fuzzier parts by trying things iteratively.

For one thing, if you, your coworker, can't do a git revert, right? That's just not something we got. And so I think having people to learn from is really important. And so, if you've got a team that has exclusively senior people, okay, that's cool. I think definitely, maybe see if some of them want to mentor some junior people. And for the love of God, hire some junior people and your senior people will be happier. And then as they leave, because that's just going to be what happens, you'll have new senior people who'll replace them in the form of the junior people you have mentored.

On the flip side, if you've got a whole bunch of really great junior people, that's awesome. Those are some really great teams. But it's probably worth it to try and invest in getting at least one senior person. And if you can't get a senior person to join the team permanently, try and get a consultant or someone to come in who can help people grow in an area that they're interested in. And of course, I'm biased, I always think code reviews are a good way to grow.

And I think this refers to both open source code reviews and code reviews inside of your company. And I think one of the things that I really appreciated about my time at Google, the first time when I was a software engineer, was just how different the code review mentality was from when I've been in the startup. It was totally okay to take a day to review someone else's code, because I didn't understand what the hell they were doing, right?

And I took that time to go through and understand all the different things that that person was touching and then ask clarifying questions to them. And at the end of the day, that code was better. I knew more about the systems that we were doing. And the person who wrote it, they were actually like, "Oh, yeah, I did this because that used to be the way it is. But that's a good point, Bigtable doesn't actually behave in that way anymore. We can change this thing here."

I see a lot of smaller companies, code reviews don't get the same ... Like there's a lot of, like LDTM or approved. Those can be nice from time to time. But I think making a culture of review and having that feedback of code reviews. It's not just engineers who need code reviews, looking at someone else's data science code is pretty important and a great way to learn.


Since you have such a wide ranging experience at these big companies, the big brand names, the marquee names, Apple, Google, IBM, you know that there's this fixation in the industry about what does Apple do? We should just copy that. So, what do you think about that now that you've been on both sides of the equation?


Oh, god, please, for the love of your operations team, for the love of your budget, for the love of just good engineering practices, do not copy what Google is doing just because they're doing it. I love Bigtable. Don't get me wrong, it is an amazing tool. And it is the wrong solution for almost everyone besides Google, right?

I think there are some great practices worth copying like code reviews and there are certainly some cultural elements that I want to copy from my different employers that I try and remember and bring with me. But, frankly, you are probably not at the scale of Google data-wise, right? And if you are, that's cool. But you're probably not there, right? And you probably don't want to structure yourself in the same way. You don't need to have a multi DC failover if you can take the occasional downtime, right?

If you're working in healthcare, then yeah, okay, you need to focus the hell on reliability, right? But on the other hand, Google certainly made some trade-offs, which if I was in the healthcare space, I wouldn't do, because I enjoy not killing people, right? And I think it's really important to remember that these big marquee names are working under a different set of constraints than you are.

One of the things that I would recommend copying that I've sort of seen come from all of these big marquee names, is the idea that if you're tied to a single vendor, you're. And I'm not saying never buy a solution from a single vendor, right? There are situations under which like, yeah, okay, it really doesn't make sense to have an alternate vendor for everything. But I think this is part of what makes open source so important, is that if you don't have an alternative, the people selling it to you, they're going to figure that out and they're going to put the screws to you when contract renewal comes.

And I've definitely seen that happen a few times. What else is worth copying from the big companies? Oh, yeah, you can definitely pay people a lot of money and I'm definitely not biased to someone who works for a living. But in all seriousness, if you want top talent, you really do need to take a look at what the compensation is that's happening out there. And you'll probably have to actually copy it. And I know, it's going to kind of suck for your bottom line. But pay them pretty well because otherwise, they're just going to go to the big companies once they have the visible attributes that the large companies are looking for.

Yeah, don't run Bigtable. Don't just use distributed data tools for the sake of using distributed data tools. And I'm saying this to you as someone who works primarily on distributed data tools, right? When the Ford salesperson says like, "Consider not buying a Ford," you should probably listen to them. I'm not saying, "Never do it." If you look at it and you're like, "Yeah, if I run this on a single machine, it's going to take all day. And if I run it in a distributed environment, I can get this done in like under an hour. And my data team can iterate." Then, yes, make that decision. Use a distributed tool. But make these decisions for the right reason, not just because they're what Google did.


So, kind of continue on with that thought about machine learning and distributed systems. What do you think the importance of machine learning is, there?


So, are we talking about machine learning on big data, like sort of the meta machine learning on our machine learning? Because I think that's really cool. Or are we just talking about the importance of machine learning for businesses in general?


It'd be management friendly, pointy haired boss, hey, pointy haired boss, why should we be thinking about machine learning?


Machine learning is really interesting, because we can use it to sort of productionize some of the elements of what a data scientist might be doing, right? If we think about it, like building things like fraud models are a great example of machine learning, right? And you can use this to augment the work that is being done by people. If people are making decisions with data, I think, a really good thing to do is to build machine learning tools to compliment them. I think, human in the loop machine learning tools are really great and they're really important.

You can help your teams, which are probably frankly, at this point, pretty overworked, right? If you're looking at a part of your business where there's a bottleneck, I think that's a great place to look and think like, "Hey, could I build an ML tool to help them, right, like, they're probably doing something with data." Realistically, that's most of what we do these days. If it's like making a pricing estimate, if it's figuring out if a customer is worth investing time into, if it's prospecting, right?

All of these things can be supported by machine learning. And so, I think, maybe a little bit different. I don't think about machine learning as sort of replacing the people that you've gotten in your organization. But I think about it as a way to help them be better and faster at their job. If you do it right it's like switching from doing things by hand to doing things with a tool, right? And that's a super big change. A really important thing is to have the people whose job you're intending to make go faster involved in this process, right? Because there's so much domain expertise up in their heads that just isn't going to be represented even in the data of the historic decisions, right?

If you ask them why they made a decision, they might tell you that, "Hey, I looked at this, but then I went over here and I looked at this other thing. And that going over there and looking at the other thing, that's not something that your machine learning model is going to immediately be able to figure out." By knowing that there's these different features that this person is combining or these different data sources, your data scientists are going to be able to build much better models, because they're going to be able to encode a lot of the different parts of these people's work and help them make their decisions quickly.


So, you've written a lot of books, how would you choose what you're going to write about? I know this is kind of like choosing your favorite furry, but can you choose?


I think there's a few different things that I like writing about more. The things that I think are the most interesting are the ones, for me, where after I write it, I look at it and I'm like, "You know what, I think this is going to help people do their job. I think this is going to help people accomplish their tasks." And those are normally the ones where I feel happier, right? Where I'm just like, "Yeah, this is going to be really useful to people."

Because at the end of the day, at least I write for the sake of it being read and I write for the sake of it helping people. Because I love teaching people, but there's only so many hours in the day. And for me, writing is a way of scaling that. And the ones which I'm happiest about are the ones where I think they're going to have the biggest impact on making the world a less place.


That's worthy of Reading Rainbow or Mr. Rogers' Neighborhood. Welcome to my neighborhood.




Well, what's a big project or groundbreaking initiative that has the Holden touch, but nobody knows how important it is?


I would jump back to my early career at Amazon. Because honestly, since then, most of my work I've been pretty good at least being able to tell people about because I think it's that skill that I picked up later in my life.

Early in my time at Amazon, we built this tool to recommend categories for search terms. I'm assuming the code itself is dead, but the concept is still around, and you can still see it on the website. To me, the part that made that really cool was ... So, okay, okay, let me backup. When I think of the Holden touch, it's not like necessarily a good thing. Yeah? It's for Canadians out there, I think of the Red Green show, I guess, also parts of America with PBS for our European and broader audience friends, I don't know if Duct Tape has the same connotation.

But essentially, there's a bunch of distinct things and we're going to put them together and we're going to make them work and it's going to be awesome, but it might fall apart while we're writing it. And I think a lot of the things that I build are like that. And eventually, I make them better. But the first version is very much like, yeah, we're just going to glue that together and it's going to be a party. And that's because you don't know if it's going to be valuable until you've glued it together.

But honestly, also, it's just hell of fun. And so building that system was just so much fun and no one knows about it. And I've got some hilarious stories from that involving what would be called health and personal care items and how to make age appropriate suggestions. Yeah. So much fun building stuff like that and I really like it. But otherwise, honestly, most of my work is on GitHub and I gave conference talks about it. So it's not a secret, right. Amazon stuff is only the stuff that people don't know about.


Yeah, in fact, when you're talking about Amazon, I don't think I knew that you worked at Amazon before.


Amazon, I did an internship there. And that was my first job out of university. I had no idea about how communities worked back then. And so, I didn't talk to anyone outside of Amazon about it. But I definitely learned a lot working there. One of the things that I learned is that if you write Perl code at a company that uses Perl, they might try and put it in production and that's a little terrifying. Because oh god, that code should not have made it to production but it's a party.


So this is a question that I don't think I ever asked you. How did you get such good jobs right out of college?


I'm not sure I would call Amazon a good job. And no offense to my manager back then, you rock like hell's yes. But I was on call for an entirety of Q4, with a physical Motorola pager. And to this day, the sound of a Motorola pager will give me, depending on how much sleep I've had, mild anxiety, like, I should go and find my Xanax sort of situation.

There's a few different things that help me do that. One of them was back then, I was presenting as a dude, right. And there's just this sort of assumed competence of men, which was really convenient. The simpler times. Also, I went to the University of Waterloo. I'm very fortunate about that, honestly, that was super touch and go. I got internships in the States, which was really great. I think, for people who are in school, pick internships at companies that you think you might want to work at.

For one thing, it'll let you understand if you actually want to work there. And for another thing, it lets them understand if they actually want to hire you in a way that a four hour interview for college new hire doesn't. There were also some open source components to it, I would say. And I really like playing around with new things in my spare time, right? Like, one of my projects ended up on Slashdot, because a classmate of mine convinced me that Scheme was a pretty fun programming language to write websites in. And I believed him.

And in retrospect, that was not the smartest decision that I made. But it was really cool and I learned about this different way of handling state tracking for websites, which was quite fascinating. And then, it ended up on Slashdot and I spent like the following 48 hours keeping this thing from falling over, using just all the computers that I could get my questionable SSH credential to let me log into and a very sketchy load balancer that I found.

And I think honestly, to this day, I have that sort of experimentation stuff going on. I have a personal rack now, instead of using a bunch of questionable servers that I may or may not have legitimate access to, it's down in Fremont, California. I love the people there at Fremont. It's Hurricane Electric. They're great folk. But it does cost a bunch of money to do that. But it lets me live out my high school dreams of being a systems administrator.

So I think honestly ...


I think you might be the only person with a rack that isn't doing Bitcoin mining.


No, no. Okay. So, there is one of the ex-Google people who either does Tailscale or Kubernetes now, I get them mixed up. The Tailscale folks, amazing. It's a VPN solution. That's the only one that I don't hate. And Kubernetes are very likely aware of, but it's a great way of managing cluster resources and computers. And anyways, he wants to get a rack, but he's negotiating with his spouse about spending that much money on something that will produce no economic value in return.

And there's an entire community of people there. It's called Aisle 6, because that's the aisle in Fremont, where they put all of the hobbyists essentially. None of us are mining Bitcoin. One of the people had their rack serving as a backup for the internet archive because they're really interested in trying out new storage layers.

There's another one where the person is a hobbyist day trader. They have a day job but they try writing algorithmic software to do trades for them. I think they normally lose money occasionally, maybe don't, right? And it's all these people who are like, "You know what, I have a job now and that means I can waste money on computers at scale."

And part of this is actually because Fremont, California is a really terrible place to do Bitcoin mining. And that's because bandwidth is really cheap but power is really expensive. And to do Bitcoin mining, you really don't give about bandwidth, but you do give about power cost.


Well, speaking of Bitcoin, what else is exciting to you in the technology world?


Okay, to be clear, Bitcoin is not exciting to me in the technology world, just to be clear. Although Bitcoin did help me pay off my student loans, which were admittedly a lot smaller, because Canada. So, in the software world, there's a whole bunch of things that I'm really excited about. One of the things that I'm excited about is to see the batch scheduling evolution inside of Kubernetes. And this is super esoteric.

And so, if you're just like, "What are you talking about Holden?" So essentially, a lot of big data stuff has been run on top of a system called YARN. Because of how YARN is structured, it makes it difficult for data scientists to try out different tools in a way where they're not going to step on each other's toes, right? And as I was talking about, one of the things that I think is super important is making it easy for data scientists to do different kinds of experiments.

And I think running our stuff on Kubernetes makes it easier for data scientists to run these experiments, but also do it in such a way that I can take their work and I can productionize it. And part of the challenge with productionizing that work right now is that in Kubernetes, scheduling huge numbers of batch containers, which is when you want to turn something into a repeatable fast job, is not ideal.

And there's a bunch of different technical reasons why this is the case. But what I'm excited about is that there's a few different approaches being taken to try and solve this. And some of them are finally starting to talk to each other. Not a lot, but a little bit. I suspect, we'll probably end up with a situation where it's like we have two competing standards who know what we really need to be the center that unifies them, and then we'll end up with three competing standards.

But I think at the end of the day, we will get something fun in this space. And I think it'll really unlock some really cool potential there. So, I'm pretty excited about that. I'm also really excited about Dask on Ray, for similar reasons. I've worked on Spark for so long that I'm just used to seeing Java stack traces show up when I write Python code. And so, I'm almost blind to them now.

But every time I'm interacting with my users, I'm reminded that, "Oh, dear god, this stuff is impossible to debug for people who just haven't spent the past seven years working on this." And I'm excited about the new tools coming into the space that might make it possible so that we don't have to depend on the JVM as much.


What do you never compromise on?


Yeah, this is a rough question. Yeah, as a trans person at Netflix, there's a lot of things that I've compromised on in the past two weeks. I would say probably my family. My chosen family is the most important thing to me in this world. If it's going to hurt my wife or my dog or my girlfriends, I'm not down for that.

But honestly, everything else has a price list, which kind of sucks but, yay, capitalism.


For what it's worth, I told Tara this is going to be something that goes down in history. We'll look back on this as what seems to be a minor event but I think it's actually going to be quite historical.


You think so? I don't know. How do I say this in such a way that I get to keep my job? I do think that there's a lot of opportunities for learning would be I think the business appropriate phrase here. And from what I can tell, these opportunities are not being taken up in the near term.

And unfortunately, they're the same opportunities that they had a few years ago that they do not appear to have learnt from. But I think that we are seeing a slow cultural shift towards more inclusion and I hope that continues. And I think also part of what we're seeing is the backlash to that cultural shift of inclusion. And I hope that dies out.


I have one last question. If there's one thing you want our listeners to remember from this conversation, what would that be?


Can I cheat and give you two?




Hell yeah. Okay, cool. So, the first one is that what we do has real world impact. You're going to sometimes and that's life, but own it and try and fix it. And number two is the voices in the room matter, if your data team all looks like you, or your vision of diversity is UC Berkeley and Stanford, that's probably not going to be a winning team. You can do better. And I believe in you and you will do better.


Thank you, Holden. This has been awesome.


Thanks for having me. It's been so much fun.


Another great story, another perspective shared on data, and the tools, technologies, methodologies, and people that use it every day. I loved it. It was informative, refreshing, and just the right dose of inspiration. Remember to check for additional resources and more great episodes. We’ll meet you back here soon at the Soda Podcast.

Nov 25, 2021
S1 Ep12
The Engineering Spark for Data with Holden Karau, Author and Open Source Engineer at Netflix
Close Icon


The Engineering Spark for Data with Holden Karau, Author and Open Source Engineer at Netflix