Lacking a better name .. Hiring for SRE

4 min readMar 25, 2018

In our container management platform (Titus), we have four big areas of focus across our team. First, we have what most people call “scheduling”, which is actually our entire control plane as well as scheduling. Next, we have our “container execution”, which extends from through Docker, advanced isolation, and container process models all the way into our custom kernel. This area is basically everything that runs on our 10’s of thousands of agents. Third, we have “integrations” where we work to integrate Titus into the entirety of the Netflix cloud platform as well as AWS EC2. Some examples are working to ensure CI/CD, Telemetry and AWS security works with containers as well as it does for VM’s. Finally, we have an area that I sometimes call SRE, but admit it doesn’t fall perfectly in line with the SRE title defined by Google.

SRE in our team isn’t looked over or funded by a separate organization. We don’t have have strict feature vs. reliability budgets. SRE in our team is built in. That means that every quarter we look at reliability work alongside feature work and sizing each appropriately. Based on our weekly on-call handover we decide to take down newly found reliability tech debt as needed. Our entire team, developers working on scheduling, container execution, integrations, and SRE are on call equally. The team members that work on SRE focused items are distributed systems software engineers (SWE’s). They need to be able to work across our front end (Java) and back end (golang) as well as our operational tooling (mix of Java, Python, and Ruby).

What then do we work on in the SRE area? I usually talk about three major things.

Operational Enablement

Titus runs containers across 10’s of thousands of virtual machines across seven regionally isolated stacks. Titus is a distributed system with many complex interrelated tiers. Titus runs over 1000 different user applications. All of this makes operating Titus a challenge. Consider what it takes to roll updates out to this many virtual machines with confidence. It’s not an easy task. If we decided to solve this with humans only, we’d be in a bad place. We instead handle this with operational tools that make this easier. While we’re not there yet, we’d love to see an automated process that knew what end state was needed for a change across the world and slowly rolled it out watching for signals across various applications knowing when to automatically take the next step or involve a human for manual judgement. We have a goal of keeping our team size small while Titus continues to grow in adoption — something that can only be achieved with this operational enablement.

Systemic Reliability

Everyone in the team considers reliability when working on their components. However, we need to look at the reliability of Titus as a whole. How do the components interact? What are the biggest known chinks in the armor? Could we break the system if we knew where to inject bad data? Could users of the system do the same and if so, how do we fix these problems? Chaos Monkey testing is good, but what would an informed Chaos Monkey do? Engineers who focus on SRE in Titus are tasked for not only looking for these holes, but recoding aspects across the entire system to close the holes.

Scale and Performance

Titus adoption continues to grow at an amazing clip. It took us one year to grow from one million containers launched per week to two million. Only a month later we were at two and a quarter. Our users are challenging us on launch latency, application cluster sizes, image sizes, heterogeneity of applications, and many other areas of performance.

We have to stay ahead of this amazing organic growth. The work on performance isn’t to just benchmark and test Titus at higher levels of scale. It is about working on the architectural changes required to continue to add order of magnitudes of scale to our system while maintaining reliability.

One other thing, I’d like to make clear. While we are looking for someone who has passion in this “SRE” space, it is perfectly fine if they want to work across the four areas of Titus. We don’t silo team members into a single focus area. While we have focal points for various areas, they change over time as makes sense for project priorities.

I’m headed to SRECon next week as I’m looking for software engineers that want to work across the Titus platform who are passionate about the above aspects of SRE. If you’re going to be there and want to catch up, please look me up on twitter (@aspyker) and say hello at the conference. Or, if you aren’t in the area start a DM on twitter or apply here.

Lacking a better name .. Hiring for SRE

Operational Enablement

Systemic Reliability

Scale and Performance

Written by Andrew Spyker