Scaling a Netflix Infrastructure Team and as an Engineering Manager
Our compute infrastructure service supports Netflix engineering with hundreds of applications running across thousands of servers. I directly manage and have hired the team that designs, develops, operates and supports this service. Our team shares the additional roles of vendor and partner manager, project manager, product manager, and customer relationship manager. We have gotten to where we are today with a single manager and a team of ten very senior software engineers.
How have we succeeded in building this important Netflix infrastructure with such a small team?
Hiring Senior Software Engineers
In 2015, our team was three engineers myself included. In 2016, I became the manager and the team grew to four engineers. In 2017 the team grew to six, and 2018 to ten engineers. At each of these steps we had a team that was much smaller than teams supporting similar systems at other companies.
The biggest factor in our success has been the engineers working on the project. Members of the team have an average of eleven years of industry experience. During those years, they have built critical operational systems at scale at Microsoft, Joyent, Amazon, Yahoo, Heroku, and Netflix among other well-known web scale properties. They have had to deal with the complexities of similar always-on distributed systems and have had the opportunity to make and learn from many mistakes.
Also, being senior in their career means they have held roles such as co-founder, team lead, manager, and chief architect. That means I, as a manager, know I can delegate many additional responsibilities to the team such as project management, product management, and relationship building. Additionally, the team maturity means they exhibit soft-skills that are critical in support of our customers and partners.
I spend much of my time maintaining relationships with partners both internally and externally. Our infrastructure is built on top of engineering tools, technologies, and services from other Netflix infrastructure teams. Being able to run an application on top of our infrastructure requires deep alignment with our application productivity and security teams. Finally, we deeply leverage Amazon EC2, to offload and automate datacenter level features of our infrastructure.
Our internal users also help us scale. Before I joined Netflix, I found it harder to experiment quickly and make breaking changes when needed to reduce technical debt. By developing bi-directional empathetic relationships with some of our largest users, we have been able to move faster.
I write a weekly update, using Google docs, to the team of all of the learnings and context I gather. I call this document “Andrew’s 1-1s and Context Document”. Throughout the week, I scribble down terse summaries of the meetings I attend. Cleaning up this document takes me an hour or two weekly. I add my own views and thoughts along with what I learned. I ping the team in slack to let them know when it is ready for consumption. To give you an idea of the depth, last year the doc was 134 pages and so far, this year it is 49 pages.
Our team comments back in places where they need more clarification, participates in discussion comment threads, and then uses the information to make more informed decisions. When items are more urgent, I write them down and alert the team through slack to read just a single update. I also use this document to ask for help in places where I don’t know how to delegate work and usually the team volunteers. This is my main scalable communication medium. It takes work to do well but is cited as one of the most valuable things I do as a manager by the team.
Building a Liaison Program
One of the reasons I joined Netflix infrastructure management was that development managers product manage their own offerings. I thoroughly enjoy knowing our users as I know it helps me build the right service. As our service grew it was hard to frequently touch base with every user team. I found that our engineers also liked aspects of product management and if they touch base with the users directly, they can more deeply understand the engineering level needs.
These reasons came together into our service’s liaison program. This program establishes an engineer within our team (the liaison) as a focal point for our larger customers or management leads for groups of users (we call them “customers”). That engineer is responsible for meeting with their customer once or twice a month to see how things are going, collect requirements, and gather feedback. I jump into these meetings as I can and when needed. The liaison is responsible for managing the relationship and engaging me if they need help.
The liaison is neither a gatekeeper to the customer nor the engineer who does all product work for that customer. We still spread that work across the team. The liaison is responsible for periodically updating the team on the customer and during our planning process representing that customer’s needs. After establishing the liaison program for customers, we expanded the types of relationships managed this way to partners and cross-functional initiatives.
How will we continue to scale as a team?
Each of the above has helped our team scale. However, as our number of customers, partners, and focus areas grow, each of the above tactics are hitting scaling limits. Finding and retaining stunning engineers takes focused and continued effort. Managing our ever increasing partnerships requires deep relationships to ensure bi-directional empathy. Sharing context with the team expediently will continue to be important as our mission widens. Finally, while the liaison program is a great approach, its impacts to engineering focus and single point of scalability bottleneck (me) are problematic.
Our team needs to grow to deliver on its current mission and has scope to grow to continue to help support the Netflix business. However, we are now at the Hackman max for a team size of ten. As my next step, I plan to hire another manager to help continue to grow the team, product and mission. If you are a manager looking to have great impact and are interested creating additional techniques to scale such a service, check out this job posting. We’d love to talk to you.