[WIP] Reflection on Org Structure & Reorganizations

11 min readApr 25, 2024

One of the key responsibilities of an org leader is to architect their orgs, it is one of the most challenging jobs on earth and is critical to the long term health of an organization. When you talk to someone about reorg, you would normally hear complaints — “uhh, another reorg!” In this blog post, I’d like to share some of my observations & thoughts towards org structure & reorgs.

Observations & Thoughts

There is no perfect org structure

The rationale behind organization / management is to scale out the same tribal knowledges across many individuals. Without a structure, tribes can rarely go beyond 150 individuals . One big invention of human society is company and management science to scale out the group close to a huge size (though after some threshold, it still tends to implode). However, as systems (either human or distributed systems) scale out, there is a cost associated with that.

In some sense, all org structures are inefficient, the most effective way to organize a company is to have a singularity mind that is connected to every employee with zero management. Unfortunately such individual or super intelligence doesn’t exist yet. CEO only has 24 hours a day, so they have to prioritize their time on the most critical things to work on and delegate some of their responsibilities out to the management team. The effectiveness of the CEO is practically how well the CEO can lead the company towards a shared objective.

Now we know organization itself is a compromise, the objective of organizing a team is to find an relatively optimal org structure that minimizes the communication loss between the CEOs and individuals. Since it is an approximation, there is usually NO one single representation of this (like you can approximate an observation via multiple ways, linear regressions, neural nets, etc, but there is no one best approach). There is usually no single best org structure for a company, it is a high dimensional linear programming problem, and our job is to find the best approximation .

Understand the challenges first before actions.

The associated cost of an reorganization is empirically 3–6 months of productivity, it is not wise to reorg because you want to pursue an ideal theoretical organization structure. You usually only want to do it when it is trending a top issue for the team and many leads in the team started to miff about it, tho not too late when it is becoming a top issue on everyone’s mind or when it starts to hurt the morale and team productivity badly.

From the previous section about “No perfect org structure”, we understand it is a fairly challenging problem to build a proper organization, and there are many problems to be solved. Now we need to find out what problems to be prioritized.

Similar as the guidance mentioned in Good Strategy, Bad Strategy. The pre-req to action is to deeply understand the problem, applying the organizations’ guiding principles, and then align the actions with guiding principles to solve the problem. As there is no perfect solution to organization structure, it will be incredibly useful to think through the acute problems that impact the team productivity, write down the pros and cons of various choices, then use the team’s values to decide which one to go with.

Talent matters

An organization is built by people and built for people. The first dimension is talent and this really matters. One management principle I repeated heard from one of my previous managers was:

As a manager, your №1 job is always to fix people issues, since it is not other’s jobs. If you don’t do that, nobody will.

So the first dimension in my opinion is to really understand the talent deck and align the org design with talents. Leaving the talent factor out, it is just a theoretically discussion of the organization structure. Talent is the first constraint of organization design. Every individual in the organization is different with different perspectives, strengths and weakness, you need to put the right person in the right spot using their strengths. It is also mentioned in “How Google Works". “Beware of the tendencies of different groups: engineers add complexity, marketing adds management layers, and sales adds assistants” So the first thing to understand is the talent stack and keep reminding yourself of this.

The second point of organizing people is organize around people that aligns with “As a company gets big and complex, you can’t just organize around people who create innovation; you need to organize around people who can create and lead entire new ventures and business.” On the other hand, we also need take good care of people that are not interested in creating new ventures or business.

Learnings

Don’t separate the interface and implementation into separate VP orgs or different geo locations

The communication overhead between two VP orgs, or between two geo locations are huge (like BLR and pacific time). Under two VPs, escalations can take forever to happen. Also when you pass information across two VP chains (from eng -> mgr -> dir -> vp -> vp -> dir -> mgr -> eng), 90% of the context can be replaced with assumptions. Even if you have the best intention to begin with, you ended up with tons of hatred in there is any misalignment between the two teams.

Autonomous teams and Conway’s Law

The structure will indeed mirror the structure of your organization and your customers will experience that.

I had a very painful experience with a broken org structure that I could not sleep well for a few months, below was one of my writings to suggest the leadership to combine the two teams together under the same leadership chain. It was a hard one!

Reflection on Building a Coherent Deep Learning Solution

Problem Statement
The company has been operating Training Infra stack across two VP orgs — Big Data Platform (BDP) org and the Machine Learning Infra (MLI) org over the past one year. The MLI team is responsible for the deep learning authoring interface (TF Trainer) and serving infrastructure (TF serving), and BDP team is responsible for training infrastructure (TF core, Training Driver and Training Infra (Kubernetes)). There have been a few challenges in this setup: (1) training interface is separated from training infrastructure; (2) training infrastructure and serving infrastructure are partitioned. We are not in a stage to decouple interface and infrastructure, and evolve independently (new fds reader, new k8s training platform require interface changes); and in deep learning, a new innovation (e.g. custom ops) added to training also has to be validated and integrated for serving.

Issues
1. Unnecessary turf discussions. We spent more time discussing team boundaries, platform/infra interfaces, than thinking about building the right deep learning solution for the customers. When the managers met from both teams, we spent at least 50% of time discussing the interfaces between the teams, a total waste of time. We don’t even have a product we are proud of, and we are in endless turf discussions. For example, should MLI work on scalability? Should MLI work on I/O? Should BDP work on serving? Who should work on hyper parameter tuning? How to handle oncall boundaries? What is the interface between infra and platform?

2. Make bad and slow decisions. Good products start from small & flat teams, the leader makes good and fast decisions, and iterates on the product quickly. When a product is early days and not mature, disagreements are very common. Under two different VP orgs, any disagreement requires cross org alignments, and each org naturally operates for their own benefits. For anything to go through, we have to make sure everyone is happy from org A’s eng -> A’s manager -> A’s director -> A’s VP -> B’s VP -> B’s director -> B’s manager -> B’s engineer. No decision makes everyone happy in the early stage. If you go through one round of escalation like this, you don’t really want to do that every week. In the end, _we don’t optimize for the best interest of our users, but optimize for reducing conflicts and the best interest of different organizations_. Example: BDP team optimizes for training, serving is not in the thought process, custom ops support is a good example; MLI team focuses on customer support, missing holistic optimization and the team is frustrated, the investment from the MLI team on MWMS distributed training is an example, dropping I/O integration task while we all know it is a priority is another example; when there is a 3rd team in the jam, we can no longer operate: it takes a few quarters to finalize a “temporary” design for hyper parameter tuning cross AIF, MLI and BigData.

3. Missing accountability. “Who owns the CSAT/success of LinkedIn deep learning” is an unsolved question. There is no single POC or team that owns the overall deep learning experience. When shit happens, we blame each other. For example, the myth of low GPU utilization — from the BDP team: the ProML platform sucks so people are not using deep learning; from MLI team: BDP folks suck, they don’t scale TF well.

4. Inflexibility in talent management. Even though we told our engineers they can do whatever they want to do for the benefit of the company, it is still unnatural for an engineer to work on the other team’s current investment. If an engineer wants to work on the projects in the other org, we have to go through a quarter long process to transfer engineers. For example, back in the MLI team, Pei Lun didn’t want to touch I/O optimization since it is mostly the BDP team’s scope, and inside the MLI team, there are also enough tasks to occupy him. After transitioning to the BDP team, Pei Lun is materializing over sized impact and innovating like a rock star. Had we optimized globally, we would have put PL in a position to focus on infra scaling first.

5. High communication cost and lost trust. Prioritization cross orgs is very costly. Some argue that we can just have more communications, or run virtual teams. We did try a virtual team in 2020, and the management and leadership spent a ton of time pulling the teams together and building trust. It not only costs internal communication overhead, but also adds communication cost to our customers, they don’t know who to talk to, and they have to be cautious who they talk to. For example, the MLI team always has the impression that the BDP team is working on secret projects with the AIF team. The high communication cost also causes knowledge silos between the two teams.

6. Lack coherent roadmap for overall deep learning offering. Because we are split across two different orgs, with an unclear charter cut or overall ownership. It is challenging to come up with a good overall deep learning strategy. For example: MLI owns serving infra and training interface, while BDP owns training infra, it is hard to either build a good strategy for training, or build a good strategy just for infra pieces.

7. Aligning culture and values across orgs is hard. Different orgs have different cultures. Influence another org with your principles is a long process. For example, “focus” has been the mantra in BDP, but not really in place in MLI deep learning team because of AIP and other challenges. There have been pushes from the BDP team to have MLI deep learning team focus more, but you won’t have enough bandwidth to address the culture of your partner team in a different org. Shaping culture and value is a full time job.

Guiding Principles

#CompanyFirst: we act for the best interest of the company. Folks should share the same value of Company First. If we all think with One Team, it is all professional discussions rather than personal emotions.
Interface and the implementation of the same service should be in the same group (director/VP groups).
1. For example: Spark core and Spark-as-a-service should be in the same organization.
2. But Spark service and services built on top of Spark don’t have to be together. For example: feature engineering team (Frame) and Spark service team don’t have to be in the same organization.
Start from the same leaf team before splitting into multiple teams. It is far easier to start from a small core team for an initiative, later expand into multiple organizations. If we prematurely split a project into multiple organizations, it is painful to coordinate.
Clarity on team charter and broadcast. When we start a new team, it should broadcast its charter with Vision -> Value, we need to make sure there isn’t a second team sharing similar statements. For example, we can’t have two teams both focus on deep learning scalability. And if team A’s charter is a superset of team B’s charter, it is better that they are in the same group or B is part of A’s group. What about Permissionless innovation versus ambiguous team charter? We embrace permissionless innovation and everyone is empowered to think big and challenge the status quo with innovations. The question is — if there is already Team A doing scope X, and there is an engineer from Team B building an innovative solution Y that is 10x better than X, what to do? First align if X and Y are indeed the same product, likely they are solving similar problems but with completely different context. If X and Y are indeed directly competing products and both are maturing, it is better to consolidate at leadership level.

Options

Option 1: Put all deep learning inside one team, one org.

What is good:

Simplicity. We save the time spent dealing with previous challenges to build a better product.

What is the problem:

Reorganization cost. We will take at least a quarter of productivity loss to rebuild the org structure and culture of the teams.
Fading horizontal context. If we move the BDP deep learning into MLI, the team might slowly lose exposure to the innovation in other compute engines (Spark, Pinot, Trino etc.); If we move the MLI deep learning team into BDP, the team might slowly lose exposure to the overall machine learning pipeline.

Option 2: Put all training inside one org, serving in another org.

Pros:

Training will start functioning better. The separation of interface & implementation in the training stack causes most of the current challenges, we will have one team, one leader to interface with customers and make decisions.
Not too intrusive. We re-org tf2-trainer and training infrastructure together, while keeping the serving stack as is.

Cons:

Product is a copy of the organization’s communication structure. We will likely land a product with unbalanced quality for deep learning training and deep learning serving. Especially when we enter an age to build large models, close collaboration between deep learning serving infra and deep learning training infra is a must to produce good results. We don’t want to end up in a state where a trained model can’t be served or a large model can be served really well but training takes forever.

We ended with Option 1 as that is pretty clear a better option on hindsight.

Technical mastery matters in infrastructure orgs

This is a corollary of “Conway’s law”. For application teams, it is critical to align the organization structure with the product interface. For infrastructure teams, just understanding the product is no longer enough, it is also critical to understand the current and future architecture evolution to propose a relatively stable org chart. Note that, the key here is **understanding**, not **proposing**. It is critical for the org leader to vision the platform and understand the technical architecture, not for the org leader to propose such an architecture by themselves, they must rely on others in the org to figure out the technical architecture together and have enough technical, product and customer exposure to choose the most sustainable one going forward. Don’t underestimate the importance of technical mastery in org design. We all want to have a flat org structure, but choose carefully when you try to flatten the org. Please don’t remove the person with the most technical expertise from the chain.

MORE TBA