A way to understand software architecture and its failures
There are a couple of “standard spiels” I have been giving my engineers for decades at this point. It’s about time I wrote them down and made them more widely accessible. This is the first of them!
Several years ago, I was in an engineering meeting where we were trying to organize our system. We realized that the system had three natural layers in it — and in the months and years that followed, we realized that this three-layer partitioning wasn’t just good for the one project, it was an extremely useful language for talking about engineering systems in general. Not only does it give names to things, but it lets us talk about their possible relationships, identify common problems, and identify common solutions. In retrospect, it’s been one of the most powerful engineering architecture tools I’ve found during my career.
Today I’d like to share it with you: first the basic idea, then how things work out when everything is going well, and finally the four most common problems and how to get out of them.
- The Core Idea: What W, X, Z, and Horizontal Teams are and how they work
- Healthy Relationships: How these teams work together when things work well
- Common Problems (And How to Handle Them): X-Z Mixing | Z-Z Mixing | W-Z Mixing | Horizontal Disconnects
If you’re wondering what happened to “Y”: We (Brian Stoler, Reza Behforooz, and I) initially had X, Y, and Z layers, then realized there was also a W layer, then realized that the Y layer didn’t actually exist. So we ended up with W, X, and Z and the names just sorta stuck.
The Z layer is the user-facing part of a system. Typically, each system has a large number of Z’s in it — all the features of the product itself. New Z’s are being tested out, and old ones retired, all the time.
The main challenge for the teams that work on the Z layer is to figure out what the customers actually want (aka, product-market fit). As a result, people who work on the Z layer of a system care a lot about being able to experiment rapidly and measure their success by customer adoption. They tend to think in relatively short timescales — months, not years — and use infrastructure but aren’t experts in it and often don’t want to be.
The X layer is the infrastructure that directly supports the Z layer. It speaks the same language as the Z layer— e.g., it might have primitives like “users” or “feeds,” not “integers” or “cores.”
The job of the X layer is to make it as easy as possible for the Z teams to experiment rapidly; success for them is that Z’s can try things out without having to think about things that aren’t directly related to the customer questions they’re thinking about, having to spend a lot of time coordinating with one another, or ensuring that a problem with one Z system doesn’t take down the whole environment.
As a result, people who work on the X layer care a lot about being able to isolate problems and measure success by Z teams’ flexibility. They think on infrastructure timescales — years rather than months — but their direct customers, the Z teams, don’t.
As a result, W teams will often have much more white-glove relationships with their customers, care about providing performance and reliability guarantees, and measure success by X teams being able to scale without incidents. Another important difference from X is that you can buy W-layer infrastructure.
This is because of the very generality of its primitives; X-layer infrastructure is all about speaking the same language as your company’s particular product, so by its very nature you need to build it. But (for reasons we’ll discuss below), even when you’re buying infra, W-layer teams are often building layers on top of that for the company’s specific needs, and exposing those to their customers, not the raw third-party tool.
Separate from these three, but closely related, are horizontal functions like security, privacy, legal, health, and site reliability. They aren’t layers at all, but stewards of shared resources of the company (like its reputation or security). They often work closely with W and X-layer teams to make sure that as many of the critical behaviors that matter to them are infrastructuralized (so that Z-layer teams don’t have to think about them explicitly), and with Z-layer teams to make sure that the integrated behavior of the system doesn’t create emergent problems. They care about making the system fail well and measure success by the health of the shared resource they’re charged with.
The “three layers” of this model aren’t quite the whole story: Each layer may have sublayers inside it. It’s very common for an X layer, for example, to have its own deep substrate (the “W of the X”) that makes it easy to build higher layers of X, a middle layer (the “X of the X”) of common primitives, and a top layer (the “Z of the X”) which is really Z-layer logic that gets reused by enough Z-layer teams that you’ve partially infrastructuralized it. And if you’re buying a W layer from someone (which you are, unless you’re mining your own silicon), they have W, X, and Z layers inside their company, and you’re the customer atop their Z.
When everything is going well, things work something like this:
- W-layer teams are spending their time thinking about scaling, reliability, security, and the like. They are in very close contact with the X-layer teams and are aware of where the next big demands will come from. They’re looking at dashboards of overall system load, latency, errors, and so on, and watching load go up and to the right while everything bad stays small. If they’re buying rather than building a technology, they’re focusing on how to smoothly integrate this technology into the X-layer ecosystem.
- X-layer teams are acting more and more like W-layer teams, thinking about aggregate load from their customers. They’ve separated their customers into categories — a handful of very large customers with whom they have very close relationships and a handful of categories of smaller customers who have similar needs with whom they have a more arms-length, self-service relationship. They’re keeping a close eye on the kinds of experiments that Z teams want to be doing and are building out new features to enable more of that.
- Z-layer teams are experimenting like mad and rarely think about the W- or X-layer teams, or even about other Z-layer teams. Except for the handful of really big and mature ones, they’re thinking about hundreds of ideas for new product behaviors and a nonstop search for product-market fit. Occasionally, they have ideas that would require some fundamentally new infrastructure, at which point they get frustrated that it doesn’t exist yet.
- Horizontal teams have split into a central team (that acts as reviewers and full-time experts) and an embedded team (of people who are working full-time inside W, X, and Z-layer teams to help those teams design their products right and keep them happy), as well as a distributed force of people in the W, X, and Z-layer teams who understand and care about the relevant issues and bring their knowledge of that to bear every day. The teams put their energy into helping people think through failure cases and managing escalations (“This team wants to do so-and-so, but that creates a risk; should we do it?”) and incidents.
Reminder: Escalation is healthy!
Sometimes people think of escalation as a sign things have gone wrong, or something to be avoided. But it’s something you should aim for, not avoid! If two groups can’t agree on something, escalating it serves a bunch of useful functions. First, it gets the relevant decision-makers in the room — and generally they want to be in the room for that. Second, helping resolve things like this is literally these people’s job. Third, often the process of escalating — of preparing clear statements of the questions under debate and the alternatives — can help resolve the problem more easily.
Conversely, if you don’t escalate once the debate is showing signs of stagnating, you’re just stretching it out; nobody gets a decision, everybody gets more stress, time is lost, and “no decision” is actually a decision in its own right, just generally not the one you would have picked otherwise.
All of which means, escalate quickly! Your team, your managers, and your product will thank you.
With this model, you can start to see the four most common things that go wrong — and how to get out of those problems when they happen. Most of them boil down to different layers of the system not having the right separation between them: X-Z Mixing, Z-Z Mixing, and W-Z Mixing. The fourth, Horizontal Disconnect, happens when teams don’t cooperate.
The Problem: People working on Z-level problems need to spend lots of time thinking about how to modify the corresponding X-level systems. Either they need to get lots of consulting help from X teams, or they need to know and manage X-level problems themselves, or (worst of all) they need the X teams to do something for them in order to make any progress of their own. Either way, they’re frustrated because it takes forever to get an X team to do things and so they can’t experiment, and the X teams are frustrated because they’re drowning in requests from Z teams and can’t get anything done. Everyone is unhappy.
How it happens: This is a really common problem and usually happens because when a system is first built, it doesn’t have layers at all; it’s basically a single Z. Then a second Z gets built, and it’s put on as a sort of hack on the side of the first Z, maybe reusing some stuff; and a third one, and a fourth one, and suddenly you have a bunch of Z’s but there is no clear X layer, only a large morass that’s full of Z-level logic. Each Z team is basically owning a bunch of stuff inside X, and they can’t easily change it because everything is full of logic from other Z teams. Even if a separate X team is created, they still own that morass.
How to get out of it: This is a sign that we need a clearer contract — an API boundary — between X and Z. When in this situation, a good checklist to follow is:
Step 1: If you haven’t already done this, make the X teams their own entity — separate from the Z teams — with the objective of making the Z teams happy.
If you want to get somewhere, first figure out where you want to be, then figure out how you’re going to get there. The order of those steps matters.
Step 2: The X teams need to get into the mindset of an enterprise software company, with the Z teams as their customers. They need to do traditional product management to it: talk to a huge number of their customers and figure out how to separate them into categories that have similar needs. Typically there will be a few really big (and often very unique) ones and then a few categories each of which contain many small categories.
The output of this step should be (1) a list of the categories of customers, with a summary of the key things that each category cares about and (2) a short (<10 item) list of typical customer requests, so that if all of these were easy, everyone would be happy. You should aggressively aim this list towards things that people might ask for in the future, rather than just the current list; experience helps make that call.
This output should be shared with all the Z teams, and everyone should be nodding their heads vigorously. (Iterate until that happens)
Step 3: Next, the X teams need to design where they want to be. At this stage, remember the first rule of planning: if you want to get somewhere, first figure out where you want to be, then figure out how you’re going to get there. Aggressively resist the temptation to figure out the best incremental improvement from your current architecture; you are currently very far from optimum and will end up “optimizing” (I use the term loosely) for your existing problems. Do a clean-sheet design and then figure out how to go from your current system to there.
A very effective approach here is “design-by-documentation:” you want to create a manual for Z teams about how to make each of the common types of changes and doing that should be really easy. Importantly, in the common case, the people doing this should not require any explicit assistance from the X teams and shouldn’t have to have a deep understanding of X-layer systems or problems beyond simple warnings like “don’t do X.”
The output of this step should be a combination of this manual (or something roughly equivalent) and an internal general design for a system which would support all of these operations easily and scale to more.
This should be done in close collaboration with customers; they’re the ones who need to say “yes, this design would make our lives great!,” as the success or failure of your project will ultimately be determined by their happiness.
Step 4: Now you can start a traditional planning process.
Your overall objective is to create this system; its key results are tied to customer satisfaction.
Your milestones should talk about alleviating existing pain points for various categories of customers by building out parts of this larger system.
Each milestone should either talk about making something available for a pilot group of customers within a category (3* large customers in that category who have agreed to work closely with you to make sure the system really satisfies their needs, and to do the work on their end to adopt it; the corresponding KR is entirely about these pilot customers’ satisfaction with the process and result), or about getting it broadly adopted by customers in that category (the corresponding KR is about adoption rates and how much time you have to spend per customer).
* The number 3 is approximate but magical. Remember that in engineering, every system is built for either one use or N; 3 is the smallest number which is greater than one and doesn’t encourage building for special cases.
“Large” is relative to the category, but generally it’s very valuable to do pilot work with the largest customers possible for a few reasons. First, they have the resources to dedicate time to piloting a new thing. Second, impact on improving a large customer is the highest, so you’re getting the most value fastest by starting with them. Third, large customers drive you to build it right — if you can support them, you can support anybody. That’s important both in terms of your design and in terms of convincing other people that it’s safe to migrate later on. Picking the right pilot customers is a key thing to get right!
Step 5: And now, get executing! Stay in continuous touch with your customers; for milestones involving pilots, you should be so close that many people aren’t even sure who belongs to which team. For general-availability milestones, make sure you’re doing regular CSAT checkups.
This not only gets you into the right architectural state (because the manual really highlights that the Z teams don’t have to spend their time on things they shouldn’t), but also into the right team health state (because you have good relationships with your Z teams and they’re happy).
The Problem: People working on Z-level problems need to spend lots of time coordinating with other Z-level teams to make sure they don’t accidentally screw up some other feature. Z teams slow down and get frustrated, and production incidents or near-misses are common.
Why it happens: Typically, the same basic story as X-Z mixing. Here the problem is that the basic primitives of the system — the nouns and verbs exposed by the API — aren’t clear or distinct enough, so the notionally small and isolated change required by a Z team turns into a big and non-isolated change that requires everyone to cooperate.
How to get out of it: There are really three possible problems here under the hood.
- If there’s X-Z mixing going on, it’s going to create Z-Z contact, so if that’s your problem, solve that and it should take care of this.
- If you’ve already done some good X-Z separation but you still have a Z-Z contact problem, your X-layer API may be incorrect and be exposing the wrong nouns or verbs. The sign of this is that the interaction between Z teams happens during the detailed design or implementation stages. In this case: Make sure you have an up-to-date list of the kinds of things Z teams are needing (or will in the future need) to do, and do a clean-sheet design of what your API should look like to support that, with nicely orthogonal components. As usual, get the Z teams on board with the new API, confirming that it will let them do the things they want to do and not need to think about other Z’s. Then implement the new API, with a glue layer so that the old API is as implemented as possible on top of it, and migrate people away from the old API.
- If, on the other hand, the problem is happening more at deployment or production time, rather than design or coding time — things like one Z team overloading the system and knocking over the X layer — then the X layer has an isolation problem. At this point, it’s time for the X team to start thinking more like a W team and maybe working closely with their own W-layer providers to make the system more isolated and robust. All the techniques of system infrastructure become useful here: prioritization, load shedding, sandboxing, and so on.
The Problem: People working on Z-level problems are spending time talking to W-layer teams and thinking about really deep-in-the-stack details. Everything is really complicated, and production changes are hard.
Note: There is one situation where this isn’t a problem, which is the case of really huge single systems that are a sizeable fraction of the entire company, so that getting a 1% performance improvement in performance means tens of millions of dollars per year or the like. At this scale, it starts to make sense to have that team working on performance issues and the like far further down the stack than usual and to staff the team to match. If you’re in this situation, you know it.
For a real-world example, when I headed high-capacity search at Google, not only did we optimize the search stack, we would retune it for each hardware platform, implement the innermost loops in hand-coded assembler, bypass the file system and talk to disks as raw block devices, modify the block device drivers in the kernel, work with the hardware team to architect the NUMA and data buses on motherboards to our needs, and convince the microchip manufacturers to change their instruction sets. (This is why x86 has an lzcnt operation!) At that scale, these were all totally reasonable things to do. At most other scales, they would not be.
How it happens: Early in the life of a system, it’s rare to know what the correct nouns and verbs are, so initial Z systems start out talking directly to more abstract lower layers, e.g. byte-level storage. (This is especially true in the era of cloud computing, where companies will happily sell you W-layer infrastructure, and Z-layer teams won’t realize that it doesn’t do the same things that X-layer infrastructure does. As Trey Harris pointed out, SaaS marketing teams are very incentivized to tell you that this will solve your problems, and not so incentivized to explain that they give you the tools to solve your problems, they don’t actually do it themselves. Automated data deletion is a famous example, since doing it requires knowing a lot about your data schemata and so is inherently X-layer work) This stuff persists, and later on it starts to feel like Z layers are talking all the way down to W layers.
Alternatively, it’s possible that the X layer didn’t hide its own abstractions clearly enough, and the X-Z API is so influenced by the W-X API that it’s almost a passthrough.
Finally, this might happen because of scaling: a W-level system that worked fine and nobody noticed at small scale suddenly requires a lot more care and feeding by its clients when you reach large scale, and so a contact point that wasn’t a problem before suddenly becomes one.
How to get out of it: Check if it’s actually a problem before doing anything; not all W-Z contact is a bad sign. If it’s actually an issue, address it like you would any infrastructure usability issue: “this task is a burden on X/Z level teams because of a leaking W abstraction.” Can the task be automated away? Can (and should) the underlying system be redesigned or replaced to eliminate this kind of task altogether? Alternatively, should there be some intermediate X layer?
Some typical signs that it is a problem are unschematized data where you can’t easily answer questions like “what data do we have” or “which of this is user data;” underlying systems being accessed in three different ways with different libraries that were found at various points; difficulty instrumenting your infrastructure (especially storage) in order to measure its behavior or insert rules specific to your needs. Any of these tend to be solid indicators that an X layer is badly needed.
Generally, an intermediate layer is useful if the W layer is something you don’t properly control (3P software), if you may need to swap it out or support multiple W layers (e.g. different kinds of cloud), or if the useful abstractions needed by your X and Z layers are different from those provided by the W layer. In all of these cases, it’s generally good to keep this intermediate layer as thin as possible. That said, having a very thin intermediate layer can sometimes be better than having none, as the presence of such a shim makes it orders of magnitude easier to swap out the W layer in the future if you need to!
The Problem: Horizontal teams (security, etc) aren’t being properly involved in the design and launch process of other teams. This leads to serious failures where company-wide resources are damaged, with consequences ranging from loss of uptime, to loss of revenue, to loss of user trust, to loss of life. The immediate predecessors of this are typically either teams doing something with the best of intentions but not realizing the danger until too late, or business leaders saying “no, we should take the risk and do it!” because they either don’t understand the risk properly, or (worst of all) have misaligned incentives and would rather take a risk and have their project “succeed” in some sense even if it fails in others.
Note how under the hood, there are actually two very different problems here:
- In safety engineering, unlike product engineering, what you don’t know very much can hurt you. Teams without someone who knows how to ask the right questions and raise the appropriate alarms can walk into disaster unknowingly.
- People are bad at risk assessment for many reasons: comparing frequent, slightly-good events against rare, really-bad events is hard. Perverse incentives or tunnel vision may cause people to delude themselves into thinking things aren’t really that bad until it’s too late. Teams without someone who can catch the problem and escalate until a level is reached where sanity prevails can walk into disaster knowingly. Those who cannot remember history are condemned to repeat it; those who can remember history are condemned to watch helplessly as other people repeat it.
How it happens: There are a few things that can cause this problem.
- Sometimes a big subset of the company, or even the company as a whole, doesn’t understand some category of risks enough to even realize that they need to defend against them. This is especially common in startups or in more isolated parts of large companies. On a continuum with this, there may not be enough general awareness among non-specialists (the people on W, X, and Z teams, for example) about the issue to know when they need to ask for help.
- There can be a lack of process which ensures that the horizontal teams have the ability to work with each other team and ensure things are on-track; alternatively, the horizontal teams may be radically understaffed, so many things slip through the cracks.
- The horizontal teams may not know how to effectively support other teams; for example, they may be constructed more as compliance teams (which can check boxes but can’t help people design), or otherwise have bad relationships with those teams.
- There may be a lack of support for protecting this resource at the senior leadership level, which means that all escalations eventually fail. What this really means is that senior leadership doesn’t care about this resource at all.
How to get out of it:
What you need to do depends a lot on which underlying problem is going on. For the easy ones:
- If there’s a lack of understanding of the issues, hire experts or otherwise find them and bring them into your group. Escalate if you can’t get the resources. Well-functioning horizontal teams are particularly good at, and often love to, help others learn about the subject, so they’re the ones to ask!
- A lack of support for protecting the resource is the most serious problem, and generally, this has to be solved from the top down: senior leadership needs to understand and continually express their focus on preserving this, and propagate that priority downwards. Setting priorities like this and preserving company-wide shared resources is a key part of their job. If you find yourself at a company where senior leadership understands but legit doesn’t care, all I can say is that this is a good time to start thinking about how you can find another job. Bad things are going to happen here.
One important technique I’ve found useful in this case is to reframe the problem away from “risks” and towards “playbooks.” It’s too easy for an executive to discount a risk as something that won’t really happen. But if you frame it as “When X happens, we are going to need to make the following decisions. Who should we escalate to in that situation? Is that you?” This has a way of making the reality of the consequences much more personal to people, and getting their attention.
The more complex underlying problem is when there’s a lack of process, or the horizontal teams aren’t set up to effectively support others. It turns out that there’s a known-good model for how to do this, but that’s a subject for another essay!