Another year, another wave in how we build software. We now live in a post-software-eaten world, where software is like the water that fish aren’t even aware exists. And yet, we have struggled to pin down the mysterious aether that makes the best software systems delightful while the rest are clunky, decaying, hulking beasts struggling to stay afloat.
The latest trend is Platform Engineering, the idea that a team in your company should take care of the common shared services that other development teams rely on to build their products and services.
Platform Engineering is not the first attempt to fix the universal difficulties of building and running software, and I’m almost sure it won’t be the last. Agile, DevOps, Site Reliability Engineering, and Digital Transformation have all come before with lofty promises and varying results. Software solutions have bred new problems, many of which eager vendors solve by providing solutions that create new problems, ad infinitum. Platform Engineering attempts to tame the endless cycle of innovation, adoption, fragmentation, and stagnation that the tireless march of progress demands.
Why Can’t I DIY?
You need tooling to support development efforts and keep engineers productive. It’s impractical to expect application development teams to run everything. So you assemble a set of 3rd party services and open-source systems and add other non-development folks to keep the system running. No matter what kind of applications you build, you write them on top of a platform.
You need a starting point to build your software. You may adopt frameworks and tools like databases, message queues, container orchestration, operating systems, and various developer tools to set up the “stack” or environment in which your applications run. Special cases and individual preferences disrupt uniformity in favor of productivity-enhancing fragmentation. There is a delicate balance between a simple, monolithic approach that is easy to understand and maintain with ultimate flexibility to let each team do whatever they want.
A simple dichotomy for divvying up the tasks in building and maintaining software systems is to divide development and operations. Dev builds new features and capabilities for customers, and Ops ensures that they run correctly after release.
DevOps attempts to eliminate this split by taking a worldview that “if you build it, you run it.” The theory goes that by putting operational ownership in the hands of the people who create the bugs (the developers), natural incentives will force the organization to reckon with technical debt and build better software out of the gate.
The challenge in practice is that DevOps has become a third silo that sits between Dev and Ops and takes care of things neither side wants to (or can) do. The role has proliferated despite the theory that a “DevOps Engineer” shouldn’t exist in the first place. It has become a shorthand for people who handle setting up CICD systems, configuring infrastructure as code, provisioning cloud environments for new applications, and frantically trying to keep up with constant change from every side.
Site Reliability Engineering
The concept of SRE sounds good in theory and can be pretty effective in the right size and type of organization. When you reach sufficient scale and complexity, ensuring reliability becomes an effort in itself. This role often handles repetitive operations tasks and identifies and eliminates “toil” over time by adopting or building new delivery and productivity tools and processes to make the overall systems of the company more competitive over time.
In practice, many organizations have stumbled in achieving the promise of SRE. Sometimes simply rebranding ops teams without changing skillsets or implementing the projects that would reduce toil — in other cases, giving too little leeway to SRE teams to change anything, leaving them screaming into the void about lofty ideals without following through. The expense goes up (as higher skills and allocating preventative capacity are up to 4x more expensive), and the return on investment remains questionable at best.
Perhaps a noble goal for aged enterprises trying to find their footing in an increasingly online world, the concept of Digital Transformation sounds like converting old family videos to MPEG files or scanning paper contracts and moving to DocuSign. Management consultants were happy to lead these expensive, slow, and painful projects, dragging organizations kicking and screaming into the cloud era. Costly data leaks added some momentary motivation, as did splashy multi-year deals with global system integrators. But of course, the biggest driver of Digital Transformation was COVID-19, which finally forced everyone to figure out how to order dinner using a QR code.
Here Comes the Platform
The word “platform” is overloaded and can mean pretty much anything. Let’s try to be a bit more precise about what people mean when talking about the “platform” in most “platform engineering” teams.
First of all, a platform implies that you can build on it. On its own, the platform sits inert, ready to be used. Unless an organization can use the platform for several purposes, it’s just part of the product. The platform serves applications, workloads, tenants, operators, developers, and customers (indirectly).
The fact that most software architectures now follow a “services” orientation means the platform has to live to serve. So we can say that a platform provides services to other teams building on top of it.
Whenever we try to pinpoint a proper definition of “the platform,” we start sounding like a marketing blather. You could say we’re talking about a “computing platform,” a “digital infrastructure platform,” or even a “cloud platform.” The platform we’re talking about encompasses something simultaneously vague and tangible (and decidedly not technical): the people, processes, tools, documentation, knowledge, escalation paths, vendor contracts, SLAs, expectations, plans, and cost structure that make the platform an ongoing concern for the organization.
I propose the following tests to determine if we’re talking about a platform:
- Platforms provide valuable services.
- Platforms conform to reusable interfaces.
- Platforms need workloads to be valuable.
- Platforms support end-users indirectly.
- Platforms improve productivity.
- Platforms are funded and maintained.
- Platforms set the bar for reliability, security, and performance.
- Platforms make it easy to do some things and more challenging to do others (intentionally or not).
- Platforms have to change carefully to avoid ripple effects.
- Platforms may include manual tasks, tickets, run books, and knowledge.
The kinds of things that platforms (whether called that or not) commonly include:
- Basic computing, network, and storage.
- A ticketing system to handle bespoke or random requests.
- Identity, authentication, and authorization services.
- Readiness checklists
- Security Scanning
- Monitoring, observability, and reporting.
- An SLA
- Various databases
- Message queues
- Disaster recovery and failover
- A mail server (how handy is that?)
If you don’t have a platform team in your company, you haven’t identified the platform yet. Maybe you’re expecting vendors to make it all work for you. Or digital transformation consultants integrate these systems, so they work seamlessly to reduce risk and produce excellent ROI.
Notably, the platform engineer has emerged to step up and take ownership. They attempt to turn this grab bag of poorly documented tools, services, open source projects, elegant hacks, TODOs, and technical debt into something resembling a product. The goal is not esoteric (“productivity” or “reliability”) but practical (“let’s build a thing that people want to use”).
Platform as a Product
Consider the platform within your company as a product you are taking to “the market.” You want to build something your organization wants to use, so they will lobby management to ensure your ongoing resources and existence. So you need to go where the action is; the big teams with important critical missions that you can help move faster, fail less often, and run efficiently. You must understand your (internal) customers’ priorities, pain points, opportunities to build economies of scale, and a sustainable business model. You will need to make the platform easy to adopt while also considering the real-world constraints of your current situation.
Applying a product mindset to your platform leads to an entirely different outcome than adopting DevOps, SRE, Digital Transformation, or Agile. You now have a thing (a platform!) that people want to use that takes some burden off their plate and lets them focus. They don’t have to worry about all the knobs and switches that The Cloud offers; they have a restricted menu informed by the platform’s constraints and decisions. Of course, no one will use the platform for its sake; the team has to earn the trust and the right to serve these customers, or they will take their workload and go home.
I don’t know why or where it came from, but ownership works. It might be deep in human psychology, or it might be an artifact of hierarchical management structures. In the real world, when you’re trying to run a significant endeavor, you need to know who’s doing what and trust that they will keep doing it and (to be honest) stay in their lane. Otherwise, the whole organization falls apart. Politics, finger-pointing, apathy, and resume-sharpening soon follow once no one knows the org chart.
SREs at Google don’t own reliability and mostly see themselves as internal reliability consultants (when they aren’t answering the dreaded pager). Google SREs attempt to help product teams make good choices in designing their systems. They help set service level objectives (SLOs) to make sure reliability targets are clearly defined. Perhaps most importantly, they push the team to consistently adopt standard services from the overall Google technical infrastructure (what we might describe as a platform). SREs are the customer success team of the platform-as-a-product.
(footnote: Google Cloud Platform has a team of Customer Reliability Engineers (CRE) that behaves like a customer success team for the cloud and provides technical guidance on effectively adopting GCP to get the most out of it.)
The Platform Engineering trend is building. The pace of articles is increasing. The number of folks on LinkedIn and Twitter adding Platform to their title is growing. CEOs are talking about Platform Engineers as their key customer.
As I was trying to understand better what was happening, I asked all my SRE friends what they thought of platform engineering. I was sure everyone would be skeptical that it’s a new name for an old thing and poo-poo it as a wannabe SRE discipline and marketing spin.
More than one SRE told me, “in reality, we’re a platform team more than an SRE team.” And friends and colleagues I know who previously had worked in performance engineering, security, and data management have all moved into platform engineering.
What I like about platform engineering most is the ownership. What I like about it least is the lack of clarity about what the platform is and what it’s not. Perhaps this is a limit of our language in briefly describing the vast complexity of how software serves humanity and the immense effort, resources, and brainpower that goes into wrangling the binary beast.
I take proclamations about the End of DevOps and the Failure of SRE with a grain of salt. The work of DevOps Engineers still needs to be done, even under a new name. Suppose platform teams come together within your organization and take ownership of the common problems that other products, applications, and IT teams face.
Instead of a faster boat, the platform is the rising tide.