This is a continuation of the series taking you on my journey into the world of cloud-native observability. It’s a world that is altering the way developers work in their daily jobs, creating new teams, and there are roles appearing to attempt to keep control of the cloud-native complexity that these large-scale architectures deliver.
The first article in this series covered how developers have to deal with more than just code in a cloud-native world. It shared a look at cloud-native observability (o11y) and touched on what the three pillars are versus the three phases of observability.
This second article takes you out onto the playing field where you need to understand who the players are and what teams they form. It’s no longer a world full of developers and operations teams as the cloud-native environments have pushed right on through those traditional walls.
Let’s dive right in, shall we?
The basic introduction started from the point that developers are in a world without clouds and then made the transition to a cloud-native development world. What does this mean for them and what are some of the challenges they are having to embrace?
The Playing Field
Over time, the traditional developer and operations teams have seen a transition to different ways of working in the cloud-native world. The developers transitioned into DevOps teams where the operations activities merge and attempts are made with process agility. Operations teams having tried DevOps then move to a more mature structure called CloudOps with a clear focus on cloud infrastructure. Finally, we’re seeing today a role emerge known as Site Reliability Engineer (SRE), who is part of a team that is focused on a broader spectrum of modern resource reliability and not just on the organization’s cloud infrastructure.
DevOps is a first step on the road to cloud-native operations and bridges both development and operations teams. As defined in the article “DevOps vs. CloudOps – What You Need to Know,” you see that they have a specific mandate:
“DevOps is primarily the automation and optimization of the application development lifecycle, including post-launch fixes and updates. It uses continuous development, integration, testing, and deployment of cloud, computer, and downloadable applications. It also focuses on IT operations as they relate to application performance and availability.”
By bringing operations and development closer to focus on processes and automation, they are making the push for agility, reliability, and speed for business goals within their organization. It remains focused, often due to the existence of more than just the cloud-native infrastructure, on application development and delivery.
The definition given by Professional DevOps.com puts CloudOps at the center of a business operational focus.
“…CloudOps provides organizations with proper (cloud) resource management. In an organization, CloudOps uses DevOps principles and IT operations applied to a cloud-based architecture to speed up the business processes.”
This is a shift towards operations focusing on the cloud-native infrastructure more specifically than the other possible infrastructures available in an organization. Once the footprint of dependency on infrastructure choices from the past has been reduced, these teams are scaled up to ensure the improvement of development architecture (infrastructure in the cloud). They focus on simplification of cloud provisioning, application deployment to the cloud, and are big users of observability platforms for both application and infrastructure in the cloud.
Site Reliability Teams
Oscar Wilde once said, “With age comes wisdom, but sometimes age comes alone.” As organizations become more active in a cloud-native world and scale up to full CloudOps teams alongside their DevOps teams, there is another role emerging to fill a gap left behind. That role is an SRE and they don’t only focus on the cloud-native infrastructure. As noted by Chris Tozzi:
“…an SRE is an all-purpose role that aims to manage reliability for any type of environment.”
SREs have to use both IT operations and development strategies to ensure that there is a focus on one thing, and one thing only: that of reliability. It’s a full-time job avoiding downtime, optimizing the performance of all applications, and supporting infrastructure regardless of whether it is in the cloud-native world or not. Together with CloudOps teams, they are a very active player in cloud-native observability and the platforms used to assist them. They have a vested interest in cloud or multi-cloud security, costs, deployment automation, and all things that help observability at scale.
The Observability Game
This takes us from the basic introduction, followed by a tour of the o11y playing field, and finally, you’ve met the players on the teams involved in cloud-native o11y.
Next up, I want to dive deeper into the pillars of monitoring and why at scale you might want to start thinking about the phases of cloud-native o11y instead.