What is cloud native and how it relates to SRE
Ask yourself or your colleagues what the meaning of cloud native, as-as-Service, or cloud-first is. You will get different answers. Responses might vary from "cloud-first" or "born in the cloud" or "cloud-native means microservices and containerization".
The Cloud Native Computing Foundation (CNCF) defines cloud native as follows:
- Cloud-native technologies empower organizations to build and run scalable applications in modern, dynamic environments such as public, private, and hybrid clouds. Containers, service meshes, microservices, immutable infrastructure, and declarative APIs exemplify this approach. These techniques enable loosely coupled systems that are resilient, manageable, and observable. Combined with robust automation, they allow engineers to make high-impact changes frequently and predictably with minimal toil.
Essentially cloud native is all about the balance between resiliency and agility. It's an approach to build and run responsive, scalable, and fault-tolerant applications that can run anywhere – in public, private, or hybrid clouds. Another lens that can be applied to understand cloud native is the Twelve-Factor App (https://12factor.net/), which consists of a set of best practices that guides the building of applications with built-in performance, automation, resiliency, elasticity, and diagnosability.
Let's explore the meaning of the following cloud native terms:
Designed for automation:
- Automation of development tasks
- Test automation
- Automation of infrastructure, provisioning updates, and upgrades
Designed for resiliency:
- High availability (e.g. Multi-zone regions or stretch clusters)
- Fault tolerance and graceful degradation
- Backup & restore
Designed for elasticity:
- Automated scale up and down
Designed for performance:
- Responsiveness with SLO and SLI defined
- Efficiency and capacity planning
Designed for diagnosability:
- Logs, traces, and metrics
Designed for efficient delivery:
- Modular, microservices based
- Automated deployments, upgrades, and updates
- Efficient build process
These concepts describe SRE practices in a nutshell. Applying these practices in the development life cycle enforces architecture toward common standards.
An important thing to note is that merely containerizing applications as is does not help achieve cloud native characteristics. In fact, these days it is possible to containerize any application; however, it requires additional effort to create a containerized application that can be automated and orchestrated effectively to behave as a cloud native application that is running on a native platform such as Kubernetes. Examples of this are applications that use Kubernetes health probes such as liveliness and readiness probes to enable graceful degradation. For more details, see the blog Are your Kubernetes readiness probes checking for readiness? Going through all the patterns is beyond the scope of this article. Kubernetes provides a portable, extensible platform for managing containerized workloads and services that facilitate both declarative configuration and automation. Going through how each of the cloud native practices can be achieved with Kubernetes will be the topic of my subsequent blog. Some additional resources are included at the end of this post.
Many applications take years to build and are complex. Additionally, many applications are built in a layered architecture, with contributions from a number of teams and technology groups. With a layered architecture, any user action might go several levels deep: from user interaction to authorization to a backend business logic service to automation processing, and there can be additional layers based on use case. To reduce the complexity, improve efficiency, and speed up development, it is critical to apply the lens of cloud native to each layer of the architecture when delivering such a service. Cloud native practices also apply to the software delivery model as well.
What is SRE and how SRE practices can be part of the development lifecycle
The role of the SRE is to keep the organization focused on what matters most to users: ensuring that the platform and services are reliable. If you are familiar with the traditional disciplines of development and operations, SRE bridges the two. The goal of SRE is to codify every aspect of operations in order to build resiliency within infrastructure and applications. This implies that reliability deliverables are to be delivered via the same CICD pipeline as development, managed by using version control tools and checked for issues by using test frameworks.
In summary, SRE implies operations to be a software delivery problem. SRE uses a software engineering approach to solve operational problems.
In an Embedded SRE model (described in the SRE model section) development and SRE collaborate throughout the lifecycle of MVP delivery. As MVP progresses through technical feature specification and development, the SRE collaborates with Development and OM to ensure cloud native practices are enabled. For example, they identify critical user journeys, associated key SLIs, and SLOs for each component.
The SRE should understand service design, including front end, back end, business logic, and database dependencies. This understanding is critical in order to document all failure points and deliver automation for service restoration. By using service design knowledge, the SRE should ensure delivery of the required automation that is described in the cloud native section.
As illustrated in the following diagram, Development and SRE collaborate to deliver functionality and reliability for MVP by using the same CICD delivery pipelines and release processes while focusing on their success metrics.
No organization starts from scratch. Shift-left for legacy might not be as easy as for new services. Incubating shift-left SRE for new services is a good way to start, and iteratively for existing legacy services.
In some development models, there are concepts of "DONE, DONE DONE" that imply code: DONE, test-automation: DONE, and documentation: DONE. Enabling SRE in a development organization implies DONE, DONE, DONE, and DONE; the additional "DONE" is for SRE enablement.
Measuring SRE
Now as organizations decide to build the development process where SRE and Development work in collaboration to deliver instances of MVP, the question is how do we measure the effectiveness of this process. For this measure, we need to look into the critical metrics committed both externally and internally.
Service Level Agreement (SLA) – SLA reflects customer expectation. It sets a promise to the consumer in terms of service availability and performance. There are business consequences if promises are not kept.
Service Level Objectives (SLO) – SLO are the reliability and performance goals set by the service for itself. These are visible internally. Every service should have an availability SLO. The SLO decides how much investment is needed in the reliability of a service. More critical services should have higher SLOs. From the SRE perspective, SLO is what defines the goal that SRE teams have to reach and measure themselves against.
Now the question is: how is SLO defined? The metrics that define SLOs should be limited to those that truly define performance measures. Every service should consider client-side impacts when defining these metrics.
Service Level Indicator (SLI) – SLI is the metric that enables measurement of compliance against SLO. Think of SLIs as a set of Key Performance Metrics (KPIs) that matter to customers. It is important that SRE, Development, and OM reach agreement on the SLIs that define SLO, and hence SLA.
See the following diagram for examples: