Patryk Plonka – Blog – Future Processing

Observability in DevOps – what we need to know?

Patryk Plonka — Thu, 25 Nov 2021 13:04:26 +0000

In this article we look at it in more detail – do read on to see what it is and why it is so important.

Why is observability so important?

As stated by the 2019 Accelerate State of DevOps Report, “delivering software quickly, reliably, and safely is at the heart of technology transformation and organisational performance. We see continued evidence that software speed, stability, and availability contribute to organisational performance (including profitability, productivity, and customer satisfaction).

Our highest performers are twice as likely to meet or exceed their organisational performance goals”.

But to develop software quickly and effectively one needs reliabale solutions not only to build it, but also to understand its current health. The latter can only be achieved by examining data the system generates: logs, traces, and metrics. In one word, by observability.

Observability in Kubernetes – a few facts

So, what is observability? Its main goal is to allow you to understand what exactly is happening across all the environments within your software to find and address any issues which may prevent the system from becoming efficient and reliable.

It helps you to:

understand what services a request went through, and where were the performance bottlenecks,
see how the execution of the request was different from the expected system behaviour,
establish why the request failed,
check how each microservice processed the request.

In the last few years, since cloud-native environments become more complex and widely used, observability became more critical then ever.

Observability, monitoring and analysis

Despite the fact observability has recently become so important, there is still a lot to be said about it, and about how it differs from monitoring. The two terms are sometimes used interchangeably, which is not correct, as observability and monitoring are two different concepts.

Let’s investigate why.

In DevOps, observability means making data from the system you want to monitor available. Monitoring, on the other hand, means collecting and displaying that data.

As stated by the SRE book by Google, your monitoring system needs to answer two simple questions: what is broken, and why is it broken.

Simply put, it informs you that something is wrong, while observability enables you to understand the reason for why it is wrong. Monitoring is impossible without some level of observability.

Another component of an effective observability is analysis – when you’ve made your system observable and you collected data via monitoring, you need to conduct analysis, which will answer some of the most important questions about the system you are working on and its health.

Building a Continuously Observable System in DevOps: pillars of observability

It may sound complicated but achieving observability doesn’t have to be difficult. To start with, concentrate on three key pillars that contribute to observability’s success:

Metrics

Metrics mean any data that can be aggregated over a period of time. It can come from many different sources such as cloud platforms, hosts or infrastructure. Metrics tell you for example how much of the total amount of memory is used by a method, or how many requests a service handles per second.

A good example of a tool used for collecting metrics is Prometheus.

Tracing

Tracing shows activity of a transaction or a request inside applications. Capturing traces of requests and determining what is happening throughout the request chain allows you to find issues within the system and determine which components are responsible for errors.

Tracing is considered the most important part of observability implementation as it allows you to understand the actual reason of each issue.

A popular tool used for tracing is Jaeger.

Logs

Logs are text records of discreet events that happened within a certain timeframe; they allow you to identify unpredictable behaviour in a system. For complex ecosystems with many components, such as Kubernetes, structured logging becomes very important.

It’s recommended to ingest logs in a structured way, for example using JSON format, so that logs become easily queryable.

The number of logs grow quickly, which make them difficult to manage and store. Fortunately, there are some tools which help to increase the effectiveness of logging. One of such tools is called OpenTelemetry – it can be used not only for logging, but also for metric collation and tracing.

OpenTelemetry integrates with popular frameworks and libraries, such as Spring, ASP.NET Core, and Express. Other good tools used for logs analysis are Elastic and Loki.

Remember that continuous automated observability lets you stay on top of any risks or problems throughout the whole software development lifecycle. It gives you insights into your infrastructure and systems, while providing you with valuable information on their health at any time.

Monitoring Kubernetes

Today, Kubernetes is the dominant platform for deploying and maintaining containers. But, as stated by Kelsey Hightower, Principal Engineer at Google working on Google’s Cloud Platform, “it is only as good as the IaaS layer it runs on top of. Like Linux, Kubernetes has entered the distro era”.

Even if your Kubernetes system does not show any errors, you may still encounter some issues outside of Kubernetes that can pose certain risk. Let’s see where else you can run into some problems while using Kubernetes:

Cloud provider/infrastructure layer

Some problems can be linked to the infrastructure of your cloud provider or to your on-premise environment. The remedy is planning your resources: you don’t want to use them all up before your Kubernetes cluster starts to scale. To do so, you must keep track of the quotas configured on the cloud provider and monitor the usage and costs of the resources.

If you are running your Kubernetes environment on-premise, monitoring all infrastructure components is also of key importance. A good solution for both cases is doing log file analysis which will allow you detect problems before they occur.

Operating system / Instance layer

Remember that you always need to keep your operating system up-to-dated. Always make sure you check the status of your Kubernetes services and automatically install all security updates as soon as they become available. A great source of information on the health check of your system are log entries.

Cloud platform layer

A lot of issues within Kubernetes environment are due to the growing number of applications while the infrastructure remains the same. A solution here is checking whether all nodes, pods, and deployments are schedulable and that you always plan for a reserve in case one of the nodes fail.

Application layer

Your Kubernetes may display no errors, but that doesn’t mean you don’t have any issues on the application layer. Fortunately, you can use real user monitoring (RUM) to check the behaviour and experience your users have when using your application.

This allows you to identify errors which you haven’t seen before, and which makes your clients abort certain actions when using your software.

Business layer

Technical things apart, even the best software cannot be successful if the customer doesn’t like it or cannot use it. This is why it is so important to link changes and new features within your application to business-related metrics such as revenue or conversion rate.

When releasing an updated version of your application, compare the metrics such as orders per hour to those from the previous version. If it looks the update has any negative effect, you may consider going back to the previous version, which was more successful.

Observability in DevOps and beyond

When creating new software, there are many things to take into account, observability being just one of them. By using right solutions, you can simplify the whole process, which will allow you to achieve better results in shorter time!

At Future Processing we focus on delivering the best solution for your particular business. Visit our website to see how we can take care of your software development process by helping you at every stage of it, its effective observability included.

Kubernetes as a platform for containerised apps – why is it becoming so popular?

Patryk Plonka — Thu, 26 Aug 2021 08:30:03 +0000

Companies that still prefer virtual machine (VM) hosting face significant problems these days, such as:

high total cost of ownership for infrastructure,
poor reliability,
prolonged release cycles,
insufficient availability,
lack of efficiency in computing resources,
weak scalability,
ineffective process for developing new services.

This is why it may be a good idea to start looking for a more modern alternative to classic infrastructure – one that will take the effectiveness of your work to a higher level.

Solution: Kubernetes as a next-level alternative to VM hosting

Kubernetes (aka K8s), is an open-source platform for containerised applications. It helps you automate deployment, manage apps, and also scale up, whenever needed. K8s was designed in Google in order to manage large production workloads and is strongly supported by the community. It’s compatible with a wide variety of container tools and runs containers in clusters. Leveraged by a growing number of companies across the globe, Kubernetes is growing in popularity and consistently being chosen over many other solutions, both classic and modern – and here’s why.

A short overview of K8s

Kubernetes is more than just a platform with a set of features for solving container orchestration needs. Of course, K8s was created to help us tame the complexity of microservices, as well as deploy and run our apps, but it’s much more than that. This is an entire ecosystem of tools, services, knowledge and support.

Together, this creates a compelling product that everyone can use according to the needs and requirements of their projects. Especially the knowledge and support section, which really makes a difference here.

Because when you bet on Kubernetes, you also receive a lot of support from the community, both in terms of using and adopting it, not to mention the help that you get whenever any issues emerge. However, there are many other reasons for companies to turn to K8s.

7 top advantages of Kubernetes

K8s stands out from the competition for its:

Flexibility and portability – Kubernetes works with a variety of container runtimes, underlying infrastructures and different configurations.
Reliability and availability – K8s helps improve software stability and can be updated with little to no downtime.
Scalability – Kubernetes is virtually future-proof since the platform supports complex, distributed and continuously growing systems.
Cost-effectiveness – thanks to auto-scaling capabilities, K8s is usually cheaper than any other solutions available for medium-large apps and software. The platform can scale up or down depending on the resources required by your app and user traffic, so you pay less for less busy hours.
Efficiency – K8s enables high efficiency both in terms of developer productivity (thanks to upgraded deployment methodologies) and the use of computing resources.
Support for stateful and stateless microservices – originally, Kubernetes offered architectural support for stateless applications only – now, it supports stateful apps as well.
Support for different deployment strategies – these strategies include rolling deployment, canary deployment, flagger deployment, and blue/green deployment, etc.

All of these features lay a solid foundation for our work, so we can adjust the platform to the projects that we run.

Kubernetes flexible adjustments

If you decide to use Kubernetes as a foundation for your software, you need to define your requirements and organisational structure, in order to find your golden balance between performance and costs.

Set clear definitions – define your services, roles and team interactions. Decide who is the provider, owner and consumer of the platform; who is responsible for what; and set clear boundaries, so that everyone knows exactly what to do, and how their work is going to affect others.
Simplify your Kubernetes ecosystem – Dev-ops should reduce Kubernetes complexities to a strict minimum by providing the team with useful abstractions in order to minimise the cognitive load (in other words, the mental effort someone has to put in just to understand and use the platform).
Manage the platform wisely – make a plan for how and when to create and destroy clusters, isolate different environments or apps, as well as smoothly update to new versions of Kubernetes, etc.
Take care of security – configure role-based access control and keep GDPR data privacy regulations in mind.
Think about change management – remember that user needs often change with the technological ecosystem, so dev-ops should consistently make sure that the platform and its new features are understandable and usable for everyone involved, including new team members.

K8s: benefits for your business

If you’re now wondering how using Kubernetes in the cloud can positively affect your business, here is just a short list of benefits:

Reduced total cost of ownership – you don’t need to invest in any type of physical infrastructure, including the maintenance of certain pieces of equipment. Plus, you pay only for the services and resources that you actually use. And everything’s automated, from deployment down to management and scaling, so you don’t waste your precious time doing these tasks manually.
Increased uptime of your services – built-in scalability and reliability assurance systems will increase the availability of your software.
Maximised use of cloud computing resources – by using a Docker deployment platform, you can use a smaller and cheaper worker node VM to deploy your services.
Smooth and efficient releases – Kubernetes uses its own deployment methodology that enables quick releases. Even complex platforms can be created rapidly by reusing existing objects.

If this sounds appealing to you, we can help you get started on your Kubernetes journey. Also, if you still have any concerns, don’t shy away from contacting us, so we can answer any questions that you may have, free of charge!

Kubernetes: challenges and opportunities for DevOps

Patryk Plonka — Thu, 27 May 2021 11:35:00 +0000

What is Kubernetes?

To start with, let’s speak about Kubernetes itself. As explained on their website:

Kubernetes is a portable, extensible, open-source platform for managing containerised workload and services, which facilitates both declarative configuration and automation. Kubernetes has a large ecosystem, is very popular and widely available. Its name means helmsman or pilot in Greek, and its abbreviation is K8s – eight being the number of letters you will find between K and s.

In the past, software development teams used to run applications on physical servers. Later on, multiple Virtual Machines (VMs) were introduced, and today they are being replaced by containers which are lightweight, have their own filesystem and are portable across clouds and OS distribution. They have great advantages but need proper management to ensure there is no downtime.

And Kubernetes is a way of managing them – it provides a framework for doing so, it takes care of scaling and failover for your application and provides deployment patterns. Kubernetes builds upon Google’s extensive experience in managing containerised applications at scale.

The Kubernetes project is governed by a collaborative framework, managed by the Kubernetes Steering Committee, which guides its community and contributions.

Traditional vs Virtualised vs Container Deployment

In the case of large-scaled containerised applications, the benefits offered by Kubernetes are unmatched. The Cloud Native Computing Foundation (CNCF) plays a crucial role in managing Kubernetes and is deeply involved in the evolution of containerisation technologies.

Kubernetes gives you the following options:

service discovery and load balancing,
storage orchestration,
automated rollouts and rollbacks,
self-healing,
secret and configuration management.

Why do you need Kubernetes?

K8s supports the automation of deployments, scaling applications, container management, and monitors workloads and changes. Application owners and development teams using the platform can focus more on the development of their product than on DevOps activities (infrastructure management and matching the product to its requirements).

Besides, Kubernetes can easily manage a cluster (group of servers working together) thanks to its orchestrator capabilities. When K8s accepts deployment, it divides it into workloads and distributes it across servers in a cluster.

Workloads in K8s are created as containers and wrapped into standard cluster resources called Pods. Complex applications are frequently operating over complex distributed infrastructure,

Kubernetes provides more insight into what is happening within an application, making it easier to identify and fix security problems.

Want to know more about DevOps and see how it can help solve your problems? Check out our other articles:

What are the key components of a Kubernetes cluster?

A Kubernetes cluster consists of several key components that work together to manage containerised applications.

The control plane, also known as the master components, includes the kube-api-server, which serves as the front-end for the Kubernetes API and handles all administrative tasks. The etcd component is a distributed key-value store that maintains the cluster’s state and configuration.

Kubernetes – cluster architecture

The kube-scheduler is responsible for assigning pods to nodes based on resource requirements and constraints. The kube-controller-manager runs various controller processes to regulate the cluster’s state, while the optional cloud-controller-manager interacts with cloud provider APIs for cloud-specific resources.

On the worker nodes, the kubelet acts as an agent ensuring containers are running in pods and communicating with the control plane. The kube-proxy maintains network rules and performs connection forwarding, implementing part of the Kubernetes Service concept.

The Container Network Interface (CNI) addresses networking challenges within Kubernetes environments by enabling seamless integration and facilitating smooth communication between containers, particularly as deployments scale and complexity increases.

Each node also requires a container runtime, such as Docker or containerd, to actually run the containers. The Container Storage Interface (CSI) is significant in standardising integration with external storage systems, a pivotal move introduced in Kubernetes version 1.9 that led to the feature’s General Availability (GA) status within just a year.

Additional components include the cluster DNS server, typically implemented using CoreDNS, which enables service discovery. An optional web-based dashboard provides a user interface for cluster management and troubleshooting. Ingress controllers may also be deployed to manage external access to services within the cluster.

These components work in concert to provide the core functionality of Kubernetes, enabling efficient container orchestration, scaling, and management. Understanding the role of each component is crucial for effectively operating and maintaining a Kubernetes cluster.

7 principles for using Kubernetes in DevSecOps

Persistent storage is very important for ensuring data retention in containerised environments, especially when managing stateful applications that require data continuity despite the ephemeral nature of containers.

Additionally, implementing a robust storage system is essential for providing fast and reliable storage solutions within Kubernetes, addressing the demands of modern applications.

To achieve the best possible results with Kubernetes you need to follow the 7 principles listed by the Department of Defense Enterprise DevSecOps Reference Design.

Principles for using Kubernetes in DevSecOps

Remove bottlenecks and manual actions

Kubernetes allows developers, testers, and administrators to work hand in hand, making it easy for them to solve defects quickly and accurately. With Kubernetes, you can get rid of long delays associated with replicating development and test environments.

Besides, thanks to standardised instances, Kubernetes helps testers and developers quickly exchange precise information.

Automate as much as possible

Thanks to Kubernetes, you can automate many time-consuming tasks regarding development and deployment. Eliminating manual activities means less work and fewer errors which translates into a shorter time to market.

DevOps assessment and its benefits

Adopt common tools from planning and requirements through deployment and operations

Kubernetes provides many capabilities that allow one container to support many environment configuration contexts. In this case, there is no need for specialised containers for different environment configurations.

Besides, the configmaps object supports configuration data used at runtime. What is of much importance, it simplifies the management of the whole deployment process thanks to the declarative syntax used to describe each of needed deployments.

Leverage agile software principles with frequent updates

Thanks to the structure, microservices benefit the most from Kubernetes. It is said that software designed with twelve-factor app methodology and communicating through APIs work best for scalable deployments on clusters.

Thus, Kubernetes is the best choice for orchestrating cloud-native applications as modular distributed services favor scaling and fast recovery from failures.

Apply the cross-functional skill sets of development, cybersecurity and operations throughout the software lifecycle

Kubernetes is constructed with health reporting metrics to enable the platform to manage life cycle events if an instance becomes unhealthy. Thanks to robust telemetry data alerting operators they can take the decisions instantly.

It is also worth noting that Kubernetes supports liveness and readiness probes providing a clear view of the state of containerised workloads.

Security risks of the underlying infrastructure must be measured and quantified

Kubernetes offers many different layers and components that ensure the highest security standards, including a scheduler that manages how workloads are distributed, controllers that manage the state of Kubernetes itself, agents that run on each node within a cluster and a key-value store where cluster configuration data is stored.

Of course, to remain immune to all types of vulnerabilities one has to implement a cohesive defence strategy consisting of the following points.

First of all, you need to use security code scanning tools to check if the vulnerabilities exist within the container code itself. There is also a need for isolation of Kubernetes nodes (servers) in a separate network, to isolate them from public networks. Not to mention restricting access to the cluster resources by role-based access control (RBAC) policies as well as using resource quotas that allow you to mitigate any disruptions.

DevSecOps

One has to remember to restrict pod-to-pod traffic using Kubernetes core data types for specifying network access controls between pods and implement network border controls to enforce some ingress and egress controls at the network border. Besides, application-layer access control can be hardened with strong application-layer authentication, such as mutual transport-level security protocols.

In addition to the aforementioned, segment your Kubernetes clusters by integrity level which translates into hosting the dev and test environments in a different cluster than the production environment. We also advise utilising security monitoring and auditing to capture application logs, host-level logs, Kubernetes API audit logs and cloud provider logs.

Yet, for security audit purposes, consider streaming your logs to an external location with append-only access from within your cluster. Some may consider it obvious, but remember to keep your Kubernetes versions up to date and use process whitelisting which will allow you to spot unexpected running processes.

Deploy immutable infrastructure, such as containers

Kubernetes promotes scenarios in which deployed components are completely replaced rather than being updated. In this case, utilising standardisation and emulation of common infrastructure components allow achieving predictable results.

Challenges around Kubernetes services

Kubernetes may seem like an easy and understandable system, but it’s not perfect. It is not a complete solution, but rather a complex operating system for a cluster, which requires many different components and tools and needs appropriate care and feeding.

Various Kubernetes tools extend the Kubernetes API, focusing on custom resources, custom controllers, and the operator pattern. These tools enable users to manage applications and services more effectively by automating tasks and providing a declarative API.

Common challenges around Kubernetes services

Among the most common plugins, you will need to install to support your Kubernetes are those dealing with:

network layer
authentication
authorisation
integration with cloud
K8s management tool (or service).

In production, Kubernetes is not the easiest of solutions either. Challenges around using it include stability, security, and the management of online application which is being created. In the past, issues around using Kubernetes included repeated outages caused by KIAM (K8s to AWS IAM) authorisation bridge.

The remedy was to restart the agent. In the newer agent version there was a feature allowing K8s restart it in case of failure. It’s a great solution, but to make use of it, someone needs to read and understand the KIAM changelog. And that’s not something your team may be ready to do, with so many other things they need to care of on daily basis!

Other important aspects when it comes to Kubernetes are its updates – as all operating systems, Kubernetes gets updated on a quarterly release schedule and you need to be on top of that, as well as of its security, which is crucial for the security of your project.

Kubernetes – all you need to know

The complexity of Kubernetes and the challenges that arise around using it require solutions that make the whole process easier. When K8s become a part of your business, it needs to be constantly monitored and maintained, and you cannot possibly rely on one person doing it.

Observability in DevOps

What you really need is 24/7 access to help – in case the employee who is responsible for Kubernetes gets ill, someone else needs to be able to carry on as efficiently as the predecessor.

If your company is large enough and have an extensive IT department, it may be a good idea to set up an internal team, responsible for this part of DevOps. Beware of bad practices though: watch out not to have a layered platform team instead, which limit their actions to preserving existing technology only.

If your company is a smaller business, a great solution may be hiring an external partner who could take you on your Kubernetes journey, based on their vast experience in dealing with it on behalf of other companies.

So, if you are considering using Kubernetes or would like to speak to someone about an efficient platform team for your K8s, do give us a call. Our experts will be happy to share their knowledge with you!

How can DevOps practices improve your cloud-based system?

Patryk Plonka — Thu, 25 Mar 2021 08:46:10 +0000

When the Internet entered general use a few decades ago, users were deeply fascinated and ready to accept even major difficulties with connection, website loading or different software functionalities. In 2021, when software offers various functionalities, end-users are incredibly aware of their needs and ready to express their demands publicly. Hours-long outages of work-related apps like Slack, Teams or Office 365, social media, shops, or streaming platforms are unacceptable. Software providers need to create a detailed plan to build and maintain fully reliable systems as customers’ trust is at stake.

Giants’ loudest outages of 2020

2020, at the beginning called the year of cloud computing, soon turned out to be the time of frustration, when global cloud-based software providers suffered outages that significantly impacted web services, apps, and overall business. Major cloud service providers had to find conclusions and effective remedies for the future.

Microsoft Azure, March 2020

Duration of outage : 6 hours
Affected: Microsoft’s East U.S. datacenter region
Cause: Cooling system failure required manually resetting the cooling system’s controllers
Affected: Storage, compute, networking, and other services

IBM Cloud, June 2020

Duration of outage: 4 hours
Cause: Multi-zone interruption of services caused by a third party network provider flooded the IBM Cloud network with incorrect routing
London, Frankfurt, and Sydney; IBM Cloud services and 80+ data centres, General cloud services, Kubernetes services, App connect Watson AI cloud services
Affected: IBM cloud customers in Washington, D.C., Dallas

Cloudflare, July 2020

Duration of outage: July 17, 2020
Cause: A configuration error in Cloudflare’s global backbone network resulted in a 50% traffic drop across its network
Affected: A significant chunk of internet services, several big-name clients such as Discord, Feedly, GitLab, League of Legends, Patreon, Politico, and Shopify

Amazon Web Services, November 2020

Duration of outage: November 25, 2020
Cause: A multi-hour, global outage was triggered due to the small addition of capacity to Amazon Kinesis
Affected: the U.S. East-1 region that knocked down services of prominent AWS customers: 1Password, Adobe Spark, Autodesk, Flickr, iRobot, Roku, Twilio, The Washington Post, and Glassdoor + other AWS services, such as Lambda, LEX, Macie, Managed Blockchain, Marketplace, MediaLive, MediaConvert, Personalize, Rekognition, SageMaker, and Workspaces

Google Cloud, December 2020

Duration of outage: nearly 1 hour, December 14, 2020
Cause: Google Cloud experienced a widespread global authentication system outage due to an internal storage quota issue
Affected: Major Google services including YouTube, Google Maps, Google Docs, Google Maps and Gmail

The remedy: DevOps practices

The meaning of DevOps is still evolving, but back to the beginning – Devops is a set of practices.

A compound of development (Dev) and operations (Ops), DevOps is the union of people, process, and technology to provide value to customers continually. What does DevOps mean for teams? DevOps enables formerly siloed roles—development, IT operations, quality engineering, and security—to coordinate and collaborate to produce better, more reliable products. By adopting a DevOps culture along with DevOps practices and tools, teams gain the ability to better respond to customer needs, increase confidence in the applications they build and achieve business goals faster.

With DevOps methodology development and operations teams can work together, imply automation and use the same tools in shorter development cycles so you get the results much faster.
DevOps methodology is based on nine pillars that, when joint, constitute a complete, highly beneficial for your business approach that leads to the project success. It’s faster, more reliable and secure.

In adopting DevOps practices, teams work to ensure system reliability, high availability and aim for zero downtime while reinforcing security and governance. DevOps teams seek to identify issues before they affect the customer experience and mitigate problems immediately when they occur.

Maintaining this vigilance requires:

rich telemetry,
actionable alerting,
full visibility into applications and the underlying system.

Failure as part of a plan

Reliable software is crucial not only from a customer’s perspective. It’s also essential for software developers and operators. When an engineering team faces disruption, they step into an interrupt-driven development phase, which is an easy way to burn out as a group and individually.

Explore DevOps

DevOps tools to build software reliability

First of all, we need to think about achieving and maintaining software reliability as a constant process that requires engagement. You decide with your team on the level of reliability your company wants to provide. And after that, all team members need to work on it consistently every day. These are just a few among the efficient tools to use:

Incident retrospectives

Analysing and learning through retrospectives are a great way to understand why something happened, but also why things work at all. Your team will discover what was done to resolve the incident and why certain decisions were made. When you identify all the factors that contributed to an outage, you can analyse every single one of them in detail, explore weak areas and plan better decisions for the future.

Chaos engineering

Rule the chaos before it takes over your system. Causing outages intentionally in a controlled way provides your team with priceless knowledge. It is a great way to build resilience and reliability among engineers. Chaos engineering helps build better functioning software.

Quick on/off turning

There are multiple tools to use, like canary releases, A/b, blue/green, rolling updates, dark launching, feature flags. They are used in the software stack, and the reason we use them is simple. When complex systems and deploys need a simple light switch to make some parts go dark when the failure occurs.

Service cataloguing

Complex systems include many running applications. An individual can’t keep track of them all. Maintenance and keeping control of thousands of microservices is easier when you have a record of all of them and all the inner workings. Your team knows what exactly they rely on in a complex system.

Simple runbooks

It’s essential to have a plan and an overall knowledge base in case something happens. Simple runbooks are a repository of rules to share before a crisis hits the system. When a database approaches its maximum disc capacity, your team will get an alert and a checklist of actions to take.

SLOs

It is challenging to measure reliability, but you can explore it by implementing SLOs (service-level objectives). Alerting on SLOs that show customer experience elements, you can get closer to your particular system’s meaning of reliability.

Turn failure into high reliability

Building a reliable system is a process of constant improvement. It never ends and requires devoted engineers ready to search for the best solution, test different paths, and most of all, eagerly learn from failure. Remember that users’ needs change just like the elements they trust and rely on.