What makes an app reliable? If you ask most IT professionals that question, their minds immediately go to uptime. That’s not surprising; after all, it’s fashionable for software and infrastructure vendors these days to boast about how many “9s” of uptime they guarantee.
To be sure, uptime is one component of delivering a reliable application experience for developers and end-users. But it’s only one. In order to build reliability into your application patterns, you must think beyond uptime alone.
Keep reading for an overview of all the factors that go into achieving reliability.
What is reliability?
A reliable application is one that meets the needs and expectations of everyone it serves. While those needs and expectations will vary from case to case, reliability is characterized by the ability of an app to:
- Be accessible when needed.
- Respond within the timeframe needed.
- Be updated or modified as needed.
- Provide security and privacy to the extent needed.
- Meet the needs and expectations not just of end-users, but of everyone else who helps create or support the app.
Maximizing reliability for cloud-native apps
Achieving these features for modern, cloud-native applications requires a multi-pronged approach. The following are the core considerations to address in order to build reliability into an app pattern:
We’ll start with uptime, which is the most obvious aspect of reliability. Uptime refers to the amount of time that your application is available, at least to some extent (as we’ll discuss below, merely being available does not necessarily mean that the app satisfies other requirements of reliability).
Maximizing uptime typically involves solutions such as distributing workloads across different servers or data centers (so if part of your infrastructure fails, others remain available) and using automated failover to move a workload to new infrastructure if it fails in one location.
Culture and processes are part of the picture, too, since you need to make sure your team is prepared to respond quickly to any incidents that arise, in order to minimize downtime.
It’s worth noting, by the way, that there can be such a thing as too much uptime. If you reach the point where adding more uptime doesn’t deliver any additional value to your organization, then it’s probably time to stop. Adding another 9 to your availability just because you can might be a poor use of resources. So, be sure to factor this part of the equation into your uptime calculations and strategy.
Performance and efficiency
In order to meet stakeholder needs and expectations, an application must not only be available, but also able to respond within the timeframe required by those stakeholders. A website that is “up” 100 percent of the time but takes 20 seconds to load each page is not a very reliable website.
This is why performance testing and monitoring must be built into your application delivery chain. You should also ensure that you “rightsize” infrastructure (for example, by choosing the right types of cloud instances) in order to guarantee that your applications have enough resources available to perform adequately. (Of course, this priority must be balanced with cost-control.)
Accessibility and usability
An application that is available all of the time and responds quickly may still fail to meet stakeholder needs if it is difficult to work with due to poor design.
From an end-user’s perspective, user-experience testing is the most obvious way to help mitigate the risk of poor accessibility or usability within your applications. But keep in mind that you also need to think about usability from the perspective of your own team. Make sure that access-control systems, infrastructure architectures, and documentation can be navigated easily. You don’t want an application or a delivery chain that is too complex for your team to be able to support effectively.
Security and data privacy
A reliable application is a secure application. There are many obvious ways to help secure applications, such as using vulnerability monitoring tools. But remember that security is not just about finding vulnerabilities as they arise or deploying certain tools. You must also build security into the design of your infrastructure, processes, and culture.
Backup and recovery
You don’t need to be a seasoned IT professional to know that backing up data is important for achieving reliability. When something goes wrong — which it occasionally will, despite your best plans and efforts — having a backup in place often means the difference between experiencing a blip in service and a major catastrophe.
But the thing that can be easier to overlook is the importance of having a disaster recovery plan in place as well as a backup routine. Simply backing up your data isn’t enough; you must also know how you’re going to restore it to new application instances quickly if disaster strikes. If, for example, all of your cloud-based virtual machines go down, what is the exact process you’ll follow for standing them back up? Identifying the steps ahead of time — and, ideally, scripting them so that the recovery process can be as automated as possible — is crucial.
Also important is quantifying how often you need to back up data (which is determined by a metric called Recovery Point Objective, or RPO), and how quickly you need to be able to recover it to avoid serious business disruption (determined by Recovery Time Objective, or RTO). Along similar lines, error budgets, which define how much time you can tolerate your production systems being down, can help you define the available time monthly for making system improvements.
Few apps are deployed and then never modified again. Instead, they must be updated or scaled constantly to meet changing needs and demands. That’s why designing an application pattern that can be easily modified is critical for achieving reliability over the long term.
In practice, modifiability entails having the right tools (like automated deployment solutions) and the right processes (such as clear feedback loops between IT Ops and developers) in order to decide efficiently what modifications to implement. It might also involve taking advantage of infrastructure technologies (such as containers) that make it easy to update a running application without imposing downtime on users.
Reliability is more complex than it may at first seem. No matter how many 9s are included in your SLAs, you are not guaranteeing true reliability unless you also address requirements such as performance, usability, and modifiability.
This post originally appeared on devops.com.