Getting 99.999% availability and beyond
This recipe will discuss the differences between DR and HA and how to architect HA solutions. Before we get into that, let’s refine the definition of a few key terms:
- High availability, or HA: This means protecting from the failure of a single component. Think of this as protecting against the failure of a system.
- Disaster recovery, or DR: This is the failure of the data center or cloud region.
- Availability nines: When referring to nines of availability, it is a way to quantify the uptime or reliability of a system by specifying the number of nines in the uptime percentage. Each nine represents a decimal place in the uptime percentage.
Here’s a breakdown of the most commonly used nines and their corresponding uptime percentages, assuming 24x7x365 operations:
Nines | Downtime per Year | Downtime per Month |
99 | 3d 14h 56m 18s | 7h 14m 41s |
99.9 | 8h 41m 38s | 43m 28s |
99.99 | 52m 10s | 4m 21s |
99.999 | 5m 13s | 26s |
99.9999 | 31s | 2.6s |
Table 6.1 – Nines downtime
In the preceding table, each additional nine in the uptime percentage signifies a higher level of availability and a reduced tolerance for downtime. Achieving higher numbers of nines typically requires implementing redundant systems, failover mechanisms, and rigorous maintenance practices to minimize downtime and ensure continuous operation. In addition, when setting up a Service-Level Agreement (SLA), you can also define the uptime during business hours and exclude scheduled maintenance. As an example, using a working schedule of Monday through Friday with 12 working hours a day, and 10 holidays off per year, the matrix would look very different!
Nines | Downtime per Year | Downtime per Month |
99 | 1d 7h 2m 58s | 2h 35m 14s |
99.9 | 3h 6m 18s | 15m 31s |
99.99 | 18m 38s | 1m 33s |
99.999 | 1m 52s | 9s |
99.9999 | 11s | 1s |
Table 6.2 – Business hours downtime
Note
When setting SLAs with the business, carefully understand the differences between including maintenance windows and operational hours within the SLA.
Getting ready
When designing HA systems, there are several considerations that need to be taken into account to ensure the system is resilient and can handle failures. Here are some key considerations:
- Redundancy: Having redundancy in HA systems is essential. It requires replicating components or whole systems to eliminate potential SPOFs. Redundancy can be implemented at different levels, such as hardware, software, and network infrastructure. To minimize the impact of localized failures, it’s crucial to distribute redundant components across different physical locations.
- Failover and load balancing: It is important for HA systems to be equipped with failover mechanisms that enable automatic switching to a backup system whenever a failure occurs. One way to achieve this is through replicating data and services across multiple servers, coupled with the use of load-balancing techniques that ensure the even distribution of workload. With load balancing, traffic can be easily redirected to available servers in the event of a server failure.
- Scalability: When designing HA systems, it is important to ensure that they can handle increased workloads and scale effortlessly. This can be achieved through horizontal scaling, which entails adding more servers to distribute the load, or vertical scaling, which involves adding resources to existing servers. Additionally, the system should be capable of dynamically adjusting resource allocation based on demand to prevent overloading.
- Data replication and backup: Maintaining data integrity and availability is crucial for HA systems. To ensure that data can still be accessed in case of a system failure, it is essential to replicate data across multiple storage systems or databases. Additionally, performing regular backups is vital to safeguard against potential data loss or corruption.
- Fault tolerance: Systems used for highly available architectures should have fault tolerance, which means they must be able to function even if specific components or subsystems malfunction. Achieving this requires creating a system that can manage errors with ease, recover automatically, and ensure continuity of service.
- Disaster recovery: Having a DR plan is crucial for HA systems to effectively deal with catastrophic events such as natural disasters or widespread outages. This plan entails generating off-site backups, setting up secondary data centers, and relying on cloud-based services to guarantee business continuity, even amid extreme situations.
- Documentation and testing: It is crucial to document the system architecture, configurations, and procedures to effectively troubleshoot and maintain the HA system. Regular testing, such as failover tests, load testing, and DR drills, plays a significant role in identifying potential issues and ensuring the system operates as intended in various scenarios.
- Cost and complexity: Designing, implementing, and maintaining HA systems can be both complex and costly. It is important to carefully consider the available budget, as well as the expertise and resources required to effectively manage and monitor the system.
By addressing these considerations, you can design a robust and resilient HA system that ensures HA, fault tolerance, and continuity of critical services.
How to do it…
As a rule, you should pick the right technology for the right subsystem and application.
When aiming to achieve HA for a web application, the first step is to place a load balancer in front of the web servers. This enables scaling of the application while also offering some fault tolerance for these systems. However, attention should also be given to the data tier, which can be addressed by clustering the database or building a cluster capable of running the database, depending on the limitations of the database technology.
If you are utilizing a technology such as Oracle Database, you have the option to establish a database-specific cluster known as Oracle Real Application Clusters (Oracle RAC). This cluster allows for both scalability and availability. With RAC, the database remains accessible for queries as long as one node is online. While other databases may utilize their own exclusive clustering technology (such as MySQL Cluster), you may opt to utilize generic cluster technologies such as Pacemaker for cluster management and Corosync for inter-cluster communications. This approach presents the advantage of enabling almost any technology to be made highly available in a Linux environment.
You can achieve HA in storage by implementing filesystems across the entire cluster. Gluster allows you to mount a filesystem across multiple servers, while at the same time replicating the storage across servers. This provides both scalability and reliability at the filesystem level.
Finally, the network is a common point of failure, and using network bonding technologies can enable both HA as well as some scaling abilities. This works by combining at least two network ports into a single virtual port.
Note
The best HA architectures mix these approaches to cover the entire technology stack.