TAILOR YOUR BASKET TO MATCH YOUR EGGS
High Availability and Disaster Recovery on Power – IBM now has an impressive portfolio
Over the last 10 years IT operations have evolved to the point where critical applications are rarely hosted on the same server, nor in many cases, in the same Data Centre. However I have found in many instances that this development is more piecemeal rather than being driven by a detailed review. A review which examines the application requirements for High Availability / Disaster Recovery (HA/DR), matching these requirements to the solutions available – across all the infrastructure.
For many years IBM has been recognised as a leader in HA and DR solutions for workloads on Power designed to meet the availability requirements of critical enterprise applications. In the last few years their portfolio has been expanded to include protection for those “less critical” applications in the Data Centre. By “less critical”, I refer to those that can afford a slightly longer outage and/or less stringent requirements around data loss. In particular if you are looking for a simple and reasonably priced HA or DR solution, we now have VMRM and VMRM/DR (more details below).
It is worth noting that in the 2019 ITIC Global Server, Hardware, OS Reliability Survey (Mid-year update), 66% of the respondents said that increasing workloads are impacting negatively on their reliability and only 18% said that they hadn’t experienced a decline in reliability due to increasing workloads. The same survey deals with estimated costs of outages, which while not under consideration here, needs to be considered when looking at pricing a well-matched HA or DR solution.
Now that IBM has a more comprehensive portfolio of HA/DR solutions, it is a good time to review what is available, what has changed, and how these options will match your application availability requirements.
While the primary focus of HA/DR solutions is to work around failures in the infrastructure, these tools are equally useful in managing around maintenance and upgrade tasks. For example PowerHA includes a tool on AIX to manage ifixes and Service Packs across the cluster. Over the last few years, PowerHA development has been focused on its ease of use and has successfully countered the perception that PowerHA is difficult to manage.
Range of options
Companies typically have a range of applications with related Service Level Agreements (SLAs) based around both Recovery Time Objective (RTO – the time before clients can access the application again), and Recovery Point Objective (RPO – last transaction saved) – see below. To match this, IBM has a number of solutions either working independently or together, to meet these different SLAs. Understanding the range of solutions is even more important now as it is rare to find only one operating system running on your Power infrastructure.
RTO / RPO
Talking about the cost of these solutions (which in many cases also includes the duplication of some expensive infrastructure) is not easy. However to be prepared, an organisation needs to be aware of what would be the possible cost of some of the more common failure scenarios. The setup side of the equation is changing as new features now make it easier for organisation to control costs – for example by activating licenses only when needed, or automating the shut down of less critical workloads when required, thus freeing resources for restarted workloads.
The range of HA/DR options on Power now include:
Live Partition Mobility or LPM (A useful tool for Administrators to move workloads for maintenance and some types of failure);
Simplified Remote Restart or SSR;
IBM Virtual Machine Recovery Manager HA;
IBM Virtual Machine Recovery Manager DR (was IBM Geographically Dispersed Resiliency for IBM Power Systems);
PowerHA – now for AIX, I and Linux;
PowerHA Enterprise Edition; and
Geographic Logical Volume Manager (GLVM) – an old AIX feature that may have a future in the cloud.
Live Partition Mobility
More of an administrative tool that is part of PowerVM that allows administrators to move running workloads from one Power system to another. This process has a few configuration requirements and offers huge benefits for little effort. However this is not really considered to be an HA solution, but I mention it for completeness and note that the following 3 options will not work if your Power environment is not configured for LPM.
Simplified Remote Restart
The simplest option, configured by a setting in the HMC, NovaLink / PowerVC to enable an LPAR to be restarted on another frame if the current frame fails and is independent of the Operating system. If managed by PowerVC placement of the restarted LPARs can be controlled, otherwise it is a manual process.
SSR is operating system agnostic and only recognises frame failures.
IBM VM Recovery Manager
Simple automation and management of SRR in the Data Centre.
Operating system agnostic and can be scripted to go beyond just frame failures with AIX or Linux agent to perform further monitoring. Distribution of virtual machines across the infrastructure can also be configured.
IBM VM Recovery Manager DR
Similar to VM Recovery Manager, but extends beyond the Data Centre, so relies on replication at the storage layer which is controlled by the Recovery Manager. It has an added cool feature that allows you to easily run DR testing by starting the LPAR in DR using cloned LUNs and thus not impacting production availability.
PowerHA – now for AIX, I and Linux
In reality, this option consists of 3 products for the 3 operating systems, with a common logic of moving resources around the infrastructure to mask failures and allow applications to be restarted and made available for client access. Also, a useful tool to move workloads around for maintenance and now includes an impressive array of management and testing tools.
PowerHA Enterprise Edition
Extending the AIX and i options by introducing sites (Data Centres), while controlling the replication of data and coordinating the behaviour of the clusters in each DC. In most of the clusters I have worked on, replication is managed by the storage subsystem, with just a few using Geographic Logical Volume Manager (GLVM).
Geographic Logical Volume Manager
GLVM is part of the AIX Logical Volume Manager and allows replication to be configured over IP. By itself, there is no control or automation, but when integrated with PowerHA EE, it is a fully managed DR solution replicating over IP. However, watch this space as GLVM may have a future in the cloud.
Comparing the features
Note: GLVM is not included in the table as it is a feature of AIX and can be managed by HA for Cross Site storage replication over IP
HA/DR maintenance and management
As discussed above, there has been a perception that the configuration and management of PowerHA is complex, however over the last 3-5 years:
IBM development has sought and acted on feedback from users;
IBM has invested heavily in a management GUI and Smart Assists (bundled solutions for “common” applications) to make the installation and management less complex and less prone to errors;
Feedback from support and analysis of the Support databases means that during installation and the regular cluster synchronisation, common mistakes and any inconsistencies are checked for; and
The testing process has been simplified and parts automated as experience has shown that a well tested and maintained cluster is not only less likely to fail, but will instil confidence.
Where to start?
For “Enterprise” Customers, greater virtualisation of Power workloads has muddied the water and added to this perceived complexity. However, in reality, most organisations have a small number of “classes” of applications based on the SLAs – the RTOs; and RPOs mentioned above. The problem can often be simplified by grouping each Application into an “Availability Class”, which is then mapped to the appropriate HA / DR option. This is an ongoing process to be updated as the landscape evolves.
Systemethix, IBM and the partner ecosystem can advise on options, help you review your HA/DR requirements, run a proof of concept and help with the installation, testing and maintenance of the solution. At Systemethix we also have the HA/DR skills to assist with the planning, implementation, testing, training, and maintenance / upgrading of existing solutions.
The summary of each of the options is by necessity very brief and we would be happy to provide further information or assist you further if required.
“My years at the coalface – L2 support@IBM” - Red Steel
Implementing IBM VM Recovery Manager for IBM Power Systems August 2019 SG24-8426-00
IBM Geographically Dispersed Resiliency for IBM Power Systems February 2017 SG24-8382-00
IBM PowerHA SystemMirror V7.2.2 and V7.2.3 for IBM AIX and Linux December 2018 SG24-8434-00