![]() ![]() ![]() |
Chapter 1: HACMP for Linux Cluster Overview
High Availability Cluster Multi-Processing (HACMP™) on Linux is the IBM tool for building Linux-based computing platforms that include more than one server and provide high availability of applications and services.
Both HACMP for AIX 5L and HACMP for Linux versions use a common software model and present a common user interface (WebSMIT). This chapter provides an overview of HACMP on Linux and contains the following sections:
Overview
HACMP for Linux enables your business application and its dependent resources to continue running either at its current hosting server (node) or, in case of a failure at the hosting node, at a backup node, thus providing high availability and recovery for the application.
HACMP detects component failures and automatically transfers your application to another node with little or no interruption to the application’s end users.
HACMP for Linux takes advantage of the following software components to reduce application downtime and recovery:
Linux operating system (RHEL or SUSE ES versions) TCP/IP subsystem High Availability Cluster Multi-Processing (HACMP™) on Linux cluster management subsystem (the Cluster Manager daemon). HACMP for Linux provides:
High Availability for system processes, services and applications that are running under HACMP’s control. HACMP ensures continuing service and access to applications during hardware or software outages (or both), planned or unplanned, in an eight-node cluster. Nodes may have access to the data stored on shared disks over an IP-based network (although shared disks cannot be part of the HACMP for Linux cluster and are not kept highly available by HACMP). Protection and recovery of applications when components fail. HACMP protects your applications against node and network failures, by providing automatic recovery of applications. If a node fails, HACMP recovers applications on a surviving node. If a network or a network interface card (adapter) fails, HACMP uses an alternate networks, an additional network interface or an IP label alias to recover the communication links and continue providing access to the data.
WebSMIT, a web-based user interface to configure an HACMP cluster. In WebSMIT, you can configure a basic cluster with the most widely used, default settings, or configure a customized cluster while having the access to customizable tools and functions. WebSMIT lets you view your existing cluster configuration in different ways (node-centric view, or application-centric view) and provides cluster status tools. Easy customization of how applications are managed by HACMP. You can configure HACMP to handle applications in the way you want: Applications startup.You select from a set of options for how you want HACMP to start up applications on the node(s). Applications recovery actions that HACMP takes. If a failure occurs with an application’s resource that is monitored by HACMP, you select whether you want HACMP to recover applications on another cluster node, or stop the applications. HACMP’s follow-up after recovery. You select how you want HACMP to react in cases when you have restored a failed cluster component. For instance, you decide on which node HACMP should restart the application that was previously automatically stopped (or moved to another node) due to a previously detected resource failure. Built-in configuration, system maintenance and troubleshooting functions. HACMP has functions to help you with your daily system management tasks, such as cluster administration, automatic cluster monitoring of the application’s health, or notification upon component failures. Tools for creating similar clusters from an existing “sample” cluster. You can save your existing HACMP cluster configuration in a cluster snapshot file, and later recreate it in an identical cluster in a few steps. Related Cluster Information
For additional information on IBM on Linux offerings, see:
http://www.ibm.com/software/os/linux/software/resource.html
Cluster Terminology
The list below includes basic terms used in the HACMP environment.
Note: In general, terminology for HACMP is based on industry conventions for high availability. However, the meaning of some of the terms in HACMP may differ from the generic terms.
An application is a service, such as a database, or a collection of system services and their dependent resources, such as a service IP label and application’s start and stop scripts, that you want to keep highly available with the use of HACMP.
An application server is a collection of application start and stop scripts that you provide to HACMP by entering the pathnames for the scripts in the WebSMIT user interface. An application server becomes a resource associated with an application, you include it in a resource group for HACMP to keep it highly available. HACMP ensures that the application can start and stop successfully no matter on which cluster node it is being started.
A cluster node is a physical machine, typically an AIX 5L or a Linux server on which you install HACMP. A cluster node also hosts an application. A cluster node serves as a server for application’s clients. HACMP’s role is to ensure continuous access to the application, no matter on which node in the cluster the application is currently active.
A home node is a node on which the application is hosted, based on your default configuration for the application’s resource group, and under normal conditions.
A takeover node is a backup cluster node to which HACMP may move the application. You can move the application to this node manually, for instance, to free the home node for planned maintenance. Or, HACMP moves the application automatically, due to a cluster component failure.
In HACMP for Linux v.5.4, a cluster configuration includes up to eight nodes. Therefore, you can have more than one potential takeover nodes for a particular application. You define the list of nodes on which you want HACMP to host your application using the WebSMIT interface. This list is called a resource group’s nodelist.
A cluster IP network is used for cluster communications between the nodes and for sending heartbeating information. All IP labels configured on the same HACMP network share the netmask, but may be required to have different subnets.
An IP label is a name of a network interface card (NIC) that you provide to HACMP. Network configuration for HACMP requires planning for several types of IP labels:
Base (or boot) IP labels on each node—the ones through which an initial cluster connectivity is established. Service IP labels for each application—the ones through which a connection for a highly available application is established. Backup IP labels (optional). Persistent IP labels on each node. These are node-bound IP labels that are useful to have in the cluster for administrative purposes. Note that to ensure high availability and access to the application, HACMP “recovers” the service IP address associated with the application on another node in the cluster in cases of network interface failures. HACMP uses IP aliases for HACMP networks. For information, see Planning IP Networks and Network Interfaces.
An IP alias is an alias placed on an IP label. It coexists on an interface along with the IP label. Networks that support Gratuitous ARP cache updates enable configuration of IP aliases.
IP Address Takeover (IPAT) is a process whereby a service IP label on one node is taken over by a backup node in the cluster. HACMP uses IPAT to provide high availability of IP service labels that belong to resource groups. These labels provide access to applications. HACMP uses IPAT to recover the IP label on the same node or the backup node. HACMP for Linux by default supports the mode of IPAT known as IPAT via IP Aliasing. (The other method of IPAT—IPAT via IP Replacement is not supported).
IP Address Takeover via IP Aliasing is the default method of IPAT used in HACMP. HACMP uses IPAT via IP Aliasing in cases when it must automatically recover a service IP label on another node. To configure IPAT via IP Aliasing, you configure service IP labels and their aliases to the system. When HACMP performs IPAT during automatic cluster events, it places an IP alias recovered from the “failed” node on top of the service IP address on the takeover node. As a result, access to the application continues to be provided.
Cluster resources can include an application server and a service IP label. All or some of these resources can be associated with an application you plan to keep highly available. You include cluster resources into resource groups.
A resource group is a collection of cluster resources.
Resource group startup is an activation of a resource group and its associated resources on a specified cluster node. You choose a resource group startup policy from a predefined list in WebSMIT.
Resource group fallover is an action of a resource group, when HACMP moves it from one node to another. In other words, a resource group and its associated application fall over to another node. You choose a resource group fallover policy from a predefined list in WebSMIT.
Takeover is an automatic action during which HACMP takes over resources from one node and moves them to another node. Takeover occurs when a resource group falls over to another node. A backup node is referred to as a takeover node.
Resource group fallback is an action of a resource group, when HACMP returns it from a takeover node back to the home node. You choose a resource group fallback policy from a predefined list in WebSMIT.
Cluster Startup is the starting of HACMP cluster services on the node(s).
Cluster Shutdown is the stopping of HACMP cluster services on the node(s).
Pre- and post-events are customized scripts provided by you (or other system administrators), which you can make known to HACMP and which will be run before or after a particular cluster event. For more information on pre- and post-event scripts, see the chapter on Planning Cluster Events in the HACMP for AIX 5L Planning Guide.
Sample Configuration with a Diagram
The following configuration includes:
Node1 and Node2 running Linux A serial network An IP-based network.
Node and Network Failure Scenarios
This section describes how HACMP for Linux handles failures and ensures that the application keeps running.
The following scenarios are considered:
Node Failure
If the application is configured to normally run on Node1 and Node1 fails, the resource group with the application falls over, or moves, to Node2.
At a high level, on Node2, HACMP detects that Node1, the default owner of the resource group, has failed and moves the resource group to Node2. This operation is called a resource group takeover. The application is kept highly available and the end users continue to access it.
If Node2 rejoins the cluster, based on the resource group policy HACMP performs the resource group fallback. The resource group moves back to Node1 (for example, if that is the selected fallback policy for the resource group).
Network Failure
A network failure occurs when none of the cluster nodes can access each other using any of the network interface cards configured for the HACMP network.
To protect against network failures, we recommend that you have the nodes in the cluster connected by multiple networks. If one network fails, HACMP uses a network that is still available for cluster traffic and for monitoring the status of the nodes (heartbeating).
You can also specify additional actions to process a network failure—for example, re-routing through an alternate network.
How HACMP Handles Network Failures on the Local Node
A local network failure occurs when all interfaces of a specific cluster network on a node fail. For example, if you have nodes A and B, and networks net1 and net2, and all interfaces of network net1 on node A fail, then a network_down event runs for net1 with node A as the event node. You can see this in the /tmp/hacmp.out file. This is also called a local network failure.
In this case, the Cluster Manager takes selective recovery actions for resource groups containing a service IP label connected to that network. The Cluster Manager attempts to recover only the resource groups affected by the local network failure event.
Network Interface Failure
The HACMP software handles failures of network interfaces on which a service IP label is configured. Types of such failures are:
Out of two network interfaces configured on the same HACMP node and network, the network interface with a service IP label fails, but an additional “backup” network interface card remains available. In this case, the Cluster Manager removes the service IP label from the failed network interface, and recovers it, via IP aliasing, on the “backup” network interface. Such a network interface failure is transparent to you except for a small delay while the system reconfigures the network interface on the node. Out of two network interfaces configured on a node, an additional or a “backup” network interface fails, but the network interface with a service IP label configured on it remains available. In this case, the Cluster Manager detects a (backup) network interface failure, logs the event, and sends a message to the system console. The application continues to be highly available. If you want additional processing, you can customize the processing for this event. If the service IP label that is part of a resource group cannot be recovered on a local node, HACMP moves the resource group with the associated IP label to another node, using IP aliasing as the mechanism to recover the associated service IP label. Preventing Cluster Partitioning
To prevent cluster partitioning, configure a serial network for heartbeating between the nodes, in addition to the IP-based cluster network. If the IP-based cluster network connection between the nodes fails, the heartbeating network prevents data divergence and cluster partitioning.
For information on planning the cluster networks configuration, see Chapter 2: Planning and Installing HACMP for Linux.
For more information, see Chapter 5: Monitoring and Troubleshooting a Cluster.
Where You Go from Here
The remainder of this guide documents how to plan, install, configure and use HACMP for Linux.
![]() ![]() ![]() |