PreviousNextIndex

Appendix A: 7x24 Maintenance


The goal of high availability is to keep systems up and running, allowing continuous access to critical applications. In many enterprises it has become necessary to keep applications running seven days a week, 24 hours a day. With proper planning, customizing, and monitoring, an HACMP cluster can provide nearly continuous availability, interrupted only by scheduled, necessary maintenance.

This appendix is a collection of information describing the issues and procedures involved in keeping a cluster running on as close to a 7 X 24 basis as possible.

The appendix contains the following sections:

  • Planning for 7 X 24 Maintenance. This section reemphasizes the importance of careful planning and customization of the original installation of the cluster.
  • Runtime Maintenance. This section offers reminders and tips to help you avoid actions that endanger a stable, running cluster.
  • Hardware Maintenance. This section contains procedures for changing or replacing certain hardware.
  • Preventive Maintenance. This section reviews tools you can use to avoid problems or catch them early.
  • Overview

    Throughout all stages of cluster administration—planning, configuration, maintenance, troubleshooting, and upgrading—here are tasks you can do and systems you can put in place to help ensure your cluster’s nearly continuous availability.

    Once you have configured the cluster and brought it online, it is very important to do maintenance tasks in as non-disruptive a way as possible. The HACMP cluster is a distributed operating system environment. Therefore maintaining an HACMP cluster requires attention to some issues that have different ramifications in the cluster environment compared to maintaining a single-server system.

    Making changes to a cluster must be thoroughly planned, since changes to one component may have cascading effects. Changes on one node may affect other nodes, but this may not be apparent until fallover occurs (or cannot occur due to a non-synchronized change to the cluster). Some of the do’s and don’ts of cluster maintenance are explained in this appendix.

    Setting up and following regular preventive maintenance procedures helps alert you to any potential problems before they occur. Then you can take timely action or plan fallovers or cluster downtime at your convenience as necessary to deal with any impending issues.

    Planning for 7 X 24 Maintenance

    Carefully planning the original installation of your cluster goes a long way toward making cluster maintenance easier. A well-configured and customized cluster is the first step to good preventive maintenance. Proper cluster configuration also makes it less likely you will have to make changes that affect cluster performance while users are accessing their applications.

    Planning the cluster starts with a single point of failure analysis. See the Planning Guide for a detailed list of issues to consider. Once the cluster is installed and running, you need to handle any failures as quickly and as automatically as possible. Planning for runtime failure recovery helps ensure that HACMP for AIX 5L does all that it is capable of doing to keep your critical resources online.

    This section includes information on the following topics:

  • Customizing the cluster, including setting up error notification to improve monitoring and management of the cluster
  • Tuning the communications system—network and nameserving issues
  • Planning disk and volume group layout
  • General planning for hardware and software maintenance.
  • Customizing the Cluster

    Customizing the cluster enhances your ability to monitor the cluster and keep it running. You can define a pre-event, a post-event, and a notification method for every cluster event. Notification of events is critical to maintain service for any HACMP cluster. Although HACMP writes messages to the hacmp.out and cluster.log log files, it is very useful to include notifications to the console or mail to the system administrator when an event occurs that demands immediate attention

    You can include automatic recovery actions as well as notification in the cluster customization. Use the HACMP and AIX 5L tools available to customize some or all of the following:

  • Hardware error notification
  • Hardware failure notification
  • Cluster event notification
  • Pre- and post-event recovery actions
  • Network failure escalation
  • ARP cache refresh
  • Pager notification
  • Application server scripts.
  • It is highly recommended that you maintain a test cluster as well as your production cluster. Thus before you make any major change to the production cluster, you can test the procedure on the test cluster. HACMP supplies event emulation utilities to aid in testing.

    Customizing AIX 5L Error Notification of Hardware Errors

    Customizing notification when you configure the cluster is a good preventive measure. See Chapter 7 in the Planning Guide for complete information on customizing and setting up notification of cluster events.

    See Chapter 9: Configuring AIX 5L for HACMP in the Installation Guide for information on using AIX 5L Error Notification, and for information on setting up automatic notification for hardware errors that do not cause cluster events.

    Using the HACMP Automatic Error Notification SMIT panels, you can turn on automatic error notification for selected hard, non-recoverable error types: disk, disk adapter, and SP switch adapter errors. All disks defined as HACMP resources, and disks in the rootvg and HACMP volume groups and filesystems are included.

    You may want to set up error notification for certain media or temporary errors. You may also want to customize the error notification for some devices rather than using one of the two automatic error notification methods.

    List of Hardware Errors to Monitor

    The following list of hardware errors gives you a good idea of types of errors to monitor. The first list shows which errors are handled by the HACMP automatic error notification utility. The following lists show other types of errors you may want to address. For each device monitored, you can determine an additional action other than notification, such as:

  • Stop cluster services and move resource groups to another node.
  • Initiate a custom recovery action such as reconfiguration for a failed device using an alternative device.
  • Hardware Errors Handled by HACMP Auto-Error Notification
    DISK_ERR2
    Permanent physical disk error (known error)
    DISK_ERR3
    Permanent physical disk error, adapter detected (known error)
    SCSI_ERR1
    Permanent SCSI adapter hardware error (known error)
    SCSI_ERR3
    Permanent SCSI adapter microcode error (known error)
    SCSI_ERR5
    Temporary SCSI bus error
    SCSI_ERR7
    Permanent unknown system error
    SCSI_ERR9
    Potential Data loss condition
    SDA_ERR1
    Adapter hardware error condition
    SDA_ERR3
    Permanent unknown system error
    SDC_ERR1
    Controller/DASD link error
    SDC_ERR2
    Controller hardware error
    SSA_HDW_ERR
    SSA hardware error condition
    SSA_DISK_ERR1
    Permananent microcode program error
    SSA_DISK_ERR4
    Permanent disk operation error
    DISK_ARRAY_ERR2
    Permanent disk operation error (disk failure)
    DISK_ARRAY_ERR3
    Permanent disk operation error (disk failure)
    DISK_ARRAY_ERR5
    Permanent disk operation error (disk failure)
    SCSI_ARRAY_ERR2
    SCSI hardware error
    HPS_FAULT4_ER
    SP Switch error
  • Disk and Adapter Errors Not Covered by HACMP Auto-Error Notification
    SSA_HDW_RECOVERED
    Temporary adapter error
    SSA_DISK_ERR3
    Temporary disk operation error
    SSA_DEGRADED_ERROR
    Adapter performance degraded
    SSA_LOGGING _ERROR
    Permanent: unable to log an error against a disk
    SSA_LINK_OPEN
    Permanent adapter detected open serial link
    SSA_SOFTWARE_ERROR
    Permanent software program error
    SSA_LINK_ERROR
    Temporary link error
    SSA_DETECTED_ERROR
    Permanent loss of redundant power/cooling
    LVM_MISSPVADDED
    PV defined as missing (unknown error)
    LVM_SA_WRT
    PV defined as missing (unknown error)
    LVM_SA_PVMISS
    Failed to write VGSA (unknown error)

    Disk Array Errors Not Covered by HACMP Auto-Error Notification
    DISK_ARRAY_ERR4
    Temporary disk operation error (disk media failing)
    DISK_ARRAY_ERR6
    Permanent array subsystem degradation (disk media failure)
    DISK_ARRAY_ERR7
    Permanent array subsystem degradation (controller)
    DISK_ARRAY_ERR8
    Permanent array active controller switch (controller)
    DISK_ARRAY_ERR9
    Permanent array controller switch failure

    Failed 64-port Adapter (tty device driver)
    COM_PERM_PIO
    PIO exception, possible adapter failure

    You may have additional devices critical to your operation that are not supported by HACMP for AIX 5L. You can set up AIX 5L error notification to monitor microcode errors for those devices or adapter time-outs.

    Customizing Cluster Events

    Customizing cluster events to send notification or to take recovery actions is another method you can use to help maintain the cluster running as smoothly as possible.

    See Chapter 7 in the Planning Guide Guide for complete information on customizing and setting up notification of cluster events.

    See the Sample Custom Scripts section in Chapter 1: Troubleshooting HACMP Clusters in the Troubleshooting Guide for tips on writing scripts to make cron jobs and print queues highly available. There is also a plug-in for print queues in the usr/es/sbin/cluster/samples directory.

    Customizing Application Server Scripts

    See Appendix B: Applications and HACMP in the Planning Guide for tips on handling applications.

    Some key things to keep in mind:

  • Define an HACMP application server for each node that supports applications requiring recovery.
  • Applications must be started up and shut down in an orderly fashion. Some situations exist where the timing and control of starting and stopping applications needs to be handled based on pre/post event process. You may need to take into account the order in which applications assigned to the same node are started. Optionally, you can also include applications in different resource groups and establish dependencies between resource groups. For more information, see Adding Resources and Attributes to Resource Groups Using the Extended Path in Chapter 5: Configuring HACMP Resource Groups (Extended).
  • Check for dependencies between nodes. For example, a process on node1 may not start until a process that runs on node2 is up. Include a check for remote node/application availability before issuing the local startup command.
  • You may need to perform some checks to make sure the application is not running and to clean up logs or roll back files before starting the application process.
  • See Sample Custom Scripts section in Chapter 1: Troubleshooting HACMP Clusters in the Troubleshooting Guide for tips on writing scripts to make cron jobs and print queues highly available. There is also a plug-in for print queues in the usr/es/sbin/cluster/samples directory.

    Application Monitoring

    You can monitor a set of applications that you define through the SMIT interface.

    In HACMP 5.2 and up you can configure multiple application monitors and associate them with one or more application servers. By supporting multiple monitors per application, HACMP can support more complex configurations. For example, you can configure one monitor for each instance of an Oracle parallel server in use. Or, you can configure a custom monitor to check the health of the database, and a process termination monitor to instantly detect termination of the database process.

    You assign each monitor a unique name in SMIT.

    Prior to HACMP 5.2, each application that is kept highly available could have only one of the two types of monitors configured for it. Process application monitoring detected the death of one or more processes using RSCT Resource Monitoring and Control (RMC). Custom application monitoring checked the health of an application at user-specified polling intervals.

    For example, you could supply a script to HACMP that sends a request to a database to check that it is functioning. A non-zero exit from the customized script indicated a failure of the monitored application, and HACMP responded by trying to recover the resource group that contains the application. However, you could not use two monitors for one application.

    For instructions, see Configuring Multiple Application Monitors in Chapter 4: Configuring HACMP Cluster Topology and Resources (Extended).

    With each monitor configured, when a problem is detected, HACMP attempts to restart the application, and continues up to a specified retry count. You select one of the following responses for HACMP to take when an application cannot be restarted within the retry count:

  • The fallover option causes the resource group containing the application to fall over to the node with the next-highest priority according to the resource policy.
  • The notify option causes HACMP to generate a server_down event, to inform the cluster of the failure.
  • You can customize the restart process through the Notify Method, Cleanup Method, and Restart Method SMIT fields, and by adding pre- and post-event scripts to any of the failure action or restart events you choose.

    Note: If the System Resource Controller (SRC) is configured to restart the application, this can interfere with actions taken by application monitoring. Disable the SRC restart for the application (application start and stop scripts should not use the SRC unless the application is not restartable). For the case of a custom monitor, the script is responsible for the correct operation. The action taken by application monitoring is supported based on the script return.
    Note: If a monitored application is under control of the system resource controller, check to be certain that action:multi are -O and -Q. The -O Specifies that the subsystem is not restarted if it stops abnormally. The -Q Specifies that multiple instances of the subsystem are not allowed to run at the same time. These values can be checked using the following command:
    lssrc -Ss <Subsystem> | cut -d : -f 10,11

    If the values are not -O and -Q then they must be changed using the chssys command.

    Measuring Application Availability

    You can use the Application Availability Analysis Tool to measure the amount of time that any of your applications (with defined application server) is available. The HACMP software collects, time stamps, and logs the following information:

  • An application starts, stops, or fails
  • A node fails or is shut down, or comes up
  • A resource group is taken offline or moved
  • Application monitoring is suspended or resumed.
  • Using SMIT, you can select a time period and the tool will display uptime and downtime statistics for a given application during that period. The tool displays:

  • Percentage of uptime
  • Amount of uptime
  • Longest period of uptime
  • Percentage of downtime
  • Amount of downtime
  • Longest period of downtime
  • Note: All nodes must be available when you run the tool to display these statistics. Clocks on all nodes must be synchronized in order to get accurate readings.

    See Chapter 4: Configuring HACMP Cluster Topology and Resources (Extended) and Chapter 10: Monitoring an HACMP Cluster, for complete information on application monitoring and configuring and using this tool.

    Network Configuration and Nameserving

    Setting up and maintaining clear communication paths for the Cluster Manager is a key element for efficient cluster operation.

    Setting up Serial Networks or Other Heartbeat Path

    It is crucial to have at least one serial network configured for the cluster. Without a serial network, you run the risk of a partitioned cluster if TCP/IP networks fail, since the nodes will be unable to maintain heartbeat communication. You can also use disk heartbeats or heartbeats over IP aliases to maintain cluster communications.

    Integrating HACMP with Network Services

    HACMP requires IP address to name resolution. The three most commonly used methods include:

  • Domain Name Service
  • Network Information Service
  • Flat file name resolution (/etc/hosts).
  • By default, a name request will look first for the DNS (/etc/resolv.conf), second for NIS, and last for /etc/hosts to resolve the name. Since DNS and NIS both require certain hosts as designated servers, it is necessary to maintain the /etc/hosts file in case the DNS or NIS name server is unavailable, and to identify hosts that are not known to the name server. It is required to have all HACMP IP labels in all cluster nodes' /etc/hosts tables.

    To ensure the most rapid name resolution of cluster nodes, change the default order for name serving so that /etc/hosts is used first (at least for cluster nodes).

    To do this, edit the /etc/netsvc.conf file so that this line appears as follows:

    hosts=local,nis,bind 
    

    Putting the local option first tells the system to use /etc/hosts first, then NIS.

    You can also change the order for name resolution by changing the environment variable NSORDER as follows:

    NSORDER=local,bind,nis 
    
    Note: By default, during the process of IP address swapping, to ensure that the external name service does not cause AIX 5L to map the service IP address to the wrong network interface, HACMP automatically disables NIS or DNS by temporarily setting the AIX 5L environment variable NSORDER=local within the event scripts.
    Note: If you are using NIS, have the NIS master server outside the cluster, and have the cluster nodes run as NIS slave servers. At a minimum, every HACMP node must be able to access NIS master or slave servers on a local subnet, and not via a router.

    See the Planning Guide and the Installation Guide for information on editing the /etc/hosts file, and also for notes on NIS and cron considerations.

    Warning: You cannot use DHCP to allocate IP addresses to HACMP cluster nodes. Clients may use this method, but cluster nodes cannot.

    Tuning Networks for Best Performance

    HACMP provides easy control over several tuning parameters that affect the cluster’s performance. Setting these tuning parameters correctly to ensure throughput and adjusting the HACMP failure detection rate can help avoid “failures” caused by heavy network traffic.

    Cluster nodes sometimes experience extreme performance problems, caused by large I/O transfers, excessive error logging, or lack of memory. When this happens, the HACMP daemons can be starved for CPU time. Processes running at a priority higher than the RSCT or Cluster Manager subsystems can also cause this problem.

    The deadman switch is an AIX 5L kernel extension that halts a node when the Cluster Manager does not run for a certain amount of time, usually due to one of the problems noted above.

    See Chapter 9: Configuring AIX 5L for HACMP in the Installation Guide for information on setting tuning parameters correctly to avoid some of the performance problems noted above.

    If you are running a cluster on an SP, also consult your SP manual set for instructions on tuning SP switch networks.

    Planning Disks and Volume Groups

    Planning the disk layout is crucial for the protection of your critical data in an HACMP cluster. Follow the guidelines carefully, and keep in mind these issues:

  • All operating system files should reside in the root volume group (rootvg) and all user data should reside outside that group. This makes updating or reinstalling the operating system and backing up data more manageable.
  • A node whose resources are not designed to be taken over should not own critical volume groups.
  • When using copies, each physical volume using a mirror copy should get its power from a UPS system.
  • Volume groups that contain at least three physical volumes provide the maximum availability when implementing mirroring (one mirrored copy for each physical volume).
  • auto-varyon must be set to false. HACMP will be managing the disks and varying them on and off as needed to handle cluster events.
  • Quorum Issues

    Setting up quorum correctly when laying out a volume group is very important. Quorum must be enabled on concurrent volume groups. With quorum enabled, a two-disk non-concurrent volume group puts you at risk for losing quorum and data access. The failure of a single adapter or cable would cause half the disks to be inaccessible. HACMP provides some protections to avoid the failure, but planning is still important.

    Either build three-disk volume groups or disable quorum on non-concurrent volume groups. You can also use the forced varyon option to work around quorum issues.

    See the detailed section about Quorum in Chapter 5: Planning Shared LVM Components in the Planning Guide.

    HACMP selectively provides recovery for resource groups that are affected by failures of individual resources. HACMP automatically reacts to a “loss of quorum” LVM_SA_QUORCLOSE error associated with a volume group going offline on a cluster node. If quorum is lost for a volume group that belongs to a resource group on a cluster node, the system checks whether the LVM_SA_QUORCLOSE error appeared in the node’s AIX 5L error log file and informs the Cluster Manager to selectively move the affected resource group.

    Note: When the AIX 5L error log buffer is full, new entries are discarded until space becomes available in the buffer and adds an error log entry to inform you of this problem. For information about increasing the size of the error log device driver internal buffer, see the AIX 5L documentation in section Accessing Publications in About This Guide
    Note: HACMP launches selective fallover and moves the affected resource group only in the case of the LVM_SA_QUORCLOSE error. Be aware that this error occurs if you use mirrored volume groups with quorum enabled. However, in many cases, different types of “volume group failure” errors could occur. HACMP does not react to any other type of volume group errors automatically. In these cases, you still need to configure customized error notification methods, or use AIX 5L Automatic Error Notification methods to react to volume group failures.

    For more information on Selective Fallover triggered by loss of quorum for a volume group, see the section Selective Fallover Caused by a Volume Group Loss in Appendix B: Resource Group Behavior during Cluster Events.

    Planning Hardware Maintenance

    Good maintenance practice in general dictates that you:

  • Check cluster power supplies periodically
  • Check the errlog and/or any other logs where you have redirected information of interest and attend to all notifications in a timely manner
  • Be prepared to replace any failed or outdated cluster hardware.
  • If possible, you should have replacement parts readily available. If the cluster has no single points of failure, it will continue to function even though a part has failed. However, now a single point of failure may exist. If you have set up notification for hardware errors, you have an early warning system in place.

    This guide contains procedures detailing how to replace the following cluster components while keeping the cluster running:

  • Network
  • Network interface card
  • Disk
  • Node.
  • See Hardware Maintenance for more information.

    Planning Software Maintenance

    Planning for software maintenance includes:

  • Customizing notification of software problems
  • Periodically checking and cleaning up log files
  • Taking cluster snapshots when making any change to the cluster configuration
  • Preparing for upgrading AIX 5L, applications, and HACMP for AIX 5L.
  • See Preventive Maintenance for more details.

    Runtime Maintenance

    Once you have configured the cluster and brought it online, it is very important to do maintenance tasks in as non-disruptive a way as possible. Maintaining an HACMP cluster requires attention to some issues that have different ramifications in the cluster environment compared to maintaining a single system.

    This section discusses the following issues:

  • Tasks that require stopping the cluster
  • Warnings about the cascading effects caused by making certain types of changes to a stable, running cluster
  • Tasks that Require Stopping the Cluster

    HACMP allows you to do many tasks without stopping the cluster; you can do many tasks dynamically using the DARE and C-SPOC utilities. However, in order to do the following tasks, you must stop the cluster:

  • Change the name of a cluster component: network module, cluster node, or network interface. Once you configure the cluster, you should not need to change these names.
  • Maintain RSCT.
  • Change automatic error notification.
  • Change SSA fence registers.
  • Convert a service IP label from IPAT via IP Replacement to IPAT via IP Aliases.
  • Changing the Cluster Configuration—Cascading Effects on Cluster Behavior

    Installing HACMP makes changes to several AIX 5L files (see Chapter 1: Administering an HACMP Cluster). All the components of the cluster are under HACMP control once you configure, synchronize, and run the cluster software. Using AIX 5L to change any cluster component, instead of using the HACMP menus and synchronizing the topology and/or the cluster resources, will interfere with the proper behavior of the HACMP cluster software and thus affect critical cluster services.

    This section contains warnings about actions that will endanger the proper behavior of an HACMP cluster. It also includes some reminders about proper maintenance procedures.

    Stopping and Starting Cluster Services

    Do not directly start or stop daemons or services that are running under the control of HACMP. Any such action will affect cluster communication and behavior. You can choose to run certain daemons (Clinfo) but others are required to run under HACMP control.

    Most important, never use the kill – 9 command to stop the Cluster Manager or any RSCT daemons. This causes an abnormal exit. SRC will run the clexit.rc script and halt the system immediately. This causes the other nodes to initiate a fallover.

    TCP/IP services are required for cluster communication. Do not stop this service on a cluster node. If you need to stop HACMP or TCP/IP to maintain a node, use the proper procedure to move the node’s resources to another cluster node, then stop cluster services on this node. Follow the instructions in Chapter 15: Managing Resource Groups in a Cluster to make changes to cluster topology or resources.

    Node, Network, and Network Interface Issues

    The HACMP configuration of the nodes and IP addresses is crucial to the communication system of the cluster. Any change in the definitions of these elements must be updated in the cluster configuration and resynchronized.

    Do not change the configuration of a cluster node, network, or network interface using AIX 5L SMIT menus or commands, individually on a cluster node, outside of HACMP. See Chapter 15: Managing Resource Groups in a Cluster, for instructions on changing the configuration dynamically following the proper HACMP cluster procedures.

    Do not start or stop daemons or services that are running under the control of HACMP. This action will affect cluster communication and behavior.

    Be sure to follow proper procedures for the following types of changes:

  • Changing the IP label/address of any network interface defined to HACMP. Changes to IP addresses must be updated in the HACMP cluster definition and the cluster must then be resynchronized. Any change to network interface attributes normally requires stopping cluster services, making the change, and restarting cluster services.
  • Note that in some circumstances you can use the HACMP facility to swap a network service IP address dynamically, to another active network interface on the same node and network, without shutting down cluster services on the node. See Swapping IP Addresses between Communication Interfaces Dynamically in Chapter 13: Managing the Cluster Topology for more information.

  • Changing netmasks of network interfaces. Service and other network interfaces on the same network must have the same netmask on all cluster nodes. Changes made outside the cluster definition will affect the ability of the Cluster Manager to send heartbeat messages across the network.
  • It is important to configure the correct interface name for network interfaces. see the relevant section in Chapter 15: Managing Resource Groups in a Cluster.
  • Enabling an alternate Hardware Address for a service network interface using AIX 5L SMIT.
  • Taking down network interface cards. Do not take down all cards on the same network if a local network failure event is set up to stop cluster services and move resource groups to another node. If the cluster is customized to stop cluster services and move resource groups to another node when all communications on a specific network fail, and you take down all network interfaces, this will force the resource groups to move to another node whether you intend it or not.
  • Taking down network interfaces. Do not bring all network interfaces down on the same network if there is only one network and no point-to-point network is defined. Doing this will cause system contention between cluster nodes and fallover attempts made by each node. A Group Services domain merge message is issued when a node has been out of communication with the cluster and then attempts to reestablish communication. The cluster will remain unstable until you fix the problem.
  • Making Changes to Network Interfaces

    In some circumstances, you can use the HACMP facility to swap a network service IP address dynamically, to an active standby network interface on the same node and network, without shutting down cluster services on the node. See Swapping IP Addresses between Communication Interfaces Dynamically in Chapter 13: Managing the Cluster Topology for more information.

    Typically, stop the cluster to make any change to network interfaces. If you must change the IP address of an network interface, or if you change the IP label/address, make sure to make the changes to both DNS or NIS and the /etc/hosts file. If DNS or NIS and /etc/hosts are not updated, you will be unable to synchronize the cluster nodes or do any DARE operations. If DNS or NIS services are interrupted, the /etc/hosts file is used for name resolution. You must also redo cl_setup kerberos if you are using Kerberos security.

    Handling Network Load/Error rates

    Dropped packets due to network loads may cause false fallovers. Also, high throughput may cause the deadman switch to time-out. If either of these conditions occurs, check the AIX 5L network options and the Failure Detection Rate you have set for the cluster. These parameters are contained in the Advanced Performance Tuning Parameters panel in SMIT.

    See Changing the Configuration of a Network Module in Chapter 13: Managing the Cluster Topology, for information on tuning the Failure Detection Rate. RSCT logging can also help with the tuning of networks.

    Maintaining and Reconfiguring Networks

    Moving Ethernet ports on a running cluster results in network interface swap or node failure. Even a brief outage results in a cluster event.

    Shared Disk, Volume Group, and Filesystem Issues

    Do not change the configuration of an HACMP shared volume group or filesystem using AIX 5L, outside of HACMP. Any such action will affect cluster behavior. The Cluster Manager and the cluster event scripts assume the shared volume groups and filesystems are under HACMP control. If you change the environment, the event scripts will not be able to complete properly and you will get unexpected results.

    Disk Issues

    Disks should always be mirrored (or use a disk array), to protect against loss of data. Once they are defined and configured within the HACMP cluster, you should always use the HACMP C-SPOC utility (smit cl_admin) to add or remove disks from a volume group with the cluster running. The cluster needs to be made aware of disks being added to or removed from a shared volume group. If you add or remove disks using the conventional method, the cluster will not be aware that these changes have occurred.

    Volume Group and Filesystems Issues

    Use the C-SPOC utility (smit cl_admin) for common maintenance tasks like creating, extending, changing, or removing a shared filesystem. See Chapter 11: Managing Shared LVM Components.

    See Chapter 5: Planning Shared LVM Components in the Planning Guide for information on using NFS and HACMP.

    When configuring volume groups and filesystems:

  • Do not set filesystems to automount; HACMP handles the mounts at startup and during cluster events.
  • Do not set volume groups to autovaryon; HACMP executes the varying on and off as needed.
  • If you are testing something when the cluster is not running and you varyon a volume group or mount a filesystem, remember to unmount the filesystem and vary off the volume group before you start HACMP.
  • Do not have any processes running that would point to a shared filesystem when cluster services are stopped with resource groups brought offline on the node that currently owns that filesystem. If cluster services are stopped with resource groups brought offline and the application stop script fails to terminate the processes that are using the filesystem, that filesystem will be unable to unmount and the fallover will not occur. The cluster will go into a config_too_long condition.
  • One of the more common reasons for a filesystem to fail being unmounted when cluster services are stopped with resource groups brought offline is because the filesystem is busy. In order to unmount a filesystem successfully, no processes or users can be accessing it at the time. If a user or process is holding it, the filesystem will be “busy” and will not unmount. The same issue may result if a file has been deleted but is still open.

    This is easy to overlook when you write application stop scripts. The script to stop an application should also include a check to make sure that the shared filesystems are not in use. You can do this by using the fuser command. The script should use the fuser command to see what processes or users are accessing the filesystems in question. These processes can then be killed. This will free the filesystem so it can be unmounted.

    Refer to the AIX 5L man pages for complete information on this command.

    Expanding Filesystems

    Use C-SPOC to increase the size of a filesystem:

      1. Enter smit cl_admin
      2. Go to System Management (C-SPOC) > HACMP Logical Volume Management > Shared Filesystems > JFS or Enhanced JFS (depending on the case) and press Enter.
      3. Select the option to change a cluster filesystem.
      4. Select the filesystem to change.
      5. Enter the new size for the filesystem.
      6. Return to the Logical Volume Management panel and synchronize the new definition to all cluster nodes via Synchronize a Shared Volume Group Definition.

    General Filesystems Issues

    The following are some more general filesystems concerns:

  • Full filesystems in the root volume group may cause cluster events to fail. You should monitor this volume group and clean it up periodically. You can set up a cron job to monitor filesystem size to help avoid filling a critical filesystem (for example, the hacmp.out file can get quite large).
  • Shared filesystems must have the mount option set to false, so that HACMP can mount and unmount them as needed to handle cluster events.
  • Be aware of the way NFS filesystems are handled. See Using NFS with HACMP in Chapter 14: Managing the Cluster Resources.
  • For information on using GPFS, see Appendix C in the Installation Guide.

    Application Issues

    Appendix B: Applications and HACMP in the Planning Guide gives many pointers on planning and maintaining applications in the HACMP environment. Some key points to remember:

  • Application maintenance will require downtime for resource groups if binaries reside on a shared disk.
  • Upgrades should be tested prior to implementation to anticipate effects on the production cluster.
  • Changes to application start and stop procedures should be thoroughly tested prior to going into production.
  • Do not have shared applications already running when you start the cluster. A second attempt at starting already running applications may cause a problem.
  • Do not manually execute the application stop script for any reason on a running cluster without starting the application back up again. Problems may occur if an attempt is made to stop the application that is already down. This could potentially cause a fallover attempt to be unsuccessful.
  • Hardware Maintenance

    Hardware failures must be dealt with promptly, as they may create single points of failure in the cluster. If you have carefully set up error notification and event customization as recommended, you receive quick notification via email of any problems. You should also periodically do error log analysis. See Viewing HACMP Cluster Log Files section in Chapter 2: Using Cluster Log Files in the Troubleshooting Guide for details on error log analysis.

    Some issues to be aware of in a high availability environment include:

  • Shared disks connect to both systems, thus open loops and failed disks can result in fragmented SSA loops and the loss of access to one mirror set.
  • Set up mirroring so that the mirrored disk copy is accessible by a different controller. This prevents loss of data access when a disk controller fails. When a disk controller fails, the mirror disk is accessible through the other controller.
  • System ID Licensing Issues

    The Concurrent Resource Manager is licensed to the system ID of a cluster node. Many of the clvm or concurrent access commands validate the ID against the license file. A mismatch will cause the command to fail, with an error message indicating the lack of a license.

    Restoring a system image from a mksysb tape created on a different node or replacing the planar board on a node will cause this problem. In such cases, you must recreate the license file by removing and reinstalling the cluster.clvm component of the current release from the original installation images.

    Replacing Topology Hardware

    Nodes, networks, and network interfaces and devices comprise the topology hardware. Changes to the cluster topology often involves downtime on one or more nodes if changes to cabling or adding/removing network interfaces is involved. In most situations, you can use the DARE utilities to add a topology resource without downtime.

    The following sections indicate the conditions under which you can use DARE and the conditions under which you must plan cluster downtime.

    Note: No automatic corrective actions take place during a DARE.

    Replacing Nodes

    Using the DARE utility, you can add or remove a node while the cluster is running

    Replacing a Node or Node Component

    If you are replacing a cluster node keep this list in mind:

  • The new node must typically have the same amount of RAM (or more) as the original cluster node.
  • The new node must typically be same type of system if your applications are optimized for a particular processor.
  • The new node’s slot capacity typically must be the same or better than the old node.
  • NIC physical placement is important – use the same slots as originally assigned.
  • If you have a concurrent environment, you must reinstall the CRM (Concurrent Resource Manager) software. This is also a consideration if you are cloning nodes.
  • Get the new license key from the application vendor for the new CPU ID if necessary.
  • If you are replacing a component of the node:

  • Be aware of CPU ID issues
  • For SCSI adapter replacement – reset external bus SCSI ID to original SCSI ID
  • For NIC replacement – use the same slots as originally assigned.
  • Procedure for Adding or Removing a Node

    Also see Changing the Configuration of Cluster Nodes in Chapter 13: Managing the Cluster Topology for more complete information.

    The basic procedure for adding or removing a node:

      1. Install AIX 5L, HACMP and LPPs on new node and apply PTFs to match the levels of the previous node.
      2. Connect networks and SSA cabling and test.
      3. Configure TCP/IP.
      4. Import volume group definitions.
      5. Connect serial network and test.
      6. Change the Configuration Database configuration on one of the existing nodes.
      7. Synchronize and verify from the node where you made the changes.

    Replacing Networks and Network Interfaces

    You can only protect your applications from downtime due to a network failure if you configure more than one IP network. You should also have a serial network. (or use heartbeat over disk or heartbeat over IP aliases). If no backup network is configured, the cluster will be inaccessible to all but directly connected clients.

    Note: It is important to configure the correct interface name for network interfaces. See Chapter 15: Managing Resource Groups in a Cluster.

    You can replace network cabling without taking HACMP off line. You can also replace hubs, routers, and bridges while HACMP is running. Be sure to use the correct IP addresses when reconfiguring a router.

    You can use the DARE swap_adapter function to swap the IP address on the same node and network. Then you can service the failed network interface card without stopping the node.

    Procedure for Replacing a LAN adapter

    If the hardware supports hot-pluggable network interfaces, no cluster downtime is required for this procedure.

    If you cannot use the swap_adapter function, use this procedure:

      1. Move resource groups to another node using the Resource Group Management utility.
      2. Use the hotplug mechanism to replace the card.
      3. Assign IP addresses and netmasks for interfaces if they were undefined (see Configuring Cluster Topology (Extended) in Chapter 4: Configuring HACMP Cluster Topology and Resources (Extended).
      4. Test IP communications.

    Handling Disk Failures

    Handling shared disk failures differs depending on the type of disk and whether it is a concurrent access configuration or not.

  • SCSI non-RAID—You will have to shut down the nodes sharing the disks.
  • SCSI RAID—There may be downtime, depending on the capabilities of the array.
  • SSA non-RAID—Requires manual intervention – you can replace disks with no system downtime.
  • SSA RAID—No downtime necessary.
  • See Restarting the Concurrent Access Daemon (clvmd) in Chapter 12: Managing Shared LVM Components in a Concurrent Access Environment for information on that procedure.

    Replacing a Failed SSA non-RAID Disk

    Use C-SPOC to replace a mirrored disk drive. See Maintaining Physical Volumes in Chapter 11: Managing Shared LVM Components.

    Preventive Maintenance

    If you have a complex and/or very critical cluster, it is highly recommended that you maintain a test cluster as well as your production cluster. Thus before you make any major change to the production cluster, you can test the procedure on the test cluster.

    HACMP also supplies event emulation utilities to aid in testing.

    Cluster Snapshots

    Periodically take snapshots of the cluster in case you need to reapply a configuration. You should take a snapshot any time you change the configuration. Keep a copy of the snapshot on another system, off the cluster, as protection against loss of the cluster configuration. You can use the snapshot to rebuild the cluster quickly in case of an emergency. See Chapter 18: Saving and Restoring Cluster Configurations in this Guide for complete information on cluster snapshots. You might want to consider setting up a cron job to do this on a regular basis.

    Backups

    HACMP does not provide tools for backing up the system. You should plan for periodic backups just as you do for a single system. You should do backups of rootvg and shared volume groups.

    Backups of shared volume groups should be done frequently.

    Some applications have their own online backup methods.

    You can use any of the following:

  • mksysb backups
  • Online backups (sysback, splitlvcopy)
  • Using mksysb

    You should do a mksysb on each node prior to and following any changes to the node environment. Such changes include:

  • Applying PTFs
  • Upgrading AIX 5L or HACMP software
  • Adding new applications
  • Adding new device drivers
  • Changing TCP/IP configuration
  • Changing cluster topology or resources
  • Changing LVM components of rootvg (paging space, filesystem sizes)
  • Changing AIX 5L parameters (including the tuning parameters: I/O pacing, syncd)
  • Using splitlvcopy

    You can use the splitlvcopy method on raw logical volumes and filesystems to do a backup while the application is still running. This method is only possible for LVM mirrored logical volumes.

    By taking advantage of the LVM’s mirroring capability, you can stop the application briefly to split off a copy of the data using the AIX 5L splitlvcopy command. Stopping the application gives the application its checkpoint. Then restart the application so it continues processing while you do a backup of the copy.

    You can do the backup using tar, cpio, or any other AIX 5L backup command that operates on a logical volume or a filesystem. Using cron, you can automate this type of backup.

    Using cron

    Use the AIX 5L cron utility to automate scheduled maintenance and to monitor the system.

    Using cron to Automate Maintenance of Log Files

    Use this utility to automate some of the administrative functions that need to be done on a regular basis. Some of the HACMP log files need cron jobs to ensure that they do not use up too much space.

    Use crontab –e to edit /var/spool/cron/crontabs/root.

    Cron will recognize the change without need for rebooting.

    You might establish a policy for each log, depending how long you want to keep the log, and what size you will allow it to grow. hacmp.out is already set to expire after it cycles >7 times.

    The RSCT logs are stored in the /var/ha/log directory. These logs are trimmed regularly. If you want to save information for a longer period of time you can either redirect the logging to a different directory, or change the maximum size file parameter (using SMIT). See Viewing HACMP Cluster Log Files section in Chapter 2: Using Cluster Log Files in the Troubleshooting Guide.

    Using cron to Set up An Early Warning System

    Use cron to set up jobs to proactively check out the system:

  • Run a custom verification daily and send a report to the system administrator.
  • Check for full filesystems (and take action if necessary).
  • Check that certain processes are running.
  • Run event emulation and send a report to the system administrator.
  • Do Regular Testing

    Regularly schedule a testing window where a failure is conducted in a controlled environment. That way you can evaluate a fallover before anything happens in your production cluster. It should include fallovers of all nodes and full verification of tested protected applications. This is strongly encouraged if you are changing or evolving your cluster environment.

    Upgrading Software (AIX 5L and HACMP)

    When upgrading the AIX 5L or HACMP software:

  • Take a cluster snapshot and save it in a directory outside the cluster.
  • Back up the operating system and data before performing any upgrade. Prepare a backout plan in case you encounter problems with the upgrade.
  • Whenever possible, plan and do an initial run through on a test cluster.
  • Use disk update if possible.
  • Follow this same general rule for fixes to the application; follow specific instructions for the application.
  • AIX 5L fixes need to be applied according to the HACMP operations guide:

  • Apply APARs to standby node.
  • Fallover (stopping cluster services with the Move Resource Groups option) to standby machine.
  • Apply APARs.
  • See the Installation Guide for installation and migration procedures.


    PreviousNextIndex