PreviousNextIndex

Chapter 1: Troubleshooting HACMP Clusters


This chapter presents the recommended troubleshooting strategy for an HACMP cluster. It describes the problem determination tools available from the HACMP main SMIT menu. This guide also includes information on tuning the cluster for best performance, which can help you avoid some common problems.

For details on how to use the various log files to troubleshoot the cluster see Chapter 2: Using Cluster Log Files. For hints on how to check system components if using the log files does not help with the problem, and a list of solutions to common problems that may occur in an HACMP environment see Chapter 3: Investigating System Components and Solving Common Problems.

For information specific to RSCT daemons and diagnosing RSCT problems, see the following IBM publications:

  • IBM Reliable Scalable Cluster Technology for AIX 5L and Linux: Group Services Programming Guide and Reference, SA22-7888
  • IBM Reliable Scalable Cluster Technology for AIX 5L and Linux: Administration Guide, SA22-7889
  • IBM Reliable Scalable Cluster Technology for AIX 5L: Technical Reference, SA22-7890
  • IBM Reliable Scalable Cluster Technology for AIX 5L: Messages, GA22-7891.
  • Note: This chapter presents the default locations of log files. If you redirected any logs, check the appropriate location. For additional information, see Chapter 2: Using Cluster Log Files.

    The main sections of this chapter include:

  • Troubleshooting an HACMP Cluster Overview
  • Using the Problem Determination Tools
  • Configuring Cluster Performance Tuning
  • Resetting HACMP Tunable Values
  • Sample Custom Scripts.
  • Troubleshooting an HACMP Cluster Overview

    Typically, a functioning HACMP cluster requires minimal intervention. If a problem does occur, diagnostic and recovery skills are essential. Therefore, troubleshooting requires that you identify the problem quickly and apply your understanding of the HACMP software to restore the cluster to full operation. In general, troubleshooting an HACMP cluster involves:

  • Becoming aware that a problem exists
  • Determining the source of the problem
  • Correcting the problem.
  • Becoming Aware of the Problem

    When a problem occurs within an HACMP cluster, you will most often be made aware of it through:

  • End user complaints, because they are not able to access an application running on a cluster node
  • One or more error messages displayed on the system console or in another monitoring program.
  • There are other ways you can be notified of a cluster problem, through mail notification, or pager notification and text messaging:

  • Mail Notification. Although HACMP standard components do not send mail to the system administrator when a problem occurs, you can create a mail notification method as a pre- or post-event to run before or after an event script executes. In an HACMP cluster environment, mail notification is effective and highly recommended. See the Planning Guide for more information.
  • Remote Notification. You can also define a notification method—numeric or alphanumeric page, or an text messaging notification to any address including a cell phone—through the SMIT interface to issue a customized response to a cluster event. For more information, see the chapter on customizing cluster events in the Planning Guide.
  • Pager Notification. You can send messages to a pager number on a given event. You can send textual information to pagers that support text display (alphanumeric page), and numerical messages to pagers that only display numbers.
  • Text Messaging. You can send cell phone text messages using a standard data modem and telephone land line through the standard Telocator Alphanumeric Protocol (TAP)—your provider must support this service.
  • You can also issue a text message using a Falcom-compatible GSM modem to transmit SMS (Short Message Service) text-message notifications wirelessly. SMS messaging requires an account with an SMS service provider. GSM modems take TAP modem protocol as input through a RS232 line or USB line, and send the message wirelessly to the providers’ cell phone tower. The provider forwards the message to the addressed cell phone. Each provider has a Short Message Service Center (SMSC).

    For each person, define remote notification methods that contain all the events and nodes so you can switch the notification methods as a unit when responders change.
    Note: Manually distribute each message file to each node. HACMP does not automatically distribute the file to other nodes during synchronization unless the File Collections utility is set up specifically to do so. See the Managing HACMP File Collections section in Chapter 7: Verifying and Synchronizing a Cluster Configuration of the Administration Guide.

    Application Services Are Not Available

    End-user complaints often provide the first indication of a system problem. End users may be locked out of an application, or they may not be able to access a cluster node. Therefore, when problems occur you must be able to resolve them and quickly restore your cluster to its full operational status.

    When a problem is reported, gather detailed information about exactly what has happened. Find out which application failed. Was an error message displayed? If possible, verify the problem by having the user repeat the steps that led to the initial problem. Try to duplicate the problem on your own system, or ask the end user to recreate the failure.

    Note: Being locked out of an application does not always indicate a problem with the HACMP software. Rather, the problem can be with the application itself or with its start and stop scripts. Troubleshooting the applications that run on nodes is an integral part of debugging an HACMP cluster.

    Messages Displayed on System Console

    The HACMP system generates descriptive messages when the scripts it executes (in response to cluster events) start, stop, or encounter error conditions. In addition, the daemons that make up an HACMP cluster generate messages when they start, stop, encounter error conditions, or change state. The HACMP system writes these messages to the system console and to one or more cluster log files. Errors may also be logged to associated system files, such as the snmpd.log file.

    For information on all of the cluster log files see Chapter 2: Using Cluster Log Files.

    Determining a Problem Source

    When trying to locate the source of a problem, the surface problem is sometimes misleading. To diagnose a problem, follow these general steps:

    Step
    What you do...
    1
    Save associated log files (/tmp/hacmp.out and /tmp/clstrmgr.debug are the most common). It is important to save the log files associated with the problem before they are overwritten or no longer available. Use the AIX 5L snap -e command to collect HACMP cluster data, including the log files.
    2
    Examine the log files for messages generated by the HACMP system. Compare the cluster verification data files for the last successful run and the failed run. For details on these log files see Chapter 7: Verifying and Synchronizing a Cluster Configuration in the Administration Guide. Look at the RSCT logs as well.
    3
    Investigate the critical components of an HACMP cluster using a combination of HACMP utilities and AIX 5L commands.
    4
    Activate tracing of HACMP subsystems.

    With each step, you will obtain additional information about the HACMP cluster components. However, you may not need to perform each step; examining the cluster log files may provide enough information to diagnose a problem.

    Troubleshooting Guidelines

    As you investigate HACMP system components, use the following guidelines to make the troubleshooting process more productive:

  • From every cluster node, save the log files associated with the problem before they become unavailable. Make sure you collect HACMP cluster data with the AIX 5L snap -e command before you do anything else to help determine the cause of the problem.
  • Attempt to duplicate the problem. Do not rely too heavily on the user’s problem report. The user has only seen the problem from the application level. If necessary, obtain the user’s data files to recreate the problem.
  • Approach the problem methodically. Allow the information gathered from each test to guide your next test. Do not jump back and forth between tests based on hunches.
  • Keep an open mind. Do not assume too much about the source of the problem. Test each possibility and base your conclusions on the evidence of the tests.
  • Isolate the problem. When tracking down a problem within an HACMP cluster, isolate each component of the system that can fail and determine whether it is working correctly. Work from top to bottom, following the progression described in the following section.
  • Go from the simple to the complex. Make the simple tests first. Do not try anything complex and complicated until you have ruled out the simple and obvious.
  • Make one change at a time. Do not make more than one change at a time. If you do, and one of the changes corrects the problem, you have no way of knowing which change actually fixed the problem. Make one change, test the change, and then, if necessary, make the next change and repeat until the problem is corrected.
  • Stick to a few simple troubleshooting tools. For most problems within an HACMP system, the tools discussed here are sufficient.
  • Do not neglect the obvious. Small things can cause big problems. Check plugs, connectors, cables, and so on.
  • Keep a record of the tests you have completed. Record your tests and results, and keep an historical record of the problem in case it reappears.
  • Stopping the Cluster Manager

    To fix some cluster problems, you must stop the Cluster Manager on the failed node and have a surviving node take over its shared resources. If the cluster is in reconfiguration, it can only be brought down by stopping it and placing the resource group in an UNMANAGED state. The surviving nodes in the cluster will interpret this kind of stop as a node_down event, but will not attempt to take over resources. The resources will still be available on that node (enhanced concurrent volume groups do not accept placing resource groups in an UNMANAGED state if they are online). You can then begin the troubleshooting procedure.

    If all else fails, stop the HACMP cluster services on all cluster nodes. Then, manually start the application that the HACMP cluster event scripts were attempting to start and run the application without the HACMP software. This may require varying on volume groups, mounting filesystems, and enabling IP addresses. With the HACMP cluster services stopped on all cluster nodes, correct the conditions that caused the initial problem.

    Using the AIX Data Collection Utility

    Use the AIX 5L snap command to collect data from an HACMP cluster.

    The -e flag collects the HACMP data. Using /usr/sbin/snap -e lets you properly document a problem with one simple command. This command gathers all files necessary for determining most HACMP problems. It provides output in a format ready to send to IBM Support personnel.

    When you run the command, the output is placed in a newly created subdirectory called /ibmsupt in /tmp.The recommended amount for available space in /tmp should be at least 100 MG before running the command, the output could require more depending on number of nodes and actual files collected.

    See the AIX 5L man page for complete information.

    Checking a Cluster Configuration with Online Planning Worksheets

    The Online Planning Worksheets application lets you view a cluster definition for the following:

  • Local HACMP cluster running HACMP 5.2 or greater
  • Cluster worksheets file created from SMIT or from Online Planning Worksheets.
  • You can use a worksheets file to view information for a cluster configuration and to troubleshoot cluster problems. The Online Planning Worksheets application lets you review definition details on the screen in an easy-to-read format and lets you create a printable formatted report.

    Warning: Although you can import a cluster definition and save it, some of the data is informational only. Making changes to informational components does not change the actual configuration on the system if the worksheets file is exported. For information about informational components in a worksheets file, see the section Entering Data in Chapter 9: Using Online Planning Worksheets in the Planning Guide.
    Note: Cluster definition files and their manipulation in the Online Planning Worksheets application supplement, but do not replace cluster snapshots.

    For more information about using cluster definition files in the Online Planning Worksheets application, see Chapter 9: Using Online Planning Worksheets in the Planning Guide.

    Using HACMP Diagnostic Utilities

    Both HACMP and AIX 5L supply many diagnostic tools. The key HACMP diagnostic tools (in addition to the cluster logs and messages) include:

  • clRGinfo provides information about resource groups and for troubleshooting purposes. For more information see Chapter 10: Monitoring an HACMP Cluster in the
    Administration Guide.
  • clstat reports the status of key cluster components—the cluster itself, the nodes in the cluster, the network interfaces connected to the nodes, the service labels, and the resource groups on each node. For more information see Chapter 10: Monitoring an HACMP Cluster in the Administration Guide.
  • clsnapshot allows you to save in a file a record of all the data that defines a particular cluster configuration. For more information see the section Using the Cluster Snapshot Utility to Check Cluster Configuration. and Creating (Adding) a Cluster Snapshot section in Chapter 18: Saving and Restoring Cluster Configurations in the Administration Guide.
  • cldisp utility displays resource groups and their startup, fallover, and fallback policies. For more information Chapter 10: Monitoring an HACMP Cluster in the Administration Guide.
  • SMIT Problem Determination Tools, for information see the section Using the Problem Determination Tools in this chapter.
  • Using the Cluster Snapshot Utility to Check Cluster Configuration

    The HACMP cluster snapshot facility (/usr/es/sbin/cluster/utilities/clsnapshot) allows you to save in a file, a record of all the data that defines a particular cluster configuration. You can use this snapshot for troubleshooting cluster problems.

    The cluster snapshot saves the data stored in the HACMP Configuration Database classes. In addition to this Configuration Database data, a cluster snapshot also includes output generated by various HACMP and standard AIX 5L commands and utilities. This data includes the current state of the cluster, node, network, and network interfaces as viewed by each cluster node, and the state of any running HACMP daemons. It may also include additional user-defined information if there are custom snapshot methods in place.

    In HACMP 5.1 and up, by default, HACMP no longer collects the cluster log files when you create the cluster snapshot. You can still specify in SMIT that the logs be collected if you want them. Skipping the logs collection reduces the size of the snapshot and reduces the running time of the snapshot utility.

    For more information on using the cluster snapshot utility, see Chapter 18: Saving and Restoring Cluster Configurations in the Administration Guide.

    Working with SMIT Problem Determination Tools

    The SMIT Problem Determination Tools menu includes the options offered by cluster snapshot utility, to help you diagnose and solve problems. For more information see the following section on Using the Problem Determination Tools in this chapter.

    Verifying Expected Behavior

    When the highly available applications are up and running, verify that end users can access the applications. If not, you may need to look elsewhere to identify problems affecting your cluster. The remaining chapters in this guide describe ways in which you should be able to locate potential problems.

    Using the Problem Determination Tools

    The Problem Determination Tools menu options are described in the following sections:

  • HACMP Verification
  • Viewing Current State
  • HACMP Log Viewing and Management
  • Recovering from HACMP Script Failure
  • Restoring HACMP Configuration Database from an Active Configuration
  • Release Locks Set by Dynamic Reconfiguration
  • Clear SSA Disk Fence Registers
  • HACMP Cluster Test Tool
  • HACMP Trace Facility
  • HACMP Event Emulation
  • HACMP Error Notification
  • Opening a SMIT Session on a Node.
  • HACMP Verification

    Select this option from the Problem Determination Tools menu to verify that the configuration on all nodes is synchronized, set up a custom verification method, or set up automatic cluster verification.

    Verify HACMP Configuration
    Select this option to verify cluster topology resources.
    Configure Custom Verification Method
    Use this option to add, show and remove custom verification methods.
    Automatic Cluster
    Configuration Monitoring
    Select this option to automatically verify the cluster every twenty-four hours and report results throughout the cluster.

    Verify HACMP Configuration

    To verify cluster topology resources and custom-defined verification methods:

      1. Enter smit hacmp
      2. In SMIT, select Problem Determination Tools > HACMP Verification > Verify HACMP Configuration.
      3. Enter field values as follows:
    HACMP Verification Method
     
    By default, Pre-Installed will run all verification methods shipped with HACMP and HACMP/XD verification (if applicable or user-provided). You can select this field to run all Pre-Installed programs or select none to specify a previously defined custom verification method.
    Custom Defined Verification Method
    Enter the name of a custom defined verification method. Press F4 for a list of previously defined verification methods. By default, when no methods are selected, and none is selected in the Base HACMP Verification Method field, verify and synchronize will not check the base verification methods, and will generate an error message.
    The order in which verification methods are listed determines the sequence in which the methods run. This sequence remains the same for subsequent verifications until different methods are selected. Selecting All verifies all custom-defined methods.
    See Adding a Custom Verification Method in Chapter 7: Verifying and Synchronizing a Cluster Configuration in the Administration Guide for information on adding or viewing a customized verification method.
    Error Count
    By default, Verify HACMP Configuration will continue to run after encountering an error in order to generate a full list of errors. To cancel the program after a specific number of errors, type the number in this field.
    Log File to store output
    Enter the name of an output file in which to store verification output. By default, verification output is also stored in the /usr/es/sbin/cluster/wsm/logs/
    wsm_smit.log
    file.
    Verify Changes Only?
    Select no to run all verification checks that apply to the current cluster configuration. Select yes to run only the checks related to parts of the HACMP configuration that have changed. The yes mode has no effect on an inactive cluster.
    Note: The yes option only relates to cluster Configuration Databases. If you have made changes to the AIX 5L configuration on your cluster nodes, you should select no. Only select yes if you have made no changes to the AIX 5L configuration.
    Logging
    Selecting on displays all output to the console that normally goes to the /var/hacmp/clverify/ clverify.log. The default is off.

    Configure Custom Verification Method

    You may want to add a custom verification method to check for a particular issue on your cluster. For example, you could add a script to check for the version of an application. You could include an error message to display and to write to the clverify.log file.

    For information on adding or viewing a customized verification method, see the Adding a Custom Verification Method section in Chapter 7: Verifying and Synchronizing a Cluster Configuration in the Administration Guide.

    Automatic Monitoring and Verification of Cluster Configuration

    The cluster verification utility runs on one user-selectable HACMP cluster node once every 24 hours. By default, the first node in alphabetical order runs the verification at midnight. During verification, any errors that might cause problems at some point in the future are displayed. You can change the defaults, by selecting a node and time that suit your configuration.

    If the selected node is unavailable (powered off), verification does not run the automatic monitoring. When cluster verification completes on the selected cluster node, this node notifies the other cluster nodes with the following verification information:

  • Name of the node where verification was run
  • Date and time of the last verification
  • Results of the verification.
  • This information is stored on every available cluster node in the HACMP log file /var/hacmp/log/clutils.log. If the selected node became unavailable or could not complete cluster verification, you can detect this by the lack of a report in the /var/hacmp/log/clutils.log file.

    In case cluster verification completes and detects some configuration errors, you are notified about the following potential problems:

  • The exit status of cluster verification is communicated across the cluster along with the information about cluster verification process completion.
  • Broadcast messages are sent across the cluster and displayed on stdout. These messages inform you about detected configuration errors.
  • A cluster_notify event runs on the cluster and is logged in hacmp.out (if cluster services is running).
  • More detailed information is available on the node that completes cluster verification in /var/hacmp/clverify/clverify.log. If a failure occurs during processing, error messages and warnings clearly indicate the node and reasons for the verification failure.

    Configuring Automatic Verification and Monitoring of Cluster Configuration

    Make sure the /var filesystem on the node has enough space for the /var/hacmp/log/clutils.log file. For additional information, see the section The Size of the /var Filesystem May Need to be Increased in Chapter 10: Monitoring an HACMP Cluster in the Administration Guide.

    To configure the node and specify the time where cluster verification runs automatically:

      1. Enter smit hacmp
      2. In SMIT, select Problem Determination Tools > HACMP Verification > Automatic Cluster Configuration Monitoring.
      3. Enter field values as follows:
    * Automatic cluster configuration verification
    Enabled is the default.
    Node name
    Select one of the cluster nodes from the list. By default, the first node in alphabetical order will verify the cluster configuration. This node will be determined dynamically every time the automatic verification occurs.
    *HOUR (00 - 23)
    Midnight (00) is the default. Verification runs automatically once every 24 hours at the selected hour.
      4. Press Enter.
      5. The changes take effect when the cluster is synchronized.

    Viewing Current State

    Select this option from the Problem Determination Tools menu to display the state of the nodes, communication interfaces, resource groups, and the local event summary for the last five events.

    HACMP Log Viewing and Management

    Select this option from the Problem Determination Tools menu to view a list of utilities related to the log files. From here you can:

  • View, save or delete Event summaries
  • View detailed HACMP log files
  • Change or show HACMP log file parameters
  • Change or show Cluster Manager log file parameters
  • Change or show a cluster log file directory
  • Collect cluster log files for problem reporting.
  • See Chapter 2: Using Cluster Log Files for complete information.

    Recovering from HACMP Script Failure

    Select this option from the Problem Determination Tools menu to recover from an HACMP script failure. For example, if a script failed because it was unable to set the hostname, the Cluster Manager reports the event failure. Once you correct the problem by setting the hostname from the command line, you must get the Cluster Manager to resume cluster processing.

    The Recover From HACMP Script Failure menu option invokes the /usr/es/sbin/cluster/utilities/clruncmd command, which sends a signal to the Cluster Manager daemon (clstrmgrES) on the specified node, causing it to stabilize. You must again run the script manually to continue processing.

    Make sure that you fix the problem that caused the script failure. You need to manually complete the remaining steps that followed the failure in the event script (see /tmp/hacmp.out). Then, to resume clustering, complete the following steps to bring the HACMP event script state to EVENT COMPLETED:

      1. Enter smit hacmp
      2. In SMIT, select Problem Determination Tools > Recover From HACMP Script Failure.
      3. Select the IP label/address for the node on which you want to run the clruncmd command and press Enter. The system prompts you to confirm the recovery attempt. The IP label is listed in the /etc/hosts file and is the name assigned to the service IP address of the node on which the failure occurred.
      4. Press Enter to continue. Another SMIT panel appears to confirm the success of the
      script recovery.

    Restoring HACMP Configuration Database from an Active Configuration

    If cluster services are up and you make changes to the configuration, those changes have modified the default configuration directory (DCD). You may realize that the impact of those changes was not well considered and you want to undo them. Because nothing was modified in the active configuration directory (ACD), all that is needed to undo the modifications to the DCD is to restore the DCD from the ACD.

    Select this option from the Problem Determination Tools menu to automatically save any of your changes in the Configuration Database as a snapshot with the path /usr/es/sbin/cluster/snapshots/UserModifiedDB before restoring the Configuration Database with the values actively being used by the Cluster Manager.

      1. Enter smit hacmp
      2. In SMIT, select Problem Determination Tools > Restore HACMP Configuration Database from Active Configuration.
    SMIT displays: Are you Sure?
      3. Press Enter.
    The snapshot is saved. For complete information on snapshots, see Chapter 18: Saving and Restoring Cluster Configurations in the Administration Guide.

    Release Locks Set by Dynamic Reconfiguration

    For information on the release locks set by Dynamic Reconfiguration, see the Dynamic Reconfiguration Issues and Synchronization section in Chapter 13: Managing the Cluster Topology in the Administration Guide.

    Clear SSA Disk Fence Registers

    Select this option from the menu only in an emergency, usually only when recommended by IBM support. If SSA Disk Fencing is enabled, and a situation has occurred in which the physical disks are inaccessible by a node or a group of nodes that need access to a disk, clearing the fence registers will allow access. Once this is done, the SSA Disk Fencing algorithm will be disabled unless HACMP is restarted from all nodes.

    To clear SSA Disk Fence Registers take the following steps:

      1. Enter smit hacmp
      2. In SMIT, stop cluster services (unless you are sure no contention for the disk will occur), by selecting System Management (C-SPOC) > Manage HACMP Services > Stop Cluster Services. For more information see the chapter on Starting and Stopping Cluster Services in the Administration Guide.
      3. Select Problem Determination Tools > Clear SSA Disk Fence Registers.
      4. Select the affected physical volume(s) and press Enter.
      5. Restart cluster services to enable SSA disk fencing again.

    HACMP Cluster Test Tool

    HACMP includes the Cluster Test Tool to help you test the recovery procedures for a new cluster before it becomes part of your production environment. You can also use it to test configuration changes to an existing cluster, when the cluster is not in service. See the chapter on Testing an HACMP Cluster in the Administration Guide.

    HACMP Trace Facility

    Select this option from the Problem Determination Tools menu if the log files have no relevant information and the component-by-component investigation does not yield concrete results. Use the HACMP trace facility to attempt to diagnose the problem. The trace facility provides a detailed look at selected system events. Note that both the HACMP and AIX 5L software must be running in order to use HACMP tracing.

    For more information on using the trace facility, see Appendix C: HACMP Tracing. Interpreting the output generated by the trace facility requires extensive knowledge of both the HACMP software and the AIX 5L operating system.

    HACMP Event Emulation

    Select this option from the Problem Determination Tools menu to emulate cluster events. Running this utility lets you emulate cluster events by running event scripts that produce output but do not affect the cluster configuration status. This allows you to predict a cluster’s reaction to an event as though the event actually occurred.

    The Event Emulator follows the same procedure used by the Cluster Manager given a particular event, but does not execute any commands that would change the status of the Cluster Manager. For descriptions of cluster events and how the Cluster Manager processes these events, see the Planning Guide. For more information on the cluster log redirection functionality see the chapter on Managing Resource Groups in a Cluster in the Administration Guide.

    The event emulator runs the events scripts on every active node of a stable cluster. Output from each node is stored in an output file on the node from which you invoked the emulation. You can specify the name and location of the output file using the environment variable EMUL_OUTPUT or you can use the default output file /tmp/emuhacmp.out.

    Event Emulator Considerations

    Keep the following cautions in mind when using the Event Emulator:

  • Run only one instance of the event emulator at a time. If you attempt to start a new emulation in a cluster while an emulation is already running, the integrity of the results cannot be guaranteed. Each emulation is a stand-alone process; one emulation cannot be based on the results of a previous emulation.
  • clinfoES must be running on all nodes.
  • Add a cluster snapshot before running an emulation, just in case uncontrolled cluster events happen during emulation. Instructions for adding cluster snapshots are in the chapter on Saving and Restoring Cluster Configurations in the Administration Guide.
  • The Event Emulator can run only event scripts that comply with the currently active configuration. For example:
  • The Emulator expects to see the same environmental arguments used by the Cluster Manager; if you define arbitrary arguments, the event scripts will run, but error reports will result.
  • In the case of swap_adapter, you must enter the ip_label supplied for service and non-service interfaces in the correct order, as specified in the usage statement. Both interfaces must be located on the same node at emulation time. Both must be configured as part of the same HACMP logical network.
  • For other events, the same types of restrictions apply. If errors occur during emulation, recheck your configuration to ensure that the cluster state supports the event to be emulated.

  • The Event Emulator runs customized scripts (pre- and post-event scripts) associated with an event, but does not run commands within these scripts. Therefore, if these customized scripts change the cluster configuration when actually run, the outcome may differ from the outcome of an emulation.
  • When emulating an event that contains a customized script, the Event Emulator uses the ksh flags -n and -v. The -n flag reads commands and checks them for syntax errors, but does not execute them. The -v flag indicates verbose mode. When writing customized scripts that may be accessed during an emulation, be aware that the other ksh flags may not be compatible with the -n flag and may cause unpredictable results during the emulation. See the ksh man page for flag descriptions.
  • Running Event Emulations

    You can run the event emulator through SMIT.

    To emulate a cluster event complete the following steps:

      1. Enter smit hacmp
      2. In SMIT, select Problem Determination Tools > HACMP Event Emulation.

    SMIT displays a panel with options. Each option provides a different cluster event to emulate. The following sections provide more information about each option.

    Emulating a Node Up Event

    To emulate a Node Up event:

      1. Select Node Up Event from the HACMP Event Emulation panel. SMIT displays the panel.
      2. Enter the name of the node to use in the emulation.
      3. Press Enter to start the emulation.

    Emulating a Node Down Event

    To emulate a Node Down event:

      1. Select Node Down Event from the HACMP Event Emulation panel. SMIT displays the panel.
      2. Enter field data as follows:
    Node Name
    Enter the node to use in the emulation.
    Node Down Mode
    Indicate the type of shutdown to emulate:
    • Bring Resource Groups Offline. The node that is shutting down releases its resources. The other nodes do not take over the resources of the stopped node.
    • Move Resource Groups. The node that is shutting down releases its resources. The other nodes do take over the resources of the stopped node.
    • Unmanage Resource Groups. HACMP shuts down immediately. The node that is shutting down retains control of all its resources. Applications that do not require HACMP daemons continue to run. Typically, you use the UNMANAGE option so that stopping the Cluster Manager does not interrupt users and clients. Note that enhanced concurrent volume groups do not accept the UNMANAGE option if they are online.
      3. Press Enter to start the emulation.

    Emulating a Network Up Event

    To emulate a Network Up event:

      1. From the HACMP Event Emulation panel, select Network Up Event. SMIT displays the panel.
      2. Enter field data as follows:
    Network Name
    Enter the network to use in the emulation.
    Node Name
    (Optional) Enter the node to use in the emulation.
      3. Press Enter to start the emulation.

    Emulating a Network Down Event

    To emulate a Network Down event:

      1. From the HACMP Event Emulation panel, select Network Down Event. SMIT displays the panel.
      2. Enter field data as follows:
    Network Name
    Enter the network to use in the emulation.
    Node Name
    (Optional) Enter the node to use in the emulation.
      3. Press Enter to start the emulation.

    Emulating a Fail Standby Event

    To emulate a Fail Standby event:

      1. Select Fail Standby Event from the HACMP Event Emulation panel. SMIT displays the Fail Standby Event panel.
      2. Enter field data as follows:
    Node Name
    Enter the node to use in the emulation.
    IP Label
    Enter the IP label to use in the emulation.
      3. Press Enter to start the emulation.

    The following messages are displayed on all active cluster nodes when emulating the Fail Standby and Join Standby events:

    Adapter $ADDR is no longer available for use as a standby, due to either a standby adapter failure or IP address takeover.

    Standby adapter $ADDR is now available.

    Emulating a Join Standby Event

    To emulate a Join Standby event:

      1. From the HACMP Event Emulation panel, select Join Standby Event. SMIT displays the Join Standby Event panel.
      2. Enter field data as follows:
    Node Name
    Enter the node to use in the emulation.
    IP Label
    Enter the IP label to use in the emulation.
      3. Press Enter to start the emulation.

    Emulating a Swap Adapter Event

    To emulate a Swap Adapter event:

      1. From the HACMP Event Emulation panel, select Swap Adapter Event. SMIT displays the Swap Adapter Event panel.
      2. Enter field data as follows:
    Node Name
    Enter the node to use in the emulation.
    Network Name
    Enter the network to use in the emulation.
    Boot Time IP Label of available Network Interface
    The name of the IP label to swap. The Boot-time IP Label must be available on the node on which the emulation is taking place.
    Service Label to Move
    The name of the Service IP label to swap, it must be on the and available to the same node as the Boot-time IP Label
      3. Press Enter to start the emulation.

    Emulating Dynamic Reconfiguration Events

    To run an emulation of a Dynamic Reconfiguration event, modify the cluster configuration to reflect the configuration to be emulated and use the SMIT panels explained in this section.

    Note: The Event Emulator will not change the configuration of a cluster device. Therefore, if your configuration contains a process that makes changes to the Cluster Manager (disk fencing, for example), the Event Emulator will not show these changes. This could lead to a different output, especially if the hardware devices cause a fallover.

    You should add a cluster snapshot before running an emulation, just in case uncontrolled cluster events happen during emulation. Instructions for adding cluster snapshots are in the chapter on Saving and Restoring Cluster Configurations in the Administration Guide.

    To emulate synchronizing a cluster resource event:

      1. Enter smit hacmp
      2. In SMIT select Extended Configuration > Extended Verification and Synchronization.
      3. Enter field data as follows:
    Emulate or Actual
    If you set this field to Emulate, the synchronization will be an emulation and will not affect the Cluster Manager. If you set this field to Actual, the synchronization will actually occur, and any subsequent changes will be made to the Cluster Manager. Emulate is the default value.
    Note that these files appear only when the cluster is active.
      4. Press Enter to start the emulation.

    After you run the emulation, if you do not wish to run an actual dynamic reconfiguration, you can restore the original configuration using the SMIT panel option Problem Determination Tools > Restore System Default Configuration from Active Configuration.

    HACMP Error Notification

    For complete information on setting up both AIX 5L and HACMP error notification, see the chapter on Tailoring AIX 5L for HACMP, in the Installation Guide.

    Opening a SMIT Session on a Node

    As a convenience while troubleshooting your cluster, you can open a SMIT session on a remote node from within the Problem Determination Tool SMIT panel.

    To open a SMIT session on a remote node:

      1. Select the Problem Determination Tools > Open a SMIT Session on a Node option. SMIT displays a list of available cluster nodes.
      2. Select the node on which you wish to open the SMIT session and press Enter.

    Configuring Cluster Performance Tuning

    Cluster nodes sometimes experience extreme performance problems such as large I/O transfers, excessive error logging, or lack of memory. When this happens, the Cluster Manager can be starved for CPU time and it may not reset the deadman switch within the time allotted. Misbehaved applications running at a priority higher than the Cluster Manager can also cause this problem.

    The deadman switch is the AIX 5L kernel extension that halts a node when it enters a hung state that extends beyond a certain time limit. This enables another node in the cluster to acquire the hung node’s resources in an orderly fashion, avoiding possible contention problems. If the deadman switch is not reset in time, it can cause a system panic and dump under certain cluster conditions.

    Setting the following tuning parameters correctly may avoid some of the performance problems noted above. To prevent the possibility of having to change the HACMP Network Modules Failure Detection Rate, it is highly recommended to first set the following two AIX 5L parameters:

  • AIX 5L high and low watermarks for I/O pacing
  • AIX 5L syncd frequency rate.
  • Set the two AIX 5L parameters on each cluster node.

    You may also set the following HACMP network tuning parameters for each type of network:

  • Failure Detection Rate
  • Grace Period.
  • You can configure these related parameters directly from HACMP SMIT.

    Network module settings are propagated to all nodes when you set them on one node and then synchronize the cluster topology.

    Setting I/O Pacing

    In some cases, you can use I/O pacing to tune the system so that system resources are distributed more equitably during large disk-writing operations. However, the results of tuning I/O pacing are highly dependent on each system’s specific configuration and I/O access characteristics.

    I/O pacing can help ensure that the HACMP Cluster Manager continues to run even during large disk-writing operations. In some situations, it can help prevent deadman switch (DMS) time-outs. You should be cautious when considering tuning I/O pacing for your cluster configuration, since this is not an absolute solution for DMS time-outs for all types of cluster configurations.

    Remember, I/O pacing and other tuning parameters should only be set to values other than the defaults after a system performance analysis indicates that doing so will lead to both the desired and acceptable side effects.

    If you experience workloads that generate large disk-writing operations or intense amounts of disk traffic, contact IBM for recommendations on choices of tuning parameters that will both allow HACMP to function, and provide acceptable performance. To contact IBM, open a Program Management Report (PMR) requesting performance assistance, or follow other established procedures for contacting IBM.

    Although the most efficient high- and low-water marks vary from system to system, an initial high-water mark of 33 and a low-water mark of 24 provides a good starting point. These settings only slightly reduce write times and consistently generate correct fallover behavior from the HACMP software.

    See the AIX 5L Performance Monitoring & Tuning Guide for more information on I/O pacing.

    To change the I/O pacing settings, do the following on each node:

      1. Enter smit hacmp
      2. In SMIT, select Extended Configuration > Extended Performance Tuning Parameters Configuration > Change/Show I/O Pacing and press Enter.
      3. Configure the field values with the recommended HIGH and LOW watermarks:
    HIGH water mark for pending write I/Os per file
    33 is recommended for most clusters. Possible values are 0 to 32767.
    LOW watermark for pending write I/Os per file
    24 is recommended for most clusters. Possible values are 0 to 32766.

    Setting Syncd Frequency

    The syncd setting determines the frequency with which the I/O disk-write buffers are flushed. Frequent flushing of these buffers reduces the chance of deadman switch time-outs.

    The AIX 5L default value for syncd as set in /sbin/rc.boot is 60. It is recommended to change this value to 10. Note that the I/O pacing parameters setting should be changed first.

    To change the syncd frequency setting, do the following on each node:

      1. Enter smit hacmp
      1. In SMIT, select Extended Configuration > Extended Performance Tuning Parameters > Change/Show syncd frequency and press Enter.
      2. Configure the field values with the recommended syncd frequency:
    syncd frequency (in seconds)
    10 is recommended for most clusters. Possible values are 0 to 32767.

    Changing the Failure Detection Rate of a Network Module after the Initial Configuration

    If you want to change the failure detection rate of a network module, either change the tuning parameters of a network module to predefined values of Fast, Normal and Slow, or set these attributes to custom values.

    Also, use the custom tuning parameters to change the baud rate for TTYs if you are using RS232 networks that might not handle the default baud rate of 38400.

    For more information, see the Changing the Configuration of a Network Module section in the chapter on Managing the Cluster Topology in the Administration Guide.

    Resetting HACMP Tunable Values

    In HACMP 5.2 and up, you can change the settings for a list of tunable values that were altered during cluster maintenance and reset them to their default settings, or installation-time cluster settings. The installation-time cluster settings are equal to the values that appear in the cluster after installing HACMP from scratch.

    Note: Resetting the tunable values does not change any other aspects of the configuration, while installing HACMP removes all user-configured configuration information including nodes, networks, and resources.

    Prerequisites and Limitations

    You can change and reset HACMP tunable values to their default values under the following conditions:

  • Before resetting HACMP tunable values, HACMP takes a cluster snapshot. After the values have been reset to defaults, if you want to go back to your customized cluster settings, you can restore them with the cluster snapshot. HACMP saves snapshots of the last ten configurations in the default cluster snapshot directory, /usr/es/sbin/cluster/snapshots, with the name active.x.odm, where x is a digit between 0 and 9, with 0 being the most recent.
  • Stop cluster services on all nodes before resetting tunable values. HACMP prevents you from resetting tunable values in a running cluster.
  • In some cases, HACMP cannot differentiate between user-configured information and discovered information, and does not reset such values. For example, you may enter a service label and HACMP automatically discovers the IP address that corresponds to that label. In this case, HACMP does not reset the service label or the IP address. The cluster verification utility detects if these values do not match.

    The clsnapshot.log file in the snapshot directory contains log messages for this utility. If any of the following scenarios are run, then HACMP cannot revert to the previous configuration:

  • cl_convert is run automatically
  • cl_convert is run manually
  • clconvert_snapshot is run manually. The clconvert_snapshot utility is not run automatically, and must be run from the command line to upgrade cluster snapshots when migrating from HACMP (HAS) to HACMP 5.1 or greater.
  • Listing Tunable Values

    You can change and reset the following list of tunable values:

  • User-supplied information.
  • Network module tuning parameters such as failure detection rate, grace period and heartbeat rate. HACMP resets these parameters to their installation-time default values.
  • Cluster event customizations such as all changes to cluster events. Note that resetting changes to cluster events does not remove any files or scripts that the customization used, it only removes the knowledge HACMP has of pre- and post-event scripts.
  • Cluster event rule changes made to the event rules database are reset to the installation-time default values.
  • HACMP command customizations made to the default set of HACMP commands are reset to the installation-time defaults.
  • Automatically generated and discovered information, generally users cannot see this information. HACMP rediscovers or regenerates this information when the cluster services are restarted or during the next cluster synchronization.
  • HACMP resets the following:

  • Local node names stored in the cluster definition database
  • Netmasks for all cluster networks
  • Netmasks, interface names and aliases for disk heartbeating (if configured) for all cluster interfaces
  • SP switch information generated during the latest node_up event (this information is regenerated at the next node_up event)
  • Instance numbers and default log sizes for the RSCT subsystem.
  • Resetting HACMP Tunable Values Using SMIT

    To reset cluster tunable values to default values:

      1. Enter smit hacmp
      2. In SMIT, select Extended Configuration > Extended Topology Configuration > Configure an HACMP Cluster > Reset Cluster Tunables and press Enter.

    Use this option to reset all the tunables (customizations) made to the cluster. For a list of the tunable values that will change, see the section Listing Tunable Values. Using this option returns all tunable values to their default values but does not change the cluster configuration. HACMP takes a snapshot file before resetting. You can choose to have HACMP synchronize the cluster when this operation is complete.

      3. Select the options as follows and press Enter:
    Synchronize Cluster Configuration
    If you set this option to yes, HACMP synchronizes the cluster after resetting the cluster tunables.
      4. HACMP asks: “Are you sure?”
      5. Press Enter.

    HACMP resets all the tunable values to their original settings and removes those that should be removed (such as the nodes’ knowledge about customized pre- and post-event scripts).

    Resetting HACMP Tunable Values using the Command Line

    We recommend that you use the SMIT interface to reset the cluster tunable values. The clsnapshot -t command also resets the cluster tunables. This command is intended for use by IBM support. See the man page for more information.

    Sample Custom Scripts

    Two situations where it is useful to run custom scripts are illustrated here:

  • Making cron jobs highly available
  • Making print queues highly available.
  • Making cron jobs Highly Available

    To help maintain the HACMP environment, you need to have certain cron jobs execute only on the cluster node that currently holds the resources. If a cron job executes in conjunction with a resource or application, it is useful to have that cron entry fallover along with the resource. It may also be necessary to remove that cron entry from the cron table if the node no longer possesses the related resource or application.

    The following example shows one way to use a customized script to do this:

    The example cluster is a two node hot standby cluster where node1 is the primary node and node2 is the backup. Node1 normally owns the shared resource group and application. The application requires that a cron job be executed once per day but only on the node that currently owns the resources.

    To ensure that the job will run even if the shared resource group and application fall over to node2, create two files as follows:

      1. Assuming that the root user is executing the cron job, create the file root.resource and another file called root.noresource in a directory on a non-shared filesystem on node1. Make these files resemble the cron tables that reside in the directory /var/spool/crontabs.
    The root.resource table should contain all normally executed system entries, and all entries pertaining to the shared resource or application.
    The root.noresource table should contain all normally executed system entries but should not contain entries pertaining to the shared resource or application.
      2. Copy the files to the other node so that both nodes have a copy of the two files.
      3. On both systems, run the following command at system startup:
    crontab root.noresource

    This will ensure that the cron table for root has only the “no resource” entries at
    system startup.

      4. You can use either of two methods to activate the root.resource cron table. The first method is the simpler of the two.
  • Run crontab root.resource as the last line of the application start script. In the application stop script, the first line should then be crontab root.noresource. By executing these commands in the application start and stop scripts, you are ensured that they will activate and deactivate on the proper node at the proper time.
  • Run the crontab commands as a post_event to node_up_complete and node_down_complete.
  • Upon node_up_complete on the primary node, run crontab root.resources.
  • On node_down_complete run crontab root.noresources.
  • The takeover node must also use the event handlers to execute the correct cron table. Logic must be written into the node_down_complete event to determine if a takeover has occurred and to run the crontab root.resources command. On a reintegration, a pre-event to node_up must determine if the primary node is coming back into the cluster and then run a crontab root.noresource command.

    Making Print Queues Highly Available

    In the event of a fallover, the currently queued print jobs can be saved and moved over to the surviving node.

    The print spooling system consists of two directories: /var/spool/qdaemon and /var/spool/lpd/qdir. One directory contains files containing the data (content) of each job. The other contains the files consisting of information pertaining to the print job itself. When jobs are queued, there are files in each of the two directories. In the event of a fallover, these directories do not normally fallover and therefore the print jobs are lost.

    The solution for this problem is to define two filesystems on a shared volume group. You might call these filesystems /prtjobs and /prtdata. When HACMP starts, these filesystems are mounted over /var/spool/lpd/qdir and /var/spool/qdaemon.

    Write a script to perform this operation as a post event to node_up. The script should do the following:

  • Stop the print queues
  • Stop the print queue daemon
  • Mount /prtjobs over /var/spool/lpd/qdir
  • Mount /prtdata over /var/spool/qdaemon
  • Restart the print queue daemon
  • Restart the print queues.
  • In the event of a fallover, the surviving node will need to do the following:

  • Stop the print queues
  • Stop the print queue daemon
  • Move the contents of /prtjobs into /var/spool/lpd/qdir
  • Move the contents of /prtdata into /var/spool/qdaemon
  • Restart the print queue daemon
  • Restart the print queues.
  • To do this, write a script called as a post-event to node_down_complete on the takeover. The script needs to determine if the node_down is from the primary node.

    Where You Go from Here

    Chapter 2: Using Cluster Log Files describes how to use the HACMP cluster log files to troubleshoot the cluster.

    For more information on using HACMP and AIX 5L utilities see Chapter 3: Investigating System Components and Solving Common Problems.


    PreviousNextIndex