PreviousNextIndex

Chapter 8: Testing an HACMP Cluster


This chapter describes how to use the Cluster Test Tool to test the recovery capabilities of an HACMP cluster. The Cluster Test Tool is available for you to test a new cluster before it becomes part of your production environment, and to test configuration changes to an existing cluster, when the cluster is not in service.

The main sections of the chapter include:

  • Prerequisites
  • Overview
  • Running Automated Tests
  • Understanding Automated Testing
  • Setting up Custom Cluster Testing
  • Description of Tests
  • Running Custom Test Procedures
  • Evaluating Results
  • Recovering the Control Node after Cluster Manager Stops
  • Error Logging
  • Fixing Problems when Running Cluster Tests.
  • Prerequisites

    The Cluster Test Tool runs only on a cluster that has:

  • HACMP 5.2 or greater installed
  • If the cluster is migrated from an earlier version, cluster migration must be complete.
  • If you used the Cluster Test Tool in previous releases, the custom test plans that you created in previous releases continue to work in HACMP v.5.4.
  • The cluster configuration verified and synchronized.
  • Before you run the tool on a cluster node, ensure that:

  • The node has HACMP installed and is part of the HACMP cluster to be tested.
  • The node has network connectivity to all of the other nodes in the HACMP cluster.
  • You have root permissions.
  • Because log file entries include time stamps, consider synchronizing the clocks on the cluster nodes to make it easier to review log file entries produced by test processing.

    Overview

    The Cluster Test Tool utility lets you test an HACMP cluster configuration to evaluate how a cluster operates under a set of specified circumstances, such as when cluster services on a node fail or when a node loses connectivity to a cluster network. You can start a test, let it run unattended, and return later to evaluate the results of your testing. You should run the tool under both low load and high load conditions to observe how system load affects your HACMP cluster.

    You run the Cluster Test Tool from SMIT on one node in an HACMP cluster. For testing purposes, this node is referred to as the control node. From the control node, the tool runs a series of specified tests—some on other cluster nodes, gathers information about the success or failure of the tests processed, and stores this information in the Cluster Test Tool log file for evaluation or future reference.

    The Cluster Test Tool lets you test an HACMP cluster in two ways, by running:

  • Automated testing (also known as Automated Test Tool). In this mode, the Cluster Test Tool runs a series of predefined sets of tests on the cluster.
  • Custom testing (also known as Test Plan). In this mode, you can create your own test plan, or a custom testing routine, that will include different tests available in the Cluster Test Tool library.
  • Automated Testing

    Use the automated test procedure (a predefined set of tests) supplied with the tool to perform basic cluster testing on any cluster. No setup is required. You simply run the test from SMIT and view test results from SMIT and the Cluster Test Tool log file.

    The automated test procedure runs a predefined set of tests on a node that the tool randomly selects. The tool ensures that the node selected for testing varies from one test to another. For information about automated testing, see the section Running Automated Tests.

    Custom Testing

    If you are an experienced HACMP administrator and want to tailor cluster testing to your environment, you can create custom tests that can be run from SMIT. You create a custom test plan (a file that lists a series of tests to be run), to meet requirements specific to your environment and apply that test plan to any number of clusters. You specify the order in which tests run and the specific components to be tested. After you set up your custom test environment, you run the test procedure from SMIT and view test results in SMIT and in the Cluster Test Tool log file. For information about customized testing, see the section Setting up Custom Cluster Testing.

    Test Duration

    Running automated testing on a basic two-node cluster that has a simple cluster configuration takes approximately 30 to 60 minutes to complete. Individual tests can take around three minutes to run. The following conditions affect the length of time to run the tests:

  • Cluster complexity
  • Testing in complex environments takes considerably longer.
  • Latency on the network
  • Cluster testing relies on network communication between the nodes. Any degradation in network performance slows the performance of the Cluster Test Tool.
  • Use of verbose logging for the tool
  • If you customize verbose logging to run additional commands from which to capture output, testing takes longer to complete. In general, the more commands you add for verbose logging, the longer a test procedure takes to complete.
  • Manual intervention on the control node
  • At some points in the test, you may need to intervene. See Recovering the Control Node after Cluster Manager Stops for ways to avoid this situation.
  • Running custom tests
  • If you run a custom test plan, the number of tests run also affects the time required to run the test procedure. If you run a long list of tests, or if any of the tests require a substantial amount of time to complete, then the time to process the test plan increases.

    Security

    The Cluster Test Tool uses the HACMP Cluster Communications daemon to communicate between cluster nodes to protect the security of your HACMP cluster. For information about the Cluster Communications Daemon, see Chapter 16: Managing User and Groups.

    Limitations

    The Cluster Test Tool has the following limitations. It does not support testing of the following HACMP cluster-related components:

  • High Performance Switch (HPS) networks
  • ATM networks
  • Sites.
  • You can perform general cluster testing for clusters that support sites, but not testing specific to HACMP sites or any of the HACMP/XD products. HACMP/XD for Metro Mirror HACMP/XD for GLVM, and HACMP/XD for HAGEO all use sites in their cluster configuration.
  • Replicated resources.
  • You can perform general cluster testing for clusters that include replicated resources, but not testing specific to replicated resources or any of the HACMP/XD products. HACMP/XD for Metro Mirror, HACMP/XD for HAGEO, and HACMP/XD for GLVM all include replicated resources in their cluster configuration.
  • Dynamic cluster reconfiguration.
  • You cannot run dynamic reconfiguration while the tool is running.
  • Pre-events and post-events.
  • Pre-events and post-events run in the usual way, but the tool does not verify that the events were run or that the correct action was taken.

    In addition, the Cluster Test Tool may not recover from the following situations:

  • A node that fails unexpectedly, that is a failure not initiated by testing
  • The cluster does not stabilize.
  • Note: The Cluster Test Tool uses the terminology for stopping cluster services that was used in HACMP prior to v.5.4 (graceful stop, graceful with takeover and forced stop). For information how this terminology maps to the currently used terms for stopping the cluster services, see Chapter 9: Starting and Stopping Cluster Services.

    Running Automated Tests

    You can run the automated test procedure on any HACMP cluster that is not currently in service. The Cluster Test Tool runs a specified set of tests and randomly selects the nodes, networks, resource groups, and so forth for testing. The tool tests different cluster components during the course of the testing. For a list of the tests that are run, see the section Understanding Automated Testing.

    Before you start running an automated test:

  • Ensure that the cluster is not in service in a production environment
  • Stop HACMP cluster services, this is recommended but optional. Note that if the Cluster Manager is running, some of the tests will be irrational for your configuration, but the Test Tool will continue to run.
  • Cluster nodes are attached to two IP networks.
  • One network is used to test a network becoming unavailable then available. The second network provides network connectivity for the Cluster Test Tool. Both networks are tested, one at a time.

    Launching the Cluster Test Tool

    To run the automated test procedure:

      1. Enter smit hacmp
      2. In SMIT, select Initialization and Standard Configuration > HACMP Cluster Test Tool and press Enter.
    The Are you sure message appears. If you press Enter again, the automated test plan runs.
      3. Evaluate the test results.
    For information about evaluating test results, see the section Evaluating Results.

    Modifying Logging and Stopping Processing in the Cluster Test Tool

    You can also modify processing for automated test procedure to:

  • Turn off verbose logging
  • Turn off cycling of log files for the tool
  • Stop processing tests after the first test fails
  • To modify processing for an automated test:

      1. Enter smit hacmp
      2. In SMIT, select either one of the following options:
  • Extended Configuration
  • Problem Determination Tools
  • Then select HACMP Cluster Test Tool.
      3. In the HACMP Cluster Test Tool panel, select Execute Automated Test Procedure.
      4. In the Execute Automated Test Procedure panel, enter field values as follows:
    Verbose Logging
    When set to yes, includes additional information in the log file. This information may help to judge the success or failure of some tests. For more information about verbose logging and how to modify it for your testing, see the section Error Logging.
    Select no to decrease the amount of information logged by the Cluster Test Tool.
    The default is yes.
    Cycle Log File
    When set to yes, uses a new log file to store output from the Cluster Test Tool.
    Select no to append messages to the current log file.
    The default is yes.
    For more information about cycling the log file, see the section Log File Rotation.
    Abort on Error
    When set to no, the Cluster Test Tool continues to run tests after some of the tests being run fail. This may cause subsequent tests to fail because the cluster state is different from the one expected by one of those tests.
    Select yes to stop processing after the first test fails.
    For information about the conditions under which the Cluster Test Tool stops running, see the section Cluster Test Tool Stops Running.
    The default is no.
    Note: The tool stops running and issues an error if a test fails and Abort on Error is selected.
      5. Press Enter to start running the automated tests.
      6. Evaluate the test results.
    For information about evaluating test results, see the section Evaluating Results.

    Understanding Automated Testing

    This section lists the sequence that the Cluster Test Tool uses for the automated testing, and describes the syntax of the tests run during automated testing.

    The automated test procedure performs sets of predefined tests in the following order:

      1. General topology tests
      2. Resource group tests on non-concurrent resource groups
      3. Resource group tests on concurrent resource groups
      4. IP-type network tests for each network
      5. Non-IP network tests for each network
      6. Volume group tests for each resource group
      7. Site-specific tests
      8. Catastrophic failure test.

    The Cluster Test Tool discovers information about the cluster configuration, and randomly selects cluster components, such as nodes and networks, to be used in the testing.

    Which nodes are used in testing varies from one test to another.The Cluster Test Tool may select some node(s) for the initial battery of tests, and then, for subsequent tests, it may intentionally select the same node(s), or, choose from nodes on which no tests were run previously. In general, the logic in the automated test sequence ensures that all components are sufficiently tested in all necessary combinations. The testing follows these rules:

  • Tests operation of a concurrent resource group on one randomly selected node—not all nodes in the resource group.
  • Tests only those resource groups that include monitored application servers or volume groups.
  • Requires at least two active IP networks in the cluster to test non-concurrent resource groups.
  • The automated test procedure runs a node_up event at the beginning of the test to make sure that all cluster nodes are up and available for testing.

    These sections list the tests in each group. For more information about a test, including the criteria to determine the success or failure of a test, see the section Description of Tests. The automated test procedure uses variables for parameters, with values drawn from the HACMP cluster configuration.

    The examples in the following sections use variables for node, resource group, application server, stop script, and network names. For information about the parameters specified for a test, see the section Description of Tests.

    General Topology Tests

    The Cluster Test Tool runs the general topology tests in the following order:

      1. Bring a node up and start cluster services on all available nodes
      2. Stop cluster services on a node and bring resource groups offline.
      3. Restart cluster services on the node that was stopped
      4. Stop cluster services and move resource groups to another node
      5. Restart cluster services on the node that was stopped
      6. Stop cluster services on another node and place resource groups in an UNMANAGED state.
      7. Restart cluster services on the node that was stopped.

    The Cluster Test Tool uses the terminology for stopping cluster services that was used in HACMP in releases prior to v.5.4. For information on how the methods for stopping cluster services map to the terminology used in v.5.4, see Chapter 9: Starting and Stopping Cluster Services.

    When the automated test procedure starts, the tool runs each of the following tests in the order shown:

      1. NODE_UP, ALL, Start cluster services on all available nodes
      2. NODE_DOWN_GRACEFUL, node1, Stop cluster services gracefully on a node
      3. NODE_UP, node1, Restart cluster services on the node that was stopped
      4. NODE_DOWN_TAKEOVER, node2, Stop cluster services with takeover on a node
      5. NODE_UP, node2, Restart cluster services on the node that was stopped
      6. NODE_DOWN_FORCED, node3, Stop cluster services forced on a node
      7. NODE_UP, node3, Restart cluster services on the node that was stopped

    Resource Group Tests

    There are two groups of resource group tests that can be run. Which group of tests run depends on the startup policy for the resource group: non-concurrent and concurrent resource groups.

    If a resource of the specified type does not exist in the resource group, the tool logs an error in the Cluster Test Tool log file.

    Resource Group Starts on a Specified Node

    The following tests run if the cluster includes one or more resource groups that have a startup management policy other than Online on All Available Nodes, that is, the cluster includes one or more non-concurrent resource groups.

    The Cluster Test Tool runs each of the following tests in the order shown for each resource group:

      1. Bring a resource group offline and online on a node.
    RG_OFFLINE, RG_ONLINE
      2. Bring a local network down on a node to produce a resource group fallover.
    NETWORK_DOWN_LOCAL, rg_owner, svc1_net, Selective fallover on local network down
      3. Recover the previously failed network.
    NETWORK_UP_LOCAL, prev_rg_owner, svc1_net, Recover previously failed network
      4. Move a resource group to another node. RG_MOVE
      5. Bring an application server down and recover from the application failure. SERVER_DOWN, ANY, app1, /app/stop/script, Recover from application failure

    Resource Group Starts on All Available Nodes

    If the cluster includes one or more resource groups that have a startup management policy of Online on All Available Nodes, that is, the cluster has concurrent resource groups, the tool runs one test that brings an application server down and recovers from the application failure.

    The tool runs the following test:

    RG_OFFLINE, RG_ONLINE

    SERVER_DOWN, ANY, app1, /app/stop/script, Recover from application failure

    Network Tests

    The tool runs tests for IP networks and for non-IP networks.

    For each IP network, the tool runs these tests:

  • Bring a network down and up.
  • NETWORK_DOWN_GLOBAL, NETWORK_UP_GLOBAL
  • Fail a network interface, join a network interface. This test is run for the service interface on the network. If no service interface is configured, the test uses a random interface defined on the network.
  • FAIL_LABEL, JOIN_LABEL

    For each Non-IP network, the tool runs these tests:

  • Bring a non-IP network down and up.
  • NETWORK_DOWN_GLOBAL, NETWORK_UP_GLOBAL

    Volume Group Tests

    For each resource group in the cluster, the tool runs tests that fail a volume group in the resource group: VG_DOWN

    Site-Specific Tests

    If sites are present in the cluster, the tool runs tests for them. The automated testing sequence that the Cluster Test Tool uses contains two site-specific tests:

  • auto_site. This sequence of tests runs if you have any cluster configuration with sites. For instance, this sequence is used for clusters with cross-site LVM mirroring configured that does not use XD_data networks. The tests in this sequence include:
  • SITE_DOWN_GRACEFUL Stop the cluster services on all nodes in a site while taking resources offline
  • SITE_UP Restart the cluster services on the nodes in a site
  • SITE_DOWN_TAKEOVER Stop the cluster services on all nodes in a site and move the resources to nodes at another site
  • SITE_UP Restart the cluster services on the nodes at a site
  • RG_MOVE_SITE Move a resource group to a node at another site
  • auto_site_isolation. This sequence of tests runs only if you configured sites and an XD-type network. The tests in this sequence include:
  • SITE_ISOLATION Isolate sites by failing XD_data networks
  • SITE_MERGE Merge sites by bringing up XD_data networks.
  • Catastrophic Failure Test

    As a final test, the tool stops the Cluster Manager on a randomly selected node that currently has at least one active resource group:

    CLSTRMGR_KILL, node1, Kill the cluster manager on a node

    If the tool terminates the Cluster Manager on the control node, you may need to reboot this node.

    Setting up Custom Cluster Testing

    If you want to extend cluster testing beyond the scope of the automated testing and you are an experienced HACMP administrator who has experience planning, implementing, and troubleshooting clusters, you can create a custom test procedure to test the HACMP clusters in your environment. You can specify the tests specific to your clusters, and use variables to specify parameters specific to each cluster. Using variables lets you extend a single custom test procedure to run on a number of different clusters. You the run the custom test procedure from SMIT.

    Warning: If you uninstall HACMP, the program removes any files you may have customized for the Cluster Test Tool. If you want to retain these files, make a copy of these files before you uninstall HACMP.

    Planning a Test Procedure

    Before you create a test procedure, make sure that you are familiar with the HACMP clusters on which you plan to run the test. List the components in your cluster and have this list available when setting up a test. Include the following items in the list:

  • Nodes
  • IP networks
  • Non-IP networks
  • XD-type networks
  • Volume groups
  • Resource groups
  • Application servers
  • Sites.
  • Your test procedure should bring each component offline then online, or cause a resource group fallover, to ensure that the cluster recovers from each failure.

    We recommend that your test start by running a node_up event on each cluster node to ensure that all cluster nodes are up and available for testing.

    Creating a Custom Test Procedure

    To create a custom test procedure:

      1. Create a Test Plan, a file that lists the tests to be run.
    For information about creating a Test Plan, see the section Creating a Test Plan.
      2. Set values for test parameters.
    For information about specifying parameters, see the section Specifying Parameters for Tests.

    Creating a Test Plan

    A Test Plan is a text file that lists cluster tests to be run in the order in which they are listed in the file. In a Test Plan, specify one test per line. You can set values for test parameters in the Test Plan or use variables to set parameter values.

    The tool supports the following tests:

    FAIL_LABEL
    Brings the interface associated with the specified label down on the specified node.
    JOIN_LABEL
    Brings the interface associated with the specified label up on the specified node.
    NETWORK_UP_GLOBAL
    Brings a specified network up (IP network or non-IP network) on all nodes that have interfaces on the network.
    NETWORK_DOWN_GLOBAL
    Brings a specified network down (IP network or non-IP network) on all nodes that have interfaces on the network.
    NETWORK_UP_LOCAL
    Brings a network interface on a node up.
    NETWORK_DOWN_LOCAL
    Brings a network interface on a node down.
    NETWORK_UP_NONIP
    Brings a non-IP network on a node up.
    NETWORK_DOWN_NONIP
    Brings a non-IP network on a node down.
    NODE_UP
    Starts cluster services on the specified node.
    NODE_DOWN_GRACEFUL
    Stops cluster services and brings the resource groups offline on the specified node.
    NODE_DOWN_TAKEOVER
    Stops cluster services with the resources acquired by another node.
    NODE_DOWN_FORCED
    Stops cluster services on the specified node with the Unmanage Resource Group option.
    CLSTRMGR_KILL
    Terminates the Cluster Manager on the specified node
    RG_MOVE
    Moves a resource group that is already online to a specific node
    RG_MOVE_SITE
    Moves a resource group that is already online to an available node at a specific site.
    RG_OFFLINE
    Brings a resource group offline that is
    already online
    RG_ONLINE
    Brings a resource group online that is
    already offline
    SERVER_DOWN
    Brings a monitored application server down
    SITE_ISOLATION
    Brings down all XD_data networks in the cluster at which the tool is running, thereby
    causing a site isolation.
    SITE _MERGE
    Brings up all XD_data networks in the cluster at which the tool is running, thereby simulating a site merge.
    Run the SITE_MERGE test after running the SITE_ISOLATION test.
    SITE_UP
    Starts cluster services on all nodes at the specified site that are currently stopped
    SITE_DOWN_TAKEOVER
    Stops cluster services on all nodes at the specified site and moves the resources to node(s) at another site by launching automatic rg_move events.
    SITE_DOWN_GRACEFUL
    Stops cluster services on all nodes at the specified site and takes the resources offline.
    VG_DOWN
    Emulates an error condition for a specified disk that contains a volume group in a resource group.
    WAIT
    Generates a wait period for the Cluster Test Tool.

    For a full description of these tests, see the section Description of Tests.

    Specifying Parameters for Tests

    You can specify parameters for the tests in the Test Plan by doing one of the following:

  • Using a variables file. A variables file defines values for variables assigned to parameters in a test plan. See the section Using a Variables File.
  • Setting values for test parameters as environment variables. See the section Using Environment Variables.
  • Identifying values for parameters in the Test Plan. See the section Using the Test Plan.
  • When the Cluster Test Tool starts, it uses a variables file if you specified the location of one in SMIT. If it does not locate a variables file, it uses values set in an environment variable. If a value is not specified in an environment variable, it uses the value in the Test Plan. If the value set in the Test Plan is not valid, the tool displays an error message.

    Using a Variables File

    The variables file is a text file that defines the values for test parameters. By setting parameter values in a separate variables file, you can use your Test Plan to test more than one cluster.

    The entries in the file have this syntax:

    parameter_name=value

    For example, to specify a node as node_waltham:

    node=node_waltham 
    

    To provide more flexibility, you can:

      1. Set the name for a parameter in the Test Plan.
      2. Assign the name to another value in the variables file.

    For example, you could specify the value for node as node1 in the Test Plan:

    NODE_UP,node1, Bring up node1

    In the variables file, you can then set the value of node1 to node_waltham:

    node1=node_waltham 
    

    The following example shows a sample variables file:

    node1=node_waltham 
    node2=node_belmont 
    node3=node_watertown 
    node4=node_lexington 
    

    Using Environment Variables

    If you do not want to use a variables file, you can assign parameter values by setting environment variables for the parameter values. If a variable file is not specified, but there are parameter_name=values in the cluster environment that match the values in the test plan, the Cluster Test Tool will use the values from the cluster environment.

    Using the Test Plan

    If you want to run a test plan on only one cluster, you can define test parameters in the Test Plan. The associated test can be run only on the cluster that includes those cluster attributes specified. For information about the syntax for parameters for tests, see the section Description of Tests.

    Description of Tests

    The Test Plan supports the tests listed in this section. The description of each test includes information about the test parameters and the success indicators for a test.

    Note: One of the success indicators for each test is that the cluster becomes stable. The definition of cluster stability takes a number of factors into account, beyond the state of the Cluster Manager. The clstat utility, by comparison, uses only the state of the Cluster Manager to assess stability. For information about the factors used to determine cluster stability for the Cluster Test Tool, see the section Evaluating Results.

    Test Syntax

    The syntax for a test is:

    TEST_NAME, parameter1, parametern|PARAMETER, comments

    where:

  • The test name is in uppercase letters.
  • Parameters follow the test name.
  • Italic text indicates parameters expressed as variables.
  • Commas separate the test name from the parameters and the parameters from each other. (Note that the HACMP 5.4 Cluster Test Tool supports spaces around commas).
  • The example syntax line shows parameters as parameter1 and parametern with n representing the next parameter. Tests typically have from two to four parameters.
  • A pipe ( | ) indicates parameters that are mutually exclusive alternatives.
  • Select one of these parameter options.
  • (Optional) Comments (user-defined text) appear at the end of the line. The Cluster Test Tool displays the text string when the Cluster Test Tool runs.
  • In the test plan, the tool ignores:

  • Lines that start with a pound sign (#)
  • Blank lines.
  • Node Tests

    The node tests start and stop cluster services on specified nodes.

    NODE_UP, node | ALL, comments

    Starts cluster services on a specified node that is offline or on all nodes that are offline.

    node
    The name of a node on which cluster services start
    ALL
    Any nodes that are offline have cluster services start
    comments
    User-defined text to describe the configured test.

    Example

    NODE_UP, node1, Bring up node1 
    

    Entrance Criteria

    Any node to be started is inactive.

    Success Indicators

    The following conditions indicate success for this test:

  • The cluster becomes stable
  • The cluster services successfully start on all specified nodes
  • No resource group enters the error state
  • No resource group moves from online to offline.
  • NODE_DOWN_GRACEFUL, node | ALL, comments

    Stops cluster services on a specified node and brings resource groups offline.

    node
    The name of a node on which cluster services stop
    ALL
    All nodes are to have cluster services stop
    If you specify ALL, at least one node in the cluster must be online for this test to run.
    comments
    User-defined text to describe the configured test.

    Example

    NODE_DOWN_GRACEFUL, node3, Bring down node3 gracefully 
    

    Entrance Criteria

    Any node to be stopped is active.

    Success Indicators

    The following conditions indicate success for this test:

  • The cluster becomes stable
  • Cluster services stop on the specified node(s)
  • Cluster services continue to run on other nodes if ALL is not specified
  • Resource groups on the specified node go offline, and do not move to other nodes
  • Resource groups on other nodes remain in the same state.
  • NODE_DOWN_TAKEOVER, node, comments

    Stops cluster services on a specified node with a resource group acquired by another node as configured, depending on resource availability.

    node
    The name of a node on which to stop cluster services
    comments
    User-defined text to describe the configured test.

    Example

    NODE_DOWN_TAKEOVER, node4, Bring down node4 gracefully with takeover 
    

    Entrance Criteria

    The specified node is active.

    Success Indicators

    The following conditions indicate success for this test:

  • The cluster becomes stable
  • Cluster services stop on the specified node
  • Cluster services continue to run on other nodes
  • All resource groups remain in the same state.
  • NODE_DOWN_FORCED, node, comments

    Stops cluster services on a specified node and places resource groups in an UNMANAGED state. Resources on the node remain online, that is they are not released.

    node
    The name of a node on which to stop cluster services
    comments
    User-defined text to describe the configured test.

    Example

    NODE_DOWN_FORCED, node2, Bring down node2 forced 
    

    Entrance Criteria

    Cluster services on another node have not already been stopped with its resource groups placed in an UNMANAGED state. The specified node is active.

    Success Indicators

    The following conditions indicate success for this test:

  • The cluster becomes stable
  • The resource groups on the node change to UNMANAGED state
  • Cluster services stop on the specified node
  • Cluster services continue to run on other nodes
  • All resource groups remain in the same state.
  • Network Tests for an IP Network

    This section lists tests that bring network interfaces up or down on an IP network. The Cluster Test Tool requires two IP networks to run any of the tests described in this section. The second network provides network connectivity for the tool to run. The Cluster Test Tool verifies that two IP networks are configured before running the test.

    NETWORK_UP_LOCAL, node, network, comments

    Brings a specified network up on a specified node by running the ifconfig up command on the node.

    node
    The name of the node on which to run the ifconfig up command
    network
    The name of the network to which the interface is connected
    comments
    User-defined text to describe the configured test.

    Example

    NETWORK_UP_LOCAL, node6, hanet1, Bring up hanet1 on node 6 
    

    Entrance Criteria

    The specified node is active and has at least one inactive interface on the specified network.

    Success Indicators

    The following conditions indicate success for this test:

  • The cluster becomes stable
  • Cluster services continue to run on the cluster nodes where they were active before the test
  • Resource groups that are in the ERROR state on the specified node and that have a service IP label available on the network can go online, but should not enter the ERROR state
  • Resource groups on other nodes remain in the same state.
  • NETWORK_DOWN_LOCAL, node, network, comments

    Brings a specified network down on a specified node by running the ifconfig down command.

    Note: If one IP network is already unavailable on a node, the cluster may become partitioned. The Cluster Test Tool does not take this into account when determining the success or failure of a test.
    node
    The name of the node on which to run the ifconfig down command
    network
    The name of the network to which the interface is connected
    comments
    User-defined text to describe the configured test.


    Example

    NETWORK_DOWN_LOCAL, node8, hanet2, Bring down hanet2 on node 8 
    

    Entrance Criteria

    The specified node is active and has at least one active interface on the specified network.

    Success Indicators

    The following conditions indicate success for this test:

  • The cluster becomes stable
  • Cluster services continue to run on the cluster nodes where they were active before the test
  • Resource groups on other nodes remain in the same state; however, some may be hosted on a different node
  • If the node hosts a resource group for which the recovery method is set to notify, the resource group does not move.
  • NETWORK_UP_GLOBAL, network, comments

    Brings specified network up on all nodes that have interfaces on the network. The network specified may be an IP network or a serial network.

    network
    The name of the network to which the interface is connected
    comments
    User-defined text to describe the configured test.

    Example

    NETWORK_UP_GLOBAL, hanet1, Bring up hanet1 on node 6 
    

    Entrance Criteria

    Specified network is active on at least one node.

    Success Indicators

    The following conditions indicate success for this test:

  • The cluster becomes stable
  • Cluster services continue to run on the cluster nodes where they were active before the test
  • Resource groups that are in the ERROR state on the specified node and that have a service IP label available on the network can go online, but should not enter the ERROR state
  • Resource groups on other nodes remain in the same state.
  • NETWORK_DOWN_GLOBAL, network, comments

    Brings the specified network down on all nodes that have interfaces on the network. The network specified may be an IP network or a serial network.

    Note: If one IP network is already unavailable on a node, the cluster may become partitioned. The Cluster Test Tool does not take this into account when determining the success or failure of a test.
    network
    The name of the network to which the interface is connected
    comments
    User-defined text to describe the configured test.


    Example

    NETWORK_DOWN_GLOBAL, hanet1, Bring down hanet1 on node 6 
    

    Entrance Criteria

    Specified network is inactive on at least one node.

    Success Indicators

    The following conditions indicate success for this test:

  • The cluster becomes stable
  • Cluster services continue to run on the cluster nodes where they were active before the test
  • Resource groups on other nodes remain in the same state.
  • Network Interface Tests for IP Networks

    JOIN_LABEL iplabel, comments

    Brings up a network interface associated with the specified IP label on a specified node by running the ifconfig up command.

    Note: You specify the IP label as the parameter. The interface that is currently hosting the IP label is used as the argument to the ifconfig command. The IP label can be a service, boot, or backup (standby) label. If it is a service label, then that service label must be hosted on some interface, for example, when the resource group is actually online. You cannot specify a service label that is not already hosted on an interface.
    The only time you could have a resource group online and the service label hosted on an inactive interface would be when the service interface fails but there was no place to move the resource group, in which case it stays online.
    iplabel
    The IP label of the interface.
    comments
    User-defined text to describe the configured test.


    Example

    JOIN_LABEL, app_serv_address, Bring up app_serv_address on node 2 
    

    Entrance Criteria

    Specified interface is currently active on the specified node.

    Success Indicators

    The following conditions indicate success for this test:

  • The cluster becomes stable
  • Specified interface comes up on specified node
  • Cluster services continue to run on the cluster nodes where they were active before the test
  • Resource groups that are in the ERROR state on the specified node and that have a service IP label available on the network can go online, but should not enter the ERROR state
  • Resource groups on other nodes remain in the same state.
  • FAIL_LABEL, iplabel, comments

    Brings down a network interface associated with a specified label on a specified node by running the ifconfig down command.

    Note: You specify the IP label as the parameter. The interface that is currently hosting the IP label is used as the argument to the ifconfig command The IP label can be a service, boot, or standby (backup) label.
    iplabel
    The IP label of the interface.
    comments
    User-defined text to describe the configured test.


    Example

    FAIL_LABEL, app_serv_label, Bring down app_serv_label, on node 2 
    

    Entrance Criteria

    The specified interface is currently inactive on the specified node

    Success Indicators

    The following conditions indicate success for this test:

  • The cluster becomes stable
  • Any service labels that were hosted by the interface are recovered
  • Resource groups that are in the ERROR state on the specified node and that have a service IP label available on the network can go online, but should not enter the ERROR state
  • Resource groups remain in the same state; however, the resource group may be hosted by another node.
  • Network Tests for a Non-IP Network

    The testing for non-IP networks is part of the NETWORK_UP_GLOBAL, NETWORK_DOWN_GLOBAL, NETWORK_UP_LOCAL and NETWORK_DOWN_LOCAL test procedures.

    Resource Group Tests

    RG_ONLINE, rg, node | ALL | ANY | RESTORE, comments

    Brings a resource group online in a running cluster.

    rg
    The name of the resource group to bring online.
    node
    The name of the node where the resource group will come online.
    ALL
    Use ALL for concurrent resource groups only. When ALL is specified, the resource group will be brought online on all nodes in the resource group. If you use ALL for non-concurrent groups, the Test Tool interprets it as ANY.
    ANY
    Use ANY for non-concurrent resource groups to pick a node where the resource group is offline. For concurrent resource groups, use ANY to pick a random node where the resource group will be brought online.
    RESTORE
    Use RESTORE for non-concurrent resource groups to bring the resource groups online on the highest priority available node. For concurrent resource groups, the resource group will be brought online on all nodes in the nodelist.
    comments
    User-defined text to describe the configured test.

    Example

    RG_ONLINE, rg_1, node2, Bring rg_1 online on node 2. 
    

    Entrance Criteria

    The specified resource group is offline, there are available resources, and can meet all dependencies.

    Success Indicators

    The following conditions indicate success for this test:

  • The cluster becomes stable
  • The resource group is brought online successfully on the specified node
  • No resource groups go offline or into ERROR state.
  • RG_OFFLINE, rg, node | ALL | ANY, comments

    Brings a resource group offline that is already online in a running cluster.

    rg
    The name of the resource group to bring offline
    node
    The name of the node on which the resource group will be taken offline
    ALL
    Use ALL for concurrent resource groups to bring the resource group offline on all nodes where the resource group is hosted.
    You can also use ALL for non-concurrent resource groups to bring the group offline on the node where it is online.
    ANY
    Use ANY for non-concurrent resource groups to bring the resource group offline on the node where it is online. You can use ANY for concurrent resource groups to select a random node where the resource group is online.
    comments
    User-defined text to describe the configured test

    Example

    RG_OFFLINE, rg_1, node2, Bring rg_1 offline from node2  
    

    Entrance Criteria

    The specified resource group is online on the specified node

    Success Indicators

    The following conditions indicate success for this test:

  • The cluster becomes stable
  • Resource group, which was online on the specified node, is brought offline successfully
  • Other resource groups remain in the same state.
  • RG_MOVE, rg, node | ANY | RESTORE, comments

    Moves a resource group that is already online in a running cluster to a specific or any available node.

    rg
    The name of the resource group to bring offline
    node
    The target node; the name of the node to which the resource group will move
    ANY
    Use ANY to let the Cluster Test Tool pick a random available node to which to move the resource group.
    RESTORE
    Enable the resource group to move to the highest priority node available
    comments
    User-defined text to describe the configured test

    Example

    RG_MOVE, rg_1, ANY, Move rg_1 to any available node. 
    

    Entrance Criteria

    The specified resource group must be non-concurrent and must be online on a node other than the target node.

    Success Indicators

    The following conditions indicate success for this test:

  • The cluster becomes stable
  • Resource group is moved to the target node successfully
  • Other resource groups remain in the same state.
  • RG_MOVE_SITE, rg, site | OTHER, comments

    Moves a resource group that is already online in a running cluster to an available node at a specific site.

    rg
    The name of the resource group to bring offline
    site
    The site where the resource group will move
    OTHER
    Use OTHER to have the Cluster Test Tool pick the “other” site as the resource group destination. For example, if the resource group is online on siteA, it will be moved to siteB, and conversely if the resource group is online on siteB, it will be moved to siteA.
    comments
    User-defined text to describe the configured test

    Example

    RG_MOVE_SITE, rg_1, site_2, Move rg_1 to site_2. 
    

    Entrance Criteria

    The specified resource group is online on a node, other than the a node in the target site

    Success Indicators

    The following conditions indicate success for this test:

  • The cluster becomes stable
  • Resource group is moved to the target site successfully
  • Other resource groups remain in the same state.
  • Volume Group Tests

    VG_DOWN, vg, node | ALL | ANY, comments

    Forces an error for a disk that contains a volume group in a resource group.

    vg
    The volume group on the disk of which to fail
    node
    The name of the node where the resource group that contains the specified volume group is currently online
    ALL
    Use ALL for concurrent resource groups. When ALL is specified, the Cluster Test Tool will fail the volume group on all nodes in the resource group where the resource group is online.
    If ALL is used for non-concurrent resource groups, the Tool performs this test for any resource group.
    ANY
    Use ANY to have the Cluster Test Tool will select the node as follows:
    • For a non-concurrent resource group, the Cluster Test Tool will select the node where the resource group is currently online.
    • For a concurrent resource group, the Cluster Test Tool will select a random node from the concurrent resource group node list, where the resource group is online
    comments
    User-defined text to describe the configured test.

    Example

    VG_DOWN, sharedvg, ANY, Fail the disk where sharedvg resides 
    

    Entrance Criteria

    The resource group containing the specified volume groups is online on the specified node.

    Success Indicators

    The following conditions indicate success for this test:

  • The cluster becomes stable
  • Resource group containing the specified volume group successfully moves to another node, or if it is a concurrent resource groups, it goes into an ERROR state
  • Resource groups may change state to meet dependencies.
  • Site Tests

    SITE_ISOLATION, comments

    Fails all the XD_data networks, causing the site_isolation event.

    comments
    User-defined text to describe the configured test.

    Example

    SITE_ISOLATION, Fail all the XD_data networks 
    

    Entrance Criteria

    At least one XD_data network is configured and is up on any node in the cluster.

    Success Indicators

    The following conditions indicate success for this test:

  • The XD_data network fails, no resource groups change state
  • The cluster becomes stable.
  • SITE_MERGE, comments

    Runs when at least one XD_data network is up to restore connections between the sites, and remove site isolation. Run this test after running the SITE_ISOLATION test.

    comments
    User-defined text to describe the configured test

    Example

    SITE_MERGE, Heal the XD_data networks 
    

    Entrance Criteria

    At least one node must be online.

    Success Indicators

    The following conditions indicate success for this test:

  • No resource groups change state
  • The cluster becomes stable.
  • SITE_DOWN_TAKEOVER, site, comments

    Stops cluster services and moves the resource groups to other nodes, on all nodes at the specified site.

    site
    The site that contains the nodes on which cluster services will be stopped
    comments
    User-defined text to describe the configured test

    Example

    SITE_DOWN_TAKEOVER, site_1, Stop cluster services on all nodes at 
    site_1, bringing the resource groups offline and moving the resource 
    groups. 
    

    Entrance Criteria

    At least one node at the site must be online.

    Success Indicators

    The following conditions indicate success for this test:

  • Cluster services are stopped on all nodes at the specified site
  • All primary instance resource groups mover to the another site.
  • All secondary instance resource groups go offline
  • The cluster becomes stable.
  • SITE_UP, site, comments

    Starts cluster services on all nodes at the specified site.

    site
    The site that contains the nodes on which cluster services will be started
    comments
    User-defined text to describe the configured test

    Example

    SITE_UP, site_1, Start cluster services on all nodes at site_1. 
    

    Entrance Criteria

    At least one node at the site must be offline.

    Success Indicators

    The following conditions indicate success for this test:

  • Cluster services are started on all nodes at the specified site
  • Resource groups remain in the same state
  • The cluster becomes stable.
  • General Tests

    The other tests available to use in HACMP cluster testing:

  • Bring an application server down
  • Terminate the Cluster Manager on a node
  • Add a wait time for test processing.
  • SERVER_DOWN, node | ANY, appserv, command, comments

    Runs the specified command to stop an application server. This test is useful when testing application availability.

    In the automated test, the test uses the stop script to turn off the application.

    node
    The name of a node on which the specified application sever is to become unavailable
    ANY
    Any available node that participates in this resource group can have the application server become unavailable
    The Cluster Test Tool tries to simulate server failure on any available cluster node. This test is equivalent to failure on the node that currently owns the resource group, if the server is in a resource group that has policies other than the following ones:
    • Startup: Online on all available nodes
    • Fallover: Bring offline (on error node only)
    appserv
    The name of the application server associated with the specified node
    command
    The command to be run to stop the application server
    comments
    User-defined text to describe the configured test.

    Example

    SERVER_DOWN,node1,db_app /apps/stop_db.pl, Kill the db app 
    

    Entrance Criteria

    The resource group is online on the specified node.

    Success Indicators

    The following conditions indicate success for this test:

  • The cluster becomes stable
  • Cluster nodes remain in the same state
  • The resource group that contains the application server is online; however, the resource group may be hosted by another node, unless it is a concurrent resource group, in which case the group goes into ERROR state.
  • CLSTRMGR_KILL, node, comments

    Runs the kill command to terminate the Cluster Manager on a specified node.

    Note: If CLSTRMGR_KILL is run on the local node, you may need to reboot the node. On startup, the Cluster Test Tool automatically starts again. For information about how to avoid manually rebooting the node, see the section Recovering the Control Node after Cluster Manager Stops.
    For the Cluster Test Tool to accurately assess the success or failure of a CLSTRMGR_KILL test, do not perform other activities in the cluster while the Cluster Test Tool is running.
    node
    The name of the node on which to terminate the Cluster Manager
    comments
    User-defined text to describe the configured test.


    Example

    CLSTRMGR_KILL, node5, Bring down node5 hard 
    

    Entrance Criteria

    The specified node is active.

    Success Indicators

    The following conditions indicate success for this test:

  • The cluster becomes stable
  • Cluster services stop on the specified node
  • Cluster services continue to run on other nodes
  • Resource groups that were online on the node where the Cluster Manager fails move to other nodes
  • All resource groups on other nodes remain in the same state.
  • For information about potential conditions caused by a CLSTRMGR_KILL test running on the control node, see the section Recovering the Control Node after Cluster Manager Stops.

    WAIT, seconds, comments

    Generates a wait period for the Cluster Test Tool for a specified number of seconds.

    seconds
    The number of seconds that the Cluster Test Tool waits before proceeding with processing
    comments
    User-defined text to describe the configured test

    Example

    WAIT, 300, We need to wait for five minutes before the next test 
    

    Entrance Criteria

    Not applicable.

    Success Indicators

    Not applicable.

    Example Test Plan

    The following excerpt from a sample Test Plan includes the tests:

  • NODE_UP
  • NODE_DOWN_GRACEFUL
  • It also includes a WAIT interval. The comment text at the end of the line describes the action to be taken by the test.

    NODE_UP,ALL,starts cluster services on all nodes 
    NODE_DOWN_GRACEFUL,waltham,stops cluster services gracefully on node waltham 
    WAIT,20 
    NODE_UP,waltham,starts cluster services on node waltham 
    

    Running Custom Test Procedures

    Before you start running custom tests, ensure that:

  • Your Test Plan is configured correctly.
  • For information about setting up a Test Plan, see the section Creating a Test Plan.
  • You have specified values for test parameters.
  • For information about parameter values, see the section Specifying Parameters for Tests.
  • You have logging for the tool configured to capture the information that you want to examine for your cluster.
  • For information about customizing verbose logging for the Cluster Test Tool, see the section Error Logging.
  • The cluster is not in service in a production environment.
  • Launching a Custom Test Procedure

    To run custom testing:

      1. Enter smit hacmp
      2. In SMIT, select either one of the following options:
  • Extended Configuration
  • Problem Determination Tools
  • Then select HACMP Cluster Test Tool.
      3. In the HACMP Cluster Test Tool panel, select Execute Custom Test Procedure.
      4. In the Execute Custom Test Procedure panel, enter field values as follows:
    Test Plan
    (Required) The full path to the Test Plan for the Cluster Test Tool. This file specifies the tests for the tool to execute.
    Variable File
    (Using a variables file is optional but recommended.) The full path to the variables file for the Cluster Test Tool. This file specifies the variable definitions used in processing the Test Plan.
    Verbose Logging
    When set to yes, includes additional information in the log file that may help to judge the success or failure of some tests. For more information about verbose logging, see the section Running Automated Tests. The default is yes.
    Select no to decrease the amount of information logged by the Cluster Test Tool.
    Cycle Log File
    When set to yes, uses a new log file to store output from the Cluster Test Tool. The default is yes.
    Select no to append messages to the current log file.
    For more information about cycling the log file, see the section Log File Rotation.
    Abort on Error
    When set to no, the Cluster Test Tool continues to run tests after some of the tests being run fail. This may cause subsequent tests to fail because the cluster state is different from the one expected by one of those tests. The default is no.
    Select yes to stop processing after the first test fails.
    For information about the conditions under which the Cluster Test Tool stops running, see the section Cluster Test Tool Stops Running.
    Note: The tool stops running and issues an error if a test fails and Abort on Error is selected.
      5. Press Enter to start running the custom tests.
      6. Evaluate the test results.
    For information about evaluating test results, see the section Evaluating Results.

    Evaluating Results

    You evaluate test results by reviewing the contents of the log file created by the Cluster Test Tool. When you run the Cluster Test Tool from SMIT, it displays status messages to the screen and stores output from the tests in the file /var/hacmp/log/cl_testtool.log. Messages indicate when a test starts and finishes and provide additional status information. More detailed information, especially when verbose logging is enabled, is stored in the log file that appears on the screen. Information is also logged to the hacmp.out file. For information about the hacmp.out file, see Chapter 2: Using Cluster Log Files in the Troubleshooting Guide.

    Criteria for Test Success or Failure

    The following criteria determine the success or failure of cluster tests:

  • Did the cluster stabilize?
  • For the Cluster Test Tool, a cluster is considered stable when:
  • The Cluster Manager has a status of stable on each node, or is not running.
  • Nodes that should be online are online.
  • If a node is stopped and that node is the last node in the cluster, the cluster is considered stable when the Cluster Manager is inoperative on all nodes.

  • No events are in the event queue for HACMP.
  • The Cluster Test Tool also monitors HACMP timers that may be active. The tool waits for some of these timers to complete before determining cluster stability. For more information about how the Cluster Test Tool interacts with HACMP timers, see the section Working with Timer Settings.
  • Has an appropriate recovery event for the test run?
  • Is a specific node online or offline as specified?
  • Are all expected resource groups still online within the cluster?
  • Did a test that was expected to run actually run?
  • Every test checks to see if it makes sense to be run; this is called a check for “rationality”. A test returning a NOT RATIONAL status indicates the test could not be run because the entrance criteria could not be met; for example, trying to run the NODE_UP test on a node that is already up. A warning message will be issued along with the exit status to explain why the test was not run. Irrational tests do not cause the Cluster Test Tool to abort.

    The NOT RATIONAL status indicates the test was not appropriate for your cluster. When performing automated testing, it is important to understand why the test did not run. For Custom Cluster tests, check the sequences of events and modify the test plan to ensure the test runs. Consider the order of the tests and the state of the cluster before running the test plan. For more information, refer to the section Setting up Custom Cluster Testing.

    The tool targets availability as being of primary importance when reporting success or failure for a test. For example, if the resource groups that are expected to be available are available, the test passes.

    Keep in mind that the Cluster Test Tool is testing the cluster configuration, not testing HACMP. In some cases the configuration may generate an error that causes a test to fail, even though the error is the expected behavior. For example, if a resource group enters the error state and there is no node to acquire the resource group, the test fails.

    Note: If a test generates an error, the Cluster Test Tool interprets the error as a test failure. For information about how the Cluster Test Tool determines the success or failure of a test, see the Success Indicators subsections for each test in the section Description of Tests.

    Recovering the Control Node after Cluster Manager Stops

    If a CLSTRMGR_KILL test runs on the control node and stops the control node, reboot the control node. No action is taken to recover from the failure. After the node reboots, the testing continues.

    To monitor testing after the Cluster Test Tool starts again, review output in the /var/hacmp/log/cl_testtool.log file. To determine whether a test procedure completes, run the tail -f command on /var/hacmp/log/cl_testtool.log file.

    How to Avoid Manual Intervention

    You can avoid manual intervention to reboot the control node during testing by:

  • Editing the /etc/cluster/hacmp.term file to change the default action after an abnormal exit.
  • The clexit.rc script checks for the presence of this file and, if the file is executable, the script calls it instead of halting the system automatically.
  • Configuring the node to auto-Initial Program Load (IPL) before running the Cluster Test Tool.
  • Error Logging

    The Cluster Test Tool has several useful functions that enable you to work with logs.

    Log Files: Overview

    If a test fails, the Cluster Test Tool collects information in the automatically created log files.To collect logs, the Cluster Test Tool creates the directory /var/hacmp/cl_testtool if it doesn't exist. HACMP never deletes the files in this directory.You evaluate the success or failure of tests by reviewing the contents of the Cluster Test Tool log file, /var/hacmp/utilities/cl_testtool.log.

    For each test plan that has any failures, the tool creates a new directory under /var/hacmp/cl_testtool. If the test plan has no failures, the tool does not create a log directory. The directory name is unique and consists of the name of the Cluster Test Tool plan file, and the time stamp when the test plan was run.

    Log File Rotation

    The Cluster Test Tool saves up to three log files and numbers them so that you can compare the results of different cluster tests. The tool also rotates the files with the oldest file being overwritten. The following list shows the three files saved:

    /var/hacmp/utilities/cl_testtool.log

    /var/hacmp/utilities/cl_testtool.log.1

    /var/hacmp/utilities/cl_testtool.log.2

    If you do not want the tool to rotate the log files, you can disable this feature from SMIT. For information about turning off this feature, see the section Running Automated Tests or Setting up Custom Cluster Testing.

    Log File Entries

    The entries in the log file are in the format:

    DD/MM/YYYY_hh:mm:ss Message text . . . 
    

    where DD/MM/YYYY_hh:mm:ss indicates day/month/year_hour/minutes/seconds.

    The following example shows the type of output stored in the log file:

    04/02/2006/_13:21:55: 
    ------------------------------------------------------- 
    04/02/2006/_13:21:55: | Initializing Variable Table 
    04/02/2006/_13:21:55: 
    ------------------------------------------------------- 
    04/02/2006/_13:21:55:   Using Variable File: /tmp/sample_variables 
    04/02/2006/_13:21:55:           data line: node1=waltham 
    04/02/2006/_13:21:55:           key: node1 - val: waltham 
    04/02/2006/_13:21:55: 
    ------------------------------------------------------- 
    04/02/2006/_13:21:55: | Reading Static Configuration Data 
    04/02/2006/_13:21:55: 
    ------------------------------------------------------- 
    04/02/2006/_13:21:55:   Cluster Name: Test_Cluster 
    04/02/2006/_13:21:55:   Cluster Version: 7 
    04/02/2006/_13:21:55:   Local Node Name: waltham 
    04/02/2006/_13:21:55:   Cluster Nodes: waltham belmont 
    04/02/2006/_13:21:55:   Found 1 Cluster Networks 
    04/02/2006/_13:21:55:   Found 4 Cluster Interfaces/Device/Labels 
    04/02/2006/_13:21:55:   Found 0 Cluster Resource Groups 
    04/02/2006/_13:21:55:   Found 0 Cluster Resources 
    04/02/2006/_13:21:55:   Event Timeout Value: 720 
    04/02/2006/_13:21:55:   Maximum Timeout Value: 2880 
    04/02/2006/_13:21:55: 
    ------------------------------------------------------- 
    04/02/2006/_13:21:55: | Building Test Queue 
    04/02/2006/_13:21:55: 
    ------------------------------------------------------- 
    04/02/2006/_13:21:55:   Test Plan: /tmp/sample_event 
    04/02/2006/_13:21:55:           Event 1: NODE_UP: NODE_UP,ALL,starts 
    cluster services on all nodes 
    04/02/2006/_13:21:55: 
    ------------------------------------------------------- 
    04/02/2006/_13:21:55: | Validate NODE_UP 
    04/02/2006/_13:21:55: 
    ------------------------------------------------------- 
    04/02/2006/_13:21:55:   Event node: ALL 
    04/02/2006/_13:21:55:   Configured nodes: waltham belmont 
    04/02/2006/_13:21:55:           Event 2: NODE_DOWN_GRACEFUL: 
    NODE_DOWN_GRACEFUL,node1,stops cluster services gracefully on node1 
    04/02/2006/_13:21:55: 
    ------------------------------------------------------- 
    04/02/2006/_13:21:55: | Validate NODE_DOWN_GRACEFUL 
    04/02/2006/_13:21:55: 
    ------------------------------------------------------- 
    04/02/2006/_13:21:55:   Event node: waltham 
    04/02/2006/_13:21:55:   Configured nodes: waltham belmont 
    04/02/2006/_13:21:55:           Event 3: WAIT: WAIT,20 
    04/02/2006/_13:21:55:           Event 4: NODE_UP: NODE_UP,node1,starts 
    cluster services on node1 
    04/02/2006/_13:21:55: 
    ------------------------------------------------------- 
    04/02/2006/_13:21:55: | Validate NODE_UP 
    04/02/2006/_13:21:55: 
    ------------------------------------------------------- 
    04/02/2006/_13:21:55:   Event node: waltham 
    04/02/2006/_13:21:55:   Configured nodes: waltham belmont 
    04/02/2006/_13:21:55:  
    . 
    . 
    . 
    

    Log File Example

    If a test fails, you will see output similar to the following:

    ===================================================================== 
    Test 1 Complete - NETWORK_DOWN_LOCAL: fail service network 
    Test Completion Status: FAILED 
    ===================================================================== 
    Copying log files hacmp.out and clstrmgr.debug from all nodes to 
    directory /var/hacmp/cl_testtool/rg_fallover_plan.1144942311 
    on node prodnode1. 
    

    After that, you can examine the directory /var/hacmp/cl_testtool/rg_fallover_plan.1144942311 on node prodnode1.

    In the log directory, the tool creates separate files for each test. The names for the specific log files stored in the directory have this structure:

    <testnum>.<testname>.<node>.<logfile>

    where

  • testnum is the order in which the test appears in the test plan file
  • testname is the name of the test that failed
  • node is the node from which the log was collected
  • logfile the source of the logging information, either the hacmp.out or clstrmgr.debug files.
  • For example, if the NETWORK_DOWN_LOCAL test fails and it is the first test that was run, and later in the test plan the fourth test, named RG_MOVE also fails, you will see the following files in the /var/hacmp/cl_testtool/rg_fallover_plan.1144942311 directory:

    1.NETWORK_DOWN_LOCAL.prodnode1.clstrmgr.debug 
    1.NETWORK_DOWN_LOCAL.prodnode1.hacmp.out 
    1.NETWORK_DOWN_LOCAL.prodnode2.clstrmgr.debug 
    1.NETWORK_DOWN_LOCAL.prodnode2.hacmp.out 
    4.RG_MOVE.prodnode1.clstrmgr.debug 
    4.RG_MOVE.prodnode1.hacmp.out 
    4.RG_MOVE.prodnode2.clstrmgr.debug 
    4.RG_MOVE.prodnode2.hacmp.out 
    

    The hacmp.out File

    The hacmp.out file also logs the start of each test that the Cluster Test Tool runs on each cluster node. This log entry has the following format:

    TestName: datetimestring1:datetimestring2

    where

    TestName
    The name of the test being processed.
    datetimestring1
    The date and time on the control node when the Cluster Test Tool starts to run the test.
    The value of datetimestring has the format MMDDHHmmYY (month day hour minute year).
    datetimestring2
    The date and time on the node on which the test runs.
    The value of datetimestring has the format MMDDHHmmYY (month day hour minute year).

    Note: The Cluster Test Tool uses the date and time strings to query the AIX 5L error log when necessary.

    Verbose Logging: Overview

    By default, the Cluster Test Tool uses verbose logging to provide a wealth of information about the results of cluster testing. You can customize the type of information that the tool gathers and stores in the Cluster Test Tool log file.

    Note: The Cluster Snapshot utility does not include the Cluster Test Tool log file because this file is specific to HACMP cluster testing at a specific point in time—not an indication of ongoing cluster status.

    With verbose logging enabled, the Cluster Test Tool:

  • Provides detailed information for each test run
  • Runs the following utilities on the control node between the processing of one test and the next test in the list:
  • Utility
    Type of Information Collected
    clRGinfo
    The location and status of resource groups
    errpt
    Errors stored in the system error log file
  • Processes each line in the following files to identify additional information to be included in the Cluster Test Tool log file. The utilities included are run on each node in the cluster after a test finishes running.
  • File
    Type of Information Specified
    cl_testtool_log_cmds
    A list of utilities to be run to collect additional status information
    cl_testtool_search_strings
    Text strings that may be in the hacmp.out file. The Cluster Test Tool searches for these strings and inserts any lines that match into the Cluster Test Tool log file.
  • If you want to gather only basic information about the results of cluster testing, you can disable verbose logging for the tool. For information about disabling verbose logging for the Cluster Test Tool, see the section Running Automated Tests or Setting up Custom Cluster Testing.

    Customizing the Types of Information to Collect

    You can customize the types of logging information to be gathered during testing. When verbose logging is enabled for the Cluster Test Tool, it runs the utilities listed in the /usr/es/sbin/cluster/etc/cl_testtool_log_cmds file, and collects status information that the specified commands generate. The Cluster Test Tool runs each of the commands listed in cl_testtool_log_cmds file after each test completes, gathers output for each node in the cluster, and stores this information in the Cluster Test Tool log file.

    You can collect information specific to a node by adding or removing utilities from the list. For example, if you have an application server running on two of the nodes in a four-node cluster, you could add application-specific commands to the list on the nodes running the application servers.

    If you want all of the cluster nodes to use the same cl_testtool_log_cmds file, you can add it to a file collection. For information about including files in a file collection, see Chapter 7: Verifying and Synchronizing an HACMP Cluster.

    By default, the cl_testtool_log_cmds file includes the following utilities:

    Utility
    Type of Information Collected
    /usr/es/sbin/cluster/utilities/cldump
    A snapshot of the status of key cluster components—the cluster itself, the nodes in the cluster, the network interfaces connected to the nodes, and the resource groups on each node
    lssrc -ls clstrmgrES
    The status of the Cluster Manager
    lssrc -ls topsvcs
    The status of Topology Services

    The file also contains entries for the following utilities, but they are commented out and not run. If you want to run any of these utilities between each test, open the file and remove the comment character from the beginning of the command line for the utility.

    Utility
    Type of Information Collected
    snmpinfo -m dump -v -o /usr/es/sbin/cluster/hacmp.defs cluster
    Information on MIB cluster status
    snmpinfo -m dump -v -o /usr/sbin/cluster/hacmp.defs resGroupNodeState
    Information on MIB resource group state
    LANG=C lssrc -a | grep -vw “inoperative$”
    The status of all subsystems for each host
    svmon -C clstrmgr
    Memory usage statistics for the Cluster Manager
    /usr/sbin/rsct/bin/hatsdmsinfo
    Information about the deadman switch timer
    netstat -i ; netstat -r
    Information about configured interfaces and routes
    lssrc -ls gsclvmd
    Information about gsclvmd—the access daemon for enhanced concurrent mode volume groups
    ps auxw
    Process information
    lsvg -o
    Information about active volume groups (those that are varied on and accessible)
    lspv
    Information about the physical volumes in a volume group
    vmstat; vmstat -s
    System resource utilization information that includes statistics for virtual memory, kernel, disks, traps, and CPU activity

    You can also add and remove commands from the cl_testtool_log_cmds file.

    Note: Enter only one command on each line of the file. The tool executes one command per line.

    Adding Data from hacmp.out to the Cluster Test Tool Log File

    You can add messages that include specified text in the hacmp.out file to the Cluster Test Tool log file. With verbose logging enabled, the tool uses the /usr/es/sbin/cluster/etc/cl_testtool/cl_testtool_search_strings file to identify text strings to search for in hacmp.out. For any text string that you specify on a separate line in the cl_testtool_search_strings file, the tool:

  • Searches the hacmp.out file for a matching string
  • Logs the line containing that string, accompanied by the line number from the hacmp.out file, to the Cluster Test Tool log file
  • You can use the line number to locate the line in the hacmp.out file and then review that line within the context of other messages in the file.

    By default, the file contains the following lines:

    !!!!!!!!!!   ERROR   !!!!!!!!!! 
              EVENT FAILED  
    

    You can edit the cl_testtool_search_strings file on each node to specify a search string specific to a node. This way, the cl_testtool_search_strings file is different on different nodes.

    If you want all of the cluster nodes to use the same cl_testtool_search_strings file, you can add it to a file collection and synchronize the cluster. For information about including files in a file collection, see Chapter 7: Verifying and Synchronizing an HACMP Cluster.

    Note: Cluster synchronization does not propagate a cl_testtool_search_strings file to other nodes in a cluster unless the file is part of a file collection.

    To edit the cl_testtool_search_strings file:

  • On each line of the file, specify a single text string that you want the tool to locate in the hacmp.out file.
  • Fixing Problems when Running Cluster Tests

    This section discusses the following issues that you may encounter when testing a cluster:

  • Cluster Test Tool Stops Running
  • Control Node Becomes Unavailable
  • Cluster Does Not Return to a Stable State
  • Working with Timer Settings
  • Testing Does Not Progress as Expected
  • Unexpected Test Results.
  • Cluster Test Tool Stops Running

    The Cluster Test Tool can stop running under the following conditions:

  • The Cluster Test Tool fails to initialize
  • A test fails and Abort on Error is set to yes for the test procedure
  • The tool times out waiting for cluster stabilization, or the cluster fails to stabilize after a test.
  • See the section Working with Timer Settings
  • An error that prohibits the Cluster Test Tool from running a test, such as a configuration in AIX 5L or a script that is missing
  • A cluster recovery event fails and requires user intervention.
  • Control Node Becomes Unavailable

    If the control node experiences an unexpected failure while the Cluster Test Tool is running, the testing stops. No action is taken to recover from the failure.

    To recover from the failure:

      1. Bring the node back online and start cluster services in the usual manner.
    You may need to reboot the control node.
      2. Stabilize the cluster.
      3. Run the test again.
    Note: The failure of the control node may invalidate the testing that occurred prior to the failure.

    If a CLSTRMGR_KILL test runs on the control node, the node and cluster services need to restart. For information about handling this situation, see the section Recovering the Control Node after Cluster Manager Stops.

    Cluster Does Not Return to a Stable State

    The Cluster Test Tool stops running tests after a timeout if the cluster does not return to a stable state either:

  • While a test is running
  • As a result of a test being processed.
  • The timeout is based on ongoing cluster activity and the cluster-wide event-duration time until warning values. If the Cluster Test Tool stops running, an error appears on the screen and is logged to the Cluster Test Tool log file before the tool stops running.

    After the cluster returns to a stable state, it is possible that the cluster components, such as resource groups, networks, and nodes, are not in a state consistent with the specifications of the list of tests. If the tool cannot run a test due to the state of the cluster, the tool generates an error. The Cluster Test Tool continues to process tests.

    If the cluster state does not let you continue a test, you can:

      1. Reboot cluster nodes and restart the Cluster Manager.
      2. Inspect the Cluster Test Tool log file and the hacmp.out file to get more information about what may have happened when the test stopped.
      3. Review the timer settings for the following cluster timers, and make sure that the settings are appropriate to your cluster:
  • Time until warning
  • Stabilization interval
  • Monitor interval.
  • For information about timers in the Cluster Test tool, and about how application monitor timers can affect whether the tool times out, see the section Working with Timer Settings.

    Working with Timer Settings

    The Cluster Test Tool requires a stable HACMP cluster for testing. If the cluster becomes unstable, the time that the tool waits for the cluster to stabilize depends on the activity in the cluster:

  • No activity.
  • The tool waits for twice the time until event duration time until warning (also referred to as config_too_long) interval, then times out.
  • Activity present.
  • The tool calculates a timeout value based on the number of nodes in the cluster and the setting for the time until warning interval.

    If the time until warning interval is too short for your cluster, testing may time out. To review or change the setting for the time until warning interval, in HACMP SMIT, select HACMP Extended Configuration > Extended Performance Tuning Parameters Configuration and press Enter.

    For complete information on tuning event duration time, see the section Tuning Event Duration Time Until Warning in Chapter 5: Configuring Cluster Events.

    The settings for the following timers configured for an application monitor can also affect whether testing times out:

  • Stabilization interval
  • Monitor interval
  • The settling time for resource groups does not affect whether or not the tool times out.

    Stabilization Interval for an Application Monitor

    If this timer is active, the Cluster Test Tool does not time out when waiting for cluster stability. If the monitor fails, however, and recovery actions are underway, the Cluster Test Tool may time out before the cluster stabilizes.

    Make sure the stabilization interval configured in HACMP is appropriate for the application being monitored.

    For information about setting the stabilization interval for an application, see Chapter 4: Configuring HACMP Cluster Topology and Resources (Extended).

    Monitor Interval for a Custom Application Monitor

    When the Cluster Test Tool runs a server_down test, it waits for the length of time specified by the monitor interval before the tool checks for cluster stability. The monitor interval defines how often to poll the application to make sure that the application is running.

    The monitor interval should be long enough to allow recovery from a failure. If the monitor interval is too short, the Cluster Test Tool may time out when a recovery is in process.

    For information about setting the monitor interval for an application, see Chapter 4: Configuring HACMP Cluster Topology and Resources (Extended).

    Testing Does Not Progress as Expected

    If the Cluster Test Tool is not processing tests and recording results as expected, use the Cluster Test Tool log file to try to resolve the problem:

      1. Ensure that verbose logging for the tool is enabled.
    For information about verbose logging for the Cluster test Tool, see the section Error Logging.
      2. View logging information from the Cluster Test Tool log file /var/hacmp/utilities/cl_testtool.log. The tool directs more information to the log file than to the screen.
      3. Add other tools to the cl_testtool_log_cmds file to gather additional debugging information. This way you can view this information within the context of the larger log file.
    For information about adding commands to the cl_testtool_log_cmds file, see the section Customizing the Types of Information to Collect.

    Unexpected Test Results

    The basic measure of success for a test is availability. In some instances, you may consider that a test has passed, when the tool indicates that the test failed. Be sure that you are familiar with the criteria that determines whether a test passes or fails. For information about the criteria for a test passing or failing, see the section Evaluating Results.

    Also ensure that:

  • Settings for cluster timers are appropriate to your cluster. See the section Cluster Does Not Return to a Stable State.
  • Verbose logging is enabled and customized to investigate an issue. See the section Testing Does Not Progress as Expected.

  • PreviousNextIndex