PreviousNextIndex

Chapter 10: Monitoring an HACMP Cluster


This chapter describes tools you can use to monitor an HACMP cluster.

You can use either ASCII SMIT or WebSMIT to configure and manage the cluster and view interactive cluster status. Starting with HACMP 5.4, you can also use WebSMIT to navigate, configure and view the status of the and graphical displays of the running cluster. For more information about WebSMIT, see Chapter 2: Administering a Cluster Using WebSMIT.

Note: The default locations of log files are used in this chapter. If you redirected any logs, check the appropriate location.

The main topics in this chapter include:

  • Periodically Monitoring an HACMP Cluster
  • Monitoring a Cluster with HAView
  • Monitoring Clusters with Tivoli Distributed Monitoring
  • Monitoring Clusters with clstat
  • Monitoring Applications
  • Monitoring Applications
  • Displaying an Application-Centric Cluster View
  • Using Resource Groups Information Commands
  • Using HACMP Topology Information Commands
  • Monitoring Cluster Services
  • HACMP Log Files.
  • Periodically Monitoring an HACMP Cluster

    By design, HACMP provides recovery for various failures that occur within a cluster. For example, HACMP can compensate for a network interface failure by swapping in a standby interface. As a result, it is possible that a component in the cluster has failed and that you are unaware of the fact. The danger here is that, while HACMP can survive one or possibly several failures, each failure that escapes your notice threatens a cluster’s ability to provide a highly available environment, as the redundancy of cluster components is diminished.

    To avoid this situation, you should customize your system by adding event notification to the scripts designated to handle the various cluster events. You can specify a command that sends you mail indicating that an event is about to happen (or that an event has just occurred), along with information about the success or failure of the event. The mail notification system enhances the standard event notification methods.

    In addition, HACMP offers application monitoring capability that you can configure and customize in order to monitor the health of specific applications and processes.

    Use the AIX 5L Error Notification facility to add an additional layer of high availability to an HACMP environment. You can add notification for failures of resources for which HACMP does not provide recovery by default. The combination of HACMP and the high availability features built into the AIX 5L system keeps single points of failure to a minimum; the Error Notification facility can further enhance the availability of your particular environment. See the chapter on Configuring AIX 5L for HACMP in the Installation Guide for suggestions on customizing error notification.

    See Chapter 7: Planning for Cluster Events in the Planning Guide for detailed information on predefined events and on customizing event handling. Also, be sure to consult your worksheets, to document any changes you make to your system, and to periodically inspect the key cluster components to make sure they are in full working order.

    Automatic Cluster Configuration Monitoring

    Verification automatically runs on one user-selectable HACMP cluster node once every 24 hours. By default, the first node in alphabetical order runs the verification at midnight. If verification finds errors, it warns about recent configuration issues that might cause problems at some point in the future. HACMP stores the results of the automatic monitoring on every available cluster node in the /var/hacmp/log/clutils.log file.

    If cluster verification detects some configuration errors, you are notified about the potential problems:

  • The exit status of verification is published across the cluster along with the information about cluster verification process completion.
  • Broadcast messages are sent across the cluster and displayed on stdout. These messages inform you about detected configuration errors.
  • A cluster_notify event runs on the cluster and is logged in hacmp.out (if cluster services is running).
  • More detailed information is available on the node that completes cluster verification in /var/hacmp/clverify/clverify.log file. If a failure occurs during processing, error messages and warnings clearly indicate the node and reasons for the verification failure.

    Tools for Monitoring an HACMP Cluster

    HACMP supplies tools for monitoring a cluster. These are described in subsequent sections:

  • The HAView utility extends Tivoli NetView services so you can monitor HACMP clusters and cluster components from a single node. Using HAView, you can also view the full cluster event history in the /usr/es/sbin/cluster/history/cluster.mmddyyyy file. The event history (and other cluster status and configuration information) is accessible through Tivoli NetView’s menu bar. For more information, see Monitoring a Cluster with HAView.
  • Cluster Monitoring with Tivoli allows you to monitor clusters and cluster components and perform cluster administration tasks through your Tivoli Framework console. For more information, see Monitoring Clusters with Tivoli Distributed Monitoring.
  • clstat (the /usr/es/sbin/cluster/clstat utility) reports the status of key cluster components—the cluster itself, the nodes in the cluster, the network interfaces connected to the nodes, the service labels, and the resource groups on each node.
  • WebSMIT displays cluster information using a slightly different layout and organization. Cluster components are displayed along their status. Expanding the item reveals additional information about it, including the network, interfaces and active resource groups.
  • For more information, see Monitoring Clusters with clstat.
  • Application Monitoring allows you to monitor specific applications and processes and define action to take upon detection of process death or other application failures. Application monitors can watch for the successful startup of the application, check that the application runs successfully after the stabilization interval has passed, or monitor both the startup and the long-running process. For more information, see Monitoring Applications.
  • SMIT and WebSMIT give you information on the cluster.
  • You have the ability to see the cluster from an application-centric point of view.
  • The HACMP Resource Group and Application Management panel in SMIT has an option to Show Current Resource Group and Application State. The SMIT panel Show All Resources by Node or Resource Group has an option linking you to the Show Current Resource Group and Application State panel.
  • Using the WebSMIT version lets you expand and collapse areas of the information. Colors reflect the state of individual items (for example, green indicates online).
  • For more information, see Displaying an Application-Centric Cluster View.
    The System Management (C-SPOC) >Manage HACMP Services > Show Cluster Services SMIT panel shows the status of the HACMP daemons.
  • The Application Availability Analysis tool measures uptime statistics for applications with application servers defined to HACMP. For more information, see Measuring Application Availability.
  • The clRGinfo and cltopinfo commands display useful information on resource group configuration and status and topology configuration, respectively. For more information, see Using Resource Groups Information Commands.
  • Log files allow you to track cluster events and history: The /usr/es/adm/cluster.log file tracks cluster events; the /tmp/hacmp.out file records the output generated by configuration scripts as they execute; the /usr/es/sbin/cluster/history/cluster.mmddyyyy log file logs the daily cluster history; the /tmp/cspoc.log file logs the status of C-SPOC commands executed on cluster nodes. You should also check the RSCT log files. For more information, see HACMP Log Files.
  • In addition to these cluster monitoring tools, you can use the following:

  • The Event Emulator provides an emulation of cluster events. For more information, see the section on Emulating Events in the Concepts Guide.
  • The Custom Remote Notification utility allows you to define a notification method through the SMIT interface to issue a customized page in response to a cluster event. In HACMP 5.3 and up, you can also send text messaging notification to any address including a cell phone. For information and instructions on setting up pager notification, see the section on Configuring a Custom Remote Notification Method in the Planning Guide.
  • Monitoring a Cluster with HAView

    HAView is a cluster monitoring utility that allows you to monitor HACMP clusters using NetView for UNIX. Using Tivoli NetView, you can monitor clusters and cluster components across a network from a single management station.

    HAView creates and registers Tivoli NetView objects that represent clusters and cluster components. It also creates submaps that present information about the state of all nodes, networks, network interfaces, and resource groups associated with a particular cluster. This cluster status and configuration information is accessible through Tivoli NetView’s menu bar.

    HAView monitors cluster status using the Simple Network Management Protocol (SNMP). It combines periodic polling and event notification through traps to retrieve cluster topology and state changes from the HACMP management agent, the Cluster Manager.

    You can view cluster event history using the HACMP Event Browser and node event history using the Cluster Event Log. Both browsers can be accessed from the Tivoli NetView menu bar. The /usr/es/sbin/cluster/history/cluster.mmddyyyy file contains more specific event history. This information is helpful for diagnosing and troubleshooting fallover situations. For more information about this log file, see Chapter 2: Using Cluster Log Files in the
    Troubleshooting Guide.

    HAView Installation Requirements

    HAView has a client/server architecture. You must install both an HAView server image and an HAView client image, on the same machine or on separate server and client machines. For information about installation requirements, see Installation Guide.

    HAView File Modification Considerations

    Certain files need to be modified in order for HAView to monitor your cluster properly. When configuring HAView, you should check and edit the following files:

  • haview_start
  • clhost
  • snmpd.conf or snmpdv3.conf
  • haview_start File

    You must edit the haview_start file so that it includes the name of the node that has the HAView server executable installed. This is how the HAView client knows where the HAView server is located. Regardless of whether the HAView server and client are on the same node or different nodes, you are required to specify the HAView server node in the haview_start file.

    The haview_start file is loaded when the HAView client is installed and is stored in /usr/haview. Initially, the haview_start file contains only the following line:

    "${HAVIEW_CLIENT:-/usr/haview/haview_client}" $SERVER 
    

    You must add the following line to the file:

    SERVER="${SERVER:-<your server name>}" 
    

    For example, if the HAView server is installed on mynode, the edited haview_start file appears as follows:

    SERVER="${SERVER:-mynode}" 
    "${HAVIEW_CLIENT:-/usr/haview/haview_client}" $SERVER 
    

    where mynode is the node that contains the HAView server executable.

    Note: If you have configured a persistent node IP label on a node on a network in your cluster, it maintains a persistent “node address” on the node – this address can be used in the haview_start file.

    clhosts File

    HAView monitors a cluster’s state within a network topology based on cluster-specific information in the /usr/es/sbin/cluster/etc/clhosts file. The clhosts file must be present on the Tivoli NetView management node. Make sure this file contains the IP address or IP label of the service and/or base interfaces of the nodes in each cluster that HAView is to monitor.

    Make sure that the hostname and the service label of your Tivoli NetView nodes are exactly the same. (If they are not the same, add an alias in the /etc/hosts file to resolve the name difference.)

    Warning: If an invalid IP address exists in the clhosts file, HAView will fail to monitor the cluster. Make sure the IP addresses are valid, and there are no extraneous characters in the clhosts file.

    snmpd.conf File

    The Tivoli NetView management node must also be configured in the list of trap destinations in the snmpd.conf files on the cluster nodes of all clusters you want it to monitor. This makes it possible for HAView to utilize traps in order to reflect cluster state changes in the submap in a timely manner. Also, HAView can discover clusters not specified in the clhosts file on the nodes in another cluster.

    Note: The default version of the snmpd.conf file for AIX 5L v.5.2 and AIX 5L v. 5.3 is snmpdv3.conf.

    The format for configuring trap destinations is as follows:

    trap <community name> <IP address of Tivoli NetView management node> 1.2.3 fe 
    

    For example, enter:

    trap	public	140.186.131.121 	1.2.3 fe	 
    

    Note the following:

  • You can specify the name of the management node instead of the IP address.
  • You can include multiple trap lines in the snmpd.conf file.
  • Note: HACMP now supports a SNMP Community Name other than “public.” If the default SNMP Community Name has been changed in /etc/snmpd.conf to something different from the default of “public” HACMP will function correctly. The SNMP Community Name used by HACMP will be the first name found that is not “private” or “system” using the lssrc -ls snmpd command.
    Clinfo will also get the SNMP Community Name in the same manner. Clinfo will still support the -c option for specifying SNMP Community Name but its use is not required. The use of the -c option is considered a security risk because doing a ps command could find the SNMP Community Name. If it is important to keep the SNMP Community Name protected, change permissions on /tmp/hacmp.out, /etc/snmpd.conf, /smit.log and /usr/tmp/snmpd.log to not be world-readable.
    See the AIX documentation for full information on the snmpd.conf file. Version 3 has some differences from Version 1.

    Tivoli NetView Hostname Requirements for HAView

    The following hostname requirements apply to using HAView in a Tivoli NetView environment. If you change the hostname of a network interface, the Tivoli NetView database daemons and the default map are affected.

    Hostname Effect on the Tivoli NetView Daemon

    The hostname required to start Tivoli NetView daemons must be associated with a valid interface name or else Tivoli NetView fails to start.

    Hostname Effect on the Tivoli NetView Default Map

    If you change the hostname of the Tivoli NetView client, the new hostname does not match the original hostname referenced in the Tivoli NetView default map database and Tivoli NetView will not open the default map. Using the Tivoli NetView mapadmin command, you need to update the default map (or an invalid map) to match the new hostname.

    See the Tivoli NetView Administrator’s Guide for more information about updating or deleting an invalid Tivoli NetView map.

    Starting HAView

    Once you have installed the HAView client and server, HAView is started and stopped when you start or stop Tivoli NetView. However, before starting Tivoli NetView/HAView, check the management node as follows:

  • Make sure both client and server components of HAView are installed. See the installation or migration chapters in the Installation Guide for more information.
  • Make sure access control has been granted to remote nodes by running the xhost command with the plus sign (+) or with specified nodes:
  • xhost + (to grant access to all computers)

    or, to grant access to specific nodes only:

    xhost <computers to be given access>

  • Make sure the DISPLAY variable has been set to the monitoring node and to a label that can be resolved by and contacted from remote nodes:
  • export DISPLAY=<monitoring node>:0.0

    These actions allow you to access HACMP SMIT panels using the HAView Cluster Administration option.

    After ensuring these conditions are set, type the following to start Tivoli NetView:

    /usr/OV/bin/nv6000

    Refer to the Tivoli NetView User’s Guide for Beginners for further instructions about starting Tivoli NetView.

    When Tivoli NetView starts, HAView creates objects and symbols to represent a cluster and its components. Through submaps, you can view detailed information about these components.

    HAView places the Clusters symbol (shown below) on the Tivoli NetView map after Tivoli NetView starts. The Clusters symbol is added to the Netview Root map and is placed alongside the Tivoli NetView Collections symbol and other symbols:

    HAView Clusters Symbol 
    

    Viewing Clusters and Components

    To see which clusters HAView is currently monitoring, double-click the Clusters symbol. The Clusters submap appears. You may see one or more symbols that represent specific clusters. Each symbol is identified by a label indicating the cluster’s name. Double-click a cluster symbol to display symbols for nodes, networks, and resource groups within that cluster.

    Note that the cluster status symbol may remain unknown until the next polling cycle, even though the status of its cluster components is known. See Customizing HAView Polling Intervals for more information about the default intervals and how to change them using SMIT.

    You can view component details at any time using the shortcut ctrl-o. See Obtaining Component Details in HAView for information and instructions.

    Read-Write and Read-Only NetView Maps

    Normally, you have one master monitoring station for Tivoli NetView/HAView. This station is supplied with new information as cluster events occur, and its map is updated so it always reflects the current cluster status.

    In normal cluster monitoring operations, you will probably not need to open multiple Tivoli NetView stations on the same node. If you do, and you want the additional stations to be updated with current cluster status information, you must be sure they use separate maps with different map names. For more information on multiple maps and changing map permissions, see the Tivoli NetView Administrator’s Guide.

    Interpreting Cluster Topology States

    When using HAView to view cluster topology, symbols for clusters and cluster components such as nodes and networks are displayed in various colors depending on the object’s state. The following table summarizes colors you may see when monitoring a cluster. (For information about the resource group symbol colors, see Interpreting Resource Group Symbol Colors in HAView.

    Status
    Meaning
    Symbol Color
    Connection Color (network submap)
    Critical
    The object has failed or is not functioning. If the symbol is a node or network, the node or network is DOWN.
    Red
    Red
    Normal
    The object is functioning correctly. If the symbol is a node object, the node is UP.
    Green
    Black
    Marginal
    Some object functions are working correctly; others are not.
    Yellow
    Red
    Unknown
    The object’s state cannot be determined. It may not be currently monitored by HAView.
    Blue
    Blue

    You can select Legend at any time from the Help pull-down menu to view Tivoli NetView and HAView symbols and to understand their associative colors.

    The Navigation Tree and Submap Windows

    In addition to the submap window, the Tivoli NetView Navigation Tree Window can help you keep track of your current location in the HAView hierarchy. Press the Tree button to see the Navigation Tree Window. In the Navigation Tree, the blue outline indicates where you are in the map, that is, which submap you are in.

    The Symbols Legend

    At any time, you can select Legend from the Help pull-down menu to view all Tivoli NetView and HAView symbols and the meanings of the symbols and their various colors.

    The Help Menu

    To view help topics, select Help > Index > Tasks > HAView Topics.

    Viewing Networks

    To view the state of the nodes and addresses connected to a network associated with a specific cluster, double-click a network symbol in the specific Cluster submap. A network submap appears displaying symbols for all nodes connected to the network. The symbols appear in a color that indicates the nodes’ current state. The vertical line representing a network is called the network connection. Its color indicates the status of the connection between the node and the network.

    See Interpreting Cluster Topology States in the next section for a table of symbol colors and how they reflect a cluster and its components’ state.

    Viewing Nodes

    To view the state of nodes associated with a particular network, double-click a network symbol. A submap appears displaying all nodes connected to the network. Each symbol’s color indicates the associated node’s current state.

    You can also view the state of any individual node associated with a cluster by double-clicking on that node’s symbol in the specific cluster submap.

    Viewing Addresses

    To view the status of addresses serviced by a particular node, double-click a node symbol from either a cluster or network submap. A submap appears displaying symbols for all addresses configured on a node. Each symbol’s color indicates the associated address’s current state.

    When you view interfaces in a node submap from a network submap, all interfaces relating to that node are shown, even if they are not related to a particular network.

    Viewing Resource Groups and Resources

    Note: In HACMP 5.2 and up, resource groups are displayed in HAView as type “unknown.”

    Resource Group Ownership Symbol

    HAView indicates the current ownership of a resource group in both the Resource Group and Node submaps by showing the owner node together with the resource group as follows.

    In the Resource Group submap, ownership is shown with this symbol:

    Resource Group Ownership Symbol in Resource Group Submap 
    

    Resource Submap—Individual Resource Symbols

    The HAView Resource Group submap displays all the individual resources configured as part of a given resource group. Each type of resource has its own symbol, as shown in the following figure:

    Symbols for Individual Resource Types 
    

    Remember that individual resource symbols always appear in the blue (unknown) state, regardless of their actual state; HAView does not monitor the status of individual resources, only their presence and location.

    Interpreting Resource Group Symbol Colors in HAView

    Each symbol’s color indicates the current state of the associated resource group, as follows:

    Resource Group Status
    Symbol Color
    What Is Occurring
    Online/UP
    green
    The resource group is currently operating properly on one or more nodes in the cluster
    Offline/DOWN
    red
    The resource group is not operating in the cluster and is not in an error condition.
    Acquiring
    yellow
    The resource group is currently trying to come up on one of the nodes in the cluster.
    Releasing
    yellow
    The resource group is in the process of being released from the ownership of a node.
    Error
    blue
    The resource group has reported an error condition, and intervention is required.
    Unknown
    blue
    The resource group’s current status cannot be obtained, possibly due to a loss of communication between the monitoring node and the cluster.

    Obtaining Component Details in HAView

    Tivoli NetView dialog boxes allow you to view detailed information about a cluster object. A dialog box can contain information about a cluster, network, node, network interface, or resource group, or about cluster events. You can access an object’s dialog box using the Tivoli NetView menu bar or the Object Context menu, or by pressing ctrl-o at any time.

    To view details about a cluster object using the Tivoli NetView menu bar:

      1. Click on an object in any submap.
      2. Select the Modify/Describe option from the Tivoli NetView Edit menu.
      3. Select the Object option. An Object Description dialog window appears.
      4. Select HAView for AIX and click on View/Modify Object Attributes. An Attributes dialog window appears.

    You can view dialog boxes for more than one object simultaneously by either clicking the left mouse button and dragging to select multiple objects, or by pressing the Alt key and clicking on all object symbols for which you want more information.

    To view details about a cluster object using the Object Context menu:

      1. Click on an object in any submap.
      2. Click on the symbol you have highlighted to display the object context menu, using Button 3 on a three-button mouse Button 2 on a two-button mouse.
      3. Select Edit from the object context menu.
      4. Select Modify/Describe from the Edit cascade menu.
      5. Select the Object option. An Object Description dialog window appears.
      6. Select HAView for AIX and click on View/Modify Object Attributes. An Attributes dialog window appears.

    Customizing HAView Polling Intervals

    To ensure that HAView is optimized for system performance and reporting requirements, you can customize these two parameters:

  • The polling interval (in seconds) at which HAView polls the HACMP clusters to determine if cluster configuration or object status has changed. The default is 60 seconds.
  • The polling interval (in minutes) at which HAView polls the clhosts file to determine if new clusters have been added. The default for Cluster Discovery polling is 120 minutes.
  • You can change the HAView polling intervals using the SMIT interface as follows:

      1. On the HAView server node, open a SMIT panel by typing: smitty haview. The Change/Show Server Configuration window opens.
      2. Enter the polling interval numbers you want (between 1 and 32000) and press OK.
    Note: If the snmpd.conf file is not properly configured to include the Tivoli NetView server as a trap destination, HAView can detect a trap that occurs as a result of a cluster event, but information about the network topology may not be timely. Refer back to the section HAView File Modification Considerations for more information on the snmpd.conf file.

    Removing a Cluster from HAView

    If a cluster does not respond to status polling, you can use the Remove Cluster option to remove the cluster from the database. To remove a cluster, the cluster state must be UNKNOWN, as represented by a blue cluster symbol. If the cluster is in any other state, the Remove Cluster option is disabled.

    Warning: The Remove Cluster option is the only supported way to delete HAView objects from submaps. Do not delete an HAView symbol (cluster or otherwise) through the Delete Object or Delete Symbol menu items. If you use these menu items, HAView continues to poll the cluster.

    When you remove a cluster, the following actions occur:

  • The cluster name is removed from the Tivoli NetView object database and HAView stops polling the cluster.
  • The symbol for the cluster is deleted.
  • The symbols for all child nodes, networks, addresses, and resource groups specific to that cluster are deleted.
  • If you are removing the cluster permanently, remember to remove the cluster addresses from the /usr/es/sbin/cluster/etc/clhosts file. If you do not remove the cluster addresses from the clhosts file, new cluster discovery polling continues to search for the cluster.

    To remove a cluster:

      1. Click on the cluster symbol you wish to remove. The cluster must be in an UNKNOWN state, represented by a blue cluster symbol.
      2. Select HAView from the Tools pull-down menu.
      3. Select Remove Cluster from the HAView cascade menu.

    Using the HAView Cluster Administration Utility

    HAView allows you to start a SMIT HACMP session to perform cluster administration functions from within the Tivoli NetView session. The administration session is run on an aixterm opened on the chosen node through a remote shell. You can open multiple sessions of SMIT HACMP while in HAView. You must have root permissions, or enter the root password, to open a SMIT panel.

    Note: You can start an administration session for any node that is in an UP state (the node symbol is green). If you attempt to start an administration session when the state of the node is DOWN or UNKNOWN, no action occurs.

    When bringing a node up, the HAView node symbol may show green before all resources are acquired. If you select the node symbol and attempt to open an administration session before all resources are acquired, you may receive an error.

    Opening and Closing a Cluster Administration Session

    To open a cluster administration session:

      1. Click on an available node symbol (the one that is green).
      2. Select the Tools > HAView > Cluster Administration.
    If you are a non-root user, an AIX 5L window appears prompting you to enter the root password. When the password is verified, a SMIT window opens.
      3. Proceed with your tasks in SMIT.
      4. Exit the Cluster Administration session; the aixterm session will also close.

    Cluster Administration Notes and Requirements

    Keep in mind the following considerations when using the Cluster Administration option:

  • Be sure you have run the xhost command prior to starting Tivoli NetView, so that a remote node can start an aixterm session on your machine.
  • Be sure you have set the DISPLAY variable to a label that can be resolved and contacted from remote nodes.
  • For the cluster administration session to proceed properly, the current Tivoli NetView user (the account that started Tivoli NetView) must have sufficient permission and be authenticated to perform an rsh to the remote node in the ~/.rhosts file or through Kerberos.
  • If an IP Address Takeover (IPAT) occurs while a cluster administration session is running, the route between the remote node and the HAView monitoring node may be lost.
  • HAView Browsers

    HAView provides two browsers that allow you to view the event history of a cluster, the Cluster Event Log and the HACMP Event Browser.

    Cluster Event Log

    Using the Cluster Event Log you can view the event history for a cluster as recorded by a specific node. The Log browser is accessible through the Tivoli NetView Tools menu, and is only selectable if an active node symbol is highlighted.

    For more detailed information on a node’s event history, log onto the specific node and check the cluster message log files. For more information on these logs see the Cluster Message Log Files section in Chapter 2: Using Cluster Log Files in the Troubleshooting Guide.

    Note: To ensure that the header for the Cluster Event Log displays properly, install all of the Tivoli NetView fonts on your system.

    To review a cluster event log:

      1. Click on the node symbol for which you wish to view a Cluster Event Log.
      2. Select HAView from the Tivoli NetView Tools menu.
      3. Select the Cluster Event Log option.
      4. Set the number of events to view field. You can use the up and down arrows to change this number or you can enter a number directly into the field. The possible range of values is 1 to 1000 records. The default value is 100.
      5. Press the Issue button to generate the list of events. The message area at the bottom of the dialog box indicates when the list is done generating.
    When the list is done generating, the dialog box displays the following view-only fields:
    Event ID
    This field displays a numeric identification for each event that occurred on the cluster.
    Node Name
    The name of the node on which the event occurred.
    Time
    The date and time the event occurred. This field is in the format MM DD hh:mm:ss.
    Description
    A description of the event.

      6. When you are finished, press the Dismiss button to close the dialog box.

    HACMP Event Browser

    HAView provides a Tivoli NetView browser that allows you to view the accumulative event history of a cluster. The browser shows the history of all nodes in the cluster, broadcast through an assigned primary node. If the primary node fails, another node assumes the primary role and continues broadcasting the event history.

    The HACMP Event Browser provides information on cluster state events. A filter is used to block all redundant traps.

    To view the HACMP Event Browser:

      1. Select HAView from the Tivoli NetView Tools menu. The menu item is always active, and when selected will start a Tivoli NetView browser showing the event history for all active clusters.
      2. Select the HACMP Event Browser option.The HACMP Event Browser appears. Note that only one instantiation of the Event Browser can be accessed at a time. See the Tivoli NetView User’s Guide for Beginners for more information on the Tivoli NetView browser functions.
      3. Select the Close option from the File menu of the HACMP Event Browser menu bar to close the browser.

    When you exit the Event Browser, the HAView application restarts. At this time, the HACMP cluster icon turns blue, disappears, and then reappears.

    Monitoring Clusters with Tivoli Distributed Monitoring

    You can monitor the state of an HACMP cluster and its components and perform some cluster administration tasks through your Tivoli Framework enterprise management system.

    In order to integrate HACMP with Tivoli, you must configure your HACMP cluster nodes as subscriber (client) nodes to the Tivoli server node, or Tivoli Management Region (TMR). Each cluster node can then maintain detailed node information in its local Tivoli database, which the TMR accesses for updated node information to display.

    The following sections discuss how to monitor your cluster once you have set up the cluster nodes and Tivoli so that Tivoli can monitor the cluster. If you have not done this setup yet, see the Appendix on configuring Tivoli for HACMP in the Installation Guide for instructions.

    Cluster Monitoring and Cluster Administration Options

    Using various windows of the Tivoli interface, you can monitor the following aspects of your cluster:

  • Cluster state and substate
  • Configured networks and network state
  • Participating nodes and node state
  • Configured resource group location and state
  • Individual resource location (not state).
  • In addition, you can perform the following cluster administration tasks from within Tivoli:

  • Start cluster services on specified nodes
  • Stop cluster services on specified nodes
  • Bring a resource group online
  • Bring a resource group offline
  • Move a resource group to another node.
  • The initial Tivoli NetView Desktop view is shown here:

    Tivoli Desktop Initial Panel 
    

    Tivoli’s thermometer icons provide a visual indication of whether components are up, down, in transition, or in an unknown or error state. From the window for a selected Policy Region, you can go a cluster’s Indicator Collection window, which displays thermometer icons indicating the state of all cluster components.

    The cluster status information shown by the thermometers is updated every three minutes by default or at another interval you specify. (Further information on changing the default polling interval appears later in this chapter. See Customizing HAView Polling Intervals.)

    Note: The following sections provide information on monitoring an HACMP cluster through the Tivoli interface. Descriptions of Tivoli components and processes are provided here as needed, but for full information on installing, configuring, and using the Tivoli software itself, consult your Tivoli product documentation.

    For complete details on setting up HACMP cluster monitoring with Tivoli, see the
     Installation Guide.

    Using Tivoli to Monitor the Cluster

    Once you have properly installed your hativoli files and defined your nodes to Tivoli, you can view information on the status of your HACMP cluster components.

    When you monitor your cluster through Tivoli, you can access cluster information in both icon and text form, in a number of different Tivoli windows. The next few sections are meant to orient you to the flow of Tivoli cluster monitoring information.

    Note: When HATivoli is unable to contact nodes in the cluster, or when all nodes are down, node status may not be displayed accurately. You should be aware that in the event that your last remaining cluster node goes down, Tivoli may still indicate that the cluster is up. This can occur when HACMP is unable to contact the Management Information Base (MIB) for updated information. In this case, the Tivoli display will show information as of the last successful poll.

    Starting Tivoli

    If Tivoli is not already running, start Tivoli by performing these steps on the TMR node:

      1. Make sure access control has been granted to remote nodes by running the xhost command with the plus sign (+) or with specified nodes. This will allow you to open a SMIT window from Tivoli.

    If you want to grant access to all computers in the network, type:

    xhost +

    or, if you want to grant access to specific nodes only:

    xhost <computers to be given access>

      2. Also to ensure later viewing of SMIT windows, set DISPLAY=<TMR node>.
      3. Run the command . /etc/Tivoli/setup_env.sh if it was not run earlier.
      4. Enter tivoli to start the application.The Tivoli graphical user interface appears, showing the initial Tivoli Desktop window.

    Note that there may be a delay as Tivoli adds the indicators for the cluster.

    Tivoli Policy Regions

    A Tivoli Policy Region groups together all entities related to a specific area of Tivoli monitoring. In this case, that area is the HACMP cluster or clusters. The HACMP Policy Region encompasses the nodes, clusters, indicator icons, and tasks related to your HACMP configuration.

    Policy Region icons appear in the initial Tivoli Desktop window (shown in the preceding figure). Clicking on a Policy Region icon opens the Policy Region window, in which you see the thermometer icons of the Indicator Collections, as well as icons for the profiles and task libraries associated with the HACMP Policy Region.

    HACMP Policy Region Window 
    

    Tivoli Distributed Monitors and Indicator Collections

    For each cluster, a group of Tivoli distributed monitors is created; these monitors query the HACMP cluster node at set intervals for information about the various cluster components. The group of monitors is associated with an indicator collection that displays the state of the cluster components. If a change is detected in the state of the cluster or one of its components, the distributed monitor takes action, changing an icon in the associated Indicator Collection window. This provides the Tivoli administrator with a visual representation of any changes in the status of the monitored items.

    New monitors are added whenever new cluster components are configured. When cluster components are removed from the cluster configuration, the associated monitors are also removed.

    The icon for cluster and cluster component status is a thermometer figure with varying levels of red color depending on the severity of the component status. When you click on a cluster’s Indicator Collection icon in the Policy Region window, the Indicator Collection window appears showing status icons for that cluster’s components.

    Indicator Collection Window 
    

    Interpreting Indicator Displays for Various Cluster Components

    The Indicator icons reflect varying degrees of severity of problems, depending on the height of the red color in the thermometer and the color-coded marker alongside it. The following tables list the indicator displays for various cluster component states:

    CLUSTER STATE Indicator Display
    Cluster State
    Normal
    UP
    Fatal
    DOWN
    Severe
    UNKNOWN

    CLUSTER SUBSTATE Indicator Display
    Cluster Substate
    Normal
    STABLE
    Warning
    UNSTABLE
    Severe
    RECONFIG
    Critical
    ERROR

    NODE Indicator Display*
    Node State(s)
    Normal
    All nodes ONLINE
    Warning
    One or more nodes OFFLINE

    Note that node state is displayed in the Distributed Monitor Indicator Collection as a composite view of all nodes rather than an individual node view.

    RESOURCE GROUP Indicator Display
    Resource Group State
    Normal
    ONLINE
    Warning
    ACQUIRING
    Warning
    RELEASING
    Critical
    ERROR
    Fatal
    OFFLINE

    Viewing Cluster Information

    The Cluster Managed Node window gives you all information about the current cluster topology and configuration.

    The Properties section displays standard system properties information for the managed node, and the IP Interfaces section at the bottom shows standard IP Interface information for the node.

    With the addition of the HACMP cluster monitoring feature, the standard Tivoli node icon is extended to capture and display additional information specific to your HACMP cluster. The items that you see in the HACMP Properties portion of the window are detailed in the following sections.

    To view the Managed Node window, right-click on a node icon in the Policy Region window.

    Cluster Managed Node Window 
    

    In the HACMP Properties portion of the Managed Node window shown above, you see the following four items of HACMP-specific top-level cluster information: Cluster name, Cluster ID, Cluster state, and Cluster substate.

    In addition, HACMP Properties buttons lead you to further cluster and component details. Selecting a button brings up a new popup window with options for retrieving specific information. These buttons and the choices within them are detailed below.

    Cluster-Wide Information Button

    Clicking the cluster-wide button gives you the Cluster Information window, shown below, with options to view details on the configuration and status information for the cluster as a whole.

    Node-Specific Information Button

    Clicking the node-specific button brings up the Node Specific Attributes window, from which you can access further information about the attributes, networks, and interfaces associated with the node.

    Resource Group Information Button

    Clicking the resource group information button brings up the Resource Group Information window; from there, you can access further details about the resource groups in your cluster and their associated resources and nodes.

    Note: All resource groups are displayed in Tivoli as type “unknown.”

    Cluster Management Button

    Clicking the Cluster Management button brings up the Cluster Management window. From here, you can open a SMIT window to perform all of your normal cluster management tasks.

    If you are a non-root user, you will be prompted to enter the root password after clicking the Open SMIT Window button. When the password is verified, a SMIT window opens and you can proceed with cluster management activity.

    Besides having root permissions or entering the root password, in order to open a SMIT window from within Tivoli, you must have run the xhost command to grant access to remote nodes. See instructions in the section Starting Tivoli.

    Customizing Polling Intervals

    The Distributed Monitors poll the cluster nodes periodically for cluster topology and status changes. The default polling interval is three minutes. If this interval is too short for your particular cluster monitoring needs, you can change this interval through an HACMP Task found in the Modify HATivoli Properties Task Library. It is not recommended to make the polling interval shorter than the default.

    As mentioned earlier, be aware that if your last remaining cluster node goes down, Tivoli may still indicate that the cluster is up. This can occur when HACMP is unable to contact the MIB for updated information. In this case, the Tivoli display will show information as of the last successful poll.

    Modifying HATivoli Properties

    The Modify HATivoli Properties Task Library window shown below contains options to perform cluster management tasks such as configuring, modifying, and deleting various items associated with the cluster, and refreshing the cluster view.

    Modify HATivoli Properties Task Library 
    

    Click on the appropriate library to perform the desired tasks.

    Using Tivoli to Perform Cluster Administration Tasks

    You can perform several cluster administration tasks from within Tivoli: You can start or stop cluster services on a specified node, bring a resource group online or offline, and move a resource group from a specified node to a target node.

    Note: In order to perform cluster administration tasks, you must have admin level privileges under Resource Roles in the HACMP policy region.

    The cluster administration tasks are found in the Cluster Services Task Library, shown here.

    Cluster Services Task Library 
    

    Overview of Steps for Configuring Cluster Administration Tasks

    All cluster administration tasks performed through Tivoli require these basic steps:

      1. Select a task from the Task Library.
      2. Specify appropriate Task Options.
      3. Specify appropriate Task Arguments.
      4. Execute the task.

    These steps are detailed in the following sections.

    Starting and Stopping Cluster Services via Tivoli

    To configure starting and stopping of cluster services on specified nodes, perform the following steps:

      1. From the HACMP Policy Region window, select the Cluster Services Task Library.
      2. From the Cluster Services Task Library, select the task (Start_Cluster_Services or Stop_Cluster_Services) you want to perform. The Execute Task panel opens. (The panel for Start_Cluster_Services is shown here.)

    Execute Task Panel for Starting Cluster Services 
    
      3. In the Execute Task panel (the panel for Start Cluster Services is shown below), set the appropriate Task Options for this task.

    Notes on Task Options

  • Select Display on Desktop if you want to see detailed output for each task immediately, or Save to File if you do not need to view the output now. (Even if you choose to view the display on the desktop, you will have the option to save it to a file later.)
  • Task Description or Help buttons in Tivoli panels do not provide any information on HACMP-specific functionality.
  • You will probably need to increase the Timeout parameter from its default of 60 seconds. Most tasks take longer than this to execute. For resource group tasks, factors such as the number of resource groups specified can cause the task to take as long as 10 minutes.
  • If the timeout period is set too short, then when you execute the task (from the Configure Task Arguments window), a message will appear to notify you when the timeout has expired. The event may have still completed successfully, but the detailed output will not appear.

  • Specify the nodes on which you want to perform the task by moving the node names to the Selected Task Endpoints window. Click on the left-facing arrow to move selected node names from the Available list to the Selected list, and the right-facing arrow to move node names out of the Available list.)
  • (Notice that the TMR node appears in the Available list. You are allowed to select it, but since it is not a cluster node, no action will be taken on that node. A message informs you of this.)

  • The Selected Profile Managers and Available Profile Managers in the lower section of the panel are not used for HACMP monitoring.
    1. 4. After setting all necessary Task Options, click the Execute & Dismiss button.
    The Configure Task Arguments panel opens. (The panel for Start_Cluster_Services is shown here.)

    Configure Task Arguments Panel for Starting Cluster Services 
    
      5. Configure the task arguments for starting or stopping cluster services, as you would when starting or stopping cluster services using SMIT.
    Note: Task Description or Help buttons in Tivoli panels do not provide any information on HACMP-specific functionality.
      6. After setting all task arguments for a task, click the Set and Execute button. As the task executes, details appear in the display if you selected Display on Desktop. (See the note about the Timeout parameter under Step 3 above.)

    The cluster will now start or stop according to the parameters you have set.

    Bringing a Resource Group Online or Offline

    Note: Resource groups are displayed in Tivoli as type “unknown.”

    To configure bringing a resource group online or offline:

      1. Follow steps one through four under Starting and Stopping Cluster Services via Tivoli to select the task and configure the task options. The Task Arguments panel displays.
      2. Select one or more resource groups and then configure the necessary task arguments for bringing a resource group online or offline, just as you would when performing this task using SMIT. (For more information on the task arguments for bringing resource groups online or offline, see Chapter 15: Managing Resource Groups in a Cluster.)
    Note: If you do not have admin level privileges, you are not allowed to select a resource group from the list. Instead, you see a message informing you that you have insufficient permissions.
      3. After setting all task arguments for a task, click the Set and Execute button. As the task executes, details appear in the display if you selected Display on Desktop. (See the note about the Timeout parameter under Notes on Task Options above.)

    The specified resource group(s) will now be brought online or offline as specified.

    Moving a Resource Group

    To configure that a resource group moves:

      1. Follow steps one through four under Starting and Stopping Cluster Services via Tivoli to select the task and configure the task options. The Task Arguments panel displays.
      2. Configure the necessary task arguments for moving a resource group, just as you would when performing this task through SMIT. (For more information on the task arguments for bringing resource groups online or offline, see Requirements before Migrating a Resource Group in Chapter 15: Managing Resource Groups in a Cluster.)
    Note: If you do not have admin level privileges, you are not allowed to select a resource group from the list. Instead, you see a message informing you that you have insufficient permissions.
      3. After setting all task arguments for a task, click the Set and Execute button.
    As the task executes, details appear in the display if you selected Display on Desktop. (See the note about the Timeout parameter under Notes on Task Options above.)

    The resource group will now be moved as specified.

    Uninstalling HACMP-Related Files from Tivoli

    To discontinue cluster monitoring or administration through Tivoli, you must perform the following steps to delete the HACMP-specific information from Tivoli:

      1. Run an uninstall through the SMIT interface, uninstalling the three hativoli filesets on all cluster nodes and the TMR.
      2. If it is not already running, invoke Tivoli on the TMR by entering these commands:
  • . /etc/Tivoli/setup_env.sh
  • tivoli
    1. 3. In the Policy Region for the cluster, select the Modify HATivoli Properties task library. A window appears containing task icons.
      4. Select Edit > Select All to select all tasks, and then Edit > Delete to delete. The Operations Status window at the left shows the progress of the deletions.
      5. Return to the Policy Region window and delete the Modify HaTivoli Properties icon.
      6. Repeat steps 4 through 6 for the Cluster Services task library.
      7. Open the Profile Manager.
      8. Select Edit > Profiles > Select All to select all HACMP Indicators.
      9. Select Edit > Profiles > Delete to delete the Indicators.
      10. Unsubscribe the cluster nodes from the Profile Manager:
      a. In the Profile Manager window, select Subscribers.
      b. Highlight each HACMP node on the left, and click to move it to the right side.
      c. Click Set & Close to unsubscribe the nodes.

    Monitoring Clusters with clstat

    HACMP provides the /usr/es/sbin/cluster/clstat utility for monitoring a cluster and its components. The clinfo daemon must be running on the local node for this utility to work properly.

    The clstat utility reports on the cluster components as follows:

  • Cluster: cluster number (system-assigned); cluster state (up or down); cluster substate (stable, or unstable).
  • Nodes: How many, and the state of each node (up, down, joining, leaving, or reconfiguring).
  • For each node, clstat displays the IP label and IP address of each network interface attached to each node, and whether that interface is up or down. clstat does not display multiple IP labels on one network interface, as in networks with aliases.
    For each node, clstat displays service IP labels for serial networks and whether they are up or down.
    Note: By default, clstat does not display whether the service IP labels for serial networks are down. Use clstat -s to display service IP labels on serial networks that are currently down.

    For each node, clstat displays the states of any resource groups (per node): online or offline.

    See the clstat man page for additional information.

    The /usr/es/sbin/cluster/clstat utility runs on both ASCII and X Window Display clients in either single-cluster or multi-cluster mode. The client display automatically corresponds to the capability of the system. For example, if you run clstat on an X Window client, a graphical display appears; however, you can run an ASCII display on an X-capable machine by specifying the -a flag.

    Viewing clstat with WebSMIT

    With HACMP 5.4, you can use WebSMIT to:

  • Display detailed cluster information.
  • Navigate and view the status of the running cluster
  • Configure and manage the cluster
  • View graphical displays of sites, networks, nodes and resource group dependencies.
  • For more information on installing and configuring WebSMIT, see the Installation Guide. For more information on using WebSMIT, see Chapter 2: Administering a Cluster Using WebSMIT.

    Viewing clstat in ASCII Display Mode

    In ASCII display mode, you have the option of viewing status for a single cluster or multiple clusters. You can also use the -o option to save a single snapshot of the clstat output in a cron job.

    Single-Cluster ASCII Display Mode

    In single-cluster ASCII display mode, the clstat utility displays information about only one cluster. To invoke the clstat utility in single-cluster (non-interactive) mode, enter:

    /usr/es/sbin/cluster/clstat 
    

    A panel similar to the following appears:

    clstat - HACMP Cluster Status Monitor 
                    ------------------------------------- 
    Cluster: myctestcluster 	(1044370190) 
    Tue Mar 11 14:19:50 EST 2004 
                    State: UP               Nodes: 2 
                    SubState: STABLE 
            Node: holmes            State: UP 
               Interface: holmes_en1svc (0)         Address: 192.168.90.40 
                                                    State:   UP 
               Resource Group: econrg1                      State:  online 
            Node: u853              State: UP 
               Interface: u853_en1svc (0)           Address: 192.168.90.50 
                                                    State:   UP 
               Resource Group: econrg1                      State:  online 
    ***************** f/forward, b/back, r/refresh, q/quit *************** 
    clstat Single-Cluster ASCII Display Mode 
    

    The cluster information displayed shows the cluster ID and name. (Note that HACMP assigns the cluster ID number; this is not user-defined.) In this example, the cluster is up and has two nodes, both of which are up. Each node has one network interface. Note that the forward and back menu options apply when more than one page of information is available to display.

    If more than one cluster exists when you run the clstat command, the utility notifies you of this fact and requests that you retry the command specifying one of the following options:

    clstat [-c cluster ID] [-n name][ -r seconds] [-i] [-a] [-o] [-s] 
    

    where:

    -c cluster ID
    Displays information about the cluster with the specified ID if that cluster is active (HACMP generates this number). This option cannot be used with the -n option.
    If the cluster is not available, the clstat utility continues looking for it until it is found or until the program is canceled. Note that this option cannot be used if the -i option (for multi-cluster mode) is used.
    -n name
    The cluster name. This option cannot be used with the -c option
    -r seconds
    Updates the cluster status display at the specified number of seconds. The default is 1 second; however, the display is updated only if the cluster state changes.
    -i
    Displays information about clusters interactively. Only valid when running clstat in ASCII mode.
    -a
    Causes clstat to display in ASCII mode.
    -o
    (once) Provides a single snapshot of the cluster state and exits. This flag can be used to run clstat out of a cron job. Must be run with the -a option; ignores -i or -r flags.
    -s
    Displays service labels for serial networks and their state (up or down).

    To see cluster information about a specific cluster, enter:

    clstat [-n name] 
    

    Multi-Cluster ASCII Display Mode

    The multi-cluster (interactive) mode lets you monitor all clusters that Clinfo can access from the list of active service IP labels or addresses found in the /usr/es/sbin/cluster/etc/clhosts file. In multi-cluster mode, the clstat utility displays this list of recognized clusters and their IDs, allowing you to select a specific cluster to monitor. Multi-cluster mode requires that you use the -i flag when invoking the clstat utility. To invoke the clstat utility in multi-cluster mode, enter:

    /use/es/sbin/cluster/clstat -i 
    

    where the -i indicates multi-cluster (interactive) ASCII mode. A panel similar to the following appears.

    		clstat - HACMP for AIX Cluster Status Monitor
    		------------------------------------------------- 
    Number of clusters active: 1 
                    ID      Name            State 
    		777	ibm_26c            UP 
    Select an option: 
            
    		# - the Cluster ID                      x- quit 
    		clstat Multi-Cluster Mode Menu 
    

    This panel displays the ID, name, and state of each active cluster accessible by the local node. You can either select a cluster to see detailed information, or quit the clstat utility.

    When you enter a cluster name, a panel appears similar to the one that follows.

    		clstat - HACMP for AIX Cluster Status Monitor 
                    --------------------------------------------- 
    Cluster: ibm_26c        (777)           Thu Jul  9 18:35:46 EDT 2002 
                    State: UP Nodes: 2 
                    SubState: STABLE 
            Node: poseidon          State: UP 
               Interface: poseidon-enboot (0)       Address: 140.186.70.106 
                                                    State:   UP 
            Node: venus             State: UP 
               Interface: venus-enboot (0)          Address: 140.186.70.107 
                                                    State:   UP 
    	 Resource Group: rot		 			State: online 
    	 Resource Gropu: rg1					State: online 
    ****************** f/forward, b/back, r/refresh, q/quit *************		 
    clstat Multi-Cluster ASCII Display Mode 
    

    After viewing this panel, press q to exit the display. The multi-cluster mode returns you to the cluster list so you can select a different cluster. Note that you can use all menu options displayed. The forward and back options allow you to scroll through displays of active clusters without returning to the previous panel.

    Viewing clstat in X Window System Display Mode

    When you start the /usr/es/sbin/cluster/clstat utility on a node capable of displaying X Window System applications, the clstat utility displays its graphical interface if the client’s DISPLAY environment variable is set to the value of the X server’s node address.

    To invoke the clstat utility X Window System display, enter the clstat command:

    /usr/es/sbin/cluster/clstat [-n name][-c Id][ -r #][-D debug_level][-s] 
    

    where:

    -n name
    The cluster name. This option cannot be used with the -c option.
    -c ID
    Displays information about the cluster with the specified ID if that cluster is active. This option cannot be used with the -n option.
    -r #
    The interval at which the clstat utility updates the display. For the graphical interface, this value is interpreted in tenths of seconds. By default, clstat updates the display every 0.10 seconds.
    -D debug_level
    The level of debugging to be performed. The levels range from 1 to 10 in increasing amounts of information. The default (0) turns debugging off.
    -s
    Displays service labels for serial networks and their state (up or down).

    The clstat utility graphical interface uses windows to represent cluster nodes, as in the figure shown here:

    clstat X Window System Display 
    

    The middle box in the top row indicates the cluster name and ID. If the cluster is stable, this box appears green. If the cluster destabilizes for any reason, this box changes to red.

    The large boxes in other rows represent nodes. A node name appears in a box for each active node in the cluster. You can see up to sixteen nodes per cluster. Nodes that are up are shown in green, nodes that are down are shown in red, nodes that are joining or leaving the cluster are shown in yellow (topology changes), and nodes that are undefined are shown in the background color. Colors are configured in the xclstat X Window resource file in the /usr/es/sbin/cluster/samples/clstat directory.

    On a monochrome display, gray shading represents the colors as follows:

    red
    dark gray
    yellow
    gray
    green
    light gray

    Five buttons are available on the clstat display:

    PREV
    Displays the previous cluster (loops from end to start).
    NEXT
    Displays the next cluster (loops from start to end).
    cluster:ID
    The refresh bar. Pressing this bar updates the status display.
    QUIT
    Cancels the clstat utility.
    HELP
    Displays help information.

    Viewing Network Interface and Resource Group Information in an X Window Display

    To view information about network interfaces and resource groups for a node, click mouse button 1 on the appropriate node box in the clstat display. A pop-up window similar to the following appears. The title in the example shows that you are viewing node holmes in cluster_1.

    clstat Node Information Display  
    

    clstat displays only the state (online or offline) of resource groups.

    Click on the DISMISS button to close the pop-up window and to return to the clstat display window. Do not use the Close option in the pull-down menu in the upper left corner of the window to close this display; it terminates the clstat utility.

    Viewing clstat with a Web Browser

    With an appropriately configured Web server, you can view clstat in a Web browser on any machine that can connect to the cluster node (a node with both a Web server and Clinfo running on it). Viewing clstat through a Web browser allows you to see status for all of your clusters on one panel, using hyperlinks or the scroll bar to view details for each cluster.

    When you install HACMP, an executable file called clstat.cgi is installed in the same directory (/usr/es/sbin/cluster/) as the clstat and xclstat files. When run, clstat.cgi provides a CGI interface that allows cluster status output to be formatted in HTML and viewed in a Web browser.

    This feature supports the following browsers:

  • Mozilla 1.7.3 for AIX and FireFox 1.0.6
  • Internet Explorer, version 6.0.
  • Browser Display

    The clstat HACMP Cluster Status Monitor displays the clstat output for all clusters from the list of active service IP labels or addresses found in the /usr/es/sbin/cluster/etc/clhosts file.

    The example below shows clstat monitoring two clusters, cluster_1 and cluster_222. The browser window displays the status information for one of the clusters, cluster_1. To display the other cluster, click the hyperlink for cluster_222 at the top of the display or scroll down to find it.

    clstat Web Browser Display 
    

    The web browser display contains the same types of cluster status information as the ASCII or X Window displays, reorganized and color-coded for easier viewing.

    The view automatically refreshes every 30 seconds to display current cluster status.

    Note: After an automatic or manual refresh, the view should be retained; that is, the browser window should continue to display the cluster that was last clicked on before the refresh. In Internet Explorer 5.5 only, however, the refresh action causes a return to the top of the display.

    In the following example, one of the resource groups is coming online and the cluster is therefore in a reconfiguration substate:

    clstat Browser Display Showing a Resource Group in Acquiring State 
    
    Note: When a cluster resource group goes offline, it can no longer be displayed by clstat. No information about that resource group appears until it is being reacquired or online.

    Configuring Web Server Access to clstat.cgi

    To view the clstat display through a web browser, you must have a web server installed on a machine where Clinfo is running and able to gather cluster information. This could be a client node as well as a server node. The clstat.cgi program works with any web server that supports the CGI standard, which includes most currently available web servers for AIX 5L. For instance, you might use the IBM HTTP Server, which is included on the Expansion Pack CD for AIX 5L.

    Full instructions for installing and configuring a web server are not included here. Please refer to the web server documentation or consult your web administrator if you need additional help.

    The following steps complete the configuration of web server access to clstat.cgi using the IBM HTTP Server with its default configuration. The directories and URL you use for your server and configuration may vary.

      1. Move or copy clstat.cgi to the cgi-bin or script directory of the web server, for instance the default HTTP Server directory /usr/HTTPserver/cgi-bin.
      2. Verify that the clstat.cgi file still has appropriate permissions (that is, the file is executable by the user nobody).
      3. You can now view cluster status using a web browser by typing in a URL of the following format:
    http://<hostname or IP label of the web server node>/cgi-bin/clstat.cgi 
    
    Note: Although you can change the name of the CGI directory, do not rename the clstat.cgi file.

    Changing the clstat.cgi Refresh Interval

    You can change the default clstat.cgi refresh interval by specifying the CLSTAT_CGI_REFRESH environment variable in the /etc/environment file on the node serving the web page. Setting the CLSTAT_CGI_REFRESH environment variable (in seconds) overrides the default setting.

    For example, to change the refresh interval to 15 seconds from the default setting, add the following to the /etc/environment file:

    # change the clstat.cgi refresh interval to 15 seconds; 30 seconds is 
    the default 
    

    CLSTAT_CGI_REFRESH=15

    Security

    Because clstat.cgi is not run as root, there should be no immediate security threat of users gaining unauthorized access to HACMP by accessing clstat.cgi from the web server.

    Some administrators may wish to restrict access to clstat.cgi from the web server and can use methods built in to the web server to prevent access, such as password authentication or IP address blocking. HACMP does not provide any specific means of access restriction to clstat.cgi.

    Monitoring Applications

    HACMP uses monitors to check if the application is running before starting the application, avoiding startup of an undesired second instance of the application. HACMP also monitors specified applications and attempts to restart them upon detecting process death or application failure.

    Application monitoring works in one of two ways:

  • Process application monitoring detects the termination of one or more processes of an application, using RSCT Resource Monitoring and Control (RMC).
  • Custom application monitoring checks the health of an application with a custom monitor method at user-specified polling intervals.
  • HACMP uses monitors to check if the application is running before starting the application. You can configure multiple application monitors and associate them with one or more application servers. You can assign each monitor a unique name in SMIT.

    By supporting multiple monitors per application, HACMP can support more complex configurations. For example, you can configure one monitor for each instance of an Oracle parallel server in use. Or, you can configure a custom monitor to check the health of the database along with a process termination monitor to instantly detect termination of the database process.

    Process monitoring is easier to set up, as it uses the built-in monitoring capability provided by RSCT and requires no custom scripts; however, it may not be an appropriate option for all applications. User-defined monitoring can monitor more subtle aspects of an application’s performance and is more customizable, but it takes more planning, as you must create the custom scripts.

    In either case, when a problem is detected by the monitor, HACMP attempts to restart the application on the current node and continues the attempts until a specified retry count is exhausted. When an application cannot be restarted within this retry count, HACMP takes one of two actions, which you specify when configuring the application monitor:

  • Choosing fallover causes the resource group containing the application to fall over to the node with the next highest priority according to the resource policy.
  • Choosing notify causes HACMP to generate a server_down event to inform the cluster of the failure.
  • When you configure an application monitor, you use the SMIT interface to specify which application is to be monitored and then define various parameters such as time intervals, retry counts, and action to be taken in the event the application cannot be restarted. You control the application restart process through the Notify Method, Cleanup Method, and Restart Method SMIT fields, and by adding pre- and post-event scripts to any of the failure action or restart events you select.

    You can temporarily suspend and then resume an application monitor in order to perform cluster maintenance.

    When an application monitor is defined, each node’s Configuration Database contains the names of monitored applications and their configuration data. This data is propagated to all nodes during cluster synchronization, and is backed up when a cluster snapshot is created. The cluster verification ensures that any user-specified methods exist and are executable on all nodes.

    Note: If you specify the fallover option, which may cause a resource group to migrate from its original node, even when the highest priority node is up, the resource group may remain offline. Unless you bring the resource group online manually, it could remain in an inactive state. See Chapter 15: Managing Resource Groups in a Cluster, for more information.

    A Note on Application Monitors

    Application monitors configurable in HACMP are a critical piece of the HACMP cluster configuration; they enable HACMP to keep applications highly available. When HACMP starts an application server on a node, it uses a monitor that you configure to check if an application is already running to avoid starting two instances of the application. HACMP also periodically manages the application using the monitor that you configure to make sure that the application is up and running.

    An erroneous application monitor may not detect a failed application. As a result, HACMP would not recover it or may erroneously detect an application as failed, which may cause HACMP to move the application to a takeover node, resulting in unnecessary downtime. For example, a custom monitor that uses an sql command to query a database to detect whether it is functional may not respond that the database process is running on the local node so this is not sufficient for use with HACMP.

    If you plan on starting the cluster services with an option of Manage Resources > Manually, or stopping the cluster services without stopping the applications, HACMP relies on configured application monitors to determine whether to start the application on the node or not.

    To summarize, we highly recommend properly configured and tested application monitors for all applications that you want to keep highly available with the use of HACMP. During verification, HACMP issues a warning if an application monitor is not configured.

    For complete information on configuring application monitoring, see Configuring Multiple Application Monitors in Chapter 4: Configuring HACMP Cluster Topology and Resources (Extended).

    Displaying an Application-Centric Cluster View

    You can use either WebSMIT or the ASCII version of SMIT to view a cluster application.

    To show a cluster application in SMIT:

      1. Enter smit hacmp
      2. In SMIT, select Extended Configuration > Extended Resource Configuration > HACMP Extended Resources Configuration > Configure HACMP Applications > Show Cluster Applications and press Enter.
    SMIT displays the list of applications.
      3. Select the application to show from the list.
    SMIT displays the application with its related components.

    To show current resource group and application state, select HACMP Resource Group and Application Management > Show Current Resource Group and Application State. This panel displays the current states of applications and resource groups for each resource group.

  • For non-concurrent groups, HACMP shows only the node on which they are online and the applications state on this node
  • For concurrent groups, HACMP shows ALL nodes on which they are online and the applications states on the nodes
  • For groups that are offline on all nodes, only the application states are displayed, node names are not listed.
  • The SMIT panel Show All Resources by Node or Resource Group has an option linking you to the Show Current Resource Group and Application State, described above.

    Starting with HACMP 5.4, WebSMIT presents the application-centric information in the Navigation frame Resource Groups View tab. For more information, see Chapter 2: Administering a Cluster Using WebSMIT.

    Measuring Application Availability

    You can use the Application Availability Analysis tool to measure the amount of time that any of your applications is available. The HACMP software collects and logs the following information in time-stamped format:

  • An application starts, stops, or fails
  • A node fails, is shut down, or comes online (or cluster services are started or shut down)
  • A resource group is taken offline or moved
  • Application monitoring is suspended or resumed.
  • Using SMIT, you can select a time period and the tool will display uptime and downtime statistics for a given application during that period. The tool displays:

  • Percentage of uptime
  • Amount of uptime
  • Longest period of uptime
  • Percentage of downtime
  • Amount of downtime
  • Longest period of downtime
  • Percentage of time application monitoring was suspended.
  • All nodes must be available when you run the tool to display the uptime and downtime statistics. Clocks on all nodes must be synchronized in order to get accurate readings.

    The Application Availability Analysis tool treats an application that is part of a concurrent resource group as available as long as the application is running on any of the nodes in the cluster. Only when the application has gone offline on all nodes in the cluster will the Application Availability Analysis tool consider the application as unavailable.

    The Application Availability Analysis tool reports application availability from the HACMP cluster infrastructure's point of view. It can analyze only those applications that have been properly configured so they will be managed by the HACMP software. (See the Appendix on Applications and HACMP in the Planning Guide for details on how to set up a highly available application with HACMP.)

    When using the Application Availability Analysis tool, keep in mind that the statistics shown in the report reflect the availability of the HACMP application server, resource group, and (if configured) the application monitor that represent your application to HACMP.

    The Application Availability Analysis tool cannot detect availability from an end user's point of view. For example, assume that you have configured a client-server application so that HACMP manages the server, and, after the server was brought online, a network outage severed the connection between the end user clients and the server. The end users would view this as an application outage because their client software could not connect to the server, but HACMP would not detect it, because the server it was managing did not go offline. As a result, the Application Availability Analysis tool would not report a period of downtime in this scenario.

    Planning and Configuring for Measuring Application Availability

    If you have application servers defined, the Application Availability Analysis Tool automatically keeps the statistics for those applications.

    In addition to using the Application Availability Analysis Tool, you can also configure Application Monitoring to monitor each application server’s status. You can define either a Process Application Monitor or a Custom Application Monitor. (See the preceding section on Monitoring Applications for details.)

    If you configure Application Monitoring solely for the purpose of checking on uptime status and do not want the Application Monitoring feature to automatically restart or move applications, you should set the Action on Application Failure parameter to just Notify and set the Restart Count to zero. (The default is three.)

    Ensure that there is adequate space for the clavan.log file on the filesystem on which it is being written. Disk storage usage is a function of node and application stability (not availability), that is, of the number (not duration) of node or application failures in a given time period. Roughly speaking, the application availability analysis tool will use 150 bytes of disk storage per outage. For example, on a node that fails once per week and has one application running on it, where that application never fails on its own, this feature uses about 150 bytes of disk storage usage per week.

    Whenever verification runs, it determines whether there is enough space for the log on all nodes in the cluster.

    Configuring and Using the Application Availability Analysis Tool

    To use SMIT to check on a given application over a certain time period:

      1. Enter smit hacmp
      2. In SMIT, select System Management (C-SPOC) > Resource Group and Application Management > Application Availability Analysis and press Enter.
      3. Select an application. Press F4 to see the list of configured applications.
      4. Fill in the fields as follows:
    Application Name
    Application you selected to monitor.
    Begin analysis on year (1970-2038)
    month (01-12)
    day (1-31)
    Enter 2006 for the year 2006, and so on.
    Begin analysis at hour (00-23)
    minutes (00-59)
    seconds (00-59)

    End analysis on year (1970-2038)
    month (01-12)
    day (1-31)

    End analysis at hour (00-23)
    minutes (00-59)
    seconds (00-59)

      5. Press Enter. The application availability report is displayed as shown in the sample below.
                   COMMAND STATUS 
    Command: OK    stdout: yes    stderr: no 
    Before command completion, additional instructions may appear below. 
    Application: myapp 
    Analysis begins: Monday, 1-May-2002, 14:30 
    Analysis ends: 	Friday, 5-May-2002, 14:30  
    Total time:    5 days, 0 hours, 0 minutes, 0 seconds 
    Uptime: 
    Amount:        4 days, 23 hours, 0 minutes, 0 seconds 
    Percentage:    99.16 % 
    Longest period:     4 days, 23 hours, 0 minutes, 0 seconds 
    Downtime: 
    Amount:        0 days, 0 hours, 45 minutes, 0 seconds 
    Percentage:    00.62 % 
    Longest period:     0 days, 0 hours, 45 minutes, 0 seconds 
    

    If the utility encounters an error in gathering or analyzing the data, it displays one or more error messages in a Command Status panel.

    Reading the clavan.log File

    The application availability analysis log records are stored in the clavan.log file. The default directory for this log file is /var/adm. You can change the directory using the System Management C-SPOC > HACMP Log Viewing and Management > Change/Show a HACMP Log Directory SMIT panel. Each node has its own instance of the file. You can look at the logs at any time to get the uptime information for your applications.

    Note: If you redirect the log, remember it is a cumulative file. Its usefulness for statistical information and analysis will be affected if you do not keep the information in one place.

    clavan.log file format

    The clavan.log file format is described here.

    Purpose 
    Records the state transitions of applications managed by HACMP. 
    Description 
    The clavan.log file keeps track of when each application that is managed 
    by HACMP is started or stopped and when the node stops on which 
    an application is running.  By collecting the records in the 
    clavan.log file from every node in the cluster, a utility program 
    can determine how long each application has been up, as well as 
    compute other statistics describing application availability time. 
    Each record in the clavan.log file consists of a single line.  
    Each line contains a fixed portion and a variable portion: 
    AAA: Ddd Mmm DD hh:mm:ss:YYYY: mnemonic:[data]:[data]: <variable 
    portion> 
    Where:         is: 
    ------		---- 
    AAA		a keyword 
    Ddd            the 3-letter abbreviation for the day of the week 
    YYYY 		the 4-digit year 
    Mmm            The 3-letter abbreviation for month 
    DD             the 2-digit day of the month (01...31) 
    hh             the 2-digit hour of the day (00...23) 
    mm             the 2-digit minute within the hour (00...59) 
    ss             the 2-digit second within the minute (00...59) 
    variable portion: one of the following, as appropriate (note that umt 
    stands for Uptime Measurement Tool, the original name of this tool): 
     

    Mnemonic
    Description
    As used in clavan.log file
    umtmonstart
    monitor started
    umtmonstart:monitor_name:node:
    umtmonstop
    monitor stopped
    umtmonstop:monitor_name:node:
    umtmonfail
    monitor failed
    umtmonfail:monitor_name:node:
    umtmonsus
    monitor suspended
    umtmonsus:monitor_name:node:
    umtmonres
    monitor resumed
    umtmonres:monitor_name:node:
    umtappstart
    application server started
    umtappstart:app_server:node:
    umtappstop
    application server stopped
    umtappstop:app_server:node:
    umtrgonln
    resource group online
    umtrgonln:group:node:
    umtrgoffln
    resource group offline
    umtrgoffln:group:node:
    umtlastmod
    file last modified
    umtlastmod:date:node:
    umtnodefail
    node failed
    umtnodefail:node:
    umteventstart
    cluster event started
    umteventstart:event
    [arguments]:
     
     
    umteventcomplete
    cluster event completed
    umteventcomplete:event
    [arguments]:
     
     

    Implementation Specifics None. Files /var/adm/clavan.log This is the default file spec for this log file. The directory can be changed with the "Change/Show a HACMP Log Directory" SMIT panel (fast path = "clusterlog_redir_menu") Related Information None.

    Examples

    The following example shows output for various types of information captured by the tool.

    AAA: Thu Feb 21 15:27:59 2002: umteventstart:reconfig_resource_release: 
    Cluster event reconfig_resource_release started 
    AAA: Thu Feb 21 15:28:02 2002: 
    umteventcomplete:reconfig_resource_release: Cluster event 
    reconfig_resource_release completed 
    AAA: Thu Feb 21 15:28:15 2002: umteventstart:reconfig_resource_acquire: 
    Cluster event reconfig_resource_acquire started 
    AAA: Thu Feb 21 15:30:17 2002: 
    umteventcomplete:reconfig_resource_acquire: Cluster event 
    reconfig_resource_acquire completed 
    AAA: Thu Feb 21 15:30:17 2002: umteventstart:reconfig_resource_complete: 
    Cluster event reconfig_resource_complete started 
    AAA: Thu Feb 21 15:30:19 2002: umtappstart:umtappa2:titan: Application 
    umtappa2 started on node titan 
    AAA: Thu Feb 21 15:30:19 2002: umtrgonln:rota2:titan: Resource group 
    rota2 online on node titan 
    

    Notes

    clavan.log file records are designed to be human-readable but also easily parsed. This means you can write your own analysis programs. The Application Availability Analysis tool is written in Perl and can be used as a reference for writing your own analysis program. The pathname of the tool is /usr/es/sbin/cluster/utilities/clavan.

    Using Resource Groups Information Commands

    In addition to using the HAView utility to monitor resource group status and location, as discussed earlier in this chapter, you can locate resource groups using the command line.

    You can use the /usr/es/sbin/cluster/utilities/clRGinfo command to monitor resource group status and location. The command tells you the current location and if a node temporarily has the highest priority for this instance.

    For the complete description and examples of the command usage see the clRGinfo section in Appendix A: Script Utilities in the Troubleshooting Guide.

    Note: Alternatively, you can use the clfindres command instead of clRGinfo. clfindres is a link to clRGinfo. Only the root user can run the clRGinfo utility.

    Using the clRGinfo Command

    Running the clRGinfo command gives you a report on the location and state of one or more specified resource groups. The output of the command displays both the global state of the resource group as well as the special state of the resource group on a local node. A resource group can be in any one of the following states (if sites are configured, more states are possible):

  • Online. The resource group is currently operating properly on one or more nodes in the cluster.
  • Offline. The resource group is not operating in the cluster and is currently not in an error condition. Two particular reasons for an Offline state are displayed in these cases:
  • OFFLINE Unmet Dependencies
  • OFFLINE User Requested
  • Acquiring. A resource group is currently coming up on one of the nodes in the cluster.
  • Releasing. The resource group is in the process of being released from ownership by one node. Under normal conditions after being successfully released from a node the resource group’s status changes to offline.
  • Error. The resource group has reported an error condition. User interaction is required.
  • Unknown. The resource group’s current status cannot be attained, possibly due to loss of communication; the fact that all nodes in the cluster are not up, or because a resource group dependency is not met (another resource group that depends on this resource group failed to be acquired first).
  • Resource Group States with Sites Defined

    If sites are defined in the cluster, the resource group can be in one of the following states:

    On the Primary Site
    On the Secondary Site
    ONLINE
    ONLINE SECONDARY
    OFFLINE
    OFFLINE SECONDARY
    ERROR
    ERROR SECONDARY
    UNMANAGED
    UNMANAGED SECONDARY

    Depending on the Inter-Site Management Policy defined in SMIT for a resource group, a particular resource group can be online on nodes in both primary and secondary sites. A resource group can also be online on nodes within one site and offline on nodes within another.

    You can use the Resource Group Management utility, clRGmove, to move a resource group online, offline or to another node either within the boundaries of a particular site, or to the other site. For more information on moving resource groups with sites defined, see the section Migrating Resource Groups with Replicated Resources in Chapter 15: Managing Resource Groups in a Cluster.

    Note: Only one instance of a resource group state exists for OFFLINE and ERROR states, be it on a primary or a secondary site. The clRGinfo command displays a node on a particular site, the state of a resource group, and a node that temporarily has the highest priority for this instance.

    clRGinfo Command Syntax

    The clRGinfo -a command provides information on what resource group movements take place during the current cluster event. For concurrent resource groups, it indicates on which nodes a resource group goes online or offline.

    If clRGinfo cannot communicate with the Cluster Manager on the local node, it attempts to find a cluster node with the Cluster Manager running, from which resource group information may be retrieved. If clRGinfo fails to find at least one node with the Cluster Manager running, HACMP displays an error message.

    clRGinfo has the following syntax:

    clRGinfo [-h][-v][-a][-s|-c][-p][-t][-d][groupname1] [groupname2] ... 
    

    Using clRGinfo -a in pre and post-event scripts is recommended, especially in HACMP clusters with dependent resource groups. When HACMP processes dependent resource groups, multiple resource groups can be moved at once with the rg_move event.

  • Use clRGinfo -t to query the Cluster Manager on the local node only. This option displays the resource groups on the local node with the settling time and the delayed fallback timer settings, if they were set for a resource group.
  • clRGinfo Command Sample Outputs

    The following examples show the output of the clRGinfo command.

    The clRGinfo -a command lets you know the pre-event location and the post-event location of a particular resource group, as in the following examples:

    Note: clRGinfo - a provides meaningful output only if you run it while a cluster event is being processed.
  • In this example, the resource group A is moving from the offline state to the online state on node B. The pre-event location is left blank, the post-event location is Node B:
  • :rg_move[112] /usr/es/sbin/cluster/utilities/clRGinfo -a
    --------------------------------------------------------
    Group Name Resource Group Movement
    --------------------------------------------------------
    rgA PRIMARY=":nodeB"
  • In this example, the resource group B is moving from Node B to the offline state. The pre-event location is node B, the post-event location is left blank:
  • :rg_move[112] /usr/es/sbin/cluster/utilities/clRGinfo -a
    --------------------------------------------------------
    Group Name Resource Group Movement
    --------------------------------------------------------
    rgB PRIMARY="nodeB:"
  • In this example, the resource group C is moving from Node A to Node B. The pre-event location is node A, the post-event location is node B:
  • :rg_move[112] /usr/es/sbin/cluster/utilities/clRGinfo -a
    --------------------------------------------------------
    Group Name Resource Group Movement
    --------------------------------------------------------
    rgC PRIMARY="nodeA:nodeB"
  • In this example with sites, the primary instance of resource group C is moving from Node A to Node B, and the secondary instance stays on node C:
  • :rg_move[112] /usr/es/sbin/cluster/utilities/clRGinfo -a
    --------------------------------------------------------
    Group Name Resource Group Movement
    --------------------------------------------------------
    rgC PRIMARY="nodeA:nodeB"
    SECONDARY=”nodeC:nodeC”
  • With concurrent resource groups, the output indicates each node from which a resource group is moving online or offline. In the following example, both nodes release the resource group:
  • :rg_move[112] /usr/es/sbin/cluster/utilities/clRGinfo -a
    --------------------------------------------------------
    Group Name Resource Group Movement
    --------------------------------------------------------
    rgA "nodeA:"
    rgA "nodeB:"

    Because HACMP performs these calculations at event startup, this information will be available in pre-event scripts (such as a pre-event script to node_up), on all nodes in the cluster, regardless of whether the node where it is run takes any action on a particular resource group.

    With this enhancement to clRGinfo, the specific behaviors of each resource group can be further tailored with pre- and post-event scripts.

    The clRGinfo -c|-s -p command lists the output in a colon separated format, and indicates the node that is temporarily the highest priority node, if applicable.

    Here is an example:

    $ clRGinfo -s -p  
    $ /usr/es/sbin/cluster/utilities/clRGinfo -s 
    Group1:ONLINE:merry::ONLINE:OHN:FNPN:FBHPN:ignore: : : :ONLINE: 
    Group1:OFFLINE:samwise::OFFLINE:OHN:FNPN:FBHPN:ignore: : : :ONLINE: 
    Group2:ONLINE:merry::ONLINE:OAAN:BO:NFB:ignore: : : :ONLINE: 
    Group2:ONLINE:samwise::ONLINE:OAAN:BO:NFB:ignore: : : :ONLINE: 
    
    Note: The -s flag prints the output in the following order:
    RGName:node:(empty):nodeState:startup:fallover:fallback: \ 
     intersite:nodePOL:POL_SEC: \ 
     fallbackTime:settlingTime: \ 
     globalState:siteName:sitePOL 
     

    where the resource group's startup fallover and fallback preferences are abbreviated as follows:

    Resource group's startup policies: OHN: Online On Home Node Only OFAN: Online On First Available Node OUDP: Online Using Node Distribution Policy OAAN: Online On All Available Nodes Resource group's fallover policies: FNPN: Fallover To Next Priority Node In The List FUDNP: Fallover Using Dynamic Node Priority BO: Bring Offline (On Error Node Only) Resource group's fallback policies: FHPN: Fallback To Higher Priority Node In The List NFB: Never Fallback Resource group's intersite policies: ignore: ignore OES: Online On Either Site OBS: Online Both Sites PPS: Prefer Primary Site

    If an attribute is not available for a resource group, the command displays a colon and a blank instead of the attribute.

    The clRGinfo -p command displays the node that temporarily has the highest priority for this instance as well as the state for the primary and secondary instances of the resource group. The command shows information about those resource groups whose locations were temporally changed because of user-requested rg_move events.

    $ /usr/es/sbin/cluster/utilities/clRGinfo -p 
    here3! 
    Cluster Name: TestCluster  
    Resource Group Name: Parent 
    Primary instance(s): 
    The following node temporarily has the highest priority for this 
    instance: 
    user-requested rg_move performed on Wed Dec 31 19:00:00 1969 
    Node                         State             
    ---------------------------- ---------------  
    node3@s2 		OFFLINE  
    node2@s1		ONLINE  
    node1@s0		OFFLINE          
    Resource Group Name: Child 
    Node                         State             
    ---------------------------- ---------------  
    node3@s2		ONLINE  
    node2@s1		OFFLINE  
    node1@s0		OFFLINE  
    

    The clRGinfo -p -t command displays the node that temporarily has the highest priority for this instance and a resource group's active timers:

    /usr/es/sbin/cluster/utilities/clRGinfo -p -t        
    Cluster Name: MyTestCluster  
    Resource Group Name: Parent 
    Primary instance(s): 
    

    The following node temporarily has the highest priority for this instance:

    node4, user-requested rg_move performed on Fri Jan 27 15:01:18 2006

    Node 		Primary State    Secondary State		Delayed Timers    
    ------------------------------- --------------- 	------------------- 
    node1@siteA 		OFFLINE          ONLINE SECONDARY  
    node2@siteA  	 	OFFLINE          OFFLINE                            
    node3@siteB 		OFFLINE          OFFLINE  
    node4@siteB  		ONLINE           OFFLINE  
    Resource Group Name: Child 
    Node 		State            Delayed Timers    
    ---------------------------- --------------- ------------------- 
    node2		ONLINE                               
    node1 		OFFLINE                              
    node4 		OFFLINE                              
    node3 		OFFLINE                              
    

    The clRGinfo -v command displays the resource group's startup, fallover and fallback preferences:

    $ /usr/sbin/cluster/utilites/clRGinfo -v 
    Cluster Name: MyCluster 
    Resource Group Name: myResourceGroup 
    Startup Policy: Online On Home-Node Only 
    Fallover Policy: Fallover Using Dynamic Node Priority 
    Fallback Policy: Fallback To Higher Priority Node In The List 
    Site Policy: Ignore 
    Location		State 
    ------------------------------------------- 
    nodeA		OFFLINE	 
    nodeB		ONLINE 
    nodeC		ONLINE 
    

    Using the cldisp Command

    The /usr/es/sbin/cluster/utilities/cldisp command provides the application-centric view of the cluster configuration. This utility can be used to display resource groups and their startup, fallover, and fallback policies.

    To show cluster applications:

      1. Enter smit hacmp
      2. In SMIT, select Extended Configuration > Extended Resource Configuration > Configure HACMP Applications > Show Cluster Applications and press Enter.
    SMIT displays the information as shown in the example:
    #############
    APPLICATIONS
    #############
    Cluster HAES_52_Test_Cluster_Cities provides the following applications: Application_Server_1 Application_Server_NFS_10
    Application: Application_Server_1 State: {online}
    Application 'Application_Server_NFS_10' belongs to a resource group which is configured to run on all its nodes simultaneously. No fallover will occur.
    This application is part of resource group 'Resource_Group_03'.
    The resource group policies:
    Startup: on all available nodes
    Fallover: bring offline on error node
    Fallback: never
    Nodes configured to provide Application_Server_1: Node_Kiev_1{up} Node_
    Minsk_2{up} Node_Moscow_3{up}
    Nodes currently providing Application_Server_1: Node_Kiev_1{up} Node
    _Minsk_2{up} Node_Moscow_3{up}
    Application_Server_1 is started by /usr/user1/hacmp/local/ghn_start_4
    Application_Server_1 is stopped by /usr/user1/hacmp/local/ghn_stop_4
    Resources associated with Application_Server_1:
    Concurrent Volume Groups:
    Volume_Group_03
    No application monitors are configured for Application_Server_1.
    Application: Application_Server_NFS_10 State: {online}
    This application is part of resource group 'Resource_Group_01'.
    The resource group policies:
    Startup: on home node only
    Fallover: to next priority node in the list
    Fallback: if higher priority node becomes available
    Nodes configured to provide Application_Server_NFS_10: Node_Kiev_1{up}...

    Here is an example of the text output from the cldisp command:

    app1{online} 
    This application belongs to the resource group rg1. 
    Nodes configured to provide app1: unberto{up} lakin{up}  
    The node currently providing app1 is: unberto {up}  
    The node that will provide app1 if unberto fails is: lakin 
    app1 is started by /home/user1/bin/app1_start 
    app1 is stopped by /home/user1/bin/app1_stop 
    Resources associated with app1: 
    srv1(10.10.11.1){online} 
    Interfaces are configured to provide srv1: 
    lcl_unberto (en1-10.10.10.1) on unberto{up} 
    lcl_lakin (en2-10.10.10.2) on lakin{up} 
    Shared Volume Groups: NONE 
    Concurrent Volume Groups: NONE 
    Filesystems: NONE 
    AIX Fast Connect Services: NONE 
    Application monitor of app1: app1 
    Monitor: app1 
    Type: custom 
    Monitor method: /home/user1/bin/app1_monitor 
    Monitor interval: 30 seconds 
    Hung monitor signal: 9 
    Stabilization interval: 30 seconds 
    Retry count: 3 tries 
    Restart interval: 198 seconds 
    Failure action: notify 
    Notify method: /home/user1/bin/app1_monitor_notify 
    Cleanup method: /home/user1/bin/app1_stop 
    Restart method: /home/user1/bin/app1_start 
    

    Using HACMP Topology Information Commands

    You can see the complete topology configuration using the /usr/es/sbin/cluster/utilities/cltopinfo command. See Appendix C: HACMP for AIX Commands for the complete syntax and examples with various flags. to organize the information by node, network, or network interface. The following example uses the basic command:

    $ /usr/es/sbin/cluster/utilities/cltopinfo 
    Cluster Description of Cluster: FVT_mycluster 
    Cluster Security Level: Standard 
    There are 2 node(s) and 1 network(s) defined 
    NODE holmes: 
            Network ether_ipat 
    		sherlock_en3svc_a1  192.168.97.50 
                    holmes_en1svc_a1 					192.168.95.40 
                    holmes_en1svc   	192.168.90.40 
    NODE sherlock: 
            Network ether_ipat 
    		sherlock_en3svc_a1  192.168.97.50 
                    holmes_en1svc_a1        192.168.95.40 
    		sherlock_en1svc     192.168.90.50 
    Resource Group econrg1 
            Behavior                 concurrent 
            Participating Nodes      holmes sherlock 
    

    Monitoring Cluster Services

    After checking cluster, node, and network interface status, check the status of the HACMP and RSCT daemons on both nodes and clients.

    Monitoring Cluster Services on a Node

    Depending on what you need to know, you may access the following for information:

  • View Management Information Base (MIB) in the hacmp.out file.
  • Use SMIT to check the status of the following HACMP subsystems on a node:
  • Cluster Manager (clstrmgrES) subsystem
  • SNMP (snmpd) daemon.
  • Clinfo (clinfoES) Cluster Information subsystem.
  • To view cluster services on a node, enter the fastpath smit clshow
  • A panel similar to following appears.
    COMMAND STATUS
    Command: OK stdout: yes stderr: no
    Before command completion, additional instructions may appear below.
    Subsystem Group PID Status
    clstrmgrES cluster 18524 active
    clinfoES cluster 15024 active

    Monitoring Cluster Services on a Client

    The only HACMP process that can run on a client is the Cluster Information (clinfo) daemon. (Not all clients run this daemon.) You can use the AIX 5L lssrc command with either the -g cluster or -s clinfoES arguments to check the status of the clinfo subsystem on a client. The output looks similar to the following:

    Subsystem		Group	PID	Status 
    clinfoES		cluster	9843	active 
    

    You can also use the ps command and grep for “clinfo.” For example:

    ps -aux | grep clinfoES

    HACMP Log Files

    HACMP writes the messages it generates to the system console and to several log files. Because each log file contains a different subset of the types of messages generated by HACMP, you can get different views of cluster status by viewing different log files. HACMP writes messages into the log files described below. For more information about these files see Chapter 2: Using Cluster Log Files in the Troubleshooting Guide.

    The default locations of log files are used in this chapter. If you redirected any logs, check the appropriate location.

    Note: If you redirect logs, they should be redirected to local filesystems and not to shared or NFS filesystems. Having logs on shared or NFS filesystems may cause problems if the filesystem needs to unmount during a fallover event. Redirecting logs to shared or NFS filesystems may also prevent cluster services from starting during node reintegration.

    Size of /var Filesystem May Need to Be Increased

    For each node in your cluster, verification requires from 500K to 4 MB of free space in the /var filesystem. HACMP stores, at most, four different copies of a node's verification data on a disk at a time:

  • /var/hacmp/clverify/current/<nodename>/* contains logs from a current execution of cluster verification
  • /var/hacmp/clverify/pass/<nodename/* contains logs from the last time verification passed
  • /var/hacmp/clverify/pass.prev/<nodename/* contains logs from the second to last time verification passed
  • /var/hacmp/clverify/fail/<nodename>/* contains information from the last time verification failed.
  • The /var/hacmp/clverify/clverify.log[0-9] log files typically consume 1-2 MB of disk space.

    In addition, the standard security mechanism that runs the clcomd utility has the following requirements for the free space in the /var filesystem:

    1) 20 MB, where:

  • /var/hacmp/clcomd/clcomd.log requires 2MB
  • /var/hacmp/clcomd/clcomddiag.log requires 18MB.
  • 2) 1 MB x n, per node (where n is the number of nodes in the cluster) in the file /var/hacmp/odmcache.

    To summarize, for a four-node cluster it is recommended to have at least 42MB of free space in the /var filesystem, where:

  • 2MB should be free for writing the clverify.log[0-9] files
  • 16MB (4MB per node) for writing the verification data from the nodes
  • 20MB for writing the clcomd log information
  • 4MB (1MB per node) for writing the ODMcache data.
  • /tmp/clinfo.debug File

    Clinfo is typically installed on both client and server systems. Client systems (cluster.es.client) do not contain any HACMP ODMs (for example HACMPlogs) or utilities (for example clcycle) therefore the logging for Clinfo does not take advantage of the redirection or cycling.

    The /tmp/clinfo.debug file records the output generated by the event scripts as they run. This information supplements and expands upon the information in the /usr/var/hacmp/log file.

    The default bug level is 0 or OFF. You can change the log file name with the command
    clinfo -l. See the manpage for more information.

    /tmp/clsmuxtrmgr.debug Log File

    The clsumxtrmgr.debug is the smux peer function log file. The default is no debugging. You can toggle the smux peer tracing on and off using the AIX 5L System Resource Controller (SRC).

    /tmp/hacmp.out File

    The /tmp/hacmp.out file records the output generated by the event scripts as they execute. This information supplements and expands upon the information in the /usr/es/adm/cluster.log file. To receive verbose output, the debug level runtime parameter should be set to high (the default). For details on setting runtime parameters see Chapter 1: Troubleshooting HACMP Clusters in the Troubleshooting Guide.

    Reported resource group acquisition failures (failures indicated by a non-zero exit code returned by a command) are tracked in hacmp.out, and a summary is written near the end of the hacmp.out listing for a top-level event.

    Checking this log is important, since the reconfig_too_long console message is not evident in every case where a problem exists. Event summaries make it easier for you to check the hacmp.out file for errors. For more information about this log file and about how to get a quick view of several days’ event summaries see the Understanding the hacmp.out Log File section and Displaying Compiled hacmp.out Event Summaries section in Chapter 2: Using Cluster Log Files in the Troubleshooting Guide.

    In releases prior to HACMP 5.2, non-recoverable event script failures result in the event_error event being run on the cluster node where the failure occured. The remaining cluster nodes do not indicate the failure. With HACMP 5.2 and up, all cluster nodes run the event_error event if any node has a fatal error. All nodes log the error and call out the failing node name in the hacmp.out log file.

    /tmp/clstrmgr.debug Log File

    The clstrmgr.debug log file contains time-stamped, formatted messages generated by HACMP clstrmgrES activity. This file is typically used only by IBM support personnel. The /usr/es/sbin/cluster/utilities/clgetesdbginfo command collects all the cluster log files. IBM support may ask you to run this command.

    /tmp/cspoc.log File

    The cpoc.log file contains time-stamped, formatted messages generated by HACMP C-SPOC commands. The /tmp/cspoc.log file resides on the node from which you issue the C-SPOC command.

    /tmp/emuhacmp.out File

    The /tmp/emuhacmp.out file records the output generated by the event emulator scripts as they execute. The /tmp/emuhacmp.out file resides on the node from which the event emulator is invoked. You can use the environment variable EMUL_OUTPUT to specify another name and location for this file, but the format and information remains the same.

    /usr/es/adm/cluster.log File

    The /usr/es/adm/cluster.log file is the main HACMP log file. HACMP error messages and messages about HACMP-related events are appended to this log with the time and date at which they occurred.

    /usr/es/sbin/cluster/history/cluster.mmddyyyy File

    The /usr/es/sbin/cluster/history/cluster.mmddyyyy file contains time-stamped, formatted messages generated by HACMP scripts. The system creates a cluster history file whenever cluster events occur, identifying each file by the file name extension mmddyyyy, where mm indicates the month, dd indicates the day, and yyyy indicates the year.

    While it is more likely that you will use these files during troubleshooting, you should occasionally look at them to get a more detailed idea of the activity within a cluster.

    /var/adm/clavan.log File

    The clavan.log file keeps track of when each application that is managed by HACMP is started or stopped and when the node stops on which an application is running. By collecting the records in the clavan.log file from every node in the cluster, a utility program can determine how long each application has been up, as well as compute other statistics describing application availability time.

    /var/hacmp/clcomd/clcomd.log File

    The clcomd.log file contains time-stamped, formatted messages generated by the HACMP Cluster Communication Daemon. This log file contains an entry for every connect request made to another node and the return status of the request.

    For information on space requirements for this file and for the file described below, see the section Size of /var Filesystem May Need to Be Increased.

    /var/hacmp/clcomd/clcomddiag.log File

    The clcomddiag.log file contains time-stamped, formatted messages generated by the HACMP Communication daemon when tracing is turned on. This log file is typically used by IBM support personnel for troubleshooting.

    /var/hacmp/clverify/clverify.log File

    The /var/hacmp/clverify/clverify.log file contains verbose messages, output during verification. Cluster verification consists of a series of checks performed against various HACMP configurations. Each check attempts to detect either a cluster consistency issue or an error. The verification messages follow a common, standardized format, where feasible, indicating such information as the node(s), devices, and command in which the error occurred. See Chapter 7: Verifying and Synchronizing an HACMP Cluster for complete information.

    For information on space requirements for this file, see the section Size of /var Filesystem May Need to Be Increased earlier in this chapter.

    /var/hacmp/log/clutils.log File

    The /var/hacmp/log/clutils.log file contains the results of the automatic verification that runs on one user-selectable HACMP cluster node once every 24 hours. When cluster verification completes on the selected cluster node, this node notifies the other cluster nodes with the following information:

  • The name of the node where verification had been run
  • The date and time of the last verification
  • Results of the verification.
  • The /var/hacmp/log/clutils.log file also contains messages about any errors found and actions taken by HACMP for the following utilities:

  • The HACMP File Collections utility
  • The Two-Node Cluster Configuration Assistant
  • The Cluster Test Tool
  • The OLPW conversion tool.
  • /var/ha/log/grpsvcs.<filename> File

    Contains time-stamped messages in ASCII format. These track the execution of internal activities of the grpsvcs daemon. IBM support personnel use this information for troubleshooting. The file gets trimmed regularly. Therefore please save it promptly if there is a chance you may need it.

    /var/ha/log/topsvcs.<filename> File

    The /var/ha/log/topsvcs.<filename> log file contains time-stamped messages in ASCII format. These track the execution of internal activities of the topsvcs daemon. IBM support personnel use this information for troubleshooting. The file gets trimmed regularly. Therefore please save it promptly if there is a chance you may need it.

    /var/ha/log/grpglsm File

    The /var/ha/log/grpglsm file tracks the execution of internal activities of the grpglsm daemon. IBM support personnel use this information for troubleshooting. The file gets trimmed regularly. Therefore please save it promptly if there is a chance you may need it.


    PreviousNextIndex