PreviousNextIndex

Chapter 5: Monitoring and Troubleshooting a Cluster


This chapter presents general information for monitoring and troubleshooting an HACMP for Linux configuration.

This chapter contains the following sections:

  • Problem Determination Tools
  • Viewing Cluster Information (clstat) in WebSMIT
  • Useful Commands
  • Logging Messages
  • Solving Common Problems with Networks and Applications.
  • Problem Determination Tools

    WebSMIT Problem Determination Tools menu has a set of tools for troubleshooting and recovering from problems that may arise in a cluster environment.

    The Problem Determination Tools panel in WebSMIT includes:

  • View Current State. WebSMIT displays cluster information using a slightly different layout and organization. Cluster components are displayed along their status. Expanding the item reveals additional information about it, including network, interfaces and active resource groups.
  • HACMP Log Viewing and Management. Contains utilities that display or manage logs maintained by HACMP. These include the log file named hacmp.out, which keeps a record of all of the local cluster events as performed by the HACMP event scripts. These HACMP event scripts automate many common system administration tasks, and, in the event of a failure, will manage HACMP and system resource to provide recovery.
  • Recover From HACMP Script Failure. Contains a command that HACMP will run to recover from a script failure. This is useful if the Cluster Manager is in reconfiguration due to a failed event script. Use this option after having manually fixed the error condition.
  • Restore HACMP Configuration Database from Active Configuration.
  • Viewing Cluster Information (clstat) in WebSMIT

    With HACMP 5.4, you can use WebSMIT to:

  • Display detailed cluster information
  • Navigate and view the status of the running cluster
  • Configure and manage the cluster
  • View graphical displays of sites, networks, nodes and resource group dependencies.
  • Useful Commands

    You have these additional utilities:

  • To view the resource group location and status, use the clRGinfo command.
  • To view the service IP label information, run the ifconfig command on the node that currently owns the resource group.
  • For a list of commands supported in HACMP for Linux, see Command Reference in Appendix A: Command Reference and the clinfo Utility.

    Logging Messages

    HACMP for Linux uses the standard logging facilities for HACMP. For information about logging in HACMP, see the HACMP for AIX 5L Troubleshooting Guide.

    To troubleshoot the HACMP operations in your cluster, use the event summaries in the hacmp.out file and syslog.

    The system logs messages into the following files:

  • /tmp/clstrmgr.debug
  • /tmp/cspoc.log
  • /tmp/clappmond
  • /tmp/hacmp.out
  • /usr/es/adm/cluster.log
  • /var/hacmp/clcomd/clcomd.log
  • /var/hacmp/clcomd/clcomddiag.log
  • /var/hacmp/log/clutils.log
  • /usr/es/sbin/cluster/wsm/logs/wsm_smit.*
  • <APACHE_HOME>/websmit/logs/wsm_smit.*
  • /usr/es/sbin/cluster/snapshots/<snapshot_name>*
  • Collecting Cluster Log Files for Problem Reporting

    To view the system files and log files as they are collected in an archive file:

      1. In WebSMIT, go to the Collect Cluster log files for Problem Reporting menu.
      2. Type or select values in entry fields.
      3. Use an appropriate Linux tool to extract or view the archive file. The archive file contains the log and system files.

    Solving Common Problems with Networks and Applications

    This section contains the following topics:

  • Identifying Causes of Unexpected Network and Network Interface Failures
  • Troubleshooting an Unsuccessful Application Fallover to Another Node
  • Troubleshooting the Serial Connection.
  • For a list of other specific problems and tips on how to check for them, see the HACMP for AIX 5L Troubleshooting Guide.

    Identifying Causes of Unexpected Network and Network Interface Failures

    This section lists some of the possible causes of the network errors you may receive.

    For example, HACMP logs network_down or interface_failed events, but the NIC appears to be functional.

    Verify that the information defined in the HACMP configuration matches the information displayed by ifconfig, for example:

    To verify the HACMP configuration for node ppstest2, run the cllsif command:

    ppstest2:~ # /usr/es/sbin/cluster/utilities/cllsif -c | grep ppstest2  
    ppstest2_enstby1:boot:net_ether_01:ether:public:ppstest2:192.9.201.131::
    eth1::255.255.255.128::  
    ppstest2_enboot:boot:net_ether_01:ether:public:ppstest2:192.9.201.2::eth
    0::255.255.255.128::  
    ppstest2:boot:net_token_01:token:public:ppstest2:9.57.28.4::tr0::255.255
    .255.128::  
    

    Now run the ifconfig command on node ppstest2 and compare the results:

    ppstest2:~ # ifconfig  
    eth0      Link encap:Ethernet  HWaddr 00:06:29:DC:82:7A  
              inet addr:192.9.201.2  Bcast:192.9.201.127  
    Mask:255.255.255.128  
              inet6 addr: fe80::206:29ff:fedc:827a/64 Scope:Link  
              UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1  
              RX packets:7120 errors:0 dropped:0 overruns:0 frame:0  
              TX packets:12605 errors:0 dropped:0 overruns:0 carrier:0  
              collisions:3821 txqueuelen:1000  
              RX bytes:499630 (487.9 Kb)  TX bytes:15802477 (15.0 Mb)  
              Interrupt:37 Base address:0xec00  
    eth1      Link encap:Ethernet  HWaddr 00:06:29:DC:E0:2A  
              inet addr:192.9.201.131  Bcast:192.9.201.255  
    Mask:255.255.255.128  
              inet6 addr: fe80::206:29ff:fedc:e02a/64 Scope:Link  
              UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1  
              RX packets:10695 errors:0 dropped:0 overruns:0 frame:0  
              TX packets:14381 errors:0 dropped:0 overruns:0 carrier:0  
              collisions:4205 txqueuelen:1000  
              RX bytes:5690673 (5.4 Mb)  TX bytes:15921948 (15.1 Mb)  
              Interrupt:38 Base address:0xec00  
    lo        Link encap:Local Loopback  
              inet addr:127.0.0.1  Mask:255.0.0.0  
              inet6 addr: ::1/128 Scope:Host  
              UP LOOPBACK RUNNING  MTU:16436  Metric:1  
              RX packets:102 errors:0 dropped:0 overruns:0 frame:0  
              TX packets:102 errors:0 dropped:0 overruns:0 carrier:0  
              collisions:0 txqueuelen:0  
              RX bytes:6624 (6.4 Kb)  TX bytes:6624 (6.4 Kb)  
    tr0       Link encap:16/4 Mbps Token Ring (New)  HWaddr 
    00:60:94:8A:D2:F7  
              inet addr:9.57.28.4  Bcast:9.57.28.127  Mask:255.255.255.128  
              inet6 addr: fe80::260:94ff:fe8a:d2f7/64 Scope:Link  
              UP BROADCAST RUNNING MULTICAST  MTU:4056  Metric:1  
              RX packets:2332 errors:0 dropped:0 overruns:0 frame:0  
              TX packets:2577 errors:0 dropped:0 overruns:0 carrier:0  
              collisions:0 txqueuelen:100  
              RX bytes:431872 (421.7 Kb)  TX bytes:829909 (810.4 Kb)  
              Interrupt:52 Base address:0xec00  
    

    Notice that the interface name and IP address for each address defined to HACMP (lo) is never defined to HACMP.

    To fix the problem, do either of the following:

  • If the interface name or IP address is incorrect, use the Change/Show Communication Interfaces/Devices WebSMIT menu to change them.
  • If the netmask is incorrect, use the Change/Show Networks menu to change the netmask for all interfaces on this network.
  • Note: You do not have to specify interface names for service IP addresses or for persistent addresses. HACMP keeps these addresses available by moving them to a different interface after any failure.

    Troubleshooting an Unsuccessful Application Fallover to Another Node

    If your application starts successfully on one node, but HACMP issues an EVENT FAILED message when trying to perform an application fallover to another node, make sure that the application start, stop and notify scripts exist and are executable on every node in the cluster. Use the cllsserv command.

    For example:

    ppstest2:~ # /usr/es/sbin/cluster/utilities/cllsserv  
    app_test2_primary  /usr/local/app_start  /usr/local/app_stop  
    ppstest2:~ # ls -l /usr/local/app_start  
    -rwxr--r--  1 root root 169 May 10 22:54 /usr/local/app_start  
    

    Troubleshooting the Serial Connection

    This section describes how to test an installed serial connection.

    To ensure that the RS232 cable is properly configured and transmits data, run the following test after creating the tty device on both nodes.

    Run this test while the tty device is not in use. If the cluster is active, remove the serial network dynamically from the configuration before running the test. Also, verify that the tty device is not in use by any other process.

    To determine if the device is in use, run the fuser command:

    fuser /dev/tty0  
    

    The output lists the PID of any process that uses the device.

    If the device is in use by RSCT, the output shows that a process hats_rs232_nim is accessing the device. After the network has been dynamically removed from the cluster configuration, no such process should exist.

    In rare cases, the hats_rs232_nim process may not terminate during a dynamic removal of the network or a stop of the cluster services. In these cases, you should call IBM support. However, it is safe to terminate any leftover hats_nim_rs232 process if the cluster is inactive on the local node.

    Use the fuser command to terminate a process that accesses the tty device:

    fuser -k /dev/tty0  
    

    Running the sttyTest

    The stty test determines whether the serial connection allows the transmission of communications.

    Running the stty Test on TTYs with RTS Flow Control Set

    To perform the stty test:

      1. On the receiving side, run:
      (stty raw -echo; cat > outputfilename) < /dev/tty2  
    
      2. On the sending side, run:
      (stty raw -echo < /dev/tty1; cat filetobesent ; sleep 5) > /dev/tty1  
    

    Running the stty Test on TTY's with XON or No Flow Control Set:

    To perform the stty test:

      1. On the receiving side (node 2), run:
      (stty raw -echo ixon ixoff; cat > outputfilename) < /dev/tty2  
    
      2. On the sending side, run
      (stty raw -echo ixon ixoff < /dev/tty1; cat filetobesent; sleep 5) > /dev/tty1  
    

    If the nodes are able to communicate over the serial cable, both nodes display their tty settings and return to the prompt.

    If the data is transmitted successfully from one node to another, then the text from the /etc/hosts file from the second node appears on the console of the first node. Note that you can use any text file for this test, and do not need to specifically use the /etc/hosts file.

    After you install and test the serial connection, you define the connection as a point-to-point network to HACMP. For information about how to configure a serial network, see Configuring Serial Networks for Heartbeating in Chapter 3: Common Task Summary.


    PreviousNextIndex