![]() ![]() ![]() |
Chapter 5: Monitoring and Troubleshooting a Cluster
This chapter presents general information for monitoring and troubleshooting an HACMP for Linux configuration.
This chapter contains the following sections:
Problem Determination Tools
WebSMIT Problem Determination Tools menu has a set of tools for troubleshooting and recovering from problems that may arise in a cluster environment.
The Problem Determination Tools panel in WebSMIT includes:
View Current State. WebSMIT displays cluster information using a slightly different layout and organization. Cluster components are displayed along their status. Expanding the item reveals additional information about it, including network, interfaces and active resource groups. HACMP Log Viewing and Management. Contains utilities that display or manage logs maintained by HACMP. These include the log file named hacmp.out, which keeps a record of all of the local cluster events as performed by the HACMP event scripts. These HACMP event scripts automate many common system administration tasks, and, in the event of a failure, will manage HACMP and system resource to provide recovery. Recover From HACMP Script Failure. Contains a command that HACMP will run to recover from a script failure. This is useful if the Cluster Manager is in reconfiguration due to a failed event script. Use this option after having manually fixed the error condition. Restore HACMP Configuration Database from Active Configuration. Viewing Cluster Information (clstat) in WebSMIT
With HACMP 5.4, you can use WebSMIT to:
Display detailed cluster information Navigate and view the status of the running cluster Configure and manage the cluster View graphical displays of sites, networks, nodes and resource group dependencies. Useful Commands
You have these additional utilities:
To view the resource group location and status, use the clRGinfo command. To view the service IP label information, run the ifconfig command on the node that currently owns the resource group. For a list of commands supported in HACMP for Linux, see Command Reference in Appendix A: Command Reference and the clinfo Utility.
Logging Messages
HACMP for Linux uses the standard logging facilities for HACMP. For information about logging in HACMP, see the HACMP for AIX 5L Troubleshooting Guide.
To troubleshoot the HACMP operations in your cluster, use the event summaries in the hacmp.out file and syslog.
The system logs messages into the following files:
/tmp/clstrmgr.debug /tmp/cspoc.log /tmp/clappmond /tmp/hacmp.out /usr/es/adm/cluster.log /var/hacmp/clcomd/clcomd.log /var/hacmp/clcomd/clcomddiag.log /var/hacmp/log/clutils.log /usr/es/sbin/cluster/wsm/logs/wsm_smit.* <APACHE_HOME>/websmit/logs/wsm_smit.* /usr/es/sbin/cluster/snapshots/<snapshot_name>* Collecting Cluster Log Files for Problem Reporting
To view the system files and log files as they are collected in an archive file:
1. In WebSMIT, go to the Collect Cluster log files for Problem Reporting menu.
2. Type or select values in entry fields.
3. Use an appropriate Linux tool to extract or view the archive file. The archive file contains the log and system files.
Solving Common Problems with Networks and Applications
This section contains the following topics:
For a list of other specific problems and tips on how to check for them, see the HACMP for AIX 5L Troubleshooting Guide.
Identifying Causes of Unexpected Network and Network Interface Failures
This section lists some of the possible causes of the network errors you may receive.
For example, HACMP logs network_down or interface_failed events, but the NIC appears to be functional.
Verify that the information defined in the HACMP configuration matches the information displayed by ifconfig, for example:
To verify the HACMP configuration for node ppstest2, run the cllsif command:
ppstest2:~ # /usr/es/sbin/cluster/utilities/cllsif -c | grep ppstest2 ppstest2_enstby1:boot:net_ether_01:ether:public:ppstest2:192.9.201.131:: eth1::255.255.255.128:: ppstest2_enboot:boot:net_ether_01:ether:public:ppstest2:192.9.201.2::eth 0::255.255.255.128:: ppstest2:boot:net_token_01:token:public:ppstest2:9.57.28.4::tr0::255.255 .255.128::Now run the ifconfig command on node ppstest2 and compare the results:
ppstest2:~ # ifconfig eth0 Link encap:Ethernet HWaddr 00:06:29:DC:82:7A inet addr:192.9.201.2 Bcast:192.9.201.127 Mask:255.255.255.128 inet6 addr: fe80::206:29ff:fedc:827a/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:7120 errors:0 dropped:0 overruns:0 frame:0 TX packets:12605 errors:0 dropped:0 overruns:0 carrier:0 collisions:3821 txqueuelen:1000 RX bytes:499630 (487.9 Kb) TX bytes:15802477 (15.0 Mb) Interrupt:37 Base address:0xec00 eth1 Link encap:Ethernet HWaddr 00:06:29:DC:E0:2A inet addr:192.9.201.131 Bcast:192.9.201.255 Mask:255.255.255.128 inet6 addr: fe80::206:29ff:fedc:e02a/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:10695 errors:0 dropped:0 overruns:0 frame:0 TX packets:14381 errors:0 dropped:0 overruns:0 carrier:0 collisions:4205 txqueuelen:1000 RX bytes:5690673 (5.4 Mb) TX bytes:15921948 (15.1 Mb) Interrupt:38 Base address:0xec00 lo Link encap:Local Loopback inet addr:127.0.0.1 Mask:255.0.0.0 inet6 addr: ::1/128 Scope:Host UP LOOPBACK RUNNING MTU:16436 Metric:1 RX packets:102 errors:0 dropped:0 overruns:0 frame:0 TX packets:102 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:0 RX bytes:6624 (6.4 Kb) TX bytes:6624 (6.4 Kb) tr0 Link encap:16/4 Mbps Token Ring (New) HWaddr 00:60:94:8A:D2:F7 inet addr:9.57.28.4 Bcast:9.57.28.127 Mask:255.255.255.128 inet6 addr: fe80::260:94ff:fe8a:d2f7/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:4056 Metric:1 RX packets:2332 errors:0 dropped:0 overruns:0 frame:0 TX packets:2577 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:100 RX bytes:431872 (421.7 Kb) TX bytes:829909 (810.4 Kb) Interrupt:52 Base address:0xec00Notice that the interface name and IP address for each address defined to HACMP (lo) is never defined to HACMP.
To fix the problem, do either of the following:
If the interface name or IP address is incorrect, use the Change/Show Communication Interfaces/Devices WebSMIT menu to change them. If the netmask is incorrect, use the Change/Show Networks menu to change the netmask for all interfaces on this network. Note: You do not have to specify interface names for service IP addresses or for persistent addresses. HACMP keeps these addresses available by moving them to a different interface after any failure.
Troubleshooting an Unsuccessful Application Fallover to Another Node
If your application starts successfully on one node, but HACMP issues an EVENT FAILED message when trying to perform an application fallover to another node, make sure that the application start, stop and notify scripts exist and are executable on every node in the cluster. Use the cllsserv command.
For example:
ppstest2:~ # /usr/es/sbin/cluster/utilities/cllsserv app_test2_primary /usr/local/app_start /usr/local/app_stop ppstest2:~ # ls -l /usr/local/app_start -rwxr--r-- 1 root root 169 May 10 22:54 /usr/local/app_startTroubleshooting the Serial Connection
This section describes how to test an installed serial connection.
To ensure that the RS232 cable is properly configured and transmits data, run the following test after creating the tty device on both nodes.
Run this test while the tty device is not in use. If the cluster is active, remove the serial network dynamically from the configuration before running the test. Also, verify that the tty device is not in use by any other process.
To determine if the device is in use, run the fuser command:
The output lists the PID of any process that uses the device.
If the device is in use by RSCT, the output shows that a process hats_rs232_nim is accessing the device. After the network has been dynamically removed from the cluster configuration, no such process should exist.
In rare cases, the hats_rs232_nim process may not terminate during a dynamic removal of the network or a stop of the cluster services. In these cases, you should call IBM support. However, it is safe to terminate any leftover hats_nim_rs232 process if the cluster is inactive on the local node.
Use the fuser command to terminate a process that accesses the tty device:
Running the sttyTest
The stty test determines whether the serial connection allows the transmission of communications.
Running the stty Test on TTYs with RTS Flow Control Set
To perform the stty test:
1. On the receiving side, run:
2. On the sending side, run:
Running the stty Test on TTY's with XON or No Flow Control Set:
To perform the stty test:
1. On the receiving side (node 2), run:
2. On the sending side, run
If the nodes are able to communicate over the serial cable, both nodes display their tty settings and return to the prompt.
If the data is transmitted successfully from one node to another, then the text from the /etc/hosts file from the second node appears on the console of the first node. Note that you can use any text file for this test, and do not need to specifically use the /etc/hosts file.
After you install and test the serial connection, you define the connection as a point-to-point network to HACMP. For information about how to configure a serial network, see Configuring Serial Networks for Heartbeating in Chapter 3: Common Task Summary.
![]() ![]() ![]() |