![]() ![]() ![]() |
Chapter 3: Investigating System Components and Solving Common Problems
This chapter guides you through the steps to investigate system components, identify problems that you may encounter as you use HACMP, and offers possible solutions.
Overview
If no error messages are displayed on the console and if examining the log files proves fruitless, you next investigate each component of your HACMP environment and eliminate it as the cause of the problem. The first section of this chapter reviews methods for investigating system components, including the RSCT subsystem. It includes these sections:
The second section provides recommendations for investigating the following areas:
Investigating System Components
Both HACMP and AIX 5L provide utilities you can use to determine the state of an HACMP cluster and the resources within that cluster. Using these commands, you can gather information about volume groups or networks. Your knowledge of the HACMP system is essential. You must know the characteristics of a normal cluster beforehand and be on the lookout for deviations from the norm as you examine the cluster components. Often, the surviving cluster nodes can provide an example of the correct setting for a system parameter or for other cluster configuration information.
The following sections review the HACMP cluster components that you can check and describes some useful utilities. If examining the cluster log files does not reveal the source of a problem, investigate each system component using a top-down strategy to move through the layers. You should investigate the components in the following order:
1. Application layer
2. HACMP layer
3. Logical Volume Manager layer
4. TCP/IP layer
5. AIX 5L layer
6. Physical network layer
7. Physical disk layer
8. System hardware layer.
The following sections describe what you should look for when examining each layer. They also briefly describe the tools you should use to examine the layers.
Checking Highly Available Applications
As a first step to finding problems affecting a cluster, check each highly available application running on the cluster. Examine any application-specific log files and perform any troubleshooting procedures recommended in the application’s documentation. In addition, check the following:
Do some simple tests; for example, for a database application try to add and delete a record. Use the ps command to check that the necessary processes are running, or to verify that the processes were stopped properly. Check the resources that the application expects to be present to ensure that they are available, the filesystems and volume groups for example. Checking the HACMP Layer
If checking the application layer does not reveal the source of a problem, check the HACMP layer. The two main areas to investigate are:
HACMP components and required files Cluster topology and configuration. The following sections describe how to investigate these problems.
Note: These steps assume that you have checked the log files and that they do not point to the problem.
Checking HACMP Components
An HACMP cluster is made up of several required files and daemons. The following sections describe what to check for in the HACMP layer.
Checking HACMP Required Files
Make sure that the HACMP files required for your cluster are in the proper place, have the proper permissions (readable and executable), and are not zero length. The HACMP files and the AIX 5L files modified by the HACMP software are listed in the README file that accompanies the product.
Checking Cluster Services and Processes
Check the status of the following HACMP daemons:
The Cluster Manager (clstrmgrES) daemon The Cluster Communications (clcomdES) daemon The Cluster Information Program (clinfoES) daemon. When these components are not responding normally, determine if the daemons are active on a cluster node. Use either the options on the SMIT System Management (C-SPOC) > Manage HACMP Services > Show Cluster Services panel or the lssrc command.
For example, to check on the status of all daemons under the control of the SRC, enter:
lssrc -a | grep active syslogd ras 290990 active sendmail mail 270484 active portmap portmap 286868 active inetd tcpip 295106 active snmpd tcpip 303260 active dpid2 tcpip 299162 active hostmibd tcpip 282812 active aixmibd tcpip 278670 active biod nfs 192646 active rpc.statd nfs 254122 active rpc.lockd nfs 274584 active qdaemon spooler 196720 active writesrv spooler 250020 active ctrmc rsct 98392 active clcomdES clcomdES 204920 active IBM.CSMAgentRM rsct_rm 90268 active IBM.ServiceRM rsct_rm 229510 active IBM.ERRM rsct_rm 188602 active IBM.AuditRM rsct_rm 151722 active topsvcs topsvcs 602292 active grpsvcs grpsvcs 569376 active emsvcs emsvcs 561188 active emaixos emsvcs 557102 active clstrmgrES cluster 544802 active gsclvmd 565356 active IBM.HostRM rsct_rm 442380 activeTo check on the status of all cluster daemons under the control of the SRC, enter:
Note: When you use the -g flag with the lssrc command, the status information does not include the status of subsystems if they are inactive. If you need this information, use the -a flag instead. For more information on the lssrc command, see the man page.
To view additional information on the status of a daemon run the clcheck_server command. The clcheck_server command makes additional checks and retries beyond what is done by lssrc command. For more information, see the clcheck_server man page.
To determine whether the Cluster Manager is running, or if processes started by the Cluster Manager are currently running on a node, use the ps command.
For example, to determine whether the clstrmgrES daemon is running, enter:
ps -ef | grep clstrmgrES root 18363 3346 3 11:02:05 - 10:20 /usr/es/sbin/cluster/clstrmgrES root 19028 19559 2 16:20:04 pts/10 0:00 grep clstrmgrESSee the ps man page for more information on using this command.
Checking for Cluster Configuration Problems
For an HACMP cluster to function properly, all the nodes in the cluster must agree on the cluster topology, network configuration, and ownership and takeover of HACMP resources. This information is stored in the Configuration Database on each cluster node.
To begin checking for configuration problems, ask yourself if you (or others) have made any recent changes that may have disrupted the system. Have components been added or deleted? Has new software been loaded on the machine? Have new PTFs or application updates been performed? Has a system backup been restored? Then run verification to ensure that the proper HACMP-specific modifications to AIX 5L software are in place and that the cluster configuration is valid.
The cluster verification utility checks many aspects of a cluster configuration and reports any inconsistencies. Using this utility, you can perform the following tasks:
Verify that all cluster nodes contain the same cluster topology information Check that all network interface cards and tty lines are properly configured, and that shared disks are accessible to all nodes that can own them Check each cluster node to determine whether multiple RS232 non-IP networks exist on the same tty device Check for agreement among all nodes on the ownership of defined resources, such as filesystems, log files, volume groups, disks, and application servers Check for invalid characters in cluster names, node names, network names, network interface names and resource group names Verify takeover information. The verification utility will also print out diagnostic information about the following:
Custom snapshot methods Custom verification methods Custom pre or post events Cluster log file redirection. If you have configured Kerberos on your system, the verification utility also determines that:
All IP labels listed in the configuration have the appropriate service principals in the .klogin file on each node in the cluster All nodes have the proper service principals Kerberos is installed on all nodes in the cluster All nodes have the same security mode setting. From the main HACMP SMIT panel, select Problem Determination Tools > HACMP Verification > Verify HACMP Configuration. If you find a configuration problem, correct it, then resynchronize the cluster.
Note: Some errors require that you make changes on each cluster node. For example, a missing application start script or a volume group with autovaryon=TRUE requires a correction on each affected node. Some of these issues can be taken care of by using HACMP File Collections.
For more information about using the cluster verification utility and HACMP File Collections, see Chapter 7: Verifying and Synchronizing a Cluster Configuration in the Administration Guide.
Run the /usr/es/sbin/cluster/utilities/cltopinfo command to see a complete listing of cluster topology. In addition to running the HACMP verification process, check for recent modifications to the node configuration files.
The command ls -lt /etc will list all the files in the /etc directory and show the most recently modified files that are important to configuring AIX 5L, such as:
etc/inett.conf etc/hosts etc/services. It is also very important to check the resource group configuration for any errors that may not be flagged by the verification process. For example, make sure the filesystems required by the application servers are included in the resource group with the application.
Check that the nodes in each resource group are the ones intended, and that the nodes are listed in the proper order. To view the cluster resource configuration information from the main HACMP SMIT panel, select Extended Configuration > Extended Resource Configuration > HACMP Extended Resource Group Configuration > Show All Resources by Node or Resource Group.
You can also run the /usr/es/sbin/cluster/utilities/clRGinfo command to see the resource group information.
Note: If cluster configuration problems arise after running the cluster verification utility, do not run C-SPOC commands in this environment as they may fail to execute on cluster nodes.
Checking a Cluster Snapshot File
The HACMP cluster snapshot facility (/usr/es/sbin/cluster/utilities/clsnapshots) allows you to save in a file, a record all the data that defines a particular cluster configuration. It also allows you to create your own custom snapshot methods, to save additional information important to your configuration. You can use this snapshot for troubleshooting cluster problems. The default directory path for storage and retrieval of a snapshot is /usr/es/sbin/cluster/snapshots.
Note that you cannot use the cluster snapshot facility in a cluster that is running different versions of HACMP concurrently.
For information on how to create and apply cluster snapshots, see Chapter 18: Saving and Restoring Cluster Configurations in the Administration Guide.
Information Saved in a Cluster Snapshot
The primary information saved in a cluster snapshot is the data stored in the HACMP Configuration Database classes (such as HACMPcluster, HACMPnode, and HACMPnetwork). This is the information used to recreate the cluster configuration when a cluster snapshot is applied.
The cluster snapshot does not save any user-customized scripts, applications, or other non-HACMP configuration parameters. For example, the name of an application server and the location of its start and stop scripts are stored in the HACMPserver Configuration Database object class. However, the scripts themselves as well as any applications they may call are not saved.
The cluster snapshot does not save any device data or configuration-specific data that is outside the scope of HACMP. For instance, the facility saves the names of shared filesystems and volume groups; however, other details, such as NFS options or LVM mirroring configuration are not saved.
If you moved resource groups using the Resource Group Management utility clRGmove, once you apply a snapshot, the resource groups return to behaviors specified by their default nodelists. To investigate a cluster after a snapshot has been applied, run clRGinfo to view the locations and states of resource groups.
In addition to this Configuration Database data, a cluster snapshot also includes output generated by various HACMP and standard AIX 5L commands and utilities. This data includes the current state of the cluster, node, network, and network interfaces as viewed by each cluster node, as well as the state of any running HACMP daemons.
The cluster snapshot includes output from the following commands:
cllscf df lsfs netstat cllsnw exportfs lslpp no cllsif ifconfig lslv clchsyncd clshowres ls lsvg cltopinfo
In HACMP 5.1 and up, by default, HACMP no longer collects cluster log files when you create the cluster snapshot, although you can still specify to do so in SMIT. Skipping the logs collection reduces the size of the snapshot and speeds up running the snapshot utility.
You can use SMIT to collect cluster log files for problem reporting. This option is available under the Problem Determination Tools > HACMP Log Viewing and Management > Collect Cluster log files for Problem Reporting SMIT menu. It is recommended to use this option only if requested by the IBM support personnel.
If you want to add commands to obtain site-specific information, create custom snapshot methods as described in the chapter on Saving and Restoring Cluster Configurations in the Administration Guide.
Note that you can also use the AIX 5L snap -e command to collect HACMP cluster data, including the hacmp.out and clstrmgr.debug log files.
Cluster Snapshot Files
The cluster snapshot facility stores the data it saves in two separate files, the Configuration Database data file and the Cluster State Information File, each displaying information in three sections.
Configuration Database Data File (.odm)
This file contains all the data stored in the HACMP Configuration Database object classes for the cluster. This file is given a user-defined basename with the .odm file extension. Because the Configuration Database information must be largely the same on every cluster node, the cluster snapshot saves the values from only one node. The cluster snapshot Configuration Database data file is an ASCII text file divided into three delimited sections:
The following is an excerpt from a sample cluster snapshot Configuration Database data file showing some of the ODM stanzas that are saved:
id = 1106245917name = "HA52_TestCluster"nodename = "mynode"sec_level = “Standard”sec_level_msg = “”sec_encryption = “”sec_persistent = “”last_node_ids = “”highest_node_id = 0last_network_ids = “”highest_network_id = 0last_site_ides = “”highest_site_id = 0handle = 1cluster_version = 7reserved1 = 0reserved2 = 0wlm_subdir = “”settling_time = org_distribution_policy = “node”noautoverification = 0clvernodename = “”clverhour = 0name = “mynode”object = “VERBOSE_LOGGING”value = “high”Cluster State Information File (.info)
This file contains the output from standard AIX 5L and HACMP system management commands. This file is given the same user-defined basename with the .info file extension. If you defined custom snapshot methods, the output from them is appended to this file. The Cluster State Information file contains three sections:
Checking the Logical Volume Manager
When troubleshooting an HACMP cluster, you need to check the following LVM entities:
Volume groups Physical volumes Logical volumes Filesystems. Checking Volume Group Definitions
Check to make sure that all shared volume groups in the cluster are active on the correct node. If a volume group is not active, vary it on using the appropriate command for your configuration.
In the SMIT panel Initialization and Standard Configuration > Configure HACMP Resource Groups > Change/Show Resources for a Resource Group (standard), all volume groups listed in the Volume Groups field for a resource group should be varied on the node(s) that have the resource group online.
Using the lsvg Command to Check Volume Groups
To check for inconsistencies among volume group definitions on cluster nodes, use the lsvg command to display information about the volume groups defined on each node in the cluster:
The system returns volume group information similar to the following:
To list only the active (varied on) volume groups in the system, use the lsvg -o command as follows:
The system returns volume group information similar to the following:
To list all logical volumes in the volume group, and to check the volume group status and attributes, use the lsvg -l command and specify the volume group name as shown in the following example:
You can also use HACMP SMIT to check for inconsistencies: System Management (C-SPOC) > HACMP Logical Volume Management > Shared Volume Groups option to display information about shared volume groups in your cluster.
Checking the Varyon State of a Volume Group
You may check the status of the volume group by issuing the lsvg <vgname> command. Depending on your configuration, the lsvg command returns the following options:
vg state could be active (if it is active varyon), or passive only (if it is passive varyon). vg mode could be concurrent or enhanced concurrent. Here is an example of lsvg output:
# lsvg myvg VOLUME GROUP: Volume_Group_01 VG IDENTIFIER: 0002231b00004c00000000f2801blcc3 VG STATE: active PP SIZE: 16 megabyte(s) VG PERMISSION: read/write TOTAL PPs: 1084 (17344 megabytes) MAX LVs: 256 FREE PPs: 977 (15632 megabytes) LVs: 4 USED PPs: 107 (1712 megabytes) OPEN LVs: 0 QUORUM: 2 TOTAL PVs: 2 VG DESCRIPTORS: 3 STALE PVs: 0 STALE PPs 0 ACTIVE PVs: 2 AUTO ON: no MAX PPs per PV 1016 MAX PVs: 32 LTG size: 128 kilobyte (s) AUTO SYNC: no HOT SPARE: noUsing the C-SPOC Utility to Check Shared Volume Groups
To check for inconsistencies among volume group definitions on cluster nodes in a two-node C-SPOC environment:
1. Enter smitty hacmp
2. In SMIT, select System Management (C-SPOC) > HACMP Logical Volume Management > Shared Volume Groups > List All Shared Volume Groups and press Enter to accept the default (no).
A list of all shared volume groups in the C-SPOC environment appears. This list also contains enhanced concurrent volume groups included as resources in non-concurrent resource groups.
You can also use the C-SPOC cl_lsvg command from the command line to display this information.
Checking Physical Volumes
To check for discrepancies in the physical volumes defined on each node, obtain a list of all physical volumes known to the systems and compare this list against the list of disks specified in the Disks field of the Command Status panel. Access the Command Status panel through the SMIT Extended Configuration > Extended Resource Configuration > HACMP Extended Resource Group Configuration > Show All Resources by Node or Resource Group panel.
To obtain a list of all the physical volumes known to a node and to find out the volume groups to which they belong, use the lspv command. If you do not specify the name of a volume group as an argument, the lspv command displays every known physical volume in the system. For example:
lspv hdisk0 0000914312e971a rootvg hdisk1 00000132a78e213 rootvg hdisk2 00000902a78e21a datavg hdisk3 00000321358e354 datavgThe first column of the display shows the logical name of the disk. The second column lists the physical volume identifier of the disk. The third column lists the volume group (if any) to which it belongs.
Note that on each cluster node, AIX 5L can assign different names (hdisk numbers) to the same physical volume. To tell which names correspond to the same physical volume, compare the physical volume identifiers listed on each node.
If you specify the logical device name of a physical volume (hdiskx) as an argument to the lspv command, it displays information about the physical volume, including whether it is active (varied on). For example:
lspv hdisk2 PHYSICAL VOLUME: hdisk2 VOLUME GROUP: abalonevg PV IDENTIFIER: 0000301919439ba5 VG IDENTIFIER: 00003019460f63c7 PV STATE: active VG STATE: active/complete STALE PARTITIONS: 0 ALLOCATABLE: yes PP SIZE: 4 megabyte(s) LOGICAL VOLUMES: 2 TOTAL PPs: 203 (812 megabytes) VG DESCRIPTORS: 2 FREE PPs: 192 (768 megabytes) USED PPs: 11 (44 megabytes) FREE DISTRIBUTION: 41..30..40..40..41 USED DISTRIBUTION: 00..11..00..00..00If a physical volume is inactive (not varied on, as indicated by question marks in the PV STATE field), use the appropriate command for your configuration to vary on the volume group containing the physical volume. Before doing so, however, you may want to check the system error report to determine whether a disk problem exists. Enter the following command to check the system error report:
You can also use the lsdev command to check the availability or status of all physical volumes known to the system.
Checking Logical Volumes
To check the state of logical volumes defined on the physical volumes, use the lspv -l command and specify the logical name of the disk to be checked. As shown in the following example, you can use this command to determine the names of the logical volumes defined on a physical volume:
lspv -l hdisk2 LV NAME LPs PPs DISTRIBUTION MOUNT POINT lv02 50 50 25..00..00..00..25 /usr lv04 44 44 06..00..00..32..06 /clusterfsUse the lslv logicalvolume command to display information about the state (opened or closed) of a specific logical volume, as indicated in the LV STATE field. For example:
lslv nodeAlv LOGICAL VOLUME: nodeAlv VOLUME GROUP: nodeAvg LV IDENTIFIER: 00003019460f63c7.1 PERMISSION: read/write VG STATE: active/complete LV STATE: opened/syncd TYPE: jfs WRITE VERIFY: off MAX LPs: 128 PP SIZE: 4 megabyte(s) COPIES: 1 SCHED POLICY: parallel LPs: 10 PPs: 10 STALE PPs: 0 BB POLICY: relocatable INTER-POLICY: minimum RELOCATABLE: yes INTRA-POLICY: middle UPPER BOUND: 32 MOUNT POINT: /nodeAfs LABEL: /nodeAfs MIRROR WRITE CONSISTENCY: on EACH LP COPY ON A SEPARATE PV ?: yesIf a logical volume state is inactive (or closed, as indicated in the LV STATE field), use the appropriate command for your configuration to vary on the volume group containing the logical volume.
Using the C-SPOC Utility to Check Shared Logical Volumes
To check the state of shared logical volumes on cluster nodes:
In SMIT select System Management (C-SPOC) > HACMP Logical Volume Management > Shared Logical Volumes > List All Shared Logical Volumes by Volume Group. A list of all shared logical volumes appears.
You can also use the C-SPOC cl_lslv command from the command line to display this information.
Checking Filesystems
Check to see if the necessary filesystems are mounted and where they are mounted. Compare this information against the HACMP definitions for any differences. Check the permissions of the filesystems and the amount of space available on a filesystem.
Use the following commands to obtain this information about filesystems:
The mount command The df command The lsfs command. Use the cl_lsfs command to list filesystem information when running the C-SPOC utility.
Obtaining a List of Filesystems
Use the mount command to list all the filesystems, both JFS and NFS, currently mounted on a system and their mount points. For example:
mount node mounted mounted over vfs date options ------------------------------------------------------------------------ /dev/hd4 / jfs Oct 06 09:48 rw,log=/dev/hd8 /dev/hd2 /usr jfs Oct 06 09:48 rw,log=/dev/hd8 /dev/hd9var /var jfs Oct 06 09:48 rw,log=/dev/hd8 /dev/hd3 /tmp jfs Oct 06 09:49 rw,log=/dev/hd8 /dev/hd1 /home jfs Oct 06 09:50 rw,log=/dev/hd8 pearl /home /home nfs Oct 07 09:59 rw,soft,bg,intr jade /usr/local /usr/local nfs Oct 07 09:59 rw,soft,bg,intrDetermine whether and where the filesystem is mounted, then compare this information against the HACMP definitions to note any differences.
Checking Available Filesystem Space
To see the space available on a filesystem, use the df command. For example:
df Filesystem Total KB free %used iused %iused Mounted on /dev/hd4 12288 5308 56% 896 21% / /dev/hd2 413696 26768 93% 19179 18% /usr /dev/hd9var 8192 3736 54% 115 5% /var /dev/hd3 8192 7576 7% 72 3% /tmp /dev/hd1 4096 3932 4% 17 1% /home /dev/crab1lv 8192 7904 3% 17 0% /crab1fs /dev/crab3lv 12288 11744 4% 16 0% /crab3fs /dev/crab4lv 16384 15156 7% 17 0% /crab4fs /dev/crablv 4096 3252 20% 17 1% /crabfsCheck the %used column for filesystems that are using more than 90% of their available space. Then check the free column to determine the exact amount of free space left.
Checking Mount Points, Permissions, and Filesystem Information
Use the lsfs command to display information about mount points, permissions, filesystem size and so on. For example:
lsfs Name Nodename Mount Pt VFS Size Options Auto /dev/hd4 -- / jfs 24576 -- yes /dev/hd1 -- /home jfs 8192 -- yes /dev/hd2 -- /usr jfs 827392 -- yes /dev/hd9var -- /var jfs 16384 -- yes /dev/hd3 -- /tmp jfs 16384 -- yes /dev/hd7 -- /mnt jfs -- -- no /dev/hd5 -- /blv jfs -- -- no /dev/crab1lv -- /crab1fs jfs 16384 rw no /dev/crab3lv -- /crab3fs jfs 24576 rw no /dev/crab4lv -- /crab4fs jfs 32768 rw no /dev/crablv -- /crabfs jfs 8192 rw noImportant: For filesystems to be NFS exported, be sure to verify that logical volume names for these filesystems are consistent throughout the cluster.
Using the C-SPOC Utility to Check Shared Filesystems
To check to see whether the necessary shared filesystems are mounted and where they are mounted on cluster nodes in a two-node C-SPOC environment:
In SMIT select System Management (C-SPOC) > HACMP Logical Volume Management > Shared Filesystems. Select from either Journaled Filesystems > List All Shared Filesystems or Enhanced Journaled Filesystems > List All Shared Filesystems to display a list of shared filesystems.
You can also use the C-SPOC cl_lsfs command from the command line to display this information.
Checking the Automount Attribute of Filesystems
At boot time, AIX 5L attempts to check all the filesystems listed in /etc/filesystems with the check=true attribute by running the fsck command. If AIX 5L cannot check a filesystem, it reports the following error:
For filesystems controlled by HACMP, this error message typically does not indicate a problem. The filesystem check fails because the volume group on which the filesystem is defined is not varied on at boot time.
To avoid generating this message, edit the /etc/filesystems file to ensure that the stanzas for the shared filesystems do not include the check=true attribute.
Checking the TCP/IP Subsystem
To investigate the TCP/IP subsystem, use the following AIX 5L commands:
Use the netstat command to make sure that the network interfaces are initialized and that a communication path exists between the local node and the target node. Use the ping command to check the point-to-point connectivity between nodes. Use the ifconfig command on all network interfaces to detect bad IP addresses, incorrect subnet masks, and improper broadcast addresses. Scan the /tmp/hacmp.out file to confirm that the /etc/rc.net script has run successfully. Look for a zero exit status. If IP address takeover is enabled, confirm that the /etc/rc.net script has run and that the service interface is on its service address and not on its base (boot) address. Use the lssrc -g tcpip command to make sure that the inetd daemon is running. Use the lssrc -g portmap command to make sure that the portmapper daemon is running. Use the arp command to make sure that the cluster nodes are not using the same IP or hardware address. Use the netstat command to: Show the status of the network interfaces defined for a node. Determine whether a route from the local node to the target node is defined. The netstat -in command displays a list of all initialized interfaces for the node, along with the network to which that interface connects and its IP address. You can use this command to determine whether the service and standby interfaces are on separate subnets. The subnets are displayed in the Network column.
netstat -in Name Mtu Network Address Ipkts Ierrs Opkts Oerrs Coll lo0 1536 <Link> 18406 0 18406 0 0 lo0 1536 127 127.0.0.1 18406 0 18406 0 0 en1 1500 <Link> 1111626 0 58643 0 0 en1 1500 100.100.86. 100.100.86.136 1111626 0 58643 0 0 en0 1500 <Link> 943656 0 52208 0 0 en0 1500 100.100.83. 100.100.83.136 943656 0 52208 0 0 tr1 1492 <Link> 1879 0 1656 0 0 tr1 1492 100.100.84. 100.100.84.136 1879 0 1656 0 0Look at the first, third, and fourth columns of the output. The Name column lists all the interfaces defined and available on this node. Note that an asterisk preceding a name indicates the interface is down (not ready for use). The Network column identifies the network to which the interface is connected (its subnet). The Address column identifies the IP address assigned to the node.
The netstat -rn command indicates whether a route to the target node is defined. To see all the defined routes, enter:
Information similar to that shown in the following example is displayed:
Routing tables Destination Gateway Flags Refcnt Use Interface Netmasks: (root node) (0)0 (0)0 ff00 0 (0)0 ffff 0 (0)0 ffff ff80 0 (0)0 70 204 1 0 (root node)Route Tree for Protocol Family 2: (root node) 127 127.0.0.1 U 3 1436 lo0 127.0.0.1 127.0.0.1 UH 0 456 lo0 100.100.83.128 100.100.83.136 U 6 18243 en0 100.100.84.128 100.100.84.136 U 1 1718 tr1 100.100.85.128 100.100.85.136 U 2 1721 tr0 100.100.86.128 100.100.86.136 U 8 21648 en1 100.100.100.128 100.100.100.136 U 0 39 en0 (root node)Route Tree for Protocol Family 6: (root node) (root node)To test for a specific route to a network (for example 100.100.83), enter:
The same test, run on a system that does not have this route in its routing table, returns no response. If the service and standby interfaces are separated by a bridge, router, or hub and you experience problems communicating with network devices, the devices may not be set to handle two network segments as one physical network. Try testing the devices independent of the configuration, or contact your system administrator for assistance.
Note that if you have only one interface active on a network, the Cluster Manager will not generate a failure event for that interface. For more information, see the section on network interface events in the Planning Guide.
See the netstat man page for more information on using this command.
Checking Point-to-Point Connectivity
The ping command tests the point-to-point connectivity between two nodes in a cluster. Use the ping command to determine whether the target node is attached to the network and whether the network connections between the nodes are reliable. Be sure to test all TCP/IP interfaces configured on the nodes (service and standby).
For example, to test the connection from a local node to a remote node named nodeA enter:
/etc/ping nodeA PING testcluster.nodeA.com: (100.100.81.141): 56 data bytes 64 bytes from 100.100.81.141: icmp_seq=0 ttl=255 time=2 ms 64 bytes from 100.100.81.141: icmp_seq=1 ttl=255 time=1 ms 64 bytes from 100.100.81.141: icmp_seq=2 ttl=255 time=2 ms 64 bytes from 100.100.81.141: icmp_seq=3 ttl=255 time=2 msType Control-C to end the display of packets. The following statistics appear:
----testcluster.nodeA.com PING Statistics---- 4 packets transmitted, 4 packets received, 0% packet loss round-trip min/avg/max = 1/1/2 msThe ping command sends packets to the specified node, requesting a response. If a correct response arrives, ping prints a message similar to the output shown above indicating no lost packets. This indicates a valid connection between the nodes.
If the ping command hangs, it indicates that there is no valid path between the node issuing the ping command and the node you are trying to reach. It could also indicate that required TCP/IP daemons are not running. Check the physical connection between the two nodes. Use the ifconfig and netstat commands to check the configuration. A “bad value” message indicates problems with the IP addresses or subnet definitions.
Note that if “DUP!” appears at the end of the ping response, it means the ping command has received multiple responses for the same address. This response typically occurs when network interfaces have been misconfigured, or when a cluster event fails during IP address takeover. Check the configuration of all interfaces on the subnet to verify that there is only one interface per address. For more information, see the ping man page.
In addition, you can assign a persistent node IP label to a cluster network on a node. When for administrative purposes you wish to reach a specific node in the cluster using the ping or telnet commands without worrying whether an service IP label you are using belongs to any of the resource groups present on that node, it is convenient to use a persistent node IP label defined on that node.
For more information on how to assign persistent Node IP labels on the network on the nodes in your cluster, see the Planning Guide and Chapter 4: Configuring HACMP Cluster Topology and Resources (Extended) in the Administration Guide.
Checking the IP Address and Netmask
Use the ifconfig command to confirm that the IP address and netmask are correct. Invoke ifconfig with the name of the network interface that you want to examine. For example, to check the first Ethernet interface, enter:
ifconfig en0 en0: flags=2000063<UP,BROADCAST,NOTRAILERS,RUNNING,NOECHO> inet 100.100.83.136 netmask 0xffffff00 broadcast 100.100.83.255If the specified interface does not exist, ifconfig replies:
The ifconfig command displays two lines of output. The first line shows the interface’s name and characteristics. Check for these characteristics:
The second line of output shows the IP address and the subnet mask (written in hexadecimal). Check these fields to make sure the network interface is properly configured.
See the ifconfig man page for more information.
Using the arp Command
Use the arp command to view what is currently held to be the IP addresses associated with nodes listed in a host’s arp cache. For example:
arp -a flounder (100.50.81.133) at 8:0:4c:0:12:34 [ethernet] cod (100.50.81.195) at 8:0:5a:7a:2c:85 [ethernet] seahorse (100.50.161.6) at 42:c:2:4:0:0 [token ring] pollock (100.50.81.147) at 10:0:5a:5c:36:b9 [ethernet]This output shows what the host node currently believes to be the IP and MAC addresses for nodes flounder, cod, seahorse and pollock. (If IP address takeover occurs without Hardware Address Takeover, the MAC address associated with the IP address in the host’s arp cache may become outdated. You can correct this situation by refreshing the host’s arp cache.)
See the arp man page for more information.
Checking Heartbeating over IP Aliases
The hacmp.out file shows when a heartbeating over IP Aliases addresses is removed from an interface and when it is added to the interface again during an adapter_swap.
Use the following to check the configuration for heartbeating over IP Aliases:
netstat -n shows the aliases clstrmgr.debug shows an IP Alias address when it is mapped to an interface. Checking ATM Classic IP Hardware Addresses
For Classic IP interfaces, the arp command is particularly useful to diagnose errors. It can be used to verify the functionality of the ATM network on the ATM protocol layer, and to verify the registration of each Classic IP client with its server.
Example 1
The following arp command yields the output below:
arp -t atm -a SVC - at0 on device atm2 ========================= at0(10.50.111.4) 39.99.99.99.99.99.99.0.0.99.99.1.1.8.0.5a.99.a6.9b.0 IP Addr VPI:VCI Handle ATM Address stby_1A(10.50.111.2) 0:110 21 39.99.99.99.99.99.99.0.0.99.99.1.1.8.0.5a.99.82.48.7 server_10_50_111(10.50.111.99) 0:103 14 39.99.99.99.99.99.99.0.0.99.99.1.1.88.88.88.88.a.11.0 stby_1C(10.50.111.6) 0:372 11 39.99.99.99.99.99.99.0.0.99.99.1.1.8.0.5a.99.98.fc.0 SVC - at2 on device atm1 ======================== at2(10.50.110.4) 39.99.99.99.99.99.99.0.0.99.99.1.1.8.0.5a.99.83.63.2 IP Addr VPI:VCI Handle ATM Address boot_1A(10.50.110.2) 0:175 37 39.99.99.99.99.99.99.0.0.99.99.1.1.8.0.5a.99.9e.2d.2 server_10_50_110(10.50.110.99) 0:172 34 39.99.99.99.99.99.99.0.0.99.99.1.1.88.88.88.88.a.10.0 boot_1C(10.50.110.6) 0:633 20 39.99.99.99.99.99.99.0.0.99.99.1.1.8.0.5a.99.99.c1.3The ATM devices atm1, and atm2, have connected to the ATM switch, and retrieved its address, 39.99.99.99.99.99.99.0.0.99.99.1.1. This address appears in the first 13 bytes of the two clients, at0, and at2. The clients have successfully registered with their corresponding Classic IP server - server_10_50_111 for at0 and server_10_50_110 for at2. The two clients are able to communicate with other clients on the same subnet. (The clients for at0, for example, are stby_1A, and stby_1C.)
Example 2
If the connection between an ATM device and the switch is not functional on the ATM layer, the output of the arp command looks as follows:
arp -t atm -a SVC - at0 on device atm2 ========================== at0(10.50.111.4) 8.0.5a.99.a6.9b.0.0.0.0.0.0.0.0.0.0.0.0.0.0Here the MAC address of ATM device atm2, 8.0.5a.99.a6.9b, appears as the first six bytes of the ATM address for interface at0. The ATM device atm2 has not registered with the switch, since the switch address does not appear as the first part of the ATM address of at0.
Checking the AIX 5L Operating System
To view hardware and software errors that may affect the cluster, use the errpt command. Be on the lookout for disk and network error messages, especially permanent ones, which indicate real failures. See the errpt man page for more information.
Checking Physical Networks
Checkpoints for investigating physical connections include:
Check the serial line between each pair of nodes. If you are using Ethernet: Use the diag command to verify that the network interface card is good. Ethernet adapters for the IBM eServer pSeries can be used either with the transceiver that is on the card or with an external transceiver. There is a jumper on the NIC to specify which you are using. Verify that your jumper is set correctly. Make sure that hub lights are on for every connected cable. If you are using Token-Ring: Use the diag command to verify that the NIC and cables are good. Make sure that all the nodes in the cluster are on the same ring. Make sure that the ringspeed is set to the same value for all NICs. To review HACMP network requirements, see Chapter 3: Planning Cluster Network Connectivity in the Planning Guide.
Checking Disks, Disk Adapters, and Disk Heartbeating Networks
Use the diag command to verify that the adapter card is functioning properly. If problems arise, be sure to check the jumpers, cables, and terminators along the SCSI bus.
For SCSI disks, including IBM SCSI disks and arrays, make sure that each array controller, adapter, and physical disk on the SCSI bus has a unique SCSI ID. Each SCSI ID on the bus must be an integer value from 0 through 15, although some SCSI adapters may have limitations on the SCSI ID that can be set. See the device documentation for information about any device-specific limitations. A common configuration is to set the SCSI ID of the adapters on the nodes to be higher than the SCSI IDs of the shared devices. Devices with higher IDs take precedence in SCSI bus contention.
For example, if the standard SCSI adapters use IDs 5 and 6, assign values from 0 through 4 to the other devices on the bus. You may want to set the SCSI IDs of the adapters to 5 and 6 to avoid a possible conflict when booting one of the systems in service mode from a mksysb tape of other boot devices, since this will always use an ID of 7 as the default.
If the SCSI adapters use IDs of 14 and 15, assign values from 3 through 13 to the other devices on the bus. Refer to your worksheet for the values previously assigned to the adapters.
You can check the SCSI IDs of adapters and disks using either the lsattr or lsdev command. For example, to determine the SCSI ID of the adapter scsi1 (SCSI-3), use the following lsattr command and specify the logical name of the adapter as an argument:
Do not use wildcard characters or full pathnames on the command line for the device name designation.
Important: If you restore a backup of your cluster configuration onto an existing system, be sure to recheck or reset the SCSI IDs to avoid possible SCSI ID conflicts on the shared bus. Restoring a system backup causes adapter SCSI IDs to be reset to the default SCSI ID of 7.
If you note a SCSI ID conflict, see the Planning Guide for information about setting the SCSI IDs on disks and disk adapters.
To determine the SCSI ID of a disk, enter:
Recovering from PCI Hot Plug NIC Failure
If an unrecoverable error causes a PCI hot-replacement process to fail, you may be left in a state where your NIC is unconfigured and still in maintenance mode. The PCI slot holding the card and/or the new card may be damaged at this point. User intervention is required to get the node back in fully working order.
For more information, refer to your hardware manuals or search for information about devices on IBM’s website.
Checking Disk Heartbeating Networks
Cluster verification confirms whether a disk heartbeating network is correctly configured. RSCT logs provide information for disk heartbeating networks that is similar for information for other types of networks.
Use the following commands to test connectivity for a disk heartbeating network:
dhb_read tests connectivity for a disk heartbeating network. For information about dhb_read, see the RSCT Command for Testing Disk Heartbeating section in Appendix C: HACMP for AIX 5L Commands in the Administration Guide.
clip_config provides information about devices discovered for disk heartbeating. lssrc -ls topsvcs shows network activity. Testing a Disk Heartbeating Network
The first step in troubleshooting a disk heartbeating network is to test the connections. For RS232 networks, the disk heartbeating network cannot be tested while the network is active.
To use dhb_read to test a disk heartbeating connection:
1. Set one node to run the command in data mode:
2. Set the other node to run the command in transmit mode:
Link operating normally.If a device that is expected to appear in a picklist does not, view the clip_config file to see what information was discovered.
$ cat /usr/es/sbin/cluster/etc/config/clip_config | grep diskhb nodeA:15#Serial#(none)#0#/0#0##0#0.0.0.0#hdisk1#hdisk1# DE:AD:BE:EF#(none)##diskhb#public#0#0002409f07346b43 nodeB:15#Serial#(none)#0#/0#0##0#0.0.0.0#hdisk1#hdisk1# DE:AD:BE:EF#(none)##diskhb#public#0#0002409f07346b43Disk Heartbeating Networks and Network Failure Detection
Disk heartbeating networks are identical to other non-IP based networks in terms of the operation of the failure detection rate. However, there is a subtle difference that affects the state of the network endpoints and the events run.
Disk heartbeating networks work by exchanging heartbeat messages on a reserved portion of a shared disk. As long as the node can access the disk, the network endpoint will be considered up, even if heartbeat messages are not being sent between nodes. The disk heartbeating network itself will still be considered down.
All other non-IP networks mark the network and both endpoints as down when either endpoint fails. This difference makes it easier to diagnose problems with disk heartbeating networks: If the problem is in the connection of just one node with the shared disk only that part of the network will be marked as down.
Disk Heartbeating Networks and Fast Node Failure Detection
HACMP 5.4 provides a method to reduce the time it takes for a node failure to be realized throughout the cluster, while reliably detecting node failures.
HACMP 5.4 uses disk heartbeating to put a departing message on a shared disk so its neighbor(s) will be immediately aware of the node failure (without waiting for missed heartbeats). Topology Services will then distribute the information about the node failure throughout the cluster and then each Topology Services daemon sends a node_down event to any concerned client.
For more information see the section Decreasing Node Fallover Time in Chapter 3: Planning Cluster Network Connectivity in the Planning Guide.
Disk Heartbeating Networks and Failed Disk Enclosures
In addition to providing a non-IP network to help ensure high availability, you can use disk heartbeating networks to detect failure of a disk enclosure (cabinet). To use this function, configure a disk heartbeating network for at least one disk in each disk enclosure.
To configure a disk heartbeating network to detect a failure of a disk enclosure:
1. Configure a disk heartbeating network for a disk in the specified enclosure. For information about configuring a disk heartbeating network, see the section Configuring Heartbeating over Disk in the Administration Guide.
2. Create a pre- or post-event, or a notification method, to determine the action to be taken in response to a failure of the disk heartbeating network. (A failure of the disk enclosure would be seen as a failure of the disk heartbeating network.)
Checking the Cluster Communications Daemon
In some cases, if you change or remove IP addresses in the AIX 5L adapter configuration, and this takes place after the cluster has been synchronized, the Cluster Communications daemon cannot validate these addresses against the /usr/es/sbin/cluster/etc/rhosts file or against the entries in the HACMP’s Configuration Database, and HACMP issues an error.
Or, you may obtain an error during the cluster synchronization.
In this case, you must update the information that is saved in the /usr/es/sbin/cluster/etc/rhosts file on all cluster nodes, and refresh clcomd to make it aware of the changes. When you synchronize and verify the cluster again, clcomd starts using IP addresses added to HACMP Configuration Database.
To refresh the Cluster Communications daemon, use:
refresh -s clcomdES
Also, configure the /usr/es/sbin/cluster/etc/rhosts file to contain all the addresses currently used by HACMP for inter-node communication, and then copy this file to all cluster nodes.
For troubleshooting other related problems, also see Cluster Communications Issues in this chapter.
Checking System Hardware
Check the power supplies and LED displays to see if any error codes are displayed. Run the AIX 5L diag command to test the system unit.
Without an argument, diag runs as a menu-driven program. You can also run diag on a specific piece of hardware. For example:
This output indicates that hdisk0 is okay.
HACMP Installation Issues
The following potential installation issues are described here:
Cannot Find Filesystem at Boot Time
Problem
At boot-time, AIX 5L tries to check, by running the fsck command, all the filesystems listed in /etc/filesystems with the check=true attribute. If it cannot check a filesystem. AIX 5L reports the following error:
+----------------------------------------------------------+ Filesystem Helper: 0506-519 Device open failed +----------------------------------------------------------+Solution
For filesystems controlled by HACMP, this error typically does not indicate a problem. The filesystem check failed because the volume group on which the filesystem is defined is not varied on at boot-time. To prevent the generation of this message, edit the /etc/filesystems file to ensure that the stanzas for the shared filesystems do not include the check=true attribute.
cl_convert Does Not Run Due to Failed Installation
Problem
When you install HACMP, cl_convert is run automatically. The software checks for an existing HACMP configuration and attempts to convert that configuration to the format used by the version of the software bring installed. However, if installation fails, cl_convert will fail to run as a result. Therefore, conversion from the Configuration Database of a previous HACMP version to the Configuration Database of the current version will also fail.
Solution
Run cl_convert from the command line. To gauge conversion success, refer to the /tmp/clconvert.log file, which logs conversion progress.
Root user privilege is required to run cl_convert.
Warning: Before converting to HACMP 5.4, be sure that your ODMDIR environment variable is set to /etc/es/objrepos.
For information on cl_convert flags, refer to the cl_convert man page.
Configuration Files Could Not Be Merged during Installation
Problem
During the installation of HACMP client software, the following message appears:
+----------------------------------------------------------+ Post-installation Processing... +----------------------------------------------------------+ Some configuration files could not be automatically merged into the system during the installation. The previous versions of these files have been saved in a configuration directory as listed below. Compare the saved files and the newly installed files to determine if you need to recover configuration data. Consult product documentation to determine how to merge the data. Configuration files, which were saved in /usr/lpp/save.config: /usr/es/sbin/cluster/utilities/clexit.rcSolution
As part of the HACMP installation process, copies of HACMP files that could potentially contain site-specific modifications are saved in the /usr/lpp/save.config directory before they are overwritten. As the message states, you must merge site-specific configuration information into the newly installed files.
HACMP Startup Issues
The following potential HACMP startup issues are described here:
ODMPATH Environment Variable Not Set Correctly
Problem
Queried object not found.
Solution
HACMP has a dependency on the location of certain ODM repositories to store configuration data. The ODMPATH environment variable allows ODM commands and subroutines to query locations other than the default location if the queried object does not reside in the default location. You can set this variable, but it must include the default location, /etc/objrepos, or the integrity of configuration information may be lost.
clinfo Daemon Exits after Starting
Problem
The “smux-connect” error occurs after starting the clinfoES daemon with the -a option. Another process is using port 162 to receive traps.
Solution
Check to see if another process, such as the trapgend smux subagent of NetView for AIX 5L or the System Monitor for AIX 5L sysmond daemon, is using port 162. If so, restart clinfoES without the -a option and configure NetView for AIX 5L to receive the SNMP traps. Note that you will not experience this error if clinfoES is started in its normal way using the startsrc command.
Node Powers Down; Cluster Manager Will Not Start
Problem
The node powers itself off or appears to hang after starting the Cluster Manager. The configuration information does not appear to be identical on all nodes, causing the clexit.rc script to issue a halt -q to the system.
Solution
Use the cluster verification utility to uncover discrepancies in cluster configuration information on all cluster nodes.
Correct any configuration errors uncovered by the cluster verification utility. Make the necessary changes using the HACMP Initialization and Standard Configuration or Extended Configuration SMIT panels. After correcting the problem, select the Verify and Synchronize HACMP Configuration option to synchronize the cluster resources configuration across all nodes. Then select the Start Cluster Services option from the System Management (C-SPOC) > Manage HACMP Services SMIT panel to start the Cluster Manager.
The Cluster Manager should not exit if the configuration has passed cluster verification. If it does exit, use the AIX 5L snap -e command to collect HACMP cluster data, including the log files and open a Program Management Report (PMR) requesting performance assistance.
For more information about the snap -e command, see the section Using the AIX Data Collection Utility, in Chapter 1: Troubleshooting HACMP Clusters.
You can modify the file /etc/cluster/hacmp.term to change the default action after an abnormal exit. The clexit.rc script checks for the presence of this file, and if you have made it executable, the instructions there will be followed instead of the automatic halt called by clexit.rc. Please read the caveats contained in the /etc/cluster/hacmp.term file, before making any modifications. For more information, see the section Abnormal Termination of a Cluster Daemon in the Administration Guide.
configchk Command Returns an Unknown Host Message
Problem
The /etc/hosts file on each cluster node does not contain the IP labels of other nodes in the cluster. For example, in a four-node cluster, Node A, Node B, and Node C’s /etc/hosts files do not contain the IP labels of the other cluster nodes.
If this situation occurs, the configchk command returns the following message to the console:
which indicates that the /etc/hosts file on Node x does not contain an entry for your node.
Solution
Before starting the HACMP software, ensure that the /etc/hosts file on each node includes the service and boot IP labels of each cluster node.
Cluster Manager Hangs during Reconfiguration
Problem
The Cluster Manager hangs during reconfiguration and generates messages similar to the following:
An event script has failed.
Solution
Determine why the script failed by examining the /tmp/hacmp.out file to see what process exited with a non-zero status. The error messages in the /usr/adm/cluster.log file may also be helpful. Fix the problem identified in the log file. Then run the clruncmd command either at the command line, or by using the SMIT Problem Determination Tools > Recover From HACMP Script Failure panel. The clruncmd command signals the Cluster Manager to resume cluster processing.
clcomdES and clstrmgrES Fail to Start on Newly installed AIX 5L Nodes
Problem
On newly installed AIX 5L nodes, clcomdES and clstrmgrES fail to start.
Solution
Manually indicate to the system console (for the AIX installation assistant) that the AIX 5L installation is finished.
This problem usually occurs on newly installed AIX nodes; at the first boot AIX runs the installation assistant from /etc/inittab and does not proceed with other entries in this file. AIX 5L installation assistant waits for your input on system console. AIX 5L will run the installation assistant on every subsequent boot, until you indicate that installation is finished. Once you do so, the system will proceed to start the cluster communications daemon (clcomdES) and the Cluster Manager daemon (clstrmgr).
Pre- or Post-Event Does Not Exist on a Node after Upgrade
Problem
The cluster verification utility indicates that a pre- or post-event does not exist on a node after upgrading to a new version of the HACMP software.
Solution
Ensure that a script by the defined name exists and is executable on all cluster nodes.
Each node must contain a script associated with the defined pre- or post-event. While the contents of the script do not have to be the same on each node, the name of the script must be consistent across the cluster. If no action is desired on a particular node, a no-op script with the same event-script name should be placed on nodes on which no processing should occur.
Node Fails During Configuration with “869” LED Display
Problem
The system appears to be hung. “869” is displayed continuously on the system LED display.
Solution
A number of situations can cause this display to occur. Make sure all devices connected to the SCSI bus have unique SCSI IDs to avoid SCSI ID conflicts. In particular, check that the adapters and devices on each cluster node connected to the SCSI bus have a different SCSI ID. By default, AIX 5L assigns an ID of 7 to a SCSI adapter when it configures the adapter. See the Planning Guide for more information on checking and setting SCSI IDs.
Node Cannot Rejoin Cluster after Being Dynamically Removed
Problem
A node that has been dynamically removed from a cluster cannot rejoin.
Solution
When you remove a node from the cluster, the cluster definition remains in the node’s Configuration Database. If you start cluster services on the removed node, the node reads this cluster configuration data and attempts to rejoin the cluster from which it had been removed. The other nodes no longer recognize this node as a member of the cluster and refuse to allow the node to join. Because the node requesting to join the cluster has the same cluster name as the existing cluster, it can cause the cluster to become unstable or crash the existing nodes.
To ensure that a removed node cannot be restarted with outdated Configuration Database information, complete the following procedure to remove the cluster definition from the node:
1. Stop cluster services on the node to be removed using the following command:
clstop -RThe -R flag removes the HACMP entry in the /etc/inittab file, preventing cluster services from being automatically started when the node is rebooted.
2. Remove the HACMP entry from the rc.net file using the following command:
clchipat false3. Remove the cluster definition from the node’s Configuration Database using the
following command:clrmclstrYou can also perform this task by selecting Extended Configuration > Extended Topology Configuration > Configure an HACMP Cluster > Remove an HACMP Cluster from the SMIT panel.
Resource Group Migration Is Not Persistent after Cluster Startup
Problem
You have specified a resource group migration operation using the Resource Group Migration Utility, in which you have requested that this particular migration Persists across Cluster Reboot, by setting this flag to true (or, by issuing the clRGmove -p command). Then, after you stopped and restarted the cluster services, this policy is not followed on one of the nodes in the cluster.
Solution
This problem occurs if, when you specified the persistent resource group migration, a node was down and inaccessible. In this case, the node did not obtain information about the persistent resource group migration, and, if after the cluster services are restarted, this node is the first to join the cluster, it will have no knowledge of the Persist across Cluster Reboot setting. Thus, the resource group migration will not be persistent. To restore the persistent migration setting, you must again specify it in SMIT under the Extended Resource Configuration > HACMP Resource Group Configuration SMIT menu.
SP Cluster Does Not Startup after Upgrade to HACMP 5.4
Problem
The ODM entry for group “hacmp” is removed on SP nodes. This problem manifests itself as the inability to start the cluster or clcomd errors.
Solution
To further improve security, the HACMP Configuration Database (ODM) has the following enhancements:
Ownership. All HACMP ODM files are owned by user root and group hacmp. In addition, all HACMP binaries that are intended for use by non-root users are also owned by user root and group hacmp. Permissions. All HACMP ODM files, except for the hacmpdisksubsystem file with 600 permissions, are set with 640 permissions (readable by user root and group hacmp, writable by user root). All HACMP binaries that are intended for use by non-root users are installed with 2555 permissions (readable and executable by all users, with the setgid bit turned on so that the program runs as group hacmp). During the installation, HACMP creates the group “hacmp” on all nodes if it does not already exist. By default, group hacmp has permission to read the HACMP ODMs, but does not have any other special authority. For security reasons, it is recommended not to expand the authority of group hacmp.
If you use programs that access the HACMP ODMs directly, you may need to rewrite them if they are intended to be run by non-root users:
All access to the ODM data by non-root users should be handled via the provided HACMP utilities. In addition, if you are using the PSSP File Collections facility to maintain the consistency of /etc/group, the new group “hacmp” that is created at installation time on the individual cluster nodes may be lost when the next file synchronization occurs. There are two possible solutions to this problem. Take one of the following actions before installing HACMP 5.4:
a. Turn off PSSP File Collections synchronization of /etc/group
or
b. Ensure that group “hacmp” is included in the master /etc/group file and ensure that the change is propagated to all cluster nodes.
Disk and Filesystem Issues
The following potential disk and filesystem issues are described here:
AIX 5L Volume Group Commands Cause System Error Reports
Problem
The redefinevg, varyonvg, lqueryvg, and syncvg commands fail and report errors against a shared volume group during system restart. These commands send messages to the console when automatically varying on a shared volume group. When configuring the volume groups for the shared disks, autovaryon at boot was not disabled. If a node that is up owns the shared drives, other nodes attempting to vary on the shared volume group will display various varyon error messages.
Solution
When configuring the shared volume group, set the Activate volume group AUTOMATICALLY at system restart? field to no on the SMIT System Management (C-SPOC) > HACMP Logical Volume Management > Shared Volume Groups > Create a Shared Volume Group panel. After importing the shared volume group on the other cluster nodes, use the following command to ensure that the volume group on each node is not set to autovaryon at boot:
Verification Fails on Clusters with Disk Heartbeating Networks
Problem 1
With clusters that have disk heartbeating networks configured, when running verification the HACMP software indicates that verification failed “PVIDs do not match” error message.
Solution 1
Run verification with verbose logging to view messages that indicate where the error occurred (for example, the node, device, or command). The verification utility uses verbose logging to write to the /var/hacmp/clverify/clverify.log file.
If the hdisks have been renumbered, the disk heartbeating network may no longer be valid. Remove the disk heartbeating network and redefine it.
Ensure that the disk heartbeating networks are configured on enhanced concurrent volume groups. You can convert an existing volume group to enhanced concurrent mode. For information about converting a volume group, see the chapter Managing Shared LVM Components in a Concurrent Access Environment in the Administration Guide.
After correcting the problem, select the Verify and Synchronize HACMP Configuration option to synchronize the cluster resources configuration across all nodes. Then select the Start Cluster Services option from the System Management (C-SPOC) > Manage HACMP Services SMIT panel to start the Cluster Manager.
varyonvg Command Fails on a Volume Group
Problem 1
The HACMP software (the /tmp/hacmp.out file) indicates that the varyonvg command failed when trying to vary on a volume group.
Solution 1
Ensure that the volume group is not set to autovaryon on any node and that the volume group (unless it is in concurrent access mode) is not already varied on by another node.
The lsvg -o command can be used to determine whether the shared volume group is active. Enter:
on the node that has the volume group activated, and check the AUTO ON field to determine whether the volume group is automatically set to be on. If AUTO ON is set to yes, correct this by entering
Problem 2
The volume group information on disk differs from that in the Device Configuration Data Base.
Solution 2
Correct the Device Configuration Data Base on the nodes that have incorrect information:
1. Use the smit exportvg fastpath to export the volume group information. This step removes the volume group information from the Device Configuration Data Base.
2. Use the smit importvg fastpath to import the volume group. This step creates a new Device Configuration Data Base entry directly from the information on disk. After importing, be sure to change the volume group to not autovaryon at the next system boot.
3. Use the SMIT Problem Determination Tools > Recover From HACMP Script Failure panel to issue the clruncmd command to signal the Cluster Manager to resume cluster processing.
Problem 3
The HACMP software indicates that the varyonvg command failed because the volume group could not be found.
Solution 3
The volume group is not defined to the system. If the volume group has been newly created and exported, or if a mksysb system backup has been restored, you must import the volume group. Follow the steps described in Problem 2 to verify that the correct volume group name is being referenced.
Problem 4
The HACMP software indicates that the varyonvg command failed because the logical volume <name> is incomplete.
Solution 4
This indicates that the forced varyon attribute is configured for the volume group in SMIT, and that when attempting a forced varyon operation, HACMP did not find a single complete copy of the specified logical volume for this volume group.
Also, it is possible that you requested a forced varyon operation but did not specify the super strict allocation policy for the mirrored logical volumes. In this case, the success of the varyon command is not guaranteed. For more information on the forced varyon functionality, see the chapter Planning Shared LVM Components in the Planning Guide and the Forcing a Varyon of Volume Groups section in the chapter on Configuring HACMP Resource Groups (Extended) in the Administration Guide.
cl_nfskill Command Fails
Problem
The /tmp/hacmp.out file shows that the cl_nfskill command fails when attempting to perform a forced unmount of an NFS-mounted filesystem. NFS provides certain levels of locking a filesystem that resists forced unmounting by the cl_nfskill command.
Solution
Make a copy of the /etc/locks file in a separate directory before executing the cl_nfskill command. Then delete the original /etc/locks file and run the cl_nfskill command. After the command succeeds, re-create the /etc/locks file using the saved copy.
cl_scdiskreset Command Fails
Problem
The cl_scdiskreset command logs error messages to the /tmp/hacmp.out file. To break the reserve held by one system on a SCSI device, the HACMP disk utilities issue the cl_scdiskreset command. The cl_scdiskreset command may fail if back-level hardware exists on the SCSI bus (adapters, cables or devices) or if a SCSI ID conflict exists on the bus.
Solution
See the appropriate sections in Chapter 2: Using Cluster Log Files to check the SCSI adapters, cables, and devices. Make sure that you have the latest adapters and cables. The SCSI IDs for each SCSI device must be different.
fsck Command Fails at Boot Time
Problem
At boot time, AIX 5L runs the fsck command to check all the filesystems listed in /etc/filesystems with the check=true attribute. If it cannot check a filesystem, AIX 5L reports the following error:
Solution
For filesystems controlled by HACMP, this message typically does not indicate a problem. The filesystem check fails because the volume group defining the filesystem is not varied on. The boot procedure does not automatically vary on HACMP-controlled volume groups.
To prevent this message, make sure that all the filesystems under HACMP control do not have the check=true attribute in their /etc/filesystems stanzas. To delete this attribute or change it to check=false, edit the /etc/filesystems file.
System Cannot Mount Specified Filesystems
Problem
The /etc/filesystems file has not been updated to reflect changes to log names for a logical volume. If you change the name of a logical volume after the filesystems have been created for that logical volume, the /etc/filesystems entry for the log does not get updated. Thus when trying to mount the filesystems, the HACMP software tries to get the required information about the logical volume name from the old log name. Because this information has not been updated, the filesystems cannot be mounted.
Solution
Be sure to update the /etc/filesystems file after making changes to logical volume names.
Cluster Disk Replacement Process Fails
Problem 1
You cannot complete the disk replacement process due to a node_down event.
Solution 1
Once the node is back online, export the volume group, then import it again before starting HACMP on this node.
Problem 2
The disk replacement process failed while the replacepv command was running.
Solution 2
Delete the /tmp/replacepv directory, and attempt the replacement process again.
You can also try running the process on another disk.
Problem 3
The disk replacement process failed with a “no free disks” message while VPATH devices were available for replacement.
Solution 3
Be sure to convert the volume group from VPATH devices to hdisks, and attempt the replacement process again. When the disk is replaced, convert hdisks back to the VPATH devices. For instructions, see the Convert SDD VPATH Device Volume Group to an ESS hdisk Device Volume Group section in the chapter on Managing Shared LVM Components in the Administration Guide.
Automatic Error Notification Fails with Subsystem Device Driver
Problem
You set up automatic error notification for the 2105 IBM Enterprise Storage System (ESS), expecting it to log errors when there is a volume group loss. (The Subsystem Device Driver handles the loss.) However, the error notification fails and you get error messages in the cspoc.log and the smit.log.
Solution
If you set up automatic error notification for the 2105 IBM Enterprise Storage System (ESS), which uses the Subsystem Device Driver, all PVIDs must be on VPATHS, or the error notification fails. To avoid this failure, convert all hdisks to VPATH devices.
Filesystem Change Not Recognized by Lazy Update
Problem
If you change the name of a filesystem, or remove a filesystem and then perform a lazy update, lazy update does not run the imfs -lx command before running the imfs command. This may lead to a failure during fallover or prevent a successful restart of the HACMP cluster services.
Solution
Use the C-SPOC utility to change or remove filesystems. This ensures that imfs -lx runs before imfs and that the changes are updated on all nodes in the cluster.
Error Reporting provides detailed information about inconsistency in volume group state across the cluster. If this happens, take manual corrective action. If the filesystem changes are not updated on all nodes, update the nodes manually with this information.
Network and Switch Issues
The following potential network and switch issues are described here:
Unexpected Network Interface Failure in Switched Networks
Problem
Unexpected network interface failures can occur in HACMP configurations using switched networks if the networks and the switches are incorrectly defined/configured.
Solution
Take care to configure your switches and networks correctly. See the section on considerations for switched networks in the Planning Guide for more information.
Troubleshooting VLANs
Problem
Interface failures in Virtual LAN networks (from now on referred to as VLAN, Virtual Local Area Network)
Solution
To troubleshoot VLAN interfaces defined to HACMP and detect an interface failure, consider these interfaces as interfaces defined on single adapter networks.
For information on single adapter networks and the use of the netmon.cf file, see Missing Entries in the /etc/hosts for the netmon.cf File May Prevent RSCT from Monitoring Networks.
In particular, list the network interfaces that belong to a VLAN in the ping_client_list variable in the /usr/es/sbin/cluster/etc/clinfo.rc script and run clinfo. This way, whenever a cluster event occurs, clinfo monitors and detects a failure of the listed network interfaces. Due to the nature of Virtual Local Area Networks, other mechanisms to detect the failure of network interfaces are not effective.
Cluster Nodes Cannot Communicate
Problem
If your configuration has two or more nodes connected by a single network, you may experience a partitioned cluster. A partitioned cluster occurs when cluster nodes cannot communicate. In normal circumstances, a service network interface failure on a node causes the Cluster Manager to recognize and handle a swap_adapter event, where the service IP label/address is replaced with another IP label/address. However, if no other network interface is available, the node becomes isolated from the cluster. Although the Cluster Managers on other nodes are aware of the attempted swap_adapter event, they cannot communicate with the now isolated (partitioned) node because no communication path exists.
A partitioned cluster can cause GPFS to lose quorum. For more information, see the Appendix on GPFS Cluster Configuration, in the Installation Guide.
Solution
Make sure your network is configured for no single point of failure.
Distributed SMIT Causes Unpredictable Results
Problem
Using the AIX 5L utility DSMIT on operations other than starting or stopping HACMP cluster services, can cause unpredictable results.
Solution
DSMIT manages the operation of networked IBM eServer pSeries processors. It includes the logic necessary to control execution of AIX 5L commands on all networked nodes. Since a conflict with HACMP functionality is possible, use DSMIT only to start and stop HACMP cluster services.
Token-Ring Network Thrashes
Problem
A Token-Ring network cannot reach steady state unless all stations are configured for the same ring speed. One symptom of the adapters being configured at different speeds is a clicking sound heard at the MAU (multi-station access unit).
Solution
Configure all adapters for either 4 or 16 Mbps.
System Crashes Reconnecting MAU Cables after a Network Failure
Problem
A global network failure occurs and crashes all nodes in a four-node cluster after reconnecting MAUs (multi-station access unit). More specifically, if the cables that connect multiple MAUs are disconnected and then reconnected, all cluster nodes begin to crash.
This result happens in a configuration where three nodes are attached to one MAU (MAU1) and a fourth node is attached to a second MAU (MAU2). Both MAUs (1 and 2) are connected together to complete a Token-Ring network. If MAU1 is disconnected from the network, all cluster nodes can continue to communicate; however, if MAU2 is disconnected, node isolation occurs.
Solution
To avoid causing the cluster to become unstable, do not disconnect cables connecting multiple MAUs in a Token-Ring configuration.
TMSCSI Will Not Properly Reintegrate when Reconnecting Bus
Problem
If the SCSI bus is broken while running as a target mode SCSI network, the network will not properly reintegrate when reconnecting the bus.
Solution
The HACMP software may need to be restarted on all nodes attached to that SCSI bus. When target mode SCSI is enabled and the cfgmgr command is run on a particular machine, it will go out on the bus and create a target mode initiator for every node that is on the SCSI network. In a four-node cluster, when all four nodes are using the same SCSI bus, each machine will have three initiator devices (one for each of the other nodes).
In this configuration, use a maximum of four target mode SCSI networks. You would therefore use networks between nodes A and B, B and C, C and D, and D and A.
Target mode SCSI devices are not always properly configured during the AIX 5L boot process. Ensure that all the tmscsi initiator devices are available on all nodes before bringing up the cluster. To do this run lsdev -Cc tmscsi, which returns:
where x identifies the particular tmscsi device. If the status is not “Available,” run the cfgmgr command and check again.
Recovering from PCI Hot Plug NIC Failure
Problem
If an unrecoverable error causes a PCI hot-replacement process to fail, the NIC may be left in an unconfigured state and the node may be left in maintenance mode. The PCI slot holding the NIC and/or the new NIC may be damaged at this point.
Solution
User intervention is required to get the node back in fully working order. For more information, refer to the AIX Managing Hot Plug Connectors from System Management Guide: Operating System and Devices.
Unusual Cluster Events Occur in Non-Switched Environments
Problem
Some network topologies may not support the use of simple switches. In these cases, you should expect that certain events may occur for no apparent reason. These events may be:
Cluster unable to form, either all or some of the time swap_adapter pairs swap_adapter, immediately followed by a join_standby fail_standby and join_standby pairs. These events occur when ARP packets are delayed or dropped. This is correct and expected HACMP behavior, as HACMP is designed to depend on core protocols strictly adhering to their related RFCs.
For a review of basic HACMP network requirements, see the Planning Guide.
Solution
The following implementations may reduce or circumvent these events:
Increase the Failure Detection Rate (FDR) to exceed the ARP retransmit time of 15 seconds, where typical values have been calculated as follows: FDR = (2+ * 15 seconds) + >5 = 35+ seconds (usually 45-60 seconds)
“2+” is a number greater than one in order to allow multiple ARP requests to be generated. This is required so that at least one ARP response will be generated and received before the FDR time expires and the network interface is temporarily marked down, then immediately marked back up.
Keep in mind, however, that the “true” fallover is delayed for the value of the FDR.
Increase the ARP queue depth. If you increase the queue, requests that are dropped or delayed will be masked until network congestion or network quiescence (inactivity) makes this problem evident.
Use a dedicated switch, with all protocol optimizations turned off. Segregate it into a physical LAN segment and bridge it back into the enterprise network. Use permanent ARP entries (IP address to MAC address bindings) for all network interfaces. These values should be set at boot time, and since none of the ROM MAC addresses are used, replacing network interface cards will be invisible to HACMP. Note: The above four items simply describe how some customers have customized their unique enterprise network topology to provide the classic protocol environment (strict adherence to RFCs) that HACMP requires. IBM cannot guarantee HACMP will work as expected in these approaches, since none addresses the root cause of the problem. If your network topology requires consideration of any of these approaches please contact the IBM Consult Line for assistance.
Cannot Communicate on ATM Classic IP Network
Problem
If you cannot communicate successfully to a cluster network interface of type atm (a cluster network interface configured over a Classic IP client, check the following:
Solution
1. Check the client configuration. Check that the 20-Byte ATM address of the Classic IP server that is specified in the client configuration is correct, and that the interface is configured as a Classic IP client (svc-c) and not as a Classic IP server (svc-s).
2. Check that the ATM TCP/IP layer is functional. Check that the UNI version settings that are configured for the underlying ATM device and for the switch port to which this device is connected are compatible. It is recommended not to use the value auto_detect for either side.
If the connection between the ATM device# and the switch is not functional on the ATM protocol layer, this can also be due to a hardware failure (NIC, cable, or switch).
[bass][/]>arp -t atm -aSVC - at0 on device atm1 -==========================at0(10.50.111.6) 39.99.99.99.99.99.99.0.0.99.99.1.1.8.0.5a.99.98.fc.0IP Addr VPI:VCI Handle ATM Addressserver_10_50_111(10.50.111.255) 0:888 15 39.99.99.99.99.99.99.0.0.99.99.1.1.88.88.88.88.a0.11.0SVC - at1 on device atm0 -==========================at1(10.50.120.6) 39.99.99.99.99.99.99.0.0.99.99.1.1.8.0.5a.99.99.c1.1IP Addr VPI:VCI Handle ATM Address?(0.0.0.0) N/A N/A 15 39.99.99.99.99.99.99.0.0.99.99.1.1.88.88.88.88.a0.20.0SVC - at3 on device atm2 -==========================at3(10.50.110.6) 8.0.5a.99.00.c1.0.0.0.0.0.0.0.0.0.0.0.0.0.0IP Addr VPI:VCI Handle ATM Address?(0.0.0.0) 0:608 16 39.99.99.99.99.99.99.0.0.99.99.1.1.88.88.88.88.a0.10.0In the example above, the client at0 is operational. It has registered with its server, server_10_50_111.
The client at1 is not operational, since it could not resolve the address of its Classic IP server, which has the hardware address 39.99.99.99.99.99.99.0.0.99.99.1.1.88.88.88.88.a0.11.0. However, the ATM layer is functional, since the 20 byte ATM address that has been constructed for the client at1 is correct. The first 13 bytes is the switch address, 39.99.99.99.99.99.99.0.0.99.99.1.1.
For client at3, the connection between the underlying device atm2 and the ATM switch is not functional, as indicated by the failure to construct the 20 Byte ATM address of at3. The first 13 bytes do not correspond to the switch address, but contain the MAC address of the ATM device corresponding to atm2 instead.
Cannot Communicate on ATM LAN Emulation Network
Problem
You are having problems communicating with an ATM LANE client.
Solution
Check that the LANE client is registered correctly with its configured LAN Emulation server. A failure of a LANE client to connect with its LAN Emulation server can be due to the configuration of the LAN Emulation server functions on the switch. There are many possible reasons.
1. Correct client configuration: Check that the 20 Byte ATM address of the LAN Emulation server, the assignment to a particular ELAN, and the Maximum Frame Size value are all correct.
2. Correct ATM TCP/IP layer: Check that the UNI version settings that are configured for the underlying ATM device and for the switch port to which this device is connected are compatible. It is recommended not to use the value auto_detect for either side.
If the connection between the ATM device# and the switch is not functional on the ATM protocol layer, this can also be due to a hardware failure (NIC, cable, or switch).
bass][/]> entstat -d ent3The output will contain the following:
General Statistics:-------------------No mbuf Errors: 0Adapter Reset Count: 3Driver Flags: Up Broadcast RunningSimplex AlternateAddressATM LAN Emulation Specific Statistics:--------------------------------------Emulated LAN Name: ETHER3Local ATM Device Name: atm1Local LAN MAC Address:42.0c.01.03.00.00Local ATM Address:39.99.99.99.99.99.99.00.00.99.99.01.01.08.00.5a.99.98.fc.04Auto Config With LECS:NoLECS ATM Address:00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00LES ATM Address:39.99.99.99.99.99.99.00.00.99.99.01.01.88.88.88.88.00.03.00In the example above, the client is operational as indicated by the Running flag.
If the client had failed to register with its configured LAN Emulation Server, the Running flag would not appear, instead the flag Limbo would be set.
If the connection of the underlying device atm# was not functional on the ATM layer, then the local ATM address would not contain as the first 13 Bytes the Address of the ATM switch.
3. Switch-specific configuration limitations: Some ATM switches do not allow more than one client belonging to the same ELAN and configured over the same ATM device to register with the LAN Emulation Server at the same time. If this limitation holds and two clients are configured, the following are typical symptoms.
Cyclic occurrence of events indicating network interface failures, such as fail_standby, join_standby, and swap_adapter This is a typical symptom if two such clients are configured as cluster network interfaces. The client which first succeeds in registering with the LES will hold the connection for a specified, configuration-dependent duration. After it times out the other client succeeds in establishing a connection with the server, hence the cluster network interface configured on it will be detected as alive, and the former as down.
Sporadic events indicating an network interface failure (fail_standby, join_standby, and swap_adapter) If one client is configured as a cluster network interface and the other outside, this configuration error may go unnoticed if the client on which the cluster network interface is configured manages to register with the switch, and the other client remains inactive. The second client may succeed at registering with the server at a later moment, and a failure would be detected for the cluster network interface configured over the first client.
IP Label for HACMP Disconnected from AIX 5L Interface
Problem
When you define network interfaces to the cluster configuration by entering or selecting an HACMP IP label, HACMP discovers the associated AIX 5L network interface name. HACMP expects this relationship to remain unchanged. If you change the name of the AIX 5L network interface name after configuring and synchronizing the cluster, HACMP will not function correctly.
Solution
If this problem occurs, you can reset the network interface name from the SMIT HACMP System Management (C-SPOC) panel. For more information, see the chapter on Managing the Cluster Resources in the Administration Guide.
TTY Baud Rate Setting Wrong
Problem
The default baud rate is 38400. Some modems or devices are incapable of doing 38400. If this is the case for your situation, you can change the default by customizing the RS232 network module to read the desired baud rate (9600/19200/38400).
Solution
Change the Custom Tuning Parameters for the RS232 network module. for instructions, see the chapter on Managing the Cluster Topology in the Administration Guide.
First Node Up Gives Network Error Message in hacmp.out
Problem
The first node up in a HACMP cluster gives the following network error message in /tmp/hacmp.out, even if the network is healthy:
Whether the network is functional or not, the RSCT topology services heartbeat interval expires, resulting in the logging of the above error message. This message is only relevant to non-IP networks (such as RS232, TMSCSI, TMSSA). This behavior does not occur for disk heartbeating networks (for which network_down events are not logged in general).
Solution
Ignore the message and let the cluster services continue to function. You should see this error message corrected in a healthy cluster as functional network communication is eventually established between other nodes in the cluster. A network_up event will be run after the second node that has an interface on this network joins the cluster. If cluster communication is not established after this error message, then the problem should be diagnosed in other sections of this guide that discuss network issues.
Network Interface Card and Network ODMs Out of Sync with Each Other
Problem
In some situations, it is possible for the HACMPadapter or the HACMPnetwork ODMs to become out of sync with the AIX 5L ODMs. For example, HACMP may refer to an Ethernet network interface card while AIX 5L refers to a Token-Ring network interface card.
If the hardware settings have been adjusted after the HACMP cluster has been successfully configured and synchronized or
If the wrong values were selected when configuring predefined communication interfaces to HACMP. Solution
Run cluster verification to detect and report the following network and network interface card type incompatibilities:
The network interface card configured in HACMP is the correct one for the node’s hardware The network interface cards configured in HACMP and AIX 5L match each other. If verification returns an error, examine and adjust the selections made on the Extended Configuration > Extended Topology Configuration > Configuring HACMP Communication Interfaces/Devices > Change/Show Communication Interfaces/Devices SMIT panel. For more information on this screen, see Chapter 4: Configuring HACMP Cluster Topology and Resources (Extended) of the Administration Guide.
Non-IP Network, Network Adapter or Node Failures
Problem
The non-IP interface declares its neighbor down after the Failure Detection Rate has expired for that network interface type. HACMP waits the same interval again before declaring the local interface down (if no heartbeat is received from the neighbor).
Solution
The non-IP heartbeating helps determine the difference between a NIC failure, network failure, and even more importantly node failure. When a non-IP network failure occurs, HACMP detects a non-IP network down and logs an error message in the /tmp/hacmp.out file.
Use the clstat -s command to display the service IP labels for non-IP networks that are currently down on a network.
The RSCT topsvcs daemon logs messages whenever an interface changes state. These errors are visible in the errpt.
For more information, see the section Changing the Configuration of a Network Module in the chapter on Managing the Cluster Topology in the Administration Guide.
Networking Problems Following HACMP Fallover
Problem
If you are using Hardware Address Takeover (HWAT) with any gigabit Ethernet adapters supporting flow control, you may be exposed to networking problems following an HACMP fallover. If a system crash occurs on one node and power still exists for the adapters on the crashed node, even though the takeover is successful, the network connection to the takeover node may be lost or the network containing the failing adapters may lock up. The problem is related to flow control being active on the gigabit Ethernet adapter in conjunction with how the Ethernet switch handles this situation.
Solution
Turn off flow control on the gigabit Ethernet adapters.
To disable flow control on the adapter type:
ifconfig entX detach # where entX corresponds to the gigabit adapter device chdev -l entX -a flow_ctrl=noThen reconfigure the network on that adapter.
Packets Lost during Data Transmission
Problem
If data is intermittently lost during transmission, it is possible that the maximum transmission unit (MTU) has been set to different sizes on different nodes. For example, if Node A sends 8 K packets to Node B, which can accept 1.5 K packets, Node B assumes the message is complete; however data may have been lost.
Solution
Run the cluster verification utility to ensure that all of the network interface cards on all cluster nodes during the same network have the same setting for MTU size. If the MTU size is inconsistent across the network, an error displays, which enables you to determine which nodes to adjust.
chev -l en0 -a mtu=<new_value_from_1_to_8>Verification Fails when Geo Networks Uninstalled
Problem
HAGEO uninstalled, but Geo network definitions remain and cluster verification fails.
Solution
After HAGEO is uninstalled, any HACMP networks which are still defined as type Geo_Primary or Geo_Secondary must either be removed, or their type must be modified to correspond to the network type (such as Ethernet, Token Ring, RS232). HACMP verification will fail unless these changes are made to the HACMP network definitions.
Missing Entries in the /etc/hosts for the netmon.cf File May Prevent RSCT from Monitoring Networks
Problem
Missing entries in the /etc/hosts for the netmon.cf file may prevent your networks from being properly monitored by the netmon utility of the RSCT Topology Services.
Solution
Make sure to include the entries for the netmon.cf file — each IP address and its corresponding label — in the /etc/hosts file. If the entries are missing, it may result in the NIM process of RSCT being blocked while RSCT attempts to determine the state of the local adapters.
In general, we recommend to create the netmon.cf file for the cluster configurations where there are networks that under certain conditions can become single adapter networks. In such networks, it can be difficult for HACMP to accurately determine adapter failure. This is because RSCT Topology Services cannot force packet traffic over the single adapter to verify its operation. The creation of the netmon.cf file allows RSCT to accurately determine adapter failure.
For more information on creating the netmon.cf file, see the Planning Guide.
Cluster Communications Issues
The following potential cluster communications issues are described here:
Message Encryption Fails
Problem
If you have message authentication or message authentication and encryption enabled, and you receive a message that encryption fails or that a message could not be decrypted.
Solution
If the encryption filesets are not found on the local node, a message indicates that the encryption libraries were not found.
If you did not receive a message that encryption libraries could not be found on the local node, check the clcomd.log file to determine if the encryption filesets are not found on a remote node.
Verify whether the cluster node has the following filesets installed:
For data encryption with DES message authentication: rsct.crypt.des For data encryption standard Triple DES message authentication: rsct.crypt.3des For data encryption with Advanced Encryption Standard (AES) message authentication: rsct.crypt.aes256. If needed, install these filesets from the AIX 5L Expansion Pack CD-ROM.
If the filesets are installed after HACMP is already running, start and stop the HACMP Cluster Communications daemon to enable HACMP to use these filesets. To restart the Cluster Communications daemon:
stopscr -s clcomdesstartsrc -s clcomdesIf the filesets are present, and you get an encryption error, the encryption filesets may have been installed, or reinstalled, after HACMP was running. In this case, restart the Cluster Communications daemon as described above.
Cluster Nodes Do Not Communicate with Each Other
Problem
Cluster nodes are unable to communicate with each, and you have one of the following configured:
Message authentication, or message authentication and encryption enabled Use of persistent IP labels for VPN tunnels. Solution
Make sure that the network is operational, see the section Network and Switch Issues.
Check if the cluster has persistent IP labels. If it does, make sure that they are configured correctly and that you can ping the IP label.
If you are using message authentication, or message authentication and encryption:
Make sure that each cluster node has the same setting for message authentication mode. If the modes are different, on each node set message authentication mode to None and configure message authentication again. Make sure that each node has the same type of encryption key in the /usr/es/sbin/cluster/etc directory. Encryption keys cannot reside in other directories. If you have configured use of persistent IP labels for a VPN:
1. Change User Persistent Labels to No.
2. Synchronize cluster configuration.
3. Change User Persistent Labels to Yes.
HACMP Takeover Issues
Note that if you are investigating resource group movement in HACMP—for instance, investigating why an rg_move event has occurred—always check the /tmp/hacmp.out file. In general, given the recent changes in the way resource groups are handled and prioritized in fallover circumstances, particularly in HACMP, the hacmp.out file and its event summaries have become even more important in tracking the activity and resulting location of your resource groups. In addition, with parallel processing of resource groups, the hacmp.out file reports details that cannot be seen in the cluster history log or the clstrmgr.debug log file. Always check the hacmp.out log early on when investigating resource group movement after takeover activity.
The following potential takeover issues are described here:
varyonvg Command Fails during Takeover
Problem
The HACMP software failed to vary on a shared volume group. The volume group name is either missing or is incorrect in the HACMP Configuration Database object class.
Solution
Check the /tmp/hacmp.out file to find the error associated with the varyonvg failure. List all the volume groups known to the system using the lsvg command; then check that the volume group names used in the HACMPresource Configuration Database object class are correct. To change a volume group name in the Configuration Database, from the main HACMP SMIT panel select Initialization and Standard Configuration > Configure HACMP Resource Groups > Change/Show Resource Groups, and select the resource group where you want the volume group to be included. Use the Volume Groups or Concurrent Volume Groups fields on the Change/Show Resources and Attributes for a Resource Group panel to set the volume group names. After you correct the problem, use the SMIT Problem Determination Tools > Recover From HACMP Script Failure panel to issue the clruncmd command to signal the Cluster Manager to resume cluster processing. Run the cluster verification utility to verify cluster resources. Highly Available Applications Fail
Problem 1
Highly available applications fail to start on a fallover node after an IP address takeover. The hostname may not be set.
Solution 1
Some software applications require an exact hostname match before they start. If your HACMP environment uses IP address takeover and starts any of these applications, add the following lines to the script you use to start the application servers:
where nnn is the hostname of the machine the fallover node is masquerading as.
Problem 2
An application that a user has manually stopped following a stop of cluster services where resource groups were placed in an UNMANAGED state, does not restart with reintegration of the node.
Solution 2
Check that the relevant application entry in the /usr/es/sbin/cluster/server.status file has been removed prior to node reintegration.
Since an application entry in the /usr/es/sbin/cluster/server.status file lists all applications already running on the node, HACMP will not restart the applications with entries in the server.status file.
Deleting the relevant application server.status entry before reintegration, allows HACMP to recognize that the highly available application is not running, and that it must be restarted on the node.
Node Failure Detection Takes Too Long
Problem
The Cluster Manager fails to recognize a node failure in a cluster configured with a Token-Ring network. The Token-Ring network cannot become stable after a node failure unless the Cluster Manager allows extra time for failure detection.
In general, a buffer time of 14 seconds is used before determining failures on a Token-Ring network. This means that all Cluster Manager failure modes will take an extra 14 seconds if the Cluster Manager is dealing with Token-Ring networks. This time, however, does not matter if the Cluster Manager is using both Token-Ring and Ethernet. If Cluster Manager traffic is using a Token-Ring network interface, the 14 extra seconds for failures applies.
Solution
If the extra time is not acceptable, you can switch to an alternative network, such as an Ethernet. Using a non-IP heartbeating network (such as RS232) as recommended for all clusters should prevent this problem.
For some configurations, it is possible to run all the cluster network traffic on a separate network (Ethernet), even though a Token-Ring network also exists in the cluster. When you configure the cluster, include only the interfaces used on this separate network. Do not include the Token-Ring interfaces.
Since the Cluster Manager has no knowledge of the Token-Ring network, the 14-second buffer does not apply; thus failure detection occurs faster. Since the Cluster Manager does not know about the Token-Ring network interfaces, it cannot monitor them, nor can it swap network interfaces if one of the network interfaces fails or if the cables are unplugged.
HACMP Selective Fallover Is Not Triggered by a Volume Group Loss of Quorum Error in AIX 5L
Problem
HACMP fails to selectively move the affected resource group to another cluster node when a volume group quorum loss occurs.
Solution
If quorum is lost for a volume group that belongs to a resource group on a cluster node, the system checks whether the LVM_SA_QUORCLOSE error appeared in the node’s AIX 5L error log file and informs the Cluster Manager to selectively move the affected resource group. HACMP uses this error notification method only for mirrored volume groups with quorum enabled.
If fallover does not occur, check that the LVM_SA_QUORCLOSE error appeared in the AIX 5L error log. When the AIX 5L error log buffer is full, new entries are discarded until buffer space becomes available and an error log entry informs you of this problem. To resolve this issue, increase the size of the AIX 5L error log internal buffer for the device driver. For information about increasing the size of the error log buffer, see the AIX 5L documentation listed in About This Guide.
Group Services Sends GS_DOM_MERGE_ER Message
Problem
A Group Services merge message is displayed and the node receiving the message shuts itself down. You see a GS_DOM_MERGE_ER error log entry, as well as a message in the Group Services daemon log file:
“A better domain XXX has been discovered, or domain master requested to dissolve the domain.”
A Group Services merge message is sent when a node loses communication with the cluster and then tries to reestablish communication.
Solution
Because it may be difficult to determine the state of the missing node and its resources (and to avoid a possible data divergence if the node rejoins the cluster), you should shut down the node and successfully complete the takeover of its resources.
For example, if a cluster node becomes unable to communicate with other nodes, yet it continues to work through its process table, the other nodes conclude that the “missing” node has failed because they no longer are receiving keepalive messages from the “missing” node. The remaining nodes then process the necessary events to acquire the disks, IP addresses, and other resources from the “missing” node. This attempt to take over resources results in the dual-attached disks receiving resets to release them from the “missing” node and to start IP address takeover scripts.
As the disks are being acquired by the takeover node (or after the disks have been acquired and applications are running), the “missing” node completes its process table (or clears an application problem) and attempts to resend keepalive messages and rejoin the cluster. Since the disks and IP address have been successfully taken over, it becomes possible to have a duplicate IP address on the network and the disks may start to experience extraneous traffic on the data bus.
Because the reason for the “missing” node remains undetermined, you can assume that the problem may repeat itself later, causing additional downtime of not only the node but also the cluster and its applications. Thus, to ensure the highest cluster availability, GS merge messages should be sent to any “missing” cluster node to identify node isolation, to permit the successful takeover of resources, and to eliminate the possibility of data corruption that can occur if both the takeover node and the rejoining “missing” node attempt to write to the disks. Also, if two nodes exist on the network with the same IP address, transactions may be missed and applications may hang.
When you have a partitioned cluster, the node(s) on each side of the partition detect this and run a node_down for the node(s) on the opposite side of the partition. If while running this or after communication is restored, the two sides of the partition do not agree on which nodes are still members of the cluster, a decision is made as to which partition should remain up, and the other partition is shutdown by a GA merge from nodes in the other partition or by a node sending a GS merge to itself.
In clusters consisting of more than two nodes the decision is based on which partition has the most nodes left in it, and that partition stays up. With an equal number of nodes in each partition (as is always the case in a two-node cluster) the node(s) that remain(s) up is determined by the node number (lowest node number in cluster remains) which is also generally the first in alphabetical order.
Group Services domain merge messages indicate that a node isolation problem was handled to keep the resources as highly available as possible, giving you time to later investigate the problem and its cause. When a domain merge occurs, Group Services and the Cluster Manager exit. The clstrmgr.debug file will contain the following error:
"announcementCb: GRPSVCS announcement code=n; exiting" "CHECK FOR FAILURE OF RSCT SUBSYSTEMS (topsvcs or grpsvcs)"cfgmgr Command Causes Unwanted Behavior in Cluster
Problem
SMIT commands like Configure Devices Added After IPL use the cfgmgr command. Sometimes this command can cause unwanted behavior in a cluster. For instance, if there has been a network interface swap, the cfgmgr command tries to reswap the network interfaces, causing the Cluster Manager to fail.
Solution
See the Installation Guide for information about modifying rc.net, thereby bypassing the issue. You can use this technique at all times, not just for IP address takeover, but it adds to the overall takeover time, so it is not recommended.
Releasing Large Amounts of TCP Traffic Causes DMS Timeout
Large amounts of TCP traffic over an HACMP-controlled service interface may cause AIX 5L to experience problems when queuing and later releasing this traffic. When traffic is released, it generates a large CPU load on the system and prevents timing-critical threads from running, thus causing the Cluster Manager to issue a deadman switch (DMS) timeout.
To reduce performance problems caused by releasing large amounts of TCP traffic into a cluster environment, consider increasing the Failure Detection Rate beyond Slow to a time that can handle the additional delay before a takeover. See the Changing the Failure Detection Rate of a Network Module section in the chapter on Managing the Cluster Topology in the Administration Guide.
Also, to lessen the probability of a DMS timeout, complete the following steps before issuing a node_down:
1. Use the netstat command to identify the ports using an HACMP-controlled service network interface.
2. Use the ps command to identify all remote processes logged to those ports.
3. Use the kill command to terminate these processes.
Deadman Switch Causes a Node Failure
Problem
The node experienced an extreme performance problem, such as a large I/O transfer, excessive error logging, or running out of memory, and the Topology Services daemon (hatsd) is starved for CPU time. It could not reset the deadman switch within the time allotted. Misbehaved applications running at a priority higher than the Cluster Manager can also cause this problem.
Solutions
The deadman switch describes the AIX 5L kernel extension that causes a system panic and dump under certain cluster conditions if it is not reset. The deadman switch halts a node when it enters a hung state that extends beyond a certain time limit. This enables another node in the cluster to acquire the hung node’s resources in an orderly fashion, avoiding possible contention problems. Solutions related to performance problems should be performed in the following order:
1. Tune the system using I/O pacing and increasing the syncd frequency as directed in the chapter on Configuring AIX 5L for HACMP in the Installation Guide.
2. If needed, increase the amount of memory available for the communications subsystem.
3. Tune virtual memory management (VMM). This is explained below.
4. Change the Failure Detection Rate. For more information, see the Changing the Failure Detection Rate of a Network Module section in the chapter on Managing the Cluster Topology in the Administration Guide.
Tuning Virtual Memory Management
For most customers, increasing minfree/maxfree whenever the freelist gets below minfree by more than 10 times the number of memory pools is necessary to allow a system to maintain consistent response times. To determine the current size of the freelist, use the vmstat command. The size of the freelist is the value labeled free. The number of memory pools in a system is the maximum of the number of CPUs/8 or memory size in GB/16, but never more than the number of CPUs and always at least one. The value of minfree is shown by the vmtune command.
In systems with multiple memory pools, it may also be important to increase minfree/maxfree even though minfree will not show as 120, since the default minfree is 120 times the number of memory pools. If raising minfree/maxfree is going to be done, it should be done with care, that is, not setting it too high since this may mean too many pages on the freelist for no real reason. One suggestion is to increase minfree and maxfree by 10 times the number of memory pools, then observe the freelist again. In specific application environments, such as multiple processes (three or more) each reading or writing a very large sequential file (at least 1GB in size each) it may be best to set minfree relatively high, e.g. 120 times the number of CPUs, so that maximum throughput can be achieved.
This suggestion is specific to a multi-process large sequential access environment. Maxfree, in such high sequential I/O environments, should also be set more than just 8 times the number of CPUs higher than minfree, e.g. maxfree = minfree + (maxpgahead x the number of CPUs), where minfree has already been determined using the above formula. The default for maxpgahead is 8, but in many high sequential activity environments, best performance is achieved with maxpgahead set to 32 or 64. This suggestion applies to all pSeries models still being marketed, regardless of memory size. Without these changes, the chances of a DMS timeout can be high in these specific environments, especially those with minimum memory size.
For database environments, these suggestions should be modified. If JFS files are being used for database tables, then watching minfree still applies, but maxfree could be just minfree + (8 x the number of memory pools). If raw logical volumes are being used, the concerns about minfree/maxfree do not apply, but the following suggestion about maxperm is relevant.
In any environment (HA or otherwise) that is seeing non-zero paging rates, it is recommended that maxperm be set lower than the default of ~80%. Use the avm column of vmstat as an estimate of the number of working storage pages in use, or the number of valid memory pages, (should be observed at full load on the system’s real memory, as shown by vmtune) to determine the percentage of real memory occupied by working storage pages. For example, if avm shows as 70% of real memory size, then maxperm should be set to 25% (vmtune -P 25). The basic formula used here is maxperm = 95 - avm/memory size in pages. If avm is less than or equal to 95% of memory, then this system is memory constrained. The options at this point are to set maxperm to 5% and incur some paging activity, add additional memory to this system, or to reduce the total workload run simultaneously on the system so that avm is lowered.
Deadman Switch Time to Trigger
The Topology Services chapter in the Parallel System Support Programs for AIX Diagnosis Guide has several hints about how to avoid having the hatsd blocked which causes the deadman switch (DMS) to hit. The relevant information is in the Diagnostic Procedure section of the chapter. See “Action 5 - Investigate hatsd problem” and “Action 8 - Investigate node crash”. The URL for this Guide follows:
http://publibfp.boulder.ibm.com/epubs/pdf/a2273503.pdf
Running the /usr/sbin/rsct/bin/hatsdmsinfo command
This command is useful for checking on the deadman switch trigger time.
Output of the /usr/sbin/rsct/bin/hatsdmsinfo command looks like this:
======================================================== Information for Topology Services -- HACMP /ES DMS Trigger time: 20.000 seconds. Last DMS Resets Time to Trigger (seconds) 06/04/02 06:51:53.064 19.500 06/04/02 06:51:53.565 19.499 06/04/02 06:51:54.065 19.500 06/04/02 06:51:54.565 19.500 06/04/02 06:51:55.066 19.500 06/04/02 06:51:55.566 19.499 DMS Resets with small time-to-trigger Time to Trigger (seconds) Threshold value: 15.000 seconds.A “device busy” Message Appears after node_up_local Fails
Problem
A device busy message in the /tmp/hacmp.out file appears when swapping hardware addresses between the boot and service address. Another process is keeping the device open.
Solution
Check to see if sysinfod, the SMUX peer daemon, or another process is keeping the device open. If it is sysinfod, restart it using the -H option.
Network Interfaces Swap Fails Due to an rmdev “device busy” Error
Problem
Network interfaces swap fails due to an rmdev device busy error. For example, /tmp/hacmp.out shows a message similar to the following:
Method error (/etc/methods/ucfgdevice): 0514-062 Cannot perform the requested function because the specified device is busy.Solution
Check to see whether the following applications are being run on the system. These applications may keep the device busy:
SNA lssrc -g snaUse the following command to stop SNA:
stopsrc -g snaIf that does not work, use the following command:
stopsrc -f -s snaIf that does not work, use the following command:
/usr/bin/sna -stop sna -t forcedIf that does not work, use the following command:
/usr/bin/sna -stop sna -t cancelNetview / Netmon Ensure that the sysmond daemon has been started with a -H flag. This will result in opening and closing the network interface each time SM/6000 goes out to read the status, and allows the cl_swap_HW_address script to be successful when executing the rmdev command after the ifconfig detach before swapping the hardware address.
Use the following command to stop all Netview daemons:
/usr/OV/bin/nv6000_smit stopdaemonsIPX ps -ef |grep npsdps -ef |grep sapdUse the following command to stop IPX:
/usr/lpp/netware/bin/stopnpsNetBIOS. ps -ef | grep netbiosUse the following commands to stop NetBIOS and unload NetBIOS streams:
mcsadm stop; mcs0 unloadUnload various streams if applicable (that is, if the file exists): cd /etcstrload -uf /etc/dlpi.confstrload -uf /etc/pse.confstrload -uf /etc/netware.confstrload -uf /etc/xtiso.confSome customer applications will keep a device busy. Ensure that the shared applications have been stopped properly. MAC Address Is Not Communicated to the Ethernet Switch
Problem
With switched Ethernet networks, MAC address takeover sometimes appears to not function correctly. Even though HACMP has changed the MAC address of the network interface, the switch is not informed of the new MAC address. The switch does not then route the appropriate packets to the network interface.
Solution
Do the following to ensure that the new MAC address is communicated to the switch:
1. Modify the line in /usr/es/sbin/cluster/etc/clinfo.rc that currently reads:
PING_CLIENT_LIST=" "2. Include on this line the names or IP addresses of at least one client on each subnet on the switched Ethernet.
3. Run clinfoES on all nodes in the HACMP cluster that are attached to the switched Ethernet.
If you normally start HACMP cluster services using the /usr/es/sbin/cluster/etc/rc.cluster shell script, specify the -i option. If you normally start HACMP cluster services through SMIT, specify yes in the Start Cluster Information Daemon? field.
Client Issues
The following potential HACMP client issues are described here:
Network Interface Swap Causes Client Connectivity Problem
Problem
The client cannot connect to the cluster. The ARP cache on the client node still contains the address of the failed node, not the fallover node.
Solution
Issue a ping command to the client from a cluster node to update the client’s ARP cache. Be sure to include the client name as the argument to this command. The ping command will update a client’s ARP cache even if the client is not running clinfoES. You may need to add a call to the ping command in your application’s pre- or post-event processing scripts to automate this update on specific clients. Also consider using hardware address swapping, since it will maintain configured hardware-to-IP address mapping within your cluster.
Clients Cannot Access Applications
Problem
The SNMP process failed.
Solution
Check the /etc/hosts file on the node on which SNMP failed to ensure that it contains IP labels or addresses of cluster nodes. Also see Clients Cannot Find Clusters.
Clients Cannot Find Clusters
Problem
The clstat utility running on a client cannot find any clusters. The clinfoES daemon has not properly managed the data structures it created for its clients (like clstat) because it has not located an SNMP process with which it can communicate. Because clinfoES obtains its cluster status information from SNMP, it cannot populate the HACMP MIB if it cannot communicate with this daemon. As a result, a variety of intermittent problems can occur between SNMP and clinfoES.
Solution
Create an updated client-based clhosts file by running verification with automatic corrective actions enabled. This produces a clhosts.client file on the server nodes. Copy this file to the /usr/es/sbin/cluster/etc/ directory on the clients, renaming the file clhosts. The clinfoES daemon uses the addresses in this file to attempt communication with an SNMP process executing on an HACMP server.
Also, check the /etc/hosts file on the node on which the SNMP process is running and on the node having problems with clstat or other clinfo API programs.
Clinfo Does Not Appear to Be Running
Problem
The service and boot addresses of the cluster node from which clinfoES was started do not exist in the client-based clhosts file.
Solution
Create an updated client-based clhosts file by running verification with automatic corrective actions enabled. This produces a clhosts.client file on the server nodes. Copy this file to the /usr/es/sbin/cluster/etc/ directory on the clients, renaming the file clhosts. Then run the clstat command.
Clinfo Does Not Report That a Node Is Down
Problem
Even though the node is down, the SNMP daemon and clinfoES report that the node is up. All the node’s interfaces are listed as down.
Solution
When one or more nodes are active and another node tries to join the cluster, the current cluster nodes send information to the SNMP daemon that the joining node is up. If for some reason, the node fails to join the cluster, clinfoES does not send another message to the SNMP daemon the report that the node is down.
To correct the cluster status information, restart the SNMP daemon, using the options on the HACMP Cluster Services SMIT panel.
Miscellaneous Issues
The following non-categorized HACMP issues are described here:
Note that if you are investigating resource group movement in HACMP—for instance, investigating why an rg_move event has occurred—always check the /tmp/hacmp.out file. In general, given the recent changes in the way resource groups are handled and prioritized in fallover circumstances, particularly in HACMP, the hacmp.out file and its event summaries have become even more important in tracking the activity and resulting location of your resource groups. In addition, with parallel processing of resource groups, the hacmp.out file reports details that will not be seen in the cluster history log or the clstrmgr.debug file. Always check this log early on when investigating resource group movement after takeover activity.
Limited Output when Running the tail -f Command on /tmp/hacmp.out
Problem
Only script start messages appear in the /tmp/hacmp.out file. The script specified in the message is not executable, or the DEBUG level is set to low.
Solution
Add executable permission to the script using the chmod command, and make sure the DEBUG level is set to high.
CDE Hangs after IPAT on HACMP Startup
Problem
If CDE is started before HACMP is started, it binds to the boot address. When HACMP is started it swaps the IP address to the service address. If CDE has already been started this change in the IP address causes it to hang.
Solution
The output of hostname and the uname -n must be the same. If the output is different, use uname -S hostname to make the uname match the output from hostname. Define an alias for the hostname on the loopback address. This can be done by editing /etc/hosts to include an entry for: 127.0.0.1 loopback localhost hostnamewhere hostname is the name of your host. If name serving is being used on the system edit the /etc/netsvc.conf file such that the local file is checked first when resolving names.
Ensure that the hostname and the service IP label resolve to different addresses. This can be determine by viewing the output of the /bin/host command for both the hostname and the service IP label. Cluster Verification Gives Unnecessary Message
Problem
You get the following message regardless of whether or not you have configured Auto Error Notification:
Solution
Ignore this message if you have not configured Auto Error Notification.
config_too_long Message Appears
This message appears each time a cluster event takes more time to complete than a specified time-out period.
In versions prior to 4.5, the time-out period was fixed for all cluster events and set to 360 seconds by default. If a cluster event, such as a node_up or a node_down event, lasted longer than 360 seconds, then every 30 seconds HACMP issued a config_too_long warning message that was logged in the hacmp.out file.
In HACMP 4.5 and up, you can customize the time period allowed for a cluster event to complete before HACMP issues a system warning for it.
If this message appears, in the hacmp.out Event Start you see:
$event_name is the reconfig event that failed $argument is the parameter(s) used by the event $sec is the number of seconds before the message was sent out. In versions prior to HACMP 4.5, config_too_long messages continued to be appended to the hacmp.out file every 30 seconds until action was taken.
Starting with version 4.5, for each cluster event that does not complete within the specified event duration time, config_too_long messages are logged in the hacmp.out file and sent to the console according to the following pattern:
The first five config_too_long messages appear in the hacmp.out file at 30-second intervals The next set of five messages appears at interval that is double the previous interval until the interval reaches one hour These messages are logged every hour until the event is complete or is terminated on
that node.This message could appear in response to the following problems:
Problem
Activities that the script is performing take longer than the specified time to complete; for example, this could happen with events involving many disks or complex scripts.
Solution
Determine what is taking so long to execute, and correct or streamline that process if possible. Increase the time to wait before calling config_too_long. You can customize Event Duration Time using the Change/Show Time Until Warning panel in SMIT. Access this panel through the Extended Configuration > Extended Event Configuration SMIT panel.
For complete information on tuning event duration time, see the Tuning Event Duration Time Until Warning section in the chapter on Configuring Cluster Events in the Administration Guide.
Problem
A command is hung and event script is waiting before resuming execution. If so, you can probably see the command in the AIX 5L process table (ps -ef). It is most likely the last command in the /tmp/hacmp.out file, before the config_too_long script output.
Solution
You may need to kill the hung command. See also Dynamic Reconfiguration Sets a Lock.
Console Displays SNMP Messages
Problem
The /etc/syslogd file has been changed to send the daemon.notice output to /dev/console.
Solution
Edit the /etc/syslogd file to redirect the daemon.notice output to /usr/tmp/snmpd.log. The snmpd.log file is the default location for logging messages.
Device LEDs Flash “888” (System Panic)
Problem
The crash system dump device with stat subcommand indicates the panic was caused by the deadman switch. The hats daemon cannot obtain sufficient time to process CPU cycles during intensive operations (df, find, for example) and may be required to wait too long for a chance at the kernel lock. Often, more than five seconds will elapse before the hatsd can get a lock. The results are the invocation of the deadman switch and a system panic.
Solution
Determine what process is hogging CPU cycles on the system that panicked. Then attempt (in order) each of the following solutions that address this problem:
1. Tune the system using I/O pacing.
2. Increase the syncd frequency.
3. Change the Failure Detection Rate.
For instructions on these procedures, see the sections under Deadman Switch Causes a Node Failure earlier in this chapter.
Unplanned System Reboots Cause Fallover Attempt to Fail
Problem
Cluster nodes did not fallover after rebooting the system.
Solution
To prevent unplanned system reboots from disrupting a fallover in your cluster environment, all nodes in the cluster should either have the Automatically REBOOT a system after a crash field on the Change/Show Characteristics of Operating System SMIT panel set to false, or you should keep the IBM eServer pSeries key in Secure mode during normal operation.
Both measures prevent a system from rebooting if the shutdown command is issued inadvertently. Without one of these measures in place, if an unplanned reboot occurs the activity against the disks on the rebooting node can prevent other nodes from successfully acquiring the disks.
Deleted or Extraneous Objects Appear in NetView Map
Problem
Previously deleted or extraneous object symbols appeared in the NetView map.
Solution
Rebuild the NetView database.
To rebuild the NetView database, perform the following steps on the NetView server:
1. Stop all NetView daemons: /usr/OV/bin/ovstop -a
2. Remove the database from the NetView server: rm -rf /usr/OV/database/*
3. Start the NetView object database: /usr/OV/bin/ovstart ovwdb
4. Restore the NetView/HAView fields: /usr/OV/bin/ovw -fields
5. Start all NetView daemons: /usr/OV/bin/ovstart -a
F1 Does not Display Help in SMIT Panels
Problem
Pressing F1 in SMIT panel does not display help.
Solution
Help can be displayed only if the LANG variable is set to one of the languages supported by HACMP, and if the associated HACMP message catalogs are installed. The languages supported by HACMP 5.4 are:
To list the installed locales (the bsl LPPs), type:
To list the active locale, type:
Since the LANG environment variable determines the active locale, if LANG=en_US, the locale is en_US.
/usr/es/sbin/cluster/cl_event_summary.txt File (Event Summaries Display) Grows Too Large
Problem
In HACMP, event summaries are pulled from the hacmp.out file and stored in the cl_event_summary.txt file. This file continues to accumulate as hacmp.out cycles, and is not automatically truncated or replaced. Consequently, it can grow too large and crowd your /usr directory.
Solution
Clear event summaries periodically, using the Problem Determination Tools > HACMP Log Viewing and Management > View/Save/Remove HACMP Event Summaries > Remove Event Summary History option in SMIT.
View Event Summaries Does Not Display Resource Group Information as Expected
Problem
In HACMP, event summaries are pulled from the hacmp.out file and can be viewed using the Problem Determination Tools > HACMP Log Viewing and Management > View/Save/Delete Event Summaries > View Event Summaries option in SMIT. This display includes resource group status and location information at the end. The resource group information is gathered by clRGinfo, and may take extra time if the cluster is not running when running the View Event Summaries option.
Solution
clRGinfo displays resource group information more quickly when the cluster is running.
If the cluster is not running, wait a few minutes and the resource group information will eventually appear.
Application Monitor Problems
If you are running application monitors you may encounter occasional problems or situations in which you want to check the state or the configuration of a monitor. Here are some possible problems and ways to diagnose and act on them.
Problem 1
Checking the State of an Application Monitor. In some circumstances, it may not be clear whether an application monitor is currently running or not. To check on the state of an application monitor, run the following command:
This command produces a long line of verbose output if the application is being monitored.
If there is no output, the application is not being monitored.
Solution 1
If the application monitor is not running, there may be a number of reasons, including
No monitor has been configured for the application server The monitor has not started yet because the stabilization interval has not completed The monitor is in a suspended state The monitor was not configured properly An error has occurred. Check to see that a monitor has been configured, the stabilization interval has passed, and the monitor has not been placed in a suspended state, before concluding that something is wrong.
If something is clearly wrong, reexamine the original configuration of the monitor in SMIT and reconfigure as needed.
Problem 2
Application Monitor Does Not Perform Specified Failure Action. The specified failure action does not occur even when an application has clearly failed.
Solution 2
Check the Restart Interval. If set too short, the Restart Counter may be reset to zero too quickly, resulting in an endless series of restart attempts and no other action taken.
Cluster Disk Replacement Process Fails
Problem
The disk replacement process fails while the replacepv command was running.
Solution
Be sure to delete the /tmp/replacepv directory, and attempt the replacement process again.
You can also try running the process on another disk.
Resource Group Unexpectedly Processed Serially
Problem
A resource group is unexpectedly processed serially even though you did not request it to be this way.
Solution
Check for the site policy that is specified for this resource group, and make sure it is set to Ignore. Then delete this resource group from the customized serial processing order list in SMIT and synchronize the cluster.
rg_move Event Processes Several Resource Groups at Once
Problem
In hacmp.out, you see that an rg_move event processes multiple non-concurrent resource groups in one operation.
Solution
This is the expected behavior. In clusters with dependencies, HACMP processes all resource groups upon node_up events, via rg_move events. During a single rg_move event, HACMP can process multiple non-concurrent resource groups within one event. For an example of the output, see the Processing in Clusters with Dependent Resource Groups or Sites section.
Filesystem Fails to Unmount
Problem
A filesystem is not unmounted properly during an event such as when you stop cluster services with the option to bring resource groups offline.
Solution
One of the more common reasons for a filesystem to fail being unmounted when you stop cluster services with the option to bring resource groups offline is because the filesystem is busy. In order to unmount a filesystem successfully, no processes or users can be accessing it at the time. If a user or process is holding it, the filesystem will be “busy” and will not unmount.
The same issue may result if a file has been deleted but is still open.
The script to stop an application should also include a check to make sure that the shared filesystems are not in use or deleted and in the open state. You can do this by using the fuser command. The script should use the fuser command to see what processes or users are accessing the filesystems in question. The PIDs of these processes can then be acquired and killed. This will free the filesystem so it can be unmounted.
Refer to the AIX 5L man pages for complete information on this command.
Dynamic Reconfiguration Sets a Lock
Problem
When attempting a DARE operation, an error message may be generated regarding a DARE lock if another DARE operation is in process, or if a previous DARE operation did not complete properly.
The error message suggests that one should take action to clear the lock if a DARE operation is not in process. “In process” here refers to another DARE operation that may have just been issued, but it also refers to any previous DARE operation that did not complete properly.
Solution
The first step is to examine the /tmp/hacmp.out logs on the cluster nodes to determine the reason for the previous DARE failure. A config_too_long entry will likely appear in hacmp.out where an operation in an event script took too long to complete. If hacmp.out indicates that a script failed to complete due to some error, correct this problem and manually complete the remaining steps that are necessary to complete the event.
Run the HACMP SMIT Problem Determination Tools > Recover from HACMP Script Failure option. This should bring the nodes in the cluster to the next complete event state.
You can clear the DARE lock by selecting the HACMP SMIT option Problem Determination Tools > Release Locks Set by Dynamic Configuration if the HACMP SMIT Recover from HACMP Script Failure step did not do so.
WebSMIT Does Not “See” the Cluster
WebSMIT is designed to run on a single node. If that node goes down, WebSMIT will become unavailable. To increase availability, you can set up WebSMIT to run on multiple nodes. Since WebSMIT is retrieving and updating information from the HACMP cluster, that information should be available from all nodes in the cluster.
Typically, you will set up WebSMIT to be accessible from a cluster's internal network but not reachable from the Internet. If sites are configured, and WebSMIT is running on a node on a remote site, you must ensure HTTP connectivity to that node; it is not handled automatically by WebSMIT or HACMP. HTTPS/SSL is highly recommended for security.
Because WebSMIT runs on one node in the cluster, the functionality it provides and the information it displays directly correspond to the version of HACMP installed on that node. For HACMP 5.4 WebSMIT to work properly, you must have cluster services running on at least one node, and enable Javascript on the client.
![]() ![]() ![]() |